# Explore Mating Behavior

## Fire up 

In [None]:
import numpy as np  
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from ggplot import *

In [None]:
df = pd.read_csv('../input/baboon_mating.csv')
df1 = df
del df1['female_id']
del df1['male_id']
del df1['cycle_id']
df1.head()

Seems we got two labels in this data set. Let's analyze them one by one. **Since it seems to be possible that the female and male themselves can influence both the mating and conception probabilities, I decide to delete those three columns first in order to better explore the influences of general biological factors on the labels.** 

## Consorting Behavior

Before we start, we split the data set into training and testing set in order to ensure the feasibility of analysis. 

In [None]:
df2 = df1
del df2['conceptive']
train_1,test_1 = train_test_split(df2,test_size=0.2,random_state=99)

**A: Exploratory Data Analysis**

In [None]:
print(train_1.describe(include = 'all'))

We first make a Correlation plot to briefly check the relationships between variables.

In [None]:
Cor_matrxi = train_1.iloc[:,1:].corr(method='pearson', min_periods=1)
print(Cor_matrxi)

In [None]:
fig, ax = plt.subplots()
heatmap = ax.pcolor(Cor_matrxi, cmap=plt.cm.Blues, alpha=0.8)
fig = plt.gcf()
fig.set_size_inches(6, 6)
ax.set_frame_on(False)
ax.set_yticks(np.arange(15) + 0.5, minor=False)
ax.set_xticks(np.arange(15) + 0.5, minor=False)
ax.set_xticklabels(train_1.columns[1:17], minor=False)
ax.set_yticklabels(train_1.columns[1:17], minor=False)
plt.xticks(rotation=90)

I am surprised about the fact that male and female genetic variables are not necessarily strongly connected with each other.  No biology background makes me confused about several facts, but I will try to understand it in a pure data mining way. Some variables are somehow redundant. For example, since we just have male_rank transform in the data set, it is of no sense to keep it alone, and the rank_interact can represent the ranks of male and female. Therefore, we just keep this variable of ranking for analysis. For the transform variables, I do not quite understand what do they mean, but considering the correlations between the original variables and them are not strong, I have decided to keep them.

In [None]:
variables = ['consort','female_hybridscore','male_hybridscore','female_gendiv','male_gendiv','female_age','males_present','females_present','gen_distance_transform','rank_interact','female_age_transform','assort_index','gen_distance']

Then I am gonna draw the box plots of consorting results and the continuous variables. 

In [None]:
con = train_1[variables]

In [None]:
for i in range(1,13):
    g= ggplot(con,aes(x= 'consort',y=variables[i]))+geom_boxplot()+ggtitle('Box Plot of Consorting Result and '+variables[i])+theme_bw()
    print(g)

I am afraid I don't see a lot of influential factors in this way. Several features that are worth notices are: rank interact and male present. 

Next, we are gonna plot several scatter plots to further explore the influence of features on the consorting result.

In [None]:
con['label'] = con['consort'].apply(lambda x:str(x))

In [None]:
g=ggplot(con,aes(x='female_hybridscore',y='male_hybridscore',color='label')) +geom_point() +theme_bw()+facet_grid('label')+ggtitle('Hybrid Score VS Consorting Behavior')
print(g)

Seems the influence of hybridscore  is not significant.

In [None]:
g=ggplot(con,aes(x='female_gendiv',y='male_gendiv',color='label')) +geom_point() +theme_bw()+facet_grid('label')+ggtitle('Gen Div VS Consorting Behavior')
print(g)

Gendiv Variables show similar pattern

In [None]:
g=ggplot(con,aes(x='females_present',y='males_present',color='label')) +geom_point() +theme_bw()+facet_grid('label')+ggtitle('Present Data VS Consorting Behavior')
print(g)

Also, the influence of male/female present is not distinct enough. 

The road of EDA is always long and tough. Therefore, I will not present all the possible EDA here. A general conclusion is that the features may not be quite distinct when doing classification. Therefore, I am not sure the effectiveness of potential classifiers.

**B: Models Fit**

In [None]:
del con['label']

I will try to fit seven classifiers in this case. Also, I will compare the result of models using all features and the features we choose to use above.

**All Features**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve

In [None]:
Classifiers = [
    LogisticRegression(C=0.000000001,solver='liblinear',max_iter=200),
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=200),
    AdaBoostClassifier(),
    GaussianNB(),
    GradientBoostingClassifier(n_estimators=200)]

In [None]:
All_features = train_1.iloc[:,1:]
Test_features = test_1.iloc[:,1:]
Label = train_1.iloc[:,0]
Model = []
Accuracy = []
for clf in Classifiers:
    fit=clf.fit(All_features,Label)
    pred=fit.predict(Test_features)
    Model.append(clf.__class__.__name__)
    Accuracy.append(accuracy_score(test_1['consort'],pred))
    prob = fit.predict_proba(Test_features)[:,1]
    print('Accuracy of '+clf.__class__.__name__ +' is '+str(accuracy_score(test_1['consort'],pred)))
    fpr, tpr, _ = roc_curve(test_1['consort'],prob)
    tmp = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
    g = ggplot(tmp, aes(x='fpr', y='tpr')) +geom_line() +geom_abline(linetype='dashed')+ ggtitle('Roc Curve of '+clf.__class__.__name__)
    print(g)

We can see that the result of every model is not bad. The models with best performances are : Adaboost, SVC, Random Forest, Gradient Boosting and Logistic Regression. I will try this five classifiers with the second data set where some variables are dropped.

In [None]:
Classifiers_2 = [
    LogisticRegression(C=0.000000001,solver='liblinear',max_iter=200),
    SVC(kernel="rbf", C=0.025, probability=True),
    RandomForestClassifier(n_estimators=200),
    GradientBoostingClassifier(n_estimators=200)]

In [None]:
All_features_2 = con.iloc[:,1:]
Test_features_2 = test_1[variables[1:]]
Label = con.iloc[:,0]


In [None]:
Model_2 = []
Accuracy_2 = []
for clf in Classifiers_2:
    fit=clf.fit(All_features_2,Label)
    pred=fit.predict(Test_features_2)
    Model_2.append(clf.__class__.__name__)
    Accuracy_2.append(accuracy_score(test_1['consort'],pred))
    prob = fit.predict_proba(Test_features_2)[:,1]
    print('Accuracy of '+clf.__class__.__name__ +' is '+str(accuracy_score(test_1['consort'],pred)))
    fpr, tpr, _ = roc_curve(test_1['consort'],prob)
    tmp = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
    g = ggplot(tmp, aes(x='fpr', y='tpr')) +geom_line() +geom_abline(linetype='dashed')+ ggtitle('Roc Curve of '+clf.__class__.__name__)
    print(g)

We can see that the result of the same model with two data set have ignorable difference. The importance of features can be further tested by the feature importance of the tree model. 

In [None]:
Model = GradientBoostingClassifier(n_estimators=200)
Fit = Model.fit(All_features,Label)
importances = Model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure()
plt.title("Feature importances")
plt.bar(range(All_features.shape[1]), importances[indices],
       color="r",  align="center")
plt.xticks(range(All_features.shape[1]),indices)
plt.xlim([-1, All_features.shape[1]])
plt.show()

We can see that only features 13,4,10, 1, 11 and 12 have obvious influence on the model. They are:

In [None]:
print(All_features.columns[13],All_features.columns[4],All_features.columns[10],All_features.columns[1],All_features.columns[11],All_features.columns[12])

The variables we dropped before, as well as the ones ranked below them, are all not influential to the results.

## Conclusion

In short, the kernel can be concluded with the following points:

A: Without considering the uniqueness of each individual, the most influential feature is the assort index. Following is the genetic distance.

B: Features of the female social rank, age as well as the genetic makeup are all not important in terms of the mating rate. We can guess that this is a male-dominant community where male controls the authority of mating.