# <center> <font color='#AD3D6F'> SPEED DATING EXPERIMENT  </font> 

# # <center> <font color='#E17327'> Python : your brand new wingman ! </center>
*Wingman : a friend who supports you when trying to meet or talk to possible romantic partners.*  

## <font color='darkblue'> 0. Kaggle link to dataset</font>


https://www.kaggle.com/annavictoria/speed-dating-experiment

## <font color='darkblue'> 1. Context </font>


I have always been intrigued by **dating**. As the use of dating apps is booming, I feel like technology has significantly impacted our dating habits and our expectations, but also somehow debunked the myth of the prince charming and the sleeping beauty. At the era of globalization and with the power of social media, people are hyper-connected, and therefore lonelier than ever. 

But in a logic of time optimization, wouldn’t it be great to only meet people that you are more likely to match with? It would avoid deceptions, broken hearts and mostly, enable people to focus on real potential romantic partners. 
 
 <font color='darkblue'> **What is the secret of finding love at first sight ?** <br> 
    **Can an algorithm provoke fate?**</font> 
    
This dataset was conducted by Columbia University in order to investigate dating preferences. It gathers data from participants in experimental speed dating events from 2002 to 2004. During these events, participants went on 4-minute dates with partners of the opposite sex (*Pretty oldschool data set...*). After this first date, participants had to decide whether they wanted to go on a second date, and to rate their date on six attributes : Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

To make this study even more interesting, the dataset also includes questionnaire data gathered before, during and after the dating events. Very important information was gathered on people's self-perception of their qualities and what they are looking for in the perfect romantic partner. Demographics, dating habits and lifestyle information will be necessary for us to debunk this myth of finding the perfect soulmate.


## <font color='darkblue'> 2. Problem definition </font>


The goal of this study is to crack the case of Love. Love has always been full of mysteries, but I am convinced that there is some logic behind it. As this data set is huge, I decided to start this study by answering some general questions to understand the experiment and get ahold of the dataset. These questions are: 

-	How serious were people about this experiment?  
-	How picky are they ? 
-	Did people find their soulmate ? 
-	Were hearts broken during this experiment ? 
-	Prince charming or just the master of speed dating ? 
-	What makes a man/woman THE perfect date ? 
-	Do we under or over estimate ourselves ?     

After answering all these questions, I will tackle the main problem of this study:

**Predicting if two people are going to be a match after a 4-minutes date regarding their personal features and what they are looking for in a romantic partner.**

This calls for a binary classification algorithm, that given all the features on 2 participants, can predict if they would match or not.   
This algorithm can be very useful to optimize those speed dating events. As I explained it earlier, people are a lot more involved in finding love today and way more afraid to end up alone. This feeling pushes people to try harder to find love and to participate even more to speed dating events. But instead of making all the participants meet one another, wouldn’t it be better to assemble only people with the highest probability of matching? Not only would it be a considerable gain of time, it will also make “Love” easier by avoiding disastrous dates and focusing on people who are relationship material.       


## <font color='darkblue'> 3. Data exploration </font>


### <font color="#3268ca"> 3.1) Importing the dataset </font>

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.linear_model

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [None]:
Speed_dating_data = pd.read_csv('../input/speed-dating-experiment/Speed Dating Data.csv',encoding="ISO-8859-1")

### <font color="#3268ca"> 3.2) First steps of the exploration  

In [None]:
Speed_dating_data.head()

In [None]:
Speed_dating_data.shape

There are 195 features ! This number is huge!! We will have to do an important work on features selection.   

In [None]:
Speed_dating_data.info()

In [None]:
#Looking for empty cells 
Speed_dating_data.isnull().sum()

As we can see, many features have a lot of missing values. Depending on the features chosen, we will have to make sure that the data is cleaned. We can't delete all the missing values now, because cleaning it for all the 195 features deletes all the dataset.

In [None]:
#Were there the same amount of male and female participants ?
Speed_dating_data.groupby(['gender'])['iid'].count().reset_index()

We can see that there is the same amount of women and men in this experiment. This enables us to signifiantly compare men and women's dating behaviours. In fact, this dataset is balanced.  

In [None]:
#The age of the participants 
int_corr = Speed_dating_data[np.isfinite(Speed_dating_data['age'])]['age']
plt.hist(int_corr.values, color='#900C3F')
plt.xlabel('Age')
plt.ylabel('Participants')
plt.title('The age of the participants')
plt.show()

Participants are relatively **young**, aged between 20 and 30 for the majority. 

### <font color="#3268ca"> 3.3) General questions on the experiment 

 #### <font color='#900C3F'> 3.3.1) How serious were people about this experiment ?

Participants were asked about their goal in participating in this speed dating sessions. They had to choose between 6 possibilities:  

<font color='grey'> What is your primary goal in participating in this event? 
- Seemed like a fun night out = 1
- To meet new people = 2
- To get a date = 3
- Looking for a serious relationship = 4
- To say I did it = 5
- Other = 6


In [None]:
#Encode the answers to the question above
replace_map = {'goal': {1: "Seemed like a fun night out", 
                        2: "To meet new people", 
                        3: "To get a date" , 
                        4: "Looking for a serious relationship" ,
                        5: "To say I did it" ,
                        6: "Other" }}
Goal=Speed_dating_data.replace(replace_map)
Goal['goal'].head()

In [None]:
intentions=Goal.groupby(['goal'])['iid'].count().reset_index()
intentions

In [None]:
plt.figure(figsize = (20,5))

plt.bar(Goal.groupby(['goal'])['iid'].count().reset_index()['goal'], 
        Goal.groupby(['goal'])['iid'].count().reset_index()['iid'], color='#701C3F',
        width= 0.3, align='center')
plt.title("The primary goal in participating in this event")
plt.show()

We can see that people take part in these events mostly because it seemed like a fun night out and in order to meet new people. Only a few of them were there loking for a serious relastionship or to get a date. This shows people willigness to have a good time and to get to know people, but the sincerety of some of them might me questionable.  

#### <font color='#900C3F'> 3.3.2) How picky are we ? 

The feature "dec" gives the decision of the participant after the 4 minutes date. If both participants said yes, a match is recorded.  

*Reminder : Yes --> 1 and No -->0* 


In [None]:
#Encode the answers to this question 
replace_map2 = {'dec': {1: "Yes", 0: "No"}}
Decision=Speed_dating_data.replace(replace_map2)

In [None]:
Decision.groupby(['gender','dec'])['iid'].count().reset_index()

In [None]:
plt.figure(figsize = (20,5))

plt.subplot(131)


plt.bar(Decision[Decision['gender']==0].groupby(['dec'])['iid'].count().reset_index()['dec'], 
        Decision[Decision['gender']==0].groupby(['dec'])['iid'].count().reset_index()['iid'], color="#E75480",
        width= 0.1, align='center')
plt.title("Decision of women after their dates with men")
plt.subplot(132)

plt.bar(Decision[Decision['gender']==1].groupby(['dec'])['iid'].count().reset_index()['dec'],
        Decision[Decision['gender']==1].groupby(['dec'])['iid'].count().reset_index()['iid'], color="darkblue",
               width= 0.1, align='center')
plt.title("Decision of men after dating women")
plt.show()



We can see that men seem to be reasonably picky. *They said yes to 47% of the women they met.* <br> 
On the opposite side, women are way more picky. *They said yes to 36% of the men they met.* 

**Men tend to be more easily rejected than women after a 4-minuntes date.**

#### <font color='#900C3F'> 3.3.3) Did people find their soulmate ? 


In [None]:
plt.figure(figsize = (10,5))

plt.bar(Speed_dating_data[Speed_dating_data['date_3']==1].groupby(['numdat_3'])['iid'].count().reset_index()['numdat_3'],
        Speed_dating_data[Speed_dating_data['date_3']==1].groupby(['numdat_3'])['iid'].count().reset_index()['iid'], color='#900C3F',
               width= 0.8, align='center')
plt.title('Number of dates participants went on after recontacting their matches')

plt.show()


People who received matches and recontacted the people they met, generally went on 1 date to confirm their interest in one another. This shows that many prople might of have found their soulmate amoung the match *- A soulmate, or at least a second date.*    

In [None]:
#Encode the answers to this question 
replace_map3 = {'match': {0: "No Match", 1: "Match"}}
matches=Speed_dating_data.replace(replace_map3)

In [None]:
#Counting how many matches were recorded 
matches.groupby(['match'])['iid'].count().reset_index()

In [None]:
#Plotting the number of matches recorded 

plt.bar(matches.groupby(['match'])['iid'].count().reset_index()['match'], 
        matches.groupby(['match'])['iid'].count().reset_index()['iid'], color='#700C3F', width=0.2 )
plt.title("Recorded matches during this experiment")
plt.show()

Based on all the records that we have, only **16% of the dates resulted in 
a match. <br>**
This shows how hard it can be to find an optimal romantic partner and proves the utility of the oncoming classification algorithm. In fact, by meeting everyone in a speed dating event, people will go on an average of 74% useless dates.      <br>

Thus,this algorithm is not balanced regarding matches. That means that our model after will be more trained to classify "No matches" than "mathes". 

#### <font color='#900C3F'> 3.3.4) Were hearts broken during this experiment ? 


In [None]:
#Generate the number of people who wanted to go on a second date, and the decision of their partners.  
Speed_dating_data[Speed_dating_data['dec']==1].groupby(['match'])['iid'].count().reset_index()

In [None]:
#Encode the answers to this question 
replace_map4 = {'match': {0: "Broken hearts", 1: "Love at first sight"}}
hearts=Speed_dating_data.replace(replace_map4)

In [None]:
#Plotting these results 
plt.bar(hearts[hearts['dec']==1].groupby(['match'])['iid'].count().reset_index()['match'], 
        hearts[hearts['dec']==1].groupby(['match'])['iid'].count().reset_index()['iid'], color='#600C3F', width=0.2 )
plt.title("Recorded matches during this experiment")
plt.show()

As we can see it, many people felt a one sided connection with their date, that apparently was not mutual. 61% of the dates ended-up "breaking" someone's hearts.  <BR>
   It could be interesting to manage the participant's feeling by protecting them from feeling rejected and not appreciated. To avoid such disillusions, the use of an algorithm that would predict the matches can be very useful.

 #### <font color='#900C3F'> 3.3.5) Prince charming or just the master of speed dating ? 

Let's count how many matches each participant has, to see if some people are just **good daters**. <br>

*Dating is just like sports: you become better with training and some people are more gifted.* 

In [None]:
data1=Speed_dating_data.groupby(['iid'])['match'].sum().reset_index()
data1.describe()

We can see that the mean is 2 and that  25% of the participants didn't have any date and 25% of them got between 4 and 14 dates. This shows that the participants are really heterogenous, and the we have both unexperimented and expert daters.      

#### <font color='#900C3F'> 3.3.6) What features make you want to go on a second date ? 

Let's first take a look at the average evaluation for each feature, depending if there was a match or not. This will show us the important features and the ones that drive the participant's decision.   

In [None]:
#Selection the features I am interested in : The way the participants percieved their partner after the 4-minute date. 
DATA3=Speed_dating_data[['iid','gender','match','dec','attr','sinc','intel','fun','amb','shar','like']].drop_duplicates().reset_index()
replace_map5 = {'match': { 1: "Match", 0:"No match" }}
DATA3.replace(replace_map5, inplace=True)


In [None]:
Features=DATA3.groupby(['match'])['attr','sinc','intel','fun','amb','shar'].mean()
Features.rename(columns={'attr': 'Attractive', 'sinc': 'Sincere','intel': 'Intelligent','fun': 'Fun','amb': 'Ambitious','shar': 'Shared interests'}, inplace=True)
Features

In [None]:
#Creating a row with the difference between Match and No match
list_cols = Features.columns
Features2 = pd.DataFrame({"match":['Match','No match','Difference (Match - No match)']})

for col in list_cols:
    
    col_to_list = list(Features[col].unique())
    diff = col_to_list[0]-col_to_list[1]
    col_to_list.append(diff)
    
    Features2[col] = col_to_list

In [None]:
Features2.set_index('match',inplace=True)

In [None]:
Features2

In [None]:
#Plot these results
Features2.plot(kind='bar', figsize=(10,3))
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
plt.title('The features of participants who were matched or rejected')
plt.show()

As these numbers show it, participants wanted to go on a second date with people who were **more attracitve**, **funnier**, and who **shared more interests**. Sincerity, ambition and intelligence don't seem to influence one's decision to go on a second date. <BR>
Isn't it good news that people are not judged only on their attractiveness ? The importance of Fun and shared interests shows that participants are really looking for genuine connections. 
    

#### <font color='#900C3F'> 3.3.7) Do we *under* or *over* estimate ouverselves ? 


The features I will be using to answer this question are :

- **"Please rate your opinion of your own attributes**, on a scale of 1-10 (1=awful, 10=great) -- Be honest!"<br>
*The answers are encoded as : attr3_s, sinc3_s , intel3_s, fun3_s and amb3_s*
    
- **"The rating by partner the night of the event**, for all 6 attributes, on a scale of 1-10 (1=awful, 10=great)."<br> 
*The answers are encoded as : attr_o; sinc_o, intel_o, fun_o and amb_o*




In [None]:
Reality=Speed_dating_data.groupby(['gender'])['attr3_s','sinc3_s','intel3_s','fun3_s','amb3_s','attr_o','sinc_o','intel_o','fun_o','amb_o'].mean()
Reality


In [None]:
#Creating an 'all' column that is always =1, to code a groupby
Speed_dating_data['all']  = 0
Reality2=Speed_dating_data.groupby(['all'])['attr3_s','sinc3_s','intel3_s','fun3_s','amb3_s',
                                            'attr_o','sinc_o','intel_o','fun_o','amb_o'].mean()
Reality2


In [None]:
df_diff = pd.DataFrame()

list_cols = Reality2.columns
new_col_names = ['Attractive','Sincere','Intelligent','Fun','Ambitious']
list_values = []
for i in range(5):
    calc = float(Reality2[list_cols[i]]-Reality2[list_cols[i+5]])
    list_values.append(calc)
    df_diff[new_col_names[i]] = [float(Reality2[list_cols[i]]),
                                 float(Reality2[list_cols[i+5]]),
                                 calc]
    
df_diff['Different perceptions'] = ['Personal rating','Others rating','Difference']
df_diff.set_index('Different perceptions',inplace=True)

In [None]:
df_diff

In [None]:
#Plotting these results

df_diff.plot(kind='bar', figsize=(15,5))
plt.title('Is our self-perception distorted ? ', fontsize=15)
plt.show()

We can see that the difference between the personal rating and others' rating is always positive, which means participants tend to **over-estimate** themselves. <br>
The feature they over-estimate the most is **fun**. This maks sense because fun is very subjective. Everyone is fun in its own way. The second attribute people over-estimate is **attractiveness**.  
Maybe people don't over-estimate themselves but they just underestimate people they meEt. In fact, it must be hard to circle someone in only 4 minutes.  Not to mention the stress factor due to the short time, people must not be at their fullest potential.  

### <font color="#3268ca"> 3.3) Looking for correlations

In [None]:
import seaborn as sns
sns.pairplot(DATA3)
plt.show()

We can see that the most liked people tend to be very **attractive** and to **share many common interests** with the participant. 

## <font color='darkblue'> **4. Features Selection** </font>


The goal of this study is to **Predict if two people are going to be a match after a 4-minutes date regarding their personal features.**

The features that will be needed for this study are : 
- The gender and iid
- Personal features on a scale of 1-10 (1=awful, 10=great) *--> attr3_s, sinc3_s, intel3_s, fun3_s, amb3_s*
- The other person's features on a scale of 1-10 (1=awful, 10=great) *--> attr, sinc, intel, fun, amb* 
- Match



In [None]:
dating_data=Speed_dating_data[['gender','iid',
                     'attr3_s','sinc3_s','intel3_s','fun3_s','amb3_s',
                     'attr','sinc','intel','fun','amb','shar','match']]

## <font color='darkblue'> 5.Data Processing Step 1 (Cleaning, etc.)

In [None]:
dating_data.head()

In [None]:
dating_data.shape

### <font color="#3268ca"> 5.1) Looking for non available values

In [None]:
dating_data.isnull().sum()

There are a lot of missing values !    

In [None]:
dating_data.isnull().sum().max()/dating_data.shape[0]

52% of the dataset contains non available values. 
We have several options de tackle this issue, among them : 
- Deleting rows with empty values 
- Replace the empty values with the mean 

I think that I will delete the empty rows because replacing the missing values with the mean would totally biase our algorithm. I want the algorithm to learn from real evaluations and not only to be mainstream individuals.      

In [None]:
#deleting non available values
dating_data_clean=dating_data.dropna().reset_index()

In [None]:
dating_data_clean.shape

### <font color="#3268ca"> 5.2) Taking care of categorical features

In [None]:
dating_data_clean.info()

Luckily, in this dataset the gender has already been encoded as an integrer. Therefore, there are no categorical features to take care of.     

### <font color="#3268ca"> 5.3) Drop duplicates

In [None]:
dating_data_clean_without_duplicates = dating_data_clean.drop_duplicates()
print(dating_data_clean.shape)
print(dating_data_clean_without_duplicates.shape)

There are no duplicates ! 

## <font color='darkblue'> 5. Data Processing Step 2 :  Features Engineering</font>


### <font color="#3268ca"> 5.1) Low variance features

Search for features that have low variance

In [None]:
feature_variances = dating_data_clean.std().sort_values(ascending=True)
features_low_variance = feature_variances[feature_variances < 0.1].index.values.tolist()
features_low_variance

None of the features have low variance. This means we can keep all of them, and that all of them will have an impact on our model.     

### <font color="#3268ca"> 5.2) Correlation study

Let's take a look at which features have the less correlation with the final result : the match. 

In [None]:
correlations = dating_data_clean.corr().abs().unstack().sort_values(ascending=False).drop_duplicates()
correlations = correlations[correlations != 1]

In [None]:
match_correlations = correlations.loc['match']
match_correlations[match_correlations > 0.1]

In [None]:
lowest_correlation = match_correlations[match_correlations < 0.1].axes[0].to_list()
lowest_correlation

As we can see, sincerity and intelligence of the participant is not very important for the decision. Gender does not have a significative correlation with the matching result. I find this surprising, as of all the results plotted before showed that it was harder for men to be liked and therefore to have a match.     

In order to be able to compare both individuals features, I decide to keep sincerity and intelligence of one of the participants among the features because the sincerety and the intelligence of the other is significantly correlated to the matching result.     

## <font color='darkblue'> **6. Model Selection** </font>


In [None]:
#Creating the training and the testing sets
from sklearn.model_selection import train_test_split
X=dating_data_clean[['gender',
                     'attr3_s','sinc3_s','intel3_s','fun3_s','amb3_s',
                     'attr','sinc','intel','fun','amb','shar']]
y=dating_data_clean['match']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=3, stratify=y)

Due to the small dataset we are working with (3373 rows), the classification models are really unstable. Therefore, results vary a lot each time I run the code. 

I will run two classifiers for this problem: 
- A logistic Regression
- A random Forest Classifier

### <font color='#FF781F'> **6.1) Logistic Regression** 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import accuracy_score
import time

t0 = time.time()

model_log_reg = LogisticRegression(C=3, random_state=43)
log_reg = model_log_reg.fit(X_train, y_train)
predict_train_log_reg = log_reg.predict(X_train)
predict_test_log_reg = log_reg.predict(X_test)
t1 = time.time()
duration = t1 - t0
Log_reg_before=duration
Log_reg_score_before=accuracy_score(y_test, predict_test_log_reg)
print("The duration of the training is " , duration)
print('Training Accuracy Score:', accuracy_score(y_train, predict_train_log_reg))
print('Validation Accuracy Score :', accuracy_score(y_test, predict_test_log_reg))

#Saving these values to compare before and after the dimentionality reduction 
Log_reg_before=duration
Log_reg_score_before=accuracy_score(y_test, predict_test_log_reg)

This model is very efficient ! I has a similar accuracy score for the training and testing set which means it avoids overfitting. These scores are pretty high wich means our model avoids also underfitting.  

In [None]:
#Plotting the confusion matrix: 
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model_log_reg.predict(X_test))


As we can see, 81 values were predicted negative while they were actually positive. This is a little bit worrying because I don't want my algorithm to make a participant miss it's soulmate. 
Also, only 4 values were correctly predicted as matches, which confirms the fact that our dataset is not balances regarding the matching results. Since we have a lot more "Non matches", the algorithm is better as predecting what is not a match than what is a match.   

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, model_log_reg.predict(X_test)))

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, model_log_reg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model_log_reg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. The further the logistic Regression is from the red line, the better the model is.  As we can see it above, this graph is great and whos that our model is working just fine !  

### <font color='#FF781F'> **6.2) Random Forest Classifier** 

In [None]:
from sklearn.ensemble import RandomForestClassifier
import time

t0 = time.time()

model = RandomForestClassifier()
rf_model = model.fit(X_train, y_train)
predict_train_rf = rf_model.predict(X_train)
predict_test_rf = rf_model.predict(X_test)
t1 = time.time()
duration = t1 - t0
print("The duration of the training is " , duration)
print('Training Accuracy:', accuracy_score(y_train, predict_train_rf))
print('Validation Accuracy:', accuracy_score(y_test, predict_test_rf))

#Saving these values to compare before and after the dimentionality reduction 
Rf_duration_before=duration
Rf_score_before=accuracy_score(y_test, predict_test_rf)

Oh wow ! A 99% training accuracy, it smells like Overfitting ...  

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model.predict(X_test))


## <font color='darkblue'> **7. Learning Curves analysis** </font>


### <font color='#3268ca'> **7.1) The function that plots learning curves** 

In [None]:
%matplotlib inline

In [None]:
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 3)):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    axes : array of 3 axes, optional (default=None)
        Axes to use for plotting the curves.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 3))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                      return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

### <font color='#3268ca'> **7.2) Logistic regression** 

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(10, 15))

X, y = X_train , y_train

title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = LogisticRegression()
plot_learning_curve(estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01),
                    cv=cv, n_jobs=3)

title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01),
                    cv=cv, n_jobs=3)

plt.show()

As we can see it: 
- the accruacy are pretty high ( between 0.80 and 0.85)
- The training accuracy is above the cross-validation score
- Both score tend to converge towards a 0.82 score as the training examples increase 

This means that the Logistic Regression classifier fits the data set. <BR>
More data would have been useful to make this model even more accurate.  

### <font color='#3268ca'> **7.3) Random Forest Classifier** 

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(10, 15))

X, y = X_train , y_train

title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = RandomForestClassifier()
plot_learning_curve(estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01),
                    cv=cv, n_jobs=3)

title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations:
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01),
                    cv=cv, n_jobs=3)

plt.show()

These learning curves confirm that the Random Forest Classifier **Overfits** a lot. The training score is way higher than the cross-validation score, and they never converge. 
How to prevent overfitting ? 
- We can train the model with **more data** : This is the optimal solution. As mentioned earlier, the model is trained with few elements. This makes the model very unstable and very likely to overfit.  
- **Remove some of the features** using dimensionality reduction. Let's try this option.    

## <font color='darkblue'> **8. Dimensionality Reduction** </font>
 

### <font color='#3268ca'> **8.1) PCA method** 

Compute the minimum number of dimensions required to preserve **95%** of the training data

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
print(d)

This means that instead of working with 13 features, we can work with only 10. 

Unfortunatly, we won't be able to plot the results of this algorithm because even the PCA uses 10 features and we only know how to plot two features.  

In [None]:
pca = PCA(n_components = 10)
X_test_reduced = pca.fit_transform(X_train)

In [None]:
X_test_reduced.shape

### <font color='#3268ca'> **8.2) Logistic Rgression** 

In [None]:
LogReg_clf_reduced =LogisticRegression(C=3, random_state=43)
from sklearn.metrics import accuracy_score
import time

t0 = time.time()
LogReg_clf_reduced.fit(X_test_reduced, y_train)
t1 = time.time()
duration = t1 - t0

print("The duration of the training before dimensionality reduction is " , Log_reg_before)

print("The duration of the training with dimensionality reduction is " , duration)


X_test_reduced2 = pca.transform(X_test)

y_pred = LogReg_clf_reduced.predict(X_test_reduced2)

print("The validation accuracy score before dimensionality reduction is " , Log_reg_score_before)
print("The validation accuracy score with dimensionality reduction is " , accuracy_score(y_test, y_pred))


Dimensionality reduction didn't really impact the accuracy score, but it has **significantly lowered the duration of the training**.

### <font color='#3268ca'> **8.3) Random Forest** 

In [None]:
Rf_clf_reduced =RandomForestClassifier()
from sklearn.metrics import accuracy_score
import time

t0 = time.time()
Rf_clf_reduced.fit(X_test_reduced, y_train)
t1 = time.time()
duration = t1 - t0
print("The duration of the training before dimensionality reduction is " , Rf_duration_before)

print("The duration of the training with dimensionality reduction is " , duration)

Rf_duration_before=duration
Rf_score_before=accuracy_score(y_test, predict_test_rf)


print("The validation accuracy score before dimensionality reduction is " , Rf_score_before)
print("The validation accuracy score with dimensionality reduction is " , accuracy_score(y_test, y_pred))

Dimensionality reduction slightly lowered the accuracy score and increased a little the duration of the training. 

**Contrary to the logistic regression, dimensionaliy reduction is not useful at all to the Random Forest model.**    

## <font color='darkblue'> **9. Results Interpretation and ideas to go further and perfect the algorithm** </font>


I decided to choose the **Logistic Regression model**. This model can give you with a **82%** accuracy wether you will match with someone or no only using your personal features (your gender, attractiveness, intelligence, sincerity, fun and ambition)  and the other person's features (its attractiveness, intelligence, sincerity, fun and ambition). Isn't it great ? 


The main reason I chose this model is that it successfully avoided overfitting and underfitting problems. The Random Forest classifier is clearly overfitting, which creates a huge bias. 

But still, to perfect this algorithm, more data is mandatory. In fact, the confusion matrix was not really satisfying. As mentioned earlier, 80 matches were falsly classified as non-matches. This is very problematic ! Imagine passing by the love of your life because of an algorithm ... In addition to adding more data, this data should be more balanced  to generate better resultats. This means it should contain almost as many matches and fails *(also colled "Non matches")*.  

Unfortunatly, I wasn't able to plot the results of this algorithm, even after applying dimensionality reduction, because the number of features is superior to 2. 

Even if there is still a lot of work to perfect this algorithm, I feel like we learned a lot about dating and especially about men and women's dating behaviours. 
Some of the most interesting conclusions :  
- In speed datings, people are really willing to have a good time and to get to know people, but the sincerety of some of them might me questionable.
- Men tend to be more easily rejected than women OR women are more picky than men.
- Finding love can be very grueling and you might have to go on numerous boring dates to find *the one* (Provided he/she exists...) 
- Attracitveness, fun, and shared interests make people want to go on a second date. Participants should focus on them to make a great first impression in 4 minutes.  
- The fact of liking somebody is very subjective. As we say : there is no accounting for tastes ! So neither is there for attractiveness and fun. This means that to find love, people just need to be themselves to meet people that will have the same taste and therefore people that fully appreciate their value.   