In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dating_data_clean = pd.read_csv('dating_data_clean.csv')
dating_data_clean.shape

(8378, 74)

** Changing variables to category type **

In [96]:
dating_data_clean['match'] = dating_data_clean['match'].astype('category')
dating_data_clean['dec'] = dating_data_clean['dec'].astype('category')
dating_data_clean['dec_o'] = dating_data_clean['dec_o'].astype('category')

# Machine Learning

## 1) Predicting based on important attributes at sign up


Each participant ranked the attributes (attractive, sincere, intelligent, ambitious, fun, shared interests) assigning a scale from 1 to 10 based on what is important for them in a partner.
The code below predicts a match based on this ranking.

### Model: KNN classifier:

**Using all 6 attributes:**

In [109]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np
from sklearn.model_selection import train_test_split

all_1_1 = dating_data_clean[['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']]
three_1_1 = dating_data_clean[['attr1_1', 'fun1_1', 'shar1_1']]
two_1_1 = dating_data_clean[['attr1_1', 'shar1_1']]
target = np.ravel(dating_data_clean['match'])

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(all_1_1, target, test_size = 0.2, random_state=42, stratify=target)

knn_a = KNeighborsClassifier()

knn_cv_a = GridSearchCV(knn_a, param_grid, cv=5)

# train for all features:
knn_cv_a.fit(X_train_a, y_train_a)


GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Question for Andrei: what does stratify do?

In [110]:
# Get best parameter
knn_cv_a.best_params_

{'n_neighbors': 26}

In [111]:
# Get best score
knn_cv_a.best_score_

0.83572068039391223

In [112]:
# Predict on training data to compare with score
y_pred_all = knn_cv_a.predict(X_train_a)
y_pred_all

array([0, 0, 0, ..., 0, 0, 0])

In [113]:
accuracy_train_all = knn_cv_a.score(X_train_a, y_train_a)
accuracy_train_all

0.83661593554162939

Question to Andrei: is score different than my prediction on the training data because of cross validation 5-fold?

In [114]:
# Predict on test data
y_pred_all = knn_cv_a.predict(X_test_a)
accuracy_all = knn_cv_a.score(X_test_a, y_test_a)
accuracy_all

0.83591885441527447

**Using 3 attributes identified in EDA with major correlation:**

In [115]:
## For 3 features:

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

# Split into training and test set
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(three_1_1, target, test_size = 0.2, random_state=42, stratify=target)

knn_3 = KNeighborsClassifier()

knn_cv_3 = GridSearchCV(knn_3, param_grid, cv=5)

# train for 3 features:
knn_cv_3.fit(X_train_3, y_train_3)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [116]:
knn_cv_3.best_params_

{'n_neighbors': 20}

In [117]:
knn_cv_3.best_score_

0.83497463443748132

In [118]:
# Predict on test data
y_pred_3 = knn_cv_3.predict(X_test_3)
accuracy_3 = knn_cv_3.score(X_test_3, y_test_3)
accuracy_3

0.8353221957040573

**Using 2 attributes identified in EDA with major correlation:**

In [119]:
## For 2 features

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

# Split into training and test set
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(two_1_1, target, test_size = 0.2, random_state=42, stratify=target)

knn_2 = KNeighborsClassifier()

knn_cv_2 = GridSearchCV(knn_2, param_grid, cv=5)

# train for 2 features:
knn_cv_2.fit(X_train_2, y_train_2)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [120]:
knn_cv_2.best_params_

{'n_neighbors': 22}

In [121]:
knn_cv_2.best_score_

0.83527305282005371

In [122]:
# Predict on test data
y_pred_2 = knn_cv_2.predict(X_test_2)
accuracy_2 = knn_cv_2.score(X_test_2, y_test_2)
accuracy_2

0.8353221957040573

**Using 1 attribute identified in EDA with major correlation:**

In [123]:
## For 1 feature:

one_1_1 = dating_data_clean[['attr1_1']]

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

# Split into training and test set
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(one_1_1, target, test_size = 0.2, random_state=42, stratify=target)

knn_1 = KNeighborsClassifier()

knn_cv_1 = GridSearchCV(knn_1, param_grid, cv=5)

# train for all features:
knn_cv_1.fit(X_train_1, y_train_1)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [124]:
knn_cv_1.best_params_

{'n_neighbors': 22}

In [125]:
knn_cv_1.best_score_

0.83527305282005371

In [126]:
# Predict on test data
y_pred_1 = knn_cv_1.predict(X_test_1)
accuracy_1 = knn_cv_1.score(X_test_1, y_test_1)
accuracy_1

0.8353221957040573

Question to Andrei: is it possible that 3, 2 and 1 features give exactly the same score? Should best score or test data accuracy be used as indicator of better model?

Note: model performs better with all features

### Model: Naive Bayes

**Using all 6 attributes:**

In [137]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

## For all features:

clf_a = MultinomialNB().fit(X_train_a, y_train_a)

clf_a.score(X_test_a, y_test_a)

0.8353221957040573

In [138]:
# Predict on test data
y_pred_a = clf_a.predict(X_test_a)
accuracy_a = clf_a.score(X_test_a, y_test_a)
accuracy_a

0.8353221957040573

**Using 3 attributes identified in EDA with major correlation:**

In [139]:
## 3 features:

clf_3 = MultinomialNB().fit(X_train_3, y_train_3)

clf_3.score(X_test_3, y_test_3)

0.8353221957040573

In [140]:
# Predict on test data
y_pred_3 = clf_3.predict(X_test_3)
accuracy_3 = clf_3.score(X_test_3, y_test_3)
accuracy_3

0.8353221957040573

**Using 2 attributes identified in EDA with major correlation:**

In [141]:
## 2 features:

clf_2 = MultinomialNB().fit(X_train_2, y_train_2)

clf_2.score(X_test_2, y_test_2)

0.8353221957040573

In [142]:
y_pred_2 = clf_2.predict(X_test_2)
accuracy_2 = clf_2.score(X_test_2, y_test_2)
accuracy_2

0.8353221957040573

**Using 1 attribute identified in EDA with major correlation:**

In [143]:
## 1 feature:

clf_1 = MultinomialNB().fit(X_train_1, y_train_1)

clf_1.score(X_test_1, y_test_1)

0.8353221957040573

In [144]:
y_pred_1 = clf_1.predict(X_test_1)
accuracy_1 = clf_1.score(X_test_1, y_test_1)
accuracy_1

0.8353221957040573

Question for Andrei: could all combination of features give the same score?

Observation: KNN with 4 features and n = 26 gave better score predicting the test data than NB

## 2) Predicting based on self assessment at sign up

Each participant ranked themselves based on the same attributes assigning a scale from 1 to 10.
The code below predicts a match based on this ranking, which could be an indication of how self-esteem or self awareness influences a match.

Shared interests is not a relevant attribute in this set.

### Model: KNN classifier

**Using all 5 attributes:**

In [146]:
all_3_1 = dating_data_clean[['attr3_1', 'sinc3_1', 'intel3_1', 'fun3_1', 'amb3_1']]


# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(all_3_1, target, test_size = 0.2, random_state=52, stratify=target)

knn_a = KNeighborsClassifier()

knn_cv_a = GridSearchCV(knn_a, param_grid, cv=5)

# train for all features:
knn_cv_a.fit(X_train_a, y_train_a)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [147]:
# Get best parameter
knn_cv_a.best_params_

{'n_neighbors': 19}

In [148]:
# Get best score
knn_cv_a.best_score_

0.83646672635034314

In [149]:
# Predict on training data to compare with score
y_pred_all = knn_cv_a.predict(X_train_a)
accuracy_all = knn_cv_a.score(X_train_a, y_train_a)
accuracy_all

0.83900328260220824

In [150]:
# Predict on test data
y_pred_all = knn_cv_a.predict(X_test_a)
accuracy_all = knn_cv_a.score(X_test_a, y_test_a)
accuracy_all

0.83353221957040569

**Using 2 attributes identified in EDA with major correlation:**

In [151]:
## For 2 features

two_3_1 = dating_data_clean[['attr3_1', 'fun3_1']]

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

# Split into training and test set
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(two_3_1, target, test_size = 0.2, random_state=42, stratify=target)

knn_2 = KNeighborsClassifier()

knn_cv_2 = GridSearchCV(knn_2, param_grid, cv=5)

# train for 2 features:
knn_cv_2.fit(X_train_2, y_train_2)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [152]:
knn_cv_2.best_params_

{'n_neighbors': 28}

In [153]:
knn_cv_2.best_score_

0.83407937928976428

In [154]:
# Predict on test data
y_pred_2 = knn_cv_2.predict(X_test_2)
accuracy_2 = knn_cv_2.score(X_test_2, y_test_2)
accuracy_2

0.8353221957040573

**Using 1 attribute identified in EDA with major correlation:**

In [155]:
## For 1 feature:

one_3_1 = dating_data_clean[['attr3_1']]

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

# Split into training and test set
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(one_3_1, target, test_size = 0.2, random_state=42, stratify=target)

knn_1 = KNeighborsClassifier()

knn_cv_1 = GridSearchCV(knn_1, param_grid, cv=5)

# train for all features:
knn_cv_1.fit(X_train_1, y_train_1)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [156]:
knn_cv_1.best_params_

{'n_neighbors': 13}

In [157]:
knn_cv_1.best_score_

0.83527305282005371

In [158]:
# Predict on test data
y_pred_1 = knn_cv_1.predict(X_test_1)
accuracy_1 = knn_cv_1.score(X_test_1, y_test_1)
accuracy_1

0.83412887828162297

Conclusion: Two features (attractive and fun) performed better than the other options.

### Model: Logistic regression

Trying Logistic regression as model to compare it with KNN

**Using all 5 attributes:**

In [159]:
## For all features:

# Import necessary modules
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Create training and test sets
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(all_3_1, target, test_size = 0.4, random_state=42)

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X_train_a, y_train_a)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))




Tuned Logistic Regression Parameters: {'C': 1.0000000000000001e-05}


In [160]:
print("Best score is {}".format(logreg_cv.best_score_))

Best score is 0.8304814962196578


**Using 2 attributes identified in EDA with major correlation:**

In [164]:
## For 2 features:

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Create training and test sets
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(two_3_1, target, test_size = 0.4, random_state=42)

# Instantiate a logistic regression classifier: logreg
logreg_2 = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv_2 = GridSearchCV(logreg_2, param_grid, cv=5)

# Fit it to the data
logreg_cv_2.fit(X_train_2, y_train_2)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv_2.best_params_))



Tuned Logistic Regression Parameters: {'C': 1.0000000000000001e-05}


In [165]:
print("Best score is {}".format(logreg_cv_2.best_score_))

Best score is 0.8304814962196578


**Using 1 attribute identified in EDA with major correlation:**

In [167]:
## For 1 feature:

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Create training and test sets
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(one_3_1, target, test_size = 0.4, random_state=42)

# Instantiate a logistic regression classifier: logreg
logreg_1 = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv_1 = GridSearchCV(logreg_1, param_grid, cv=5)

# Fit it to the data
logreg_cv_1.fit(X_train_1, y_train_1)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv_1.best_params_))

Tuned Logistic Regression Parameters: {'C': 1.0000000000000001e-05}


In [168]:
print("Best score is {}".format(logreg_cv_1.best_score_))

Best score is 0.8304814962196578


Question to Andrei: discuss why all features have the same score

Conclusion: knn performs better than logistic regression

## 3) Predicting based on important attributes at sign up for both participants

Same ranking used in item 1, but adding group of attributes ranked by both participants / couple. 

**Using all 6 attributes for both participants (12 features):**

A model with less features was not created based on result from previous models that indicated the score is higher with all the attributes.

In [182]:
features_all = dating_data_clean[['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1', 'pf_o_att', 'pf_o_sin', 'pf_o_int', 'pf_o_fun', 'pf_o_amb', 'pf_o_sha']]

In [177]:
# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_fa, X_test_fa, y_train_fa, y_test_fa = train_test_split(features_all, target, test_size = 0.2, random_state=42, stratify=target)

knn_fa = KNeighborsClassifier()

knn_cv_fa = GridSearchCV(knn_fa, param_grid, cv=5)

# train for all features:
knn_cv_fa.fit(X_train_fa, y_train_fa)


GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [178]:
# Get best parameter
knn_cv_fa.best_params_

{'n_neighbors': 21}

In [179]:
# Get best score
knn_cv_fa.best_score_

0.83572068039391223

In [180]:
# Predict on training data to compare with score
y_pred_fa = knn_cv_fa.predict(X_train_fa)
accuracy_fa = knn_cv_fa.score(X_train_fa, y_train_fa)
accuracy_fa

0.83572068039391223

In [181]:
# Predict on test data
y_pred_test_fa = knn_cv_fa.predict(X_test_fa)
accuracy_fa = knn_cv_fa.score(X_test_fa, y_test_fa)
accuracy_fa

0.83591885441527447

Conclusion: performs as well as using only one of the participants evaluation (item 1)

## 4) Predicting based on ratings the night of event

Each participant ranked their dating with each partner based on 6 attributes (attractive, sincere, intelligent, ambitious, fun, shared interests) assigning a scale from 1 to 10.

The code below predicts a match based on this ranking.

**Using all 6 attributes for each of the partners (12 features):**

In [184]:
features_rat = dating_data_clean[['attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o']]
target_rat = dating_data_clean['match']

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_rat, X_test_rat, y_train_rat, y_test_rat = train_test_split(features_rat, target_rat, test_size = 0.2, random_state=52, stratify=target_rat)

knn_rat = KNeighborsClassifier()

knn_cv_rat = GridSearchCV(knn_rat, param_grid, cv=5)

# train for all features:
knn_cv_rat.fit(X_train_rat, y_train_rat)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [185]:
# Get best parameter
knn_cv_rat.best_params_

{'n_neighbors': 28}

In [186]:
# Get best score
knn_cv_rat.best_score_

0.84079379289764244

In [187]:
# Predict on training data to compare with score
y_pred_rat = knn_cv_rat.predict(X_train_rat)
accuracy_rat = knn_cv_rat.score(X_train_rat, y_train_rat)
accuracy_rat

0.84601611459265891

In [188]:
# Predict on test data
y_pred_test_rat = knn_cv_rat.predict(X_test_rat)
accuracy_test_rat = knn_cv_rat.score(X_test_rat, y_test_rat)
accuracy_test_rat

0.83711217183770881

**Using 1 attribute (attractiveness):**

In [189]:
features_attr_rat = dating_data_clean[['attr', 'attr_o']]
target_att_rat = dating_data_clean['match']

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_rat, X_test_rat, y_train_rat, y_test_rat = train_test_split(features_attr_rat, target_att_rat, test_size = 0.2, random_state=52, stratify=target_att_rat)

knn_rat = KNeighborsClassifier()

knn_cv_rat = GridSearchCV(knn_rat, param_grid, cv=5)

# train for all features:
knn_cv_rat.fit(X_train_rat, y_train_rat)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [190]:
# Predict on test data
y_pred_test_rat = knn_cv_rat.predict(X_test_rat)
accuracy_test_rat = knn_cv_rat.score(X_test_rat, y_test_rat)
accuracy_test_rat

0.83412887828162297

**Using 1 attribute (shared interests):**

In [191]:
features_shar_rat = dating_data_clean[['shar', 'shar_o']]
target_shar_rat = dating_data_clean['match']

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_rat, X_test_rat, y_train_rat, y_test_rat = train_test_split(features_shar_rat, target_shar_rat, test_size = 0.2, random_state=52, stratify=target_shar_rat)

knn_rat = KNeighborsClassifier()

knn_cv_rat = GridSearchCV(knn_rat, param_grid, cv=5)

# train for all features:
knn_cv_rat.fit(X_train_rat, y_train_rat)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [192]:
# Predict on test data
y_pred_test_rat = knn_cv_rat.predict(X_test_rat)
accuracy_test_rat = knn_cv_rat.score(X_test_rat, y_test_rat)
accuracy_test_rat

0.83472553699284013

**Using 2 attributes (attractiveness and shared interests):**

In [193]:
features_shar_rat = dating_data_clean[['attr', 'attr_o', 'shar', 'shar_o']]
target_shar_rat = dating_data_clean['match']

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_rat, X_test_rat, y_train_rat, y_test_rat = train_test_split(features_shar_rat, target_shar_rat, test_size = 0.2, random_state=52, stratify=target_shar_rat)

knn_rat = KNeighborsClassifier()

knn_cv_rat = GridSearchCV(knn_rat, param_grid, cv=5)

# train for all features:
knn_cv_rat.fit(X_train_rat, y_train_rat)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [194]:
# Predict on test data
y_pred_test_rat = knn_cv_rat.predict(X_test_rat)
accuracy_test_rat = knn_cv_rat.score(X_test_rat, y_test_rat)
accuracy_test_rat

0.83233890214797135

**Using 3 attributes (attractiveness, fun, shared interests):**

In [195]:
features_shar_rat = dating_data_clean[['attr', 'attr_o', 'shar', 'shar_o', 'fun', 'fun_o']]
target_shar_rat = dating_data_clean['match']

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features'

# Split into training and test set
X_train_rat, X_test_rat, y_train_rat, y_test_rat = train_test_split(features_shar_rat, target_shar_rat, test_size = 0.2, random_state=52, stratify=target_shar_rat)

knn_rat = KNeighborsClassifier()

knn_cv_rat = GridSearchCV(knn_rat, param_grid, cv=5)

# train for all features:
knn_cv_rat.fit(X_train_rat, y_train_rat)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [196]:
# Predict on test data
y_pred_test_rat = knn_cv_rat.predict(X_test_rat)
accuracy_test_rat = knn_cv_rat.score(X_test_rat, y_test_rat)
accuracy_test_rat

0.83591885441527447

Conclusion: model has a better score when all features are used.

### Model: Naive Bayes

**Using all 6 attributes:**

In [197]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

## For all features:

clf_a = MultinomialNB().fit(X_train_rat, y_train_rat)

clf_a.score(X_test_rat, y_test_rat)

0.8353221957040573

In [199]:
# Predict on test data
y_pred_a = clf_a.predict(X_test_rat)
accuracy_a = clf_a.score(X_test_rat, y_test_rat)
accuracy_a

0.8353221957040573

### Model: Decision Trees

In [204]:
# Import necessary modules
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}


features_rat = dating_data_clean[['attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o']]
target_rat = dating_data_clean['match']
X_train_rat, X_test_rat, y_train_rat, y_test_rat = train_test_split(features_rat, target_rat, test_size = 0.2, random_state=52, stratify=target_rat)


# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data
tree_cv.fit(X_train_rat, y_train_rat)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 2}
Best score is 0.8352730528200537


In [205]:
# Predict on test data
y_pred_tree_rat = tree_cv.predict(X_test_rat)
accuracy_tree_rat = tree_cv.score(X_test_rat, y_test_rat)
accuracy_tree_rat

0.8353221957040573

Conclusion: KNN had a better score

## 5) Predicting based on 'like' scale

Each participant when asked if they liked the other participant in the date selected a number in a scale from 1 to 10. 

The code below predicts a match based on the like scale for both partners (2 features).

In [206]:
features_like = dating_data_clean[['like', 'like_o']]
target_like = dating_data_clean['match']

# Setup the hyperparameter grid
k = [x for x in range(1,30)]
param_grid = {'n_neighbors': k}

## For all features:

# Split into training and test set
X_train_like, X_test_like, y_train_like, y_test_like = train_test_split(features_like, target_like, test_size = 0.2, random_state=52, stratify=target_like)

knn_like = KNeighborsClassifier()

knn_cv_like = GridSearchCV(knn_rat, param_grid, cv=5)

# train for all features:
knn_cv_like.fit(X_train_like, y_train_like)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [207]:
# Get best parameter
knn_cv_like.best_params_

{'n_neighbors': 28}

In [208]:
# Predict on training data to compare with score
accuracy_like = knn_cv_like.score(X_train_like, y_train_like)
accuracy_like

0.8488510892270964

In [209]:
# Predict on test data
accuracy_test_like = knn_cv_like.score(X_test_like, y_test_like)
accuracy_test_like

0.84367541766109788

# Findings:

Best features for predicting a match:

    1) like scale collected during event after each date (score on test data: 0.8436)

    2) evaluation of each date the time of event by both participants (score on test data: 0.8371)

    3) 6 attributes important to participant at sign up (score on test data: 0.8359)

    4) self evaluation of participant based on 2 attributes (attractiveness and fun) at sign up (score on test data: 0.8353)

The above indicates that it's possible to predict with a good accuracy if a participant will have a match as soon as they sign up based on how important the attributes are to them and based on how they evaluate themselves.

During the event, the predictions are more accurate based on how much they liked each other.

Regarding the model, KNN classifier gave better scores than Naive Bayes, Logistic Classification or Decision Trees.