# Speed Dating Analysis - Tristan Langley

**This project looks at speed dating survey results. The goal is to find the features that best predict whether someone will decide "yes" on their partner (i.e. want to match with them). Then, we will create prediction models based on these features and test how accurately they predict yes/no decisions.**

**Additional topics explored are the differences between men's and women's decisions, and how accurately people predict their partners' decisions.**

## 1. Data preprocessing
**Speed dating data is in a csv file. There are 195 columns; we will keep only the most "straightforward" ones.
E.g. attractiveness rating, correlation in interests, age difference, etc.**

In [40]:
import pandas as pd

# List of columns I want to keep
col_names = ['gender', 'int_corr', 'samerace', 'age_o', 'dec_o', 'attr_o', 'sinc_o',
            'intel_o', 'fun_o', 'amb_o', 'shar_o', 'prob_o', 'met_o', 'age', 'race', 'dec']

# Load the columns I want into a pandas dataframe
full_df = pd.read_csv('SpeedDatingData.csv', usecols=col_names)

# Drop any rows with NaN value(s)
full_df.dropna(inplace=True)

# Create a new column for difference in age (absolute value)
full_df['d_age'] = full_df.apply(lambda row: abs(row.age - row.age_o), axis = 1)

# Now drop my age from the dataframe -- not going to use this as a predictor, because it does not describe their partner
full_df = full_df.drop(['age_o'], axis=1)

# Separate data (possible predictors) from targets (what we are trying to predict: the decision yes/no)
data = full_df.drop(['gender', 'dec_o', 'dec', 'prob_o', 'age'], axis=1)
target = full_df['dec_o']

# Store the guesses for whether their partner liked them, as a probability between 0 and 1
target_guess = full_df['prob_o']/10

# Make lists of feature column names and map them to longer descriptions
ft_names = data.columns.tolist()
ft_desc = { 'gender': 'Gender', 'int_corr': 'Correlation of Interests', 'samerace': 'Same Race',
            'age_o': 'My Age', 'dec_o': 'My Decision Y/N', 'attr_o': 'Attractiveness Rating',
            'sinc_o': 'Sincerity Rating', 'intel_o': 'Intelligence Rating', 'fun_o': 'Funnyness Rating',
            'amb_o': 'Ambition Rating', 'shar_o': 'Shared Interests Rating',
            'prob_o': "My Guess of Partner's Decision No/Yes", 'met_o': "Have I Met Partner Before Y/N",
            'age': "Partner's Age", 'race': "Partner's Race", 'dec': "Partner's Decision Y/N",
            'd_age': "Age Difference" }

data

Unnamed: 0,int_corr,samerace,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,met_o,race,d_age
0,0.14,0,6.0,8.0,8.0,8.0,8.0,6.0,2.0,4.0,6.0
1,0.54,0,7.0,8.0,10.0,7.0,7.0,5.0,2.0,4.0,1.0
2,0.16,1,10.0,10.0,10.0,10.0,10.0,10.0,1.0,4.0,1.0
3,0.61,0,7.0,8.0,9.0,8.0,9.0,8.0,2.0,4.0,2.0
4,0.21,0,8.0,7.0,9.0,6.0,9.0,7.0,2.0,4.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...
8372,0.28,1,8.0,8.0,7.0,7.0,8.0,6.0,2.0,2.0,1.0
8374,0.71,0,6.0,3.0,7.0,3.0,7.0,2.0,2.0,2.0,1.0
8375,-0.46,0,2.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,4.0
8376,0.62,0,5.0,7.0,5.0,5.0,3.0,6.0,2.0,2.0,3.0


Description of columns:
- gender: my gender, F=0, M=1
- int_corr: correlation between interests (polled from 1-10 on interests like exercise, dining, museums, gaming, etc.)
- samerace: same race=1, different races=0
- dec_o: my decision, no=0 yes=1
- attr_o: my rating of partner's attractiveness (1-10)
- sinc_o: my rating of partner's sincerity (1-10)
- intel_o: my rating of partner's intelligence (1-10)
- fun_o: my rating of partner's funnyness (1-10)
- amb_o: my rating of partner's ambition (1-10)
- shar_o: my rating of partner on our shared interests (1-10)
- prob_o: how likely do I think it is that my partner decided "yes" on me (1-10)
- met_o: have I met my partner before, no=1 yes=2
- age: partner's age (years)
- race: partner's race (integer)
- dec: partner's decision on me, no=0 yes=1
- d_age: difference in age (negative means partner is younger than me)

## 2. Find best predictors
**Find the qualities that seem to be the best predictors of the yes/no decision. Use calculated correlations
between qualities and decisions, top layers of a decision tree, and top dimensions of principle component
analysis.**

In [41]:
# Calculate correlations between features and targets
corr = data.corrwith(target).sort_values(ascending=False)
print('%-30s %s' %('Quality', "Correlation with a 'Yes' Decision"))
for i in range(len(corr)):
    print('%-30s %.5f' %(ft_desc[corr.index[i]], corr[i]))

Quality                        Correlation with a 'Yes' Decision
Attractiveness Rating          0.48802
Funnyness Rating               0.41175
Shared Interests Rating        0.39771
Intelligence Rating            0.21568
Sincerity Rating               0.21061
Ambition Rating                0.18264
Same Race                      0.03336
Correlation of Interests       0.01003
Age Difference                 -0.04011
Partner's Race                 -0.06513
Have I Met Partner Before Y/N  -0.06769


The participant's rating of their partner's attractiveness is the most correlated with a 'yes' decision, at almost 50% correlation. The next two are funnyness and shared interests, which have similar values for correlation at around 40%. These results are not very surprising, although, I am a bit surprised that age difference did not have a larger magnitude of correlation.

In [42]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Train a decision tree of depth 4
dt4 = DecisionTreeClassifier(max_depth=4)
dt4.fit(data, target)
print(tree.export_text(dt4, feature_names=ft_names))

|--- attr_o <= 6.25
|   |--- shar_o <= 4.50
|   |   |--- attr_o <= 5.50
|   |   |   |--- fun_o <= 5.50
|   |   |   |   |--- class: 0
|   |   |   |--- fun_o >  5.50
|   |   |   |   |--- class: 0
|   |   |--- attr_o >  5.50
|   |   |   |--- race <= 1.50
|   |   |   |   |--- class: 1
|   |   |   |--- race >  1.50
|   |   |   |   |--- class: 0
|   |--- shar_o >  4.50
|   |   |--- attr_o <= 4.50
|   |   |   |--- shar_o <= 6.50
|   |   |   |   |--- class: 0
|   |   |   |--- shar_o >  6.50
|   |   |   |   |--- class: 0
|   |   |--- attr_o >  4.50
|   |   |   |--- fun_o <= 6.50
|   |   |   |   |--- class: 0
|   |   |   |--- fun_o >  6.50
|   |   |   |   |--- class: 0
|--- attr_o >  6.25
|   |--- shar_o <= 4.50
|   |   |--- fun_o <= 5.50
|   |   |   |--- amb_o <= 8.50
|   |   |   |   |--- class: 0
|   |   |   |--- amb_o >  8.50
|   |   |   |   |--- class: 0
|   |   |--- fun_o >  5.50
|   |   |   |--- amb_o <= 5.50
|   |   |   |   |--- class: 1
|   |   |   |--- amb_o >  5.50
|   |   |   |   |---

The decision tree generally shows the same results as the calculated correlations. The top layer of the tree is attractiveness rating, the second layer is shared interests rating, and the third layer includes both attractiveness and funnyness ratings. The entire 4-layer tree is made up of those three ratings, aside from a couple branches which consider ambition rating.

In [43]:
from sklearn.decomposition import PCA
import numpy as np

# Fit a PCA transform using training data
pca = PCA()
pca.fit(data)

# Print the amount of variance that the first three principal components explain
for i in range(3):
    print('PC-%d explains %.1f%% of the variance' %(i, pca.explained_variance_ratio_[i]*100))

# Print dataframe of the weights of each feature on the first three principal components
pd.DataFrame(abs(pca.components_[:3]), columns=data.columns, index=['PC-1','PC-2', 'PC-3'])

PC-0 explains 38.1% of the variance
PC-1 explains 28.1% of the variance
PC-2 explains 9.3% of the variance


Unnamed: 0,int_corr,samerace,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,met_o,race,d_age
PC-1,0.003151,0.006677,0.409262,0.357774,0.325021,0.473156,0.365102,0.483479,0.009594,0.019241,0.086134
PC-2,0.003498,0.012233,0.026449,0.037066,0.02637,0.036382,0.02766,0.052263,0.006026,0.025788,0.995765
PC-3,0.000419,0.023619,0.386479,0.433997,0.449123,0.173803,0.445685,0.455012,0.00544,0.155192,0.004347


Each "principal component" is a combination of various features. Principal component 1 explains about 38.1% of the variance. The heaviest weights in principal component 1 are funnyness and shared interests, followed by attractiveness, then sincerity, intelligence, and ambition. Overall, this supports the general conclusion from the correlations and the decision tree that attractiveness, funnyness, and shared interests are the best predictors of a 'yes' decision.

## 3. Train models to predict yes/no decisions
**Train decision tree, support vector machine, and k-nearest neighbor models using the training data set.
Train one set of models using all features, and another set of models using just the top 3 features from Section 2.**

In [44]:
# Create a new data frame with just the top 3 feature columns: attractiveness, funnyness, and shared interest ratings
data3 = data[['attr_o', 'shar_o', 'fun_o']]

# Split the data into training and testing. Use 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=0)
x_train3, x_test3, y_train3, y_test3 = train_test_split(data3, target, test_size=0.2, random_state=0)

In [45]:
# Train decision tree models of depth 6
dt = DecisionTreeClassifier(max_depth=6)
dt.fit(x_train, y_train)
dt3 = DecisionTreeClassifier(max_depth=6)
dt3.fit(x_train3, y_train3)

In [46]:
from sklearn.svm import SVC

# Train SVMs (RBF kernel)
svm = SVC(kernel='rbf') 
svm.fit(x_train, y_train)
svm3 = SVC(kernel='rbf') 
svm3.fit(x_train3, y_train3)

In [47]:
from sklearn.neighbors import KNeighborsClassifier

# Train KNN models with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
knn3 = KNeighborsClassifier(n_neighbors=5)
knn3.fit(x_train3, y_train3)

## 4. Compare accuracies of each prediction model
**Score each model on the testing data set to compare their accuracies.**

In [48]:
# Training set scores
print('Scores on the training set, all-feature models')
print('%-20s %.3f' %('Decision Tree:', dt.score(x_train, y_train)))
print('%-20s %.3f' %('Support Vector M.:', svm.score(x_train, y_train)))
print('%-20s %.3f' %('5-Nearest Neighbor:', knn.score(x_train, y_train)))

print('\nScores on the training set, 3-feature models')
print('%-20s %.3f' %('Decision Tree:', dt3.score(x_train3, y_train3)))
print('%-20s %.3f' %('Support Vector M.:', svm3.score(x_train3, y_train3)))
print('%-20s %.3f' %('5-Nearest Neighbor:', knn3.score(x_train3, y_train3)))

Scores on the training set, all-feature models
Decision Tree:       0.765
Support Vector M.:   0.759
5-Nearest Neighbor:  0.804

Scores on the training set, 3-feature models
Decision Tree:       0.756
Support Vector M.:   0.752
5-Nearest Neighbor:  0.737


The all-feature models perform only slightly better on the training set, but all models' scores are around the high 70%'s.

In [49]:
# Testing set scores
print('\nScores on the testing set, all-feature models')
print('%-20s %.3f' %('Decision tree:', dt6.score(x_test, y_test)))
print('%-20s %.3f' %('Support Vector M.:', svm.score(x_test, y_test)))
print('%-20s %.3f' %('5-Nearest Neighbor:', knn.score(x_test, y_test)))

print('\nScores on the testing set, 3-feature models')
print('%-20s %.3f' %('Decision tree:', dt3.score(x_test3, y_test3)))
print('%-20s %.3f' %('Support Vector M.:', svm3.score(x_test3, y_test3)))
print('%-20s %.3f' %('5-Nearest Neighbor:', knn3.score(x_test3, y_test3)))


Scores on the testing set, all-feature models
Decision tree:       0.725
Support Vector M.:   0.727
5-Nearest Neighbor:  0.707

Scores on the testing set, 3-feature models
Decision tree:       0.727
Support Vector M.:   0.720
5-Nearest Neighbor:  0.718


All the models, regardless of if they used all features or only the top 3, produced about 71% accuracy on the testing set. Removing the less important features from the dataset did not significantly worsen accuracy.

## 5. Calculate how accurately participants predicted their partners' decisions

In [51]:
from sklearn.metrics import mean_absolute_error

# Calculate the accuracy
guess_error = mean_absolute_error(target, target_guess)
guess_accuracy = 1 - guess_error

print("Accuracy of participants' predictions: %.3f" %guess_accuracy)

Accuracy of participants' predictions: 0.563


Note that participants did not just guess whether their partner decided 'yes' or 'no' (1 or 0), but rather guessed the probability of a 'yes' decision from their partner.

The accuracy that a participant predicted their partner's decision was only 56.3%. Intuitively, I expected this number to be higher since people tend to be able to gauge whether a conversation went well. However, the models trained in this code predict the decisions significantly more accurately than the participants did.

## 6. Compare best predictors for men vs. women
**Repeat Section 2, but separate the data based on whether the participant is male or female. Find the best predictors for each, and compare.**

In [52]:
# Separate into men/women
full_df_m = full_df[full_df.gender == 1]
full_df_f = full_df[full_df.gender == 0]

# Separate into data and targets
data_m = full_df_m.drop(['gender', 'dec_o', 'dec', 'prob_o', 'age'], axis=1)
target_m = full_df_m['dec_o']
data_f = full_df_f.drop(['gender', 'dec_o', 'dec', 'prob_o', 'age'], axis=1)
target_f = full_df_f['dec_o']

In [53]:
# Find best predictors for men's decisions
print("Results for Men's Decision Data:\n")

# Correlations
corr_m = data_m.corrwith(target_m).sort_values(ascending=False)
print('%-30s %s' %('Quality', 'Correlation with Decision'))
for i in range(len(corr_m)):
    print('%-30s %.5f' %(ft_desc[corr_m.index[i]], corr_m[i]))

# Decision tree
print('\nDecision tree')
dt4_m = DecisionTreeClassifier(max_depth=4)
dt4_m.fit(data_m, target_m)
print(tree.export_text(dt4_m, feature_names=ft_names))

# PCA
print('\nPrincipal component analysis')
pca_m = PCA()
pca_m.fit(data)
pd.DataFrame(abs(pca_m.components_[:3]), columns=data_m.columns, index=['PC-1','PC-2', 'PC-3'])

Results for Men's Decision Data:

Quality                        Correlation with Decision
Attractiveness Rating          0.44245
Funnyness Rating               0.42163
Shared Interests Rating        0.41254
Intelligence Rating            0.24226
Sincerity Rating               0.23081
Ambition Rating                0.18763
Same Race                      0.05962
Correlation of Interests       0.01459
Age Difference                 -0.01099
Have I Met Partner Before Y/N  -0.07548
Partner's Race                 -0.09134

Decision tree
|--- attr_o <= 6.75
|   |--- fun_o <= 6.50
|   |   |--- shar_o <= 4.50
|   |   |   |--- fun_o <= 4.50
|   |   |   |   |--- class: 0
|   |   |   |--- fun_o >  4.50
|   |   |   |   |--- class: 0
|   |   |--- shar_o >  4.50
|   |   |   |--- attr_o <= 4.50
|   |   |   |   |--- class: 0
|   |   |   |--- attr_o >  4.50
|   |   |   |   |--- class: 0
|   |--- fun_o >  6.50
|   |   |--- shar_o <= 3.50
|   |   |   |--- sinc_o <= 7.50
|   |   |   |   |--- class: 0
|   

Unnamed: 0,int_corr,samerace,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,met_o,race,d_age
PC-1,0.003151,0.006677,0.409262,0.357774,0.325021,0.473156,0.365102,0.483479,0.009594,0.019241,0.086134
PC-2,0.003498,0.012233,0.026449,0.037066,0.02637,0.036382,0.02766,0.052263,0.006026,0.025788,0.995765
PC-3,0.000419,0.023619,0.386479,0.433997,0.449123,0.173803,0.445685,0.455012,0.00544,0.155192,0.004347


The top three predictors for men are still attractiveness, funnyness, and shared interests.

In [54]:
# Find best predictors for women's decisions
print("Results for Women's Decision Data:\n")

# Correlations
corr_f = data_f.corrwith(target_f).sort_values(ascending=False)
print('%-30s %s' %('Quality', 'Correlation with Decision'))
for i in range(len(corr_f)):
    print('%-30s %.5f' %(ft_desc[corr_f.index[i]], corr_f[i]))

# Decision tree
print('\nDecision tree')
dt4_f = DecisionTreeClassifier(max_depth=4)
dt4_f.fit(data_f, target_f)
print(tree.export_text(dt4_f, feature_names=ft_names))

# PCA
print('\nPrincipal component analysis')
pca_f = PCA()
pca_f.fit(data)
pd.DataFrame(abs(pca_f.components_[:3]), columns=data_f.columns, index=['PC-1','PC-2', 'PC-3'])

Results for Women's Decision Data:

Quality                        Correlation with Decision
Attractiveness Rating          0.52090
Funnyness Rating               0.39742
Shared Interests Rating        0.38381
Intelligence Rating            0.20653
Ambition Rating                0.20602
Sincerity Rating               0.18457
Same Race                      0.00838
Correlation of Interests       0.00573
Partner's Race                 -0.04748
Have I Met Partner Before Y/N  -0.06179
Age Difference                 -0.06872

Decision tree
|--- attr_o <= 6.25
|   |--- attr_o <= 4.50
|   |   |--- shar_o <= 6.50
|   |   |   |--- race <= 5.00
|   |   |   |   |--- class: 0
|   |   |   |--- race >  5.00
|   |   |   |   |--- class: 0
|   |   |--- shar_o >  6.50
|   |   |   |--- intel_o <= 6.50
|   |   |   |   |--- class: 0
|   |   |   |--- intel_o >  6.50
|   |   |   |   |--- class: 0
|   |--- attr_o >  4.50
|   |   |--- shar_o <= 5.50
|   |   |   |--- sinc_o <= 6.50
|   |   |   |   |--- class: 0


Unnamed: 0,int_corr,samerace,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,met_o,race,d_age
PC-1,0.003151,0.006677,0.409262,0.357774,0.325021,0.473156,0.365102,0.483479,0.009594,0.019241,0.086134
PC-2,0.003498,0.012233,0.026449,0.037066,0.02637,0.036382,0.02766,0.052263,0.006026,0.025788,0.995765
PC-3,0.000419,0.023619,0.386479,0.433997,0.449123,0.173803,0.445685,0.455012,0.00544,0.155192,0.004347


The top three predictors for women are similarly attractiveness, funnyness, and shared interests.

Although the top predictors for men and women are the same, there are some slight numerical differences that can be discussed.

Firstly, for men, the correlations between attractiveness rating and decision, funnyness rating and decision, and shared interests rating and decision had similar values (low 40%'s). In contrast, for women the correlation between attractiveness rating and decision was over 50%, while funnyness rating and shared interests rating trailed behind at below 40% correlation.

Secondly, the attractiveness 'cutoff' for men in the top layer of the decision tree was a 6.75 rating out of 10, while for women it was only a 6.25 rating out of 10.

Although nothing definitive can be stated from just these results, they point towards the idea that while women place slightly more importance on a man's attractiveness in speed dating, they also are less 'harsh' on how attractive the man must be.