# Speed Dating Analysis - Tristan Langley

**This project looks at speed dating survey results. The goal is to find the features that best predict whether someone
will decide "yes" on their partner (i.e. want to match with them). Then, we will create a model based on these features
and test how accurately it predicts yes/no decisions.**

# 1. Data preprocessing
**Speed dating data is in a csv file. There are 195 columns; we will keep only the most "straightforward" ones.
E.g. attractiveness rating, correlation in interests, age difference, etc.**

In [1]:
import pandas as pd

# List of columns I want to keep
# TODO decide if I want to put partner's ranking on each individual interest (e.g. 'exercise', 'dining', etc.)
col_names = ['gender', 'int_corr', 'samerace', 'age_o', 'dec_o', 'attr_o', 'sinc_o',
            'intel_o', 'fun_o', 'amb_o', 'shar_o', 'prob_o', 'met_o', 'age', 'race', 'dec']

# Load the columns I want into a pandas dataframe
full_df = pd.read_csv('SpeedDatingData.csv', usecols=col_names)

# Drop any rows with NaN value(s)
full_df.dropna(inplace=True)

# Create a new column for difference in age (absolute value)
full_df['d_age'] = full_df.apply(lambda row: abs(row.age - row.age_o), axis = 1)

# Now drop my age from the dataframe -- not going to use this as a predictor, because it does not describe their partner
full_df = full_df.drop(['age_o'], axis=1)

# Separate data (possible predictors) from targets (what we are trying to predict: the decision yes/no)
data = full_df.drop(['gender', 'dec_o', 'dec', 'prob_o', 'age'], axis=1)
target = full_df['dec_o']
prob_guess = full_df['prob_o']

# Make lists of feature column names and map them to longer descriptions
ft_names = data.columns.tolist()
ft_desc = { 'gender': 'Gender', 'int_corr': 'Correlation of Interests', 'samerace': 'Same Race',
            'age_o': 'My Age', 'dec_o': 'My Decision Y/N', 'attr_o': 'Attractiveness Rating',
            'sinc_o': 'Sincerity Rating', 'intel_o': 'Intelligence Rating', 'fun_o': 'Funnyness Rating',
            'amb_o': 'Ambition Rating', 'shar_o': 'Shared Interests Rating',
            'prob_o': "My Guess of Partner's Decision No/Yes", 'met_o': "Have I Met Partner Before Y/N",
            'age': "Partner's Age", 'race': "Partner's Race", 'dec': "Partner's Decision Y/N",
            'd_age': "Age Difference" }

data

Unnamed: 0,int_corr,samerace,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,met_o,race,d_age
0,0.14,0,6.0,8.0,8.0,8.0,8.0,6.0,2.0,4.0,6.0
1,0.54,0,7.0,8.0,10.0,7.0,7.0,5.0,2.0,4.0,1.0
2,0.16,1,10.0,10.0,10.0,10.0,10.0,10.0,1.0,4.0,1.0
3,0.61,0,7.0,8.0,9.0,8.0,9.0,8.0,2.0,4.0,2.0
4,0.21,0,8.0,7.0,9.0,6.0,9.0,7.0,2.0,4.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...
8372,0.28,1,8.0,8.0,7.0,7.0,8.0,6.0,2.0,2.0,1.0
8374,0.71,0,6.0,3.0,7.0,3.0,7.0,2.0,2.0,2.0,1.0
8375,-0.46,0,2.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,4.0
8376,0.62,0,5.0,7.0,5.0,5.0,3.0,6.0,2.0,2.0,3.0


Description of columns:
- gender: F=0, M=1
- int_corr: correlation between interests (polled from 1-10 on interests like exercise, dining, museums, gaming, etc.)
- samerace: same race=1, different races=0
- dec_o: my decision, no=0 yes=1
- attr_o: my rating of partner's attractiveness (1-10)
- sinc_o: my rating of partner's sincerity (1-10)
- intel_o: my rating of partner's intelligence (1-10)
- fun_o: my rating of partner's funnyness (1-10)
- amb_o: my rating of partner's ambition (1-10)
- shar_o: my rating of partner on our shared interests (1-10)
- prob_o: how likely do I think it is that my partner decided "yes" on me (1-10)
- met_o: have I met my partner before, no=1 yes=2
- age: partner's age (years)
- race: partner's race (integer)
- dec: partner's decision on me, no=0 yes=1
- d_age: difference in age (negative means partner is younger than me)

## 2. Find best predictors
**Find the qualities that seem to be the best predictors of a 'Yes' decision. Use calculated correlations
between qualities and decisions, top layers of a decision tree, and top dimensions of principle component
analysis.**

In [2]:
# Calculate correlations between features and targets
corr = data.corrwith(target).sort_values(ascending=False)
print('%-30s %s' %('Quality', 'Correlation with Decision'))
for i in range(len(corr)):
    print('%-30s %.5f' %(ft_desc[corr.index[i]], corr[i]))

Quality                        Correlation with Decision
Attractiveness Rating          0.48802
Funnyness Rating               0.41175
Shared Interests Rating        0.39771
Intelligence Rating            0.21568
Sincerity Rating               0.21061
Ambition Rating                0.18264
Same Race                      0.03336
Correlation of Interests       0.01003
Age Difference                 -0.04011
Partner's Race                 -0.06513
Have I Met Partner Before Y/N  -0.06769


EXPLAIN THE RESULTS OF THE CORRELATIONS

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Train a decision tree of depth 4
dt4 = DecisionTreeClassifier(max_depth=4)
dt4.fit(data, target)

# Print the tree
print(tree.export_text(dt4, feature_names=ft_names))



|--- attr_o <= 6.25
|   |--- shar_o <= 4.50
|   |   |--- attr_o <= 5.50
|   |   |   |--- fun_o <= 5.50
|   |   |   |   |--- class: 0
|   |   |   |--- fun_o >  5.50
|   |   |   |   |--- class: 0
|   |   |--- attr_o >  5.50
|   |   |   |--- race <= 1.50
|   |   |   |   |--- class: 1
|   |   |   |--- race >  1.50
|   |   |   |   |--- class: 0
|   |--- shar_o >  4.50
|   |   |--- attr_o <= 4.50
|   |   |   |--- shar_o <= 6.50
|   |   |   |   |--- class: 0
|   |   |   |--- shar_o >  6.50
|   |   |   |   |--- class: 0
|   |   |--- attr_o >  4.50
|   |   |   |--- fun_o <= 6.50
|   |   |   |   |--- class: 0
|   |   |   |--- fun_o >  6.50
|   |   |   |   |--- class: 0
|--- attr_o >  6.25
|   |--- shar_o <= 4.50
|   |   |--- fun_o <= 5.50
|   |   |   |--- amb_o <= 8.50
|   |   |   |   |--- class: 0
|   |   |   |--- amb_o >  8.50
|   |   |   |   |--- class: 0
|   |   |--- fun_o >  5.50
|   |   |   |--- amb_o <= 5.50
|   |   |   |   |--- class: 1
|   |   |   |--- amb_o >  5.50
|   |   |   |   |---

EXPLAIN THE RESULTS OF THE DECISION TREE

In [4]:
from sklearn.decomposition import PCA
import numpy as np

# Fit a PCA transform using training data
pca = PCA()
pca.fit(data)

# Transform training set and testing set using PCA
x_train_pca = pca.transform(data)

# Compute the proportion of variance explained by each dimension
var_dimensions = pca.explained_variance_ratio_

# Print dataframe of the weights of each feature on the first three principal components
pd.DataFrame(abs(pca.components_[:3]), columns=data.columns, index=['PC-1','PC-2', 'PC-3'])

Unnamed: 0,int_corr,samerace,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,met_o,race,d_age
PC-1,0.003151,0.006677,0.409262,0.357774,0.325021,0.473156,0.365102,0.483479,0.009594,0.019241,0.086134
PC-2,0.003498,0.012233,0.026449,0.037066,0.02637,0.036382,0.02766,0.052263,0.006026,0.025788,0.995765
PC-3,0.000419,0.023619,0.386479,0.433997,0.449123,0.173803,0.445685,0.455012,0.00544,0.155192,0.004347


EXPLAIN THE RESULTS OF PCA

# 3. Train models to predict yes/no decisions
**Train decision tree, support vector machine, and k-nearest neighbor models using
the training data set.**

In [5]:
# Split the data into training and testing. Use 20% of the data for testing
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=0)

In [6]:
# Train a decision tree model of depth 6
dt6 = DecisionTreeClassifier(max_depth=6)
dt6.fit(x_train, y_train)

In [7]:
from sklearn.svm import SVC

# Train an SVM (RBF kernel)
svm = SVC(kernel='rbf') 
svm.fit(x_train, y_train)

# Check the SVM's performance on the training and testing data sets
print('Score of rbf kernel SVM on training set: %.4f' %svm.score(x_train, y_train))
print('Score of rbf kernel SVM on testing set: %.4f' %svm.score(x_test, y_test))

Score of rbf kernel SVM on training set: 0.7591
Score of rbf kernel SVM on testing set: 0.7269


In [8]:
from sklearn.neighbors import KNeighborsClassifier

# Train a KNN model with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)

# 4. Compare accuracies of each prediction model
**Score each model on the testing data set to compare their accuracies.**

In [9]:
# Training set scores
print('Scores on the training set')
print('%-20s %.3f' %('Decision Tree:', dt6.score(x_train, y_train)))
print('%-20s %.3f' %('Support Vector M.:', svm.score(x_train, y_train)))
print('%-20s %.3f' %('5-Nearest Neighbor:', knn.score(x_train, y_train)))

# Testing set scores
print('\nScores on the testing set')
print('%-20s %.3f' %('Decision tree:', dt6.score(x_test, y_test)))
print('%-20s %.3f' %('Support Vector M.:', svm.score(x_test, y_test)))
print('%-20s %.3f' %('5-Nearest Neighbor:', knn.score(x_test, y_test)))

Scores on the training set
Decision Tree:       0.765
Support Vector M.:   0.759
5-Nearest Neighbor:  0.804

Scores on the testing set
Decision tree:       0.725
Support Vector M.:   0.727
5-Nearest Neighbor:  0.707


EXPLAIN SCORES