# Speed Dating Dataset

This data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four-minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests. The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information.

There are 122 columns(independent variables) in the dataset, match column(dependent variable) needs to be predicted.

In [40]:
import pandas as pd
dating = pd.read_csv('speeddating.csv')
dating.head()

Unnamed: 0,has_null,wave,gender,age,age_o,d_age,d_d_age,race,race_o,samerace,...,d_expected_num_interested_in_me,d_expected_num_matches,like,guess_prob_liked,d_like,d_guess_prob_liked,met,decision,decision_o,match
0,b'',1.0,b'female',21.0,27.0,6.0,b'[4-6]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'0',b'0'
1,b'',1.0,b'female',21.0,22.0,1.0,b'[0-1]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,5.0,b'[6-8]',b'[5-6]',1.0,b'1',b'0',b'0'
2,b'',1.0,b'female',21.0,22.0,1.0,b'[0-1]',b'Asian/Pacific Islander/Asian-American',b'Asian/Pacific Islander/Asian-American',b'1',...,b'[0-3]',b'[3-5]',7.0,,b'[6-8]',b'[0-4]',1.0,b'1',b'1',b'1'
3,b'',1.0,b'female',21.0,23.0,2.0,b'[2-3]',b'Asian/Pacific Islander/Asian-American',b'European/Caucasian-American',b'0',...,b'[0-3]',b'[3-5]',7.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'1',b'1'
4,b'',1.0,b'female',21.0,24.0,3.0,b'[2-3]',b'Asian/Pacific Islander/Asian-American',b'Latino/Hispanic American',b'0',...,b'[0-3]',b'[3-5]',6.0,6.0,b'[6-8]',b'[5-6]',0.0,b'1',b'1',b'1'


In [41]:
dating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Columns: 123 entries, has_null to match
dtypes: float64(59), object(64)
memory usage: 7.9+ MB


In [42]:
dating.shape

(8378, 123)

In [43]:
#data types in the features
dating.dtypes.value_counts()

object     64
float64    59
dtype: int64

In [44]:
#looking for columns with unique values
dating.nunique().sort_values()

has_null                        1
decision                        2
decision_o                      2
samerace                        2
match                           2
                             ... 
shared_interests_important     85
attractive_important           94
pref_o_attractive              94
interests_correlate           155
field                         260
Length: 123, dtype: int64

In [45]:
#drop the column has_null because it has only one value for all the rows
dating.drop(['has_null'], axis = 1, inplace= True)

In [46]:
#create a function that removes all the unwanted characters b', ''
def remove_characters(feature):
    return feature.replace("b'",'').replace("'","")

#select string columns
string_dataset = dating.select_dtypes(include = ['object'])

#remove the characters
for feature in string_dataset.columns:
    dating[feature] = dating[feature].apply(lambda x: remove_characters(x))

In [47]:
#the columns with the prefix d_ are the values of other columns but binned
to_drop = [column_name for column_name in dating.columns if column_name.startswith('d_')]
dating.drop(to_drop, axis = 1, inplace = True)

In [48]:
dating.shape

(8378, 66)

In [49]:
#decision and decision_o at night event are basically the same as match, match it is calculated from them
dating.drop(['decision', 'decision_o'], axis = 1, inplace= True)

In [50]:
missing_columns = dating.isnull().sum().sort_values()

In [51]:
#drop the columns that have more than the 5% of missing values
dating.drop(['expected_num_interested_in_me', 'expected_num_matches', 'shared_interests_o',
             'shared_interests_partner', 'ambitous_o', 'ambition_partner'], axis = 1, inplace= True)

In [52]:
#There's to columns of age, from self and o, we can get just one age column from diff = self - o
dating['age_diff'] = dating['age'] - dating['age_o']
dating.drop(['age','age_o'], axis = 1, inplace = True)

In [53]:
#if they had met the partner before, yes or no (1 or 0)
dating['met'].value_counts()

0.0    7644
1.0     351
7.0       3
5.0       2
3.0       1
8.0       1
6.0       1
Name: met, dtype: int64

In [54]:
#I change the few different values for the mode that is 0
for number in [3.0, 5.0, 6.0, 7.0, 8.0]:
    dating['met'].replace(number,0, inplace =True)

In [55]:
dating['met'].value_counts()

0.0    7652
1.0     351
Name: met, dtype: int64

In [56]:
#field has many different categorical values, when I convert this column into a numeric one, it sums more than 200 columns
dating.drop(['field'], axis = 1, inplace = True)

In [57]:
dating.shape

(8378, 56)

In [58]:
missing_rows=dating.isnull().sum(axis = 1)
missing_rows.value_counts()

0     7079
1      627
2      143
11     119
3       85
4       61
7       58
8       54
33      48
5       37
32      15
6        8
34       6
19       5
9        5
13       5
12       5
44       5
10       4
15       3
37       2
18       1
43       1
39       1
40       1
dtype: int64

In [59]:
dating_clean = dating.dropna()  #drop 15% of the rows

In [60]:
dating_clean.shape

(7079, 56)

In [61]:
dating_clean.isnull().sum().sum()

0

In [62]:
#numeric columns
columns_numeric = dating_clean.select_dtypes(include = ['int','float']).columns.tolist()

#categorical columns
columns_category = dating_clean.select_dtypes(include = ['object']).drop('match', axis=1).columns

In [63]:
dating_clean[columns_category]

Unnamed: 0,gender,race,race_o,samerace
0,female,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0
1,female,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0
3,female,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0
4,female,Asian/Pacific Islander/Asian-American,Latino/Hispanic American,0
5,female,Asian/Pacific Islander/Asian-American,European/Caucasian-American,0
...,...,...,...,...
8372,male,European/Caucasian-American,European/Caucasian-American,1
8373,male,European/Caucasian-American,Latino/Hispanic American,0
8374,male,European/Caucasian-American,Other,0
8376,male,European/Caucasian-American,Asian/Pacific Islander/Asian-American,0


In [64]:
dating_clean[columns_numeric]

Unnamed: 0,wave,importance_same_race,importance_same_religion,pref_o_attractive,pref_o_sincere,pref_o_intelligence,pref_o_funny,pref_o_ambitious,pref_o_shared_interests,attractive_o,...,concerts,music,shopping,yoga,interests_correlate,expected_happy_with_sd_people,like,guess_prob_liked,met,age_diff
0,1.0,2.0,4.0,35.0,20.0,20.0,20.0,0.0,5.0,6.0,...,10.0,9.0,8.0,1.0,0.14,3.0,7.0,6.0,0.0,-6.0
1,1.0,2.0,4.0,60.0,0.0,0.0,40.0,0.0,0.0,7.0,...,10.0,9.0,8.0,1.0,0.54,3.0,7.0,5.0,1.0,-1.0
3,1.0,2.0,4.0,30.0,5.0,15.0,40.0,5.0,5.0,7.0,...,10.0,9.0,8.0,1.0,0.61,3.0,7.0,6.0,0.0,-2.0
4,1.0,2.0,4.0,30.0,10.0,20.0,10.0,10.0,20.0,8.0,...,10.0,9.0,8.0,1.0,0.21,3.0,6.0,6.0,0.0,-3.0
5,1.0,2.0,4.0,50.0,0.0,30.0,10.0,0.0,10.0,7.0,...,10.0,9.0,8.0,1.0,0.25,3.0,6.0,5.0,0.0,-4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8372,21.0,1.0,1.0,10.0,15.0,30.0,20.0,15.0,10.0,8.0,...,10.0,10.0,7.0,3.0,0.28,10.0,4.0,4.0,0.0,1.0
8373,21.0,1.0,1.0,10.0,10.0,30.0,20.0,10.0,15.0,10.0,...,10.0,10.0,7.0,3.0,0.64,10.0,2.0,5.0,0.0,-1.0
8374,21.0,1.0,1.0,50.0,20.0,10.0,5.0,10.0,5.0,6.0,...,10.0,10.0,7.0,3.0,0.71,10.0,4.0,4.0,0.0,1.0
8376,21.0,1.0,1.0,10.0,25.0,25.0,10.0,10.0,20.0,5.0,...,10.0,10.0,7.0,3.0,0.62,10.0,5.0,5.0,0.0,3.0


In [65]:
#use get dummies to convert categorical attributes into numericals
dating_ready = pd.get_dummies(data=dating_clean, columns=['gender', 'race', 'race_o', 'samerace', 'match'],drop_first=True)

In [66]:
dating_ready.shape

(7079, 62)

In [67]:
dating_ready.isnull().sum().sum()

0

In [68]:
dating_ready.dtypes.value_counts()

float64    51
uint8      11
dtype: int64

In [69]:
dating_ready.describe()

Unnamed: 0,wave,importance_same_race,importance_same_religion,pref_o_attractive,pref_o_sincere,pref_o_intelligence,pref_o_funny,pref_o_ambitious,pref_o_shared_interests,attractive_o,...,race_Black/African American,race_European/Caucasian-American,race_Latino/Hispanic American,race_Other,race_o_Black/African American,race_o_European/Caucasian-American,race_o_Latino/Hispanic American,race_o_Other,samerace_1,match_1
count,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,...,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0,7079.0
mean,11.299336,3.782738,3.65772,22.232585,17.444366,20.304576,17.490668,10.723546,11.84888,6.209549,...,0.047606,0.563921,0.077977,0.067382,0.048736,0.560107,0.080096,0.06597,0.40048,0.174318
std,5.957994,2.832566,2.81831,12.372573,6.932509,6.831764,6.092708,6.107862,6.348855,1.939503,...,0.212945,0.495932,0.268155,0.250701,0.21533,0.496409,0.271461,0.248247,0.49003,0.37941
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.0,1.0,1.0,15.0,15.0,17.5,15.0,5.0,9.52,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,11.0,3.0,3.0,20.0,18.37,20.0,18.0,10.0,10.64,6.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,15.0,6.0,6.0,25.0,20.0,23.81,20.0,15.0,16.0,8.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
max,21.0,10.0,10.0,100.0,47.0,50.0,50.0,53.0,30.0,10.5,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [70]:
dating_ready.columns

Index(['wave', 'importance_same_race', 'importance_same_religion',
       'pref_o_attractive', 'pref_o_sincere', 'pref_o_intelligence',
       'pref_o_funny', 'pref_o_ambitious', 'pref_o_shared_interests',
       'attractive_o', 'sinsere_o', 'intelligence_o', 'funny_o',
       'attractive_important', 'sincere_important', 'intellicence_important',
       'funny_important', 'ambtition_important', 'shared_interests_important',
       'attractive', 'sincere', 'intelligence', 'funny', 'ambition',
       'attractive_partner', 'sincere_partner', 'intelligence_partner',
       'funny_partner', 'sports', 'tvsports', 'exercise', 'dining', 'museums',
       'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater',
       'movies', 'concerts', 'music', 'shopping', 'yoga',
       'interests_correlate', 'expected_happy_with_sd_people', 'like',
       'guess_prob_liked', 'met', 'age_diff', 'gender_male',
       'race_Black/African American', 'race_European/Caucasian-American',
       'race_L

# Logistic Regression and Cross Validation

In [71]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')

In [72]:
#choose attributes to make the regression and the target
y = dating_ready[['match_1']]
X = dating_ready.drop(['match_1'], axis = 1) #all the attributes
X1 = X[['like', 'met']] 
X2 = X[['shopping', 'concerts', 'clubbing']]
X3 = X[['sports', 'tvsports', 'hiking', 'exercise']]

In [73]:
#all the attributes, cv = 5 means we split the data en 5 folds
log_reg = LogisticRegression()
result_X = cross_val_score(log_reg, X, y, cv = 5)
result_X1 = cross_val_score(log_reg, X1, y, cv = 5)
result_X2 = cross_val_score(log_reg, X2, y, cv = 5)
result_X3 = cross_val_score(log_reg, X3, y, cv = 5)

print("Cross validation X: ",result_X)
print("Mean X: ",result_X.mean())

print("Cross validation X1: ",result_X1)
print("Mean X1: ",result_X1.mean())

print("Cross validation X2: ",result_X2)
print("Mean X2: ",result_X2.mean())

print("Cross validation X3: ",result_X3)
print("Mean X3: ",result_X3.mean())

Cross validation X:  [0.83545198 0.8509887  0.82485876 0.84322034 0.84452297]
Mean X:  0.8398085484418358
Cross validation X1:  [0.82344633 0.83262712 0.82062147 0.82485876 0.82120141]
Mean X1:  0.8245510171487892
Cross validation X2:  [0.82556497 0.82556497 0.82556497 0.82556497 0.82614841]
Mean X2:  0.8256816593799285
Cross validation X3:  [0.82556497 0.82556497 0.82556497 0.82556497 0.82614841]
Mean X3:  0.8256816593799285


# Decision tree

In [74]:
from sklearn.tree import DecisionTreeClassifier

#all the attributes, cv = 5 means we split the data en 5 folds
tree = DecisionTreeClassifier()
result_X = cross_val_score(tree, X, y, cv = 5)
result_X1 = cross_val_score(tree, X1, y, cv = 5)
result_X2 = cross_val_score(tree, X2, y, cv = 5)
result_X3 = cross_val_score(tree, X3, y, cv = 5)

print("Cross validation X: ",result_X)
print("Mean X: ",result_X.mean())

print("Cross validation X1: ",result_X1)
print("Mean X1: ",result_X1.mean())

print("Cross validation X2: ",result_X2)
print("Mean X2: ",result_X2.mean())

print("Cross validation X3: ",result_X3)
print("Mean X3: ",result_X3.mean())

Cross validation X:  [0.46822034 0.71751412 0.7789548  0.73870056 0.78091873]
Mean X:  0.6968617116847338
Cross validation X1:  [0.82556497 0.8269774  0.82485876 0.82415254 0.82897527]
Mean X1:  0.8261057874668104
Cross validation X2:  [0.77189266 0.72316384 0.79237288 0.75353107 0.79575972]
Mean X2:  0.7673440338583778
Cross validation X3:  [0.67584746 0.65819209 0.8029661  0.69915254 0.80777385]
Mean X3:  0.7287864087361002


Look for the hyperparameter of the decision tree: criterion, min_samples_leaf, max_depth and random_state.

In [75]:
from sklearn.model_selection import GridSearchCV

parameters = {'criterion': ['gini', 'entropy'], 'min_samples_leaf': [5, 10, 50, 100, 150, 200],
              'max_depth': [2, 4, 6, 8, 10, 12], 'random_state': [0, 10, 42]}

tree = DecisionTreeClassifier()

searching_X = GridSearchCV(tree, parameters, cv=5)
searching_X.fit(X, y)
searching_X1= GridSearchCV(tree, parameters, cv=5)
searching_X1.fit(X1, y)
searching_X2 = GridSearchCV(tree, parameters, cv=5)
searching_X2.fit(X2, y)
searching_X3 = GridSearchCV(tree, parameters, cv=5)
searching_X3.fit(X3, y)

print("Best parameters for X: ", searching_X.best_params_)
print("Mean for X: ", searching_X.best_score_)

print("Best parameters for X1: ", searching_X1.best_params_)
print("Mean for X1: ", searching_X1.best_score_)

print("Best parameters for X2: ", searching_X2.best_params_)
print("Mean for X2: ", searching_X2.best_score_)

print("Best parameters for X3: ", searching_X3.best_params_)
print("Mean for X3: ", searching_X3.best_score_)

Best parameters for X:  {'criterion': 'entropy', 'max_depth': 4, 'min_samples_leaf': 100, 'random_state': 0}
Mean for X:  0.8423511209598532
Best parameters for X1:  {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 100, 'random_state': 0}
Mean for X1:  0.8276595596015252
Best parameters for X2:  {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'random_state': 0}
Mean for X2:  0.8256816593799285
Best parameters for X3:  {'criterion': 'gini', 'max_depth': 2, 'min_samples_leaf': 5, 'random_state': 0}
Mean for X3:  0.8256816593799285


# Random Forest

In [76]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()

result_X = cross_val_score(forest, X, y, cv = 5)
result_X1 = cross_val_score(forest, X1, y, cv = 5)
result_X2 = cross_val_score(forest, X2, y, cv = 5)
result_X3 = cross_val_score(forest, X3, y, cv = 5)

print("Cross validation X: ",result_X)
print("Mean X: ",result_X.mean())

print("Cross validation X1: ",result_X1)
print("Mean X1: ",result_X1.mean())

print("Cross validation X2: ",result_X2)
print("Mean X2: ",result_X2.mean())

print("Cross validation X3: ",result_X3)
print("Mean X3: ",result_X3.mean())

Cross validation X:  [0.84180791 0.83968927 0.83757062 0.84887006 0.84381625]
Mean X:  0.8423508215048612
Cross validation X1:  [0.82556497 0.8269774  0.82485876 0.82415254 0.82897527]
Mean X1:  0.8261057874668104
Cross validation X2:  [0.77824859 0.73163842 0.82556497 0.78107345 0.79575972]
Mean X2:  0.7824570282086601
Cross validation X3:  [0.79237288 0.70409605 0.81214689 0.7549435  0.82614841]
Mean X3:  0.7779415463855783


Look for the hyperparameter of the random forest: criterion, n_estimators, min_samples_leaf, max_depth and random_state.

In [77]:
parameters = {'min_samples_leaf': [5, 10, 50, 100, 150, 200], 
              'n_estimators': [50, 100, 150, 200], 'max_features': ['sqrt', 'log2']}

forest = RandomForestClassifier()

searching_X = GridSearchCV(forest, parameters, cv=5)
searching_X.fit(X, y)
searching_X1= GridSearchCV(forest, parameters, cv=5)
searching_X1.fit(X1, y)
searching_X2 = GridSearchCV(forest, parameters, cv=5)
searching_X2.fit(X2, y)
searching_X3 = GridSearchCV(forest, parameters, cv=5)
searching_X3.fit(X3, y)

print("Best parameters for X: ", searching_X.best_params_)
print("Mean for X: ", searching_X.best_score_)

print("Best parameters for X1: ", searching_X1.best_params_)
print("Mean for X1: ", searching_X1.best_score_)

print("Best parameters for X2: ", searching_X2.best_params_)
print("Mean for X2: ", searching_X2.best_score_)

print("Best parameters for X3: ", searching_X3.best_params_)
print("Mean for X3: ", searching_X3.best_score_)

Best parameters for X:  {'max_features': 'sqrt', 'min_samples_leaf': 5, 'n_estimators': 200}
Mean for X:  0.845316723563115
Best parameters for X1:  {'max_features': 'sqrt', 'min_samples_leaf': 50, 'n_estimators': 50}
Mean for X1:  0.827235531332974
Best parameters for X2:  {'max_features': 'sqrt', 'min_samples_leaf': 50, 'n_estimators': 50}
Mean for X2:  0.8256816593799285
Best parameters for X3:  {'max_features': 'sqrt', 'min_samples_leaf': 50, 'n_estimators': 50}
Mean for X3:  0.8256816593799285
