THIS IS A COURSE PROJECT DONE BY ME AND MY CLASSMATES. 



**Background**

Imagine that we are a social search service app, like tinder. We have millions of users. They register and write down some personal description, looking for a match. Here comes the question: How do we predict the match? More precisely, for a user of our app, who should we recommend to him/her?
Data description
The dataset we are using here was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment. In speed dating, the participants engage in a four-minute conversations and then determine if they are interested in the person or not. The subjects are students from graduate and professional schools of Columbia University. Before the actual meeting, all the participants registered would need to fill out a form online, which including some basic personal information (age/gender/religion/etc) and then give a self rating on their attributes like attractiveness. There are 14 rounds conducted in 2002 to 2004, with the number of participants varies. After the speed dating, all the participants would have to decide if they want to meet with the partner again. If only both say 'yes', we would call that a 'match'.
Besides trains a model to predict whether a male will like a female or vice versa, here're some interesting intuitive questions:

 1. What's the most important factor(s) when male or female make
    decisions?
 2. What's the difference between male and female when they choose their
    partners?
 3. Will a factor, such as age difference, positively effect or
    negatively effect one's decision?

To answer the above questions, before we train the model, we firstly normalize all parameters over given 8k samples with mean equals zero and variance equals 1. Under this condition, "NAN" in the data, which means the participants didn't give the information, can be treated as zero. After we train the model, the magnitude of coefficient will tell the importance of corresponding parameters, and sign of coefficient will show whether it will positively or negatively support.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
import matplotlib.pylab as plt
from sklearn import cross_validation, linear_model

%matplotlib inline
data_df = pd.read_csv("../input/Speed Dating Data.csv", encoding="ISO-8859-1")
data_df.head()
fields = data_df.columns
num_dates_per_male = data_df[data_df.gender == 1].groupby('iid').apply(len)
num_dates_per_female = data_df[data_df.gender == 0].groupby('iid').apply(len)

**Preprocessing**

Convert string to floating-point.

In [None]:
def str_to_float(series):
    return series.apply(lambda x: str(x).replace(",", "")).astype('float64')

for trait in ['mn_sat', 'tuition', 'income']:
    data_df[trait] = str_to_float(data_df[trait])

    
data_df['pid'] = data_df['pid'].fillna(-1.0).astype('int64')  # Invalid PID as -1

**Standardize and group up the features: personal feature**

Features:

 1. Financial: tuition + income. An indicator of the wealthiness of the
    family participants come from.
 2. Experience: date + go_out. How often the participants are involved
    in social interaction. Giving a higher weight for dating.
 3. Intelligence: mn_sat. The participants' undergrad mean sat score. Used in a way to represent the participants' 'intelligence level'.

In [None]:
# standardize features 1, z-score
def standardize_feature(series):
    return (series - series.mean()) / series.std(ddof=0)

# standardize features 2, std=1, remain the mean
def standardize_feature_2(series):
    return (series) / series.std(ddof=0)

In [None]:

    
    
# 2.PROFILE OF THE PERSON
#     2.1 'tuition' + 'income' -> 'financial'

# fill out the nan in tuition and income to be mean value.
data_df['tuition']=data_df['tuition'].fillna(data_df['tuition'].mean())
data_df['income']=data_df['income'].fillna(data_df['income'].mean())

# Feature 1: financial
data_df['financial'] = (data_df['tuition']) \
                       .add((data_df['income']))

    
    
#     2.2 'date' + 'go_out' -> 'experience' 
#     fill nan
data_df['date']=data_df['date'].fillna(data_df['date'].mean())
data_df['go_out']=data_df['go_out'].fillna(data_df['go_out'].mean())

#     different weights given to 'date' and 'go_out'
a=5
b=1
data_df['experience'] = a*data_df['date'] + b*data_df['go_out']
data_df['experience'] = data_df['experience']


#     2.3 'mn_sat' -> 'intelligence'
data_df['int'] = data_df['mn_sat'];
data_df['int'] = data_df['int'].fillna(data_df['int'].mean())

**Standardize and group up the features: pair-wise feature**

Pair-wise features are mainly the difference, difference in personal features. 
 1. Similarity: field, career;
 2. Difference: age + above personal features.

First, get the personal features that will be used in the calculation.

For nan:
If the value is a difference, fill it with 0;
If the value has a certain meaning (say age which can't be 0), fill it with the mean value.

In [None]:
# personal profile. To be read later when doing pair-wise feature.
profiles = data_df[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
           .set_index(keys='iid').drop_duplicates()
for trait in ['int', 'financial', 'experience']:
    profiles[trait] = profiles[trait].fillna(profiles[trait].mean)


#     3.1 age difference 
temp = data_df[data_df.gender == 1]['age'].\
            fillna(data_df[data_df.gender == 1]['age'].mean())
data_df[data_df.gender == 1]['age'] = temp
data_df[data_df.gender == 0]['age'] = data_df[data_df.gender == 0]['age'].\
            fillna(data_df[data_df.gender == 1]['age'].mean())
data_df['age_diff'] = data_df['age'].sub(data_df['age_o'])  # Age difference = self - other
data_df['age_diff'] = standardize_feature_2(data_df['age_diff'])

#     3.2 same field, true = 1, false = -1
def is_similar_profession(x, profiles):
    if np.isnan(x['field_cd']) or np.isnan(x['pid']) or x['pid'] not in profiles.index or \
    x['field_cd'] != profiles.loc[x['pid']]['field_cd']:
        return -1
    else:
        return ((int(x['field_cd'] == profiles.loc[x['pid']]['field_cd']))-0.5)*2.0
    
data_df['sim_profession'] = data_df[['field_cd', 'pid']]\
                            .apply(lambda x: is_similar_profession(x, profiles), axis=1)
data_df['sim_profession'] = standardize_feature_2(data_df['sim_profession'])

#     3.2 same career, true = 1, false = -1
def is_similar_career(x, profiles):
    if np.isnan(x['career_c']) or np.isnan(x['pid']) or x['pid'] not in profiles.index or\
        x['career_c'] != profiles.loc[x['pid']]['career_c']:
        return -1
    else:
        return ((int(x['career_c'] == profiles.loc[x['pid']]['career_c']))-0.5)*2.0
    
data_df['sim_career'] = data_df[['career_c', 'pid']]\
                            .apply(lambda x: is_similar_career(x, profiles), axis=1)
data_df['sim_career'] = standardize_feature_2(data_df['sim_career'])
    
    
    

#     3.4 basic traits diffrence (standardized)
    
def trait_difference(trait):
    trait_other = data_df['pid'].apply(lambda x: profiles.loc[x][trait] if x in profiles.index else None)
    return data_df[trait].sub(trait_other)
    
# basic trait difference : male - female
for trait in ['int','experience', 'financial']:
    string= trait + '_diff'
    data_df[string] = trait_difference(trait)
    data_df[string] = data_df[string].fillna(value=0) 
    data_df[string] = standardize_feature_2(data_df[string])

data_df.loc[data_df['pid']==21].loc[data_df['iid']==40]
# data_df.loc[data_df['iid']==11]['pid']

In [None]:
#test2 = data_df[data_df.gender == 0]['age'].\
#            fillna(data_df[data_df.gender == 0]['age'].mean())
test2 = data_df
test2[data_df.gender == 0]['age'] = data_df[data_df.gender == 0]['age'].\
            fillna(data_df[data_df.gender == 0]['age'].mean())

np.count_nonzero(np.isnan(test2[data_df.gender == 0]['age']))
#test2.head()

In [None]:
test2= data_df[data_df.gender == 1]['age'].mean()
data_df[data_df.gender == 1]['age'].\
            fillna(test2,inplace=True)
#data_df[data_df.gender == 1]['age'] = temp

In [None]:
data_df['age'].fillna(data_df['age'].mean(),inplace=True) # works but???

In [None]:
np.count_nonzero(np.isnan(data_df['age']))

In [None]:
temp2 = data_df[data_df.gender == 1].set_value(index =':', col='age',value = 'temp')

In [None]:
np.count_nonzero(np.isnan(data_df[data_df.gender == 1]['age'])) 

In [None]:
#np.count_nonzero(np.isnan(temp))
np.count_nonzero(np.isnan(data_df[data_df.gender == 1]['age']))

In [None]:
#print(data_df['iid'],data_df['pid']) if int(np.isnan(data_df['age']))
#np.count_nonzero(np.isnan(data_df['age']))
profiles = data_df[['iid','age']]\
           .set_index(keys='iid').drop_duplicates()
profiles

In [None]:



############################################################
#     3.5.1 preprocess. 
attr_exp = ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1']
attr_o = ['attr3_1', 'sinc3_1', 'intel3_1', 'fun3_1', 'amb3_1']

# profiles_f = data_df_f[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
#            .set_index(keys='iid').drop_duplicates()

# attr comes from original data, contains only the pair ids and attributes we need.
attr = data_df[['iid','gender','pid','attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1',\
               'attr3_1', 'sinc3_1', 'intel3_1', 'fun3_1', 'amb3_1', ]].set_index(keys='iid').drop_duplicates()

# data_norm['sum1']=data_norm[f1].sum(axis=1)
attr['attr1_sum']=attr[attr_exp].sum(axis=1)
attr['attr3_sum']=attr[attr_o].sum(axis=1)

In [None]:
# 3 pairwise features for male participants

#     3.0
# divide them into female data and male data
data_df_m = data_df[data_df.gender == 1] #data of male
data_df_f = data_df[data_df.gender == 0] #data of female

#############################################################

df_train = data_df_m[['iid','pid','match', 'age_diff','sim_profession','sim_career','int_diff','experience_diff','financial_diff']]
df_train['rating'] = np.zeros(len(df_train))

attr_exp_n = []
attr_o_n = []

for trait in attr_exp:
    
    attr[trait + '_n'] = attr[trait]/attr['attr1_sum'] 
    attr_exp_n.append(trait+'_n')

    
for trait in attr_o:
    
    attr[trait + '_n'] = attr[trait]/attr['attr3_sum'] 
    attr_o_n.append(trait+'_n')


    #######################################################
    
    
attr[attr_exp_n] = attr[attr_exp_n].fillna(value=1.0/5).astype('float64')
attr[attr_o_n] = attr[attr_o_n].fillna(value=1.0/5).astype('float64')

attr_m = attr[attr.gender == 1];
attr_f = attr[attr.gender == 0];

In [None]:


for i in attr_m.index.drop_duplicates():
    for j in attr_m.loc[i].pid:
        # j = female iid, (i,j) makes a pair
        temp1=0
        temp2=0
        temp3=0
        for k in np.arange(0,5):

            if i not in attr_m.index or \
                j not in attr_f.index:
                
                temp1 = 0
                temp2 = 0
            else:

                temp1 = attr_m.loc[attr_m['pid']==j].loc[i][attr_exp_n[k]]
                temp2 = attr_f.loc[attr_f['pid']==i].loc[j][attr_o_n[k]]
            temp3 += temp1*temp2

        get_index = df_train.loc[df_train['iid']==i].loc[df_train['pid']==j].index
        
        df_train.set_value(get_index,'rating',temp3)


test=df_train

In [None]:
np.count_nonzero(np.isnan(df_train['age_diff']))

In [None]:
df_test.columns

In [None]:
df_test=test[3000:]
#print(df_test)
df_train=test[:3000]
#print(df_train)

In [None]:
##Use Scikit logistic learn
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

from sklearn import linear_model, datasets
X = df_train[[ 'age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating']]
y = df_train["match"]
Xtest = df_test[[ 'age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating']]
ytest = df_test["match"]

In [None]:

clf_LR=LogisticRegression(C=1000)
clf_LR.fit(X,y)

In [None]:
print("No-R Single Logistic Regression accuracy:",clf_LR.score(Xtest,ytest))
clf_l1_LR = LogisticRegression(C=1, penalty='l1')
clf_l1_LR.fit(X,y)
print ("Logistic Regression accuracy with l1 penalty:",clf_l1_LR.score(Xtest,ytest))
clf_l2_LR = LogisticRegression(C=1, penalty='l2')
clf_l2_LR.fit(X,y)
print ("Logistic Regression accuracy with l2 penalty:",clf_l2_LR.score(Xtest,ytest))

print ("parameter of Male")
print ('age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating')
print(clf_LR.coef_) 

In [None]:
df_train.head()