## Predict Tinder Matches

We are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them

Reference: https://www.geeksforgeeks.org/predict-tinder-matches-with-machine-learning/

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
tinder_data_reduced = pd.read_csv("../data/tinder_data_filtered.csv")

In [3]:
type(tinder_data_reduced)

pandas.core.frame.DataFrame

In [4]:
tinder_data_reduced.head()

Unnamed: 0,user_id,username,age,sex,drinks,drugs,height,job,pets,smokes,language,new_languages,body_profile,education_level,dropped_out,bio,cleaned_bio
0,fffe31003000,Eric Goldberger,28,m,very often,sometimes,62.0,other,dislikes dogs and dislikes cats,when drinking,"english, french, german, chinese, sign language",not interested,used up,4.0,no,i work and i play. hard. all day. everyday. i ...,work play hard day everyday professional archi...
1,fffe31003200,Lucile Trexler,24,f,socially,never,65.0,medicine / health,likes dogs and likes cats,no,"english (fluently), spanish (okay), french (po...",interested,average,3.0,no,"so...there is much to say... for now, i'm a st...",student tentative degree molecular cell biolog...
2,fffe31003300,Earl Eells,29,m,socially,never,73.0,education / academia,likes dogs and dislikes cats,no,english,not interested,average,3.0,no,i moved to san francisco two years ago after l...,move san francisco year ago live east bay year...
3,fffe31003400,Claudine Shreve,33,f,socially,never,65.0,other,has dogs,no,"english, spanish (fluently)",not interested,a little extra,4.0,no,i'm a mixed bag of nuts! i'm outgoing yet shy ...,mixed bag nut outgoing shy meet people grow sf...
4,fffe31003500,Myong Ellison,39,f,often,sometimes,65.0,medicine / health,likes dogs and has cats,no,"english (fluently), spanish (fluently)",not interested,average,4.0,no,"i'm easy-going, fun and and pretty straight fo...",easy go fun pretty straight forward new englan...


In [5]:
tinder_data_reduced.shape

(1321, 17)

In [6]:
tinder_data_reduced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1321 entries, 0 to 1320
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   user_id          1321 non-null   object 
 1   username         1321 non-null   object 
 2   age              1321 non-null   int64  
 3   sex              1321 non-null   object 
 4   drinks           1321 non-null   object 
 5   drugs            1321 non-null   object 
 6   height           1321 non-null   float64
 7   job              1321 non-null   object 
 8   pets             1321 non-null   object 
 9   smokes           1321 non-null   object 
 10  language         1321 non-null   object 
 11  new_languages    1321 non-null   object 
 12  body_profile     1321 non-null   object 
 13  education_level  1321 non-null   float64
 14  dropped_out      1321 non-null   object 
 15  bio              1321 non-null   object 
 16  cleaned_bio      1320 non-null   object 
dtypes: float64(2),

In [7]:
# identify all the features that need encoding

list(tinder_data_reduced.select_dtypes('object').columns)

['user_id',
 'username',
 'sex',
 'drinks',
 'drugs',
 'job',
 'pets',
 'smokes',
 'language',
 'new_languages',
 'body_profile',
 'dropped_out',
 'bio',
 'cleaned_bio']

In [8]:
# set user id to be the index and drop user name

tinder_data_reduced = tinder_data_reduced.set_index('user_id')

tinder_data_reduced.head()

Unnamed: 0_level_0,username,age,sex,drinks,drugs,height,job,pets,smokes,language,new_languages,body_profile,education_level,dropped_out,bio,cleaned_bio
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
fffe31003000,Eric Goldberger,28,m,very often,sometimes,62.0,other,dislikes dogs and dislikes cats,when drinking,"english, french, german, chinese, sign language",not interested,used up,4.0,no,i work and i play. hard. all day. everyday. i ...,work play hard day everyday professional archi...
fffe31003200,Lucile Trexler,24,f,socially,never,65.0,medicine / health,likes dogs and likes cats,no,"english (fluently), spanish (okay), french (po...",interested,average,3.0,no,"so...there is much to say... for now, i'm a st...",student tentative degree molecular cell biolog...
fffe31003300,Earl Eells,29,m,socially,never,73.0,education / academia,likes dogs and dislikes cats,no,english,not interested,average,3.0,no,i moved to san francisco two years ago after l...,move san francisco year ago live east bay year...
fffe31003400,Claudine Shreve,33,f,socially,never,65.0,other,has dogs,no,"english, spanish (fluently)",not interested,a little extra,4.0,no,i'm a mixed bag of nuts! i'm outgoing yet shy ...,mixed bag nut outgoing shy meet people grow sf...
fffe31003500,Myong Ellison,39,f,often,sometimes,65.0,medicine / health,likes dogs and has cats,no,"english (fluently), spanish (fluently)",not interested,average,4.0,no,"i'm easy-going, fun and and pretty straight fo...",easy go fun pretty straight forward new englan...


In [9]:
# drop the user name
tinder_data_reduced.drop(['username'], axis=1, inplace=True)

In [10]:
list(tinder_data_reduced.select_dtypes('object').columns)

['sex',
 'drinks',
 'drugs',
 'job',
 'pets',
 'smokes',
 'language',
 'new_languages',
 'body_profile',
 'dropped_out',
 'bio',
 'cleaned_bio']

In [11]:
# the categories of drinks

tinder_data_reduced.drinks.value_counts()

drinks
socially       937
rarely         139
often          129
not at all      96
very often      11
desperately      9
Name: count, dtype: int64

In [12]:
# we convert the categories into dummies and then join this to our original dataframe

pd.get_dummies(tinder_data_reduced.drinks, prefix='drinks')


Unnamed: 0_level_0,drinks_desperately,drinks_not at all,drinks_often,drinks_rarely,drinks_socially,drinks_very often
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
fffe31003000,False,False,False,False,False,True
fffe31003200,False,False,False,False,True,False
fffe31003300,False,False,False,False,True,False
fffe31003400,False,False,False,False,True,False
fffe31003500,False,False,True,False,False,False
...,...,...,...,...,...,...
fffe3100390039003000,False,False,False,False,True,False
fffe3100390039003100,False,False,False,False,True,False
fffe3100390039003300,False,False,False,False,True,False
fffe3100390039003700,False,False,False,False,True,False


In [13]:
# to make life easier, we could use this definition to one hot encode columns

def one_hot(df, categorical_cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    
    for c in categorical_cols:
        dummies = pd.get_dummies(df[c], prefix=c)
        df = pd.concat([df, dummies], axis=1)
        df.drop(c, axis = 1, inplace = True)
    
    return df

In [14]:
tinder_data_encoded = one_hot(tinder_data_reduced, ['drinks'])

In [15]:
# do the same for other columns with categories
# for e.g.

tinder_data_encoded.drugs.value_counts()

drugs
never        1048
sometimes     263
often          10
Name: count, dtype: int64

In [16]:
tinder_data_encoded.job.value_counts(normalize=True)

job
other                                0.142316
student                              0.125662
science / tech / engineering         0.099924
sales / marketing / biz dev          0.085541
artistic / musical / writer          0.080242
computer / hardware / software       0.078728
medicine / health                    0.065859
education / academia                 0.059803
banking / financial / real estate    0.044663
entertainment / media                0.038607
executive / management               0.034822
law / legal services                 0.026495
construction / craftsmanship         0.026495
hospitality / travel                 0.024981
political / government               0.015897
clerical / administrative            0.014383
transportation                       0.009841
rather not say                       0.009841
unemployed                           0.009084
military                             0.004542
retired                              0.002271
Name: proportion, dtype: float

In [17]:
tinder_data_encoded.smokes.value_counts()

smokes
no                1051
sometimes           92
when drinking       73
yes                 65
trying to quit      40
Name: count, dtype: int64

In [18]:
cols = ['drugs', 'job', 'smokes', 'body_profile', 'new_languages']
# one hot encode drinks, drugs, job ...  using the one hot definition
print(tinder_data_encoded.shape)
tinder_data_encoded = one_hot(tinder_data_encoded, cols)
print(tinder_data_encoded.shape)

(1321, 20)
(1321, 59)


In [19]:
list(tinder_data_encoded.select_dtypes('object').columns)

['sex', 'pets', 'language', 'dropped_out', 'bio', 'cleaned_bio']

In [20]:
tinder_data_encoded['sex'].value_counts()

sex
m    791
f    530
Name: count, dtype: int64

In [21]:
# convert to boolean - 1 and 0
replace_dict = {'f':0, 'm':1}


tinder_data_encoded['sex'] = tinder_data_encoded['sex'].map(replace_dict)

In [22]:
tinder_data_encoded['sex'].value_counts()

sex
1    791
0    530
Name: count, dtype: int64

In [23]:
# do the same for dropped out

tinder_data_encoded.dropped_out.value_counts()

dropped_out
no     1263
yes      58
Name: count, dtype: int64

In [24]:
replace_dict = {'no':0, 'yes':1}


tinder_data_encoded['dropped_out'] = tinder_data_encoded['dropped_out'].map(replace_dict)

In [25]:
tinder_data_encoded.dropped_out.value_counts()

dropped_out
0    1263
1      58
Name: count, dtype: int64

In [26]:
# how about pets?!
# has too many categories, so avoid one hot encoding it

tinder_data_encoded.pets.value_counts()

pets
likes dogs and likes cats          513
likes dogs                         219
likes dogs and has cats            125
has dogs                           120
has dogs and likes cats             93
likes dogs and dislikes cats        92
has dogs and has cats               40
likes cats                          40
has cats                            34
has dogs and dislikes cats          20
dislikes dogs and dislikes cats     11
dislikes cats                        4
dislikes dogs                        4
dislikes dogs and likes cats         4
dislikes dogs and has cats           2
Name: count, dtype: int64

In [27]:
# create new features instead for e.g.
# likes dog

tinder_data_encoded['likes_dog'] = 0

tinder_data_encoded.loc[tinder_data_reduced['pets'].str.contains('likes dogs'), 'likes_dog'] = 1

In [28]:
tinder_data_encoded['likes_dog'].value_counts()

likes_dog
1    970
0    351
Name: count, dtype: int64

In [29]:
# likes cat

tinder_data_encoded['likes_cats'] = 0

tinder_data_encoded.loc[tinder_data_encoded['pets'].str.contains('likes cats'), 'likes_cats'] = 1

In [30]:
tinder_data_encoded['likes_cats'].value_counts()

likes_cats
1    777
0    544
Name: count, dtype: int64

In [31]:
# for reference - profiles that like cats and dogs

tinder_data_encoded[(tinder_data_encoded['likes_dog'] == 1) & (tinder_data_encoded['likes_cats'] == 1)].shape[0]

620

In [32]:
# we can drop the pets columns as we have converted this into

tinder_data_encoded.drop(['pets'], axis=1, inplace=True)

In [33]:
tinder_data_encoded.language.value_counts()

language
english                                                                            348
english (fluently)                                                                 177
english (fluently), spanish (poorly)                                                61
english (fluently), spanish (okay)                                                  53
english (fluently), spanish (fluently)                                              38
                                                                                  ... 
english (poorly), c++ (poorly)                                                       1
english (fluently), italian (okay), spanish (okay)                                   1
english, english (okay), japanese (fluently), chinese (fluently), french (okay)      1
english, spanish (fluently), portuguese (okay), french (poorly)                      1
english (fluently), spanish (poorly), chinese (okay)                                 1
Name: count, Length: 419, dtype: i

In [34]:
# everyone can speak english, so this could be considered a redundant variable
tinder_data_encoded[~tinder_data_encoded['language'].str.contains('english')]

Unnamed: 0_level_0,age,sex,height,language,education_level,dropped_out,bio,cleaned_bio,drinks_desperately,drinks_not at all,...,body_profile_overweight,body_profile_rather not say,body_profile_skinny,body_profile_thin,body_profile_used up,new_languages_interested,new_languages_not interested,new_languages_somewhat interested,likes_dog,likes_cats
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [35]:
# could consider another feature for if they speak spanish or not, but for another time

tinder_data_encoded[tinder_data_encoded['language'].str.contains('spanish')].head()

Unnamed: 0_level_0,age,sex,height,language,education_level,dropped_out,bio,cleaned_bio,drinks_desperately,drinks_not at all,...,body_profile_overweight,body_profile_rather not say,body_profile_skinny,body_profile_thin,body_profile_used up,new_languages_interested,new_languages_not interested,new_languages_somewhat interested,likes_dog,likes_cats
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
fffe31003200,24,0,65.0,"english (fluently), spanish (okay), french (po...",3.0,0,"so...there is much to say... for now, i'm a st...",student tentative degree molecular cell biolog...,False,False,...,False,False,False,False,False,True,False,False,1,1
fffe31003400,33,0,65.0,"english, spanish (fluently)",4.0,0,i'm a mixed bag of nuts! i'm outgoing yet shy ...,mixed bag nut outgoing shy meet people grow sf...,False,False,...,False,False,False,False,False,False,True,False,0,0
fffe31003500,39,0,65.0,"english (fluently), spanish (fluently)",4.0,0,"i'm easy-going, fun and and pretty straight fo...",easy go fun pretty straight forward new englan...,False,False,...,False,False,False,False,False,False,True,False,1,0
fffe31003700,30,1,70.0,"english, spanish (poorly)",3.0,0,joined this online dating thing per the advice...,join online date thing advice friend recently ...,False,False,...,False,False,False,False,False,True,False,False,1,1
fffe32003200,26,0,67.0,"english (fluently), spanish (poorly)",3.0,0,not sure how to describe myself in a few parag...,sure describe paragraph risk sound like descri...,False,False,...,False,False,False,True,False,True,False,False,1,1


In [36]:
# drop language

tinder_data_encoded.drop(['language'], axis=1, inplace=True)

In [37]:
tinder_data_encoded.bio.value_counts()

bio
i work and i play. hard. all day. everyday. i am a professional architect and i make enough to make it rain. hard. i love sex. hard. i love anal.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [38]:
list(tinder_data_encoded.select_dtypes('object').columns)

['bio', 'cleaned_bio']

In [39]:
# drop bio and work with work with cleaned bio
tinder_data_encoded.drop(['bio'], axis=1, inplace=True)

In [40]:
tinder_data_encoded.shape

# to convert cleaned bio into a vector, you will have to install some libraries

(1321, 58)

In [None]:
# TODO: EMBED THE TEXT

In [50]:
# for now drop bio - will come back to this next week

tinder_data_encoded.drop(['cleaned_bio'], axis=1, inplace=True)

In [51]:
# save the prepared data
tinder_data_encoded.reset_index().to_csv("../data/tinder_data_prepared.csv", index=False)