## Predict Tinder Matches

We are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them

Reference: https://www.geeksforgeeks.org/predict-tinder-matches-with-machine-learning/

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
tinder_data = pd.read_csv("../../03Lecture/data/tinder_data.csv")

In [3]:
type(tinder_data)

pandas.core.frame.DataFrame

In [4]:
tinder_data.head()

Unnamed: 0,user_id,username,age,status,sex,orientation,drinks,drugs,height,job,location,pets,smokes,language,new_languages,body_profile,education_level,dropped_out,bio,location_preference
0,fffe3100,Edith Lopez,27,single,f,gay,socially,never,66.0,medicine / health,"oakland, california",likes dogs and likes cats,no,"english (fluently), spanish (poorly), sign lan...",interested,athletic,4.0,no,bottom line i love life! i work hard and i lov...,same state
1,fffe3200,Travis Young,26,single,m,gay,socially,never,68.0,other,"pleasant hill, california",likes dogs,no,"english (fluently), tagalog (okay), french (po...",interested,fit,3.0,no,"i'm a straightforward, genuine, fun loving (i'...",anywhere
2,fffe3300,Agnes Smith,20,seeing someone,f,bisexual,socially,sometimes,69.0,other,"oakland, california",has dogs and likes cats,sometimes,"english (fluently), sign language (poorly), fr...",interested,fit,2.0,no,mmmmm yummy tacosss. yoga is where it's at. i ...,same city
3,fffe3400,Salvador Klaver,27,single,m,bisexual,socially,sometimes,68.0,computer / hardware / software,"daly city, california",likes dogs and likes cats,no,english,not interested,average,3.0,no,i'm a stealth geek. that special mix of techni...,same city
4,fffe3500,Elana Sewell,22,single,f,bisexual,often,sometimes,68.0,other,"oakland, california",likes dogs and likes cats,yes,english,not interested,average,2.0,yes,with the whisper of the wind i was weaved into...,same city


In [5]:
tinder_data.shape

(2001, 20)

In [6]:
tinder_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2001 entries, 0 to 2000
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              2001 non-null   object 
 1   username             2001 non-null   object 
 2   age                  2001 non-null   int64  
 3   status               2001 non-null   object 
 4   sex                  2001 non-null   object 
 5   orientation          2001 non-null   object 
 6   drinks               2001 non-null   object 
 7   drugs                2001 non-null   object 
 8   height               2001 non-null   float64
 9   job                  2001 non-null   object 
 10  location             2001 non-null   object 
 11  pets                 2001 non-null   object 
 12  smokes               2001 non-null   object 
 13  language             2001 non-null   object 
 14  new_languages        2001 non-null   object 
 15  body_profile         2001 non-null   o

In [7]:
# filtering the data based on conditions

# Start by creating groups based on an age criteria and orientation

tinder_data_reduced = tinder_data[tinder_data.orientation == "straight"]


In [8]:
print(tinder_data.shape)
print(tinder_data_reduced.shape)

(2001, 20)
(1736, 20)


In [9]:
# cap profiles who are over 40

tinder_data_reduced = tinder_data_reduced[tinder_data_reduced.age <= 40]

In [10]:
# only consider profiles that are single and available
tinder_data_reduced = tinder_data_reduced[tinder_data_reduced.status.isin(['single', 'available'])]

In [11]:
# drop the username and single column - since we will not be matching on username and we only have orientation straight
tinder_data_reduced.drop(['status', 'orientation'], axis=1, inplace=True)

In [12]:
print(tinder_data_reduced.shape)

(1321, 18)


In [13]:
# let's look at location

tinder_data_reduced.location.value_counts()

# let's assume everyone is from the same location so we will also drop this, and location preference

location
san francisco, california    623
oakland, california          154
berkeley, california          93
san mateo, california         41
palo alto, california         32
                            ... 
larkspur, california           1
petaluma, california           1
corte madera, california       1
phoenix, arizona               1
lagunitas, california          1
Name: count, Length: 62, dtype: int64

In [14]:
tinder_data_reduced.drop(['location', 'location_preference'], axis=1, inplace=True)

In [15]:
# identify all the features that need encoding

list(tinder_data_reduced.select_dtypes('object').columns)

['user_id',
 'username',
 'sex',
 'drinks',
 'drugs',
 'job',
 'pets',
 'smokes',
 'language',
 'new_languages',
 'body_profile',
 'dropped_out',
 'bio']

In [16]:
tinder_data_reduced.bio.value_counts()

bio
i work and i play. hard. all day. everyday. i am a professional architect and i make enough to make it rain. hard. i love sex. hard. i love anal.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [17]:
# clean the data

import spacy
import re


def spacy_cleaner(original_text):
    """Cleans text data. Removes punctuations, whitespaces, numbers, stopwords from the text
    and lemmatizes each token"""
    
    nlp = spacy.load("en_core_web_sm")

    final_tokens = []
    parsed_text = nlp(original_text)

    for token in parsed_text:
        if token.is_punct or token.is_space or token.like_num or token.is_stop:
            pass
        else:
            if token.lemma_ == '-PRON-':
                final_tokens.append(str(token))
            else:
                sc_removed = re.sub("[^a-zA-Z]", '', str(token.lemma_))  # code to keep pronouns as they are
                if len(sc_removed) > 1:
                    final_tokens.append(sc_removed)
    joined = ' '.join(final_tokens)
    preprocessed_text = re.sub(r'(.)\1+', r'\1\1', joined)

    return preprocessed_text

In [18]:
tinder_data_reduced['cleaned_bio'] = tinder_data_reduced.bio.apply(lambda x: spacy_cleaner(x))
# can takes some time to run

In [19]:
# after cleaning

tinder_data_reduced['cleaned_bio'][9]

'work play hard day everyday professional architect rain hard love sex hard love anal'

In [20]:
# before cleaning

tinder_data_reduced['bio'][9]

'i work and i play. hard. all day. everyday. i am a professional architect and i make enough to make it rain. hard. i love sex. hard. i love anal.'

In [21]:
tinder_data_reduced.shape

(1321, 17)

In [22]:
tinder_data_reduced.to_csv("../data/tinder_data_filtered.csv", index=False)