<a href="https://colab.research.google.com/github/sorandomchad/projects/blob/main/date-a-scientist/Portfolio_DateAScientist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Portfolio Project: OkCupid Date-A-Scientist

In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we’ve never had before about how different people experience romance.

In this portfolio project, we will analyze some data from OKCupid, an app that focuses on using multiple choice and short answers to match users.

We will also create a presentation about our findings from this OKCupid dataset.

The purpose of this project is to practice formulating questions and implementing machine learning techniques to answer those questions.

## The Dataset

The dataset includes about 60k profiles from the OkCupid application in a csv format. Some salient features include age, bio, religion, ethnicity, height, job, location, sex, sexual orientation, astrology sign.

In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

### Load dataset with pandas

In [None]:
# load data
data = pd.read_csv('profiles.csv')
sample_data = data.sample(n=1000, random_state=42)

## Exploratory Data Analysis

## Data Cleaning

In [None]:
sample_data.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
4800,26,average,anything,socially,never,graduated from high school,i am really obsessed with music and would love...,i am supervisor and i am really bored of it. i...,guitar i like to think i am alright. i am alwa...,well used to be my hair. had to get rid of the...,...,"martinez, california",,straight,,,m,sagittarius,no,english,single
56896,37,athletic,,socially,never,,"as for me, im a crazy busy hair stylist by day...",trying to juggle career and life,,i suppose you could ask them.,...,"san francisco, california",,gay,likes dogs,,m,libra and it&rsquo;s fun to think about,no,english,single
17834,30,fit,anything,socially,never,working on masters program,i like to look for the humorous side in everyt...,"currently going to business school full time, ...","writing, telling stories, laughing at myself, ...","people tell me i'm sarcastic, funny (hilarious...",...,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,atheism and laughing about it,m,gemini and it&rsquo;s fun to think about,no,"english (fluently), chinese (fluently), spanis...",single
27275,26,average,mostly anything,socially,,graduated from college/university,"i love trees, i'm not crazy about them though<...",12:50 press return<br />\nif you know what tha...,"different things, different times. but general...","i look way too innocent, clueless or oblivious...",...,"berkeley, california",,straight,likes dogs,other and laughing about it,m,gemini and it&rsquo;s fun to think about,yes,"english, persian",single
3335,33,average,strictly anything,often,,graduated from college/university,"i come to san francisco by way of new york, wh...",i've taken up iaido with a fantastic sensei in...,observing myself and others. sticking with it ...,"beats me! nobody has ever told me, and i've ne...",...,"san francisco, california",doesn&rsquo;t have kids,straight,likes cats,atheism,m,,no,"english, japanese (poorly)",single


In [None]:
sample_data.essay0.iloc[11]

"its always hard to write about this sort of stuff...but here's a\nshot:<br />\nalways up for an adventure on a whim. the worst has been when i\ndrove 5 hours to get lunch...but it was quite worth it.<br />\ni have eclectic hobbies to say the least.<br />\ntry to spend time as much time as possible outside enjoying what\nthe bay area has to offer.<br />\ntravelling to places that have rich per-biblical history...it is\nincredibly fascinating. next trip will either be to greece or\nitaly.<br />\nlove animals...growing up my house resembled more a zoo with\nrabbits, doves, pigeons, baters, cats, dogs and a deer.<br />\nnot too into watching sports...would much rather prefer to get off\nthe couch and do something...though every now and then will watch\nsome f-1.<br />\ni haven't owned a tv since 2003.<br />\nnot the biggest fan of the east coast winters...they're great to\nvisit...but when driving became more a scene for a roller derby\nrink, i had to reconsider my choices...hence the mov

In [None]:
sample_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 4800 to 29995
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          1000 non-null   int64  
 1   body_type    918 non-null    object 
 2   diet         608 non-null    object 
 3   drinks       957 non-null    object 
 4   drugs        768 non-null    object 
 5   education    880 non-null    object 
 6   essay0       913 non-null    object 
 7   essay1       883 non-null    object 
 8   essay2       841 non-null    object 
 9   essay3       825 non-null    object 
 10  essay4       839 non-null    object 
 11  essay5       816 non-null    object 
 12  essay6       753 non-null    object 
 13  essay7       790 non-null    object 
 14  essay8       697 non-null    object 
 15  essay9       795 non-null    object 
 16  ethnicity    900 non-null    object 
 17  height       1000 non-null   float64
 18  income       1000 non-null   int64  
 19  job    

The `last_online` column will be dropped because of bad formatting and irrelevance to our analysis.

In [None]:
# drop last_online column
if 'last_online' in sample_data.columns:
  sample_data = sample_data.drop(columns=['last_online'])
sample_data.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job', 'location',
       'offspring', 'orientation', 'pets', 'religion', 'sex', 'sign', 'smokes',
       'speaks', 'status'],
      dtype='object')

In [None]:
sample_data.isnull().sum()/len(sample_data)

Unnamed: 0,0
age,0.0
body_type,0.082
diet,0.392
drinks,0.043
drugs,0.232
education,0.12
essay0,0.087
essay1,0.117
essay2,0.159
essay3,0.175


In [None]:
!pip install contractions



## Text preprocessing

In [None]:
# libraries for formatting text
import contractions
import re
import string

### Contractions

We begin cleaning the text data in the essay columns by separating contractions. Contractions are abbreviations of two words separated by an apostrophe. With the `contractions` module, we can separate contractions which reduces the dimensionality of the document-term matrix in analysis.

The ASCII equivalent of the apostrophe is `&rsquo;`, which we will replace. Then, the `contractions.fix()` function fully expands the contractions in the dataset.

### Punctuation
Other than standard punctuation, we need to get rid of:
* the newline character `\n`
* the xhtml tags `<(any number of chars)>`

The `string.punctuation` list in python contains the following punctuation symbols:

``!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~``

In [None]:
def remove_punctuation(text):
  '''
    Accepts a string as input and returns a cleaned string without punctuation.
  '''
  # replace &rsquo; with '
  text = re.sub(r'&rsquo;', '\'', text)

  # expand contractions
  text = contractions.fix(text)

  # remove newlines, html tags, forward slashes, and ellipses
  text = re.sub(r'\n|<[^>]+>|\.{3}|/', ' ', text)

  # remove remaining punctuation
  PUNCT_TO_REMOVE = string.punctuation
  text = text.translate(str.maketrans('', '',
                                    PUNCT_TO_REMOVE))

  # remove extra space
  text = re.sub(r'\s{2,}', ' ', text)

  return text

Gather all of the text columns.

In [None]:
# all of the text columns in the dataframe
text_cols = [col for col in sample_data.columns if sample_data[col].dtype == 'object']
text_cols

['body_type',
 'diet',
 'drinks',
 'drugs',
 'education',
 'essay0',
 'essay1',
 'essay2',
 'essay3',
 'essay4',
 'essay5',
 'essay6',
 'essay7',
 'essay8',
 'essay9',
 'ethnicity',
 'job',
 'location',
 'offspring',
 'orientation',
 'pets',
 'religion',
 'sex',
 'sign',
 'smokes',
 'speaks',
 'status']

Since `NaN` is a numeric type, we can use the empty string as an equivalent so that we can use string operations on the data.

The `.map()` method applies a function to a DataFrame elementwise. Here, we will combine this method with the `remove_punctuation()` function to get rid of punctuation, leaving only words in the text columns.

In [None]:
# change NaNs to empty strings to avoid problems with string operations
sample_data[text_cols] = sample_data[text_cols].fillna('')

# remove punctuation of text in the dataset
sample_data[text_cols] = sample_data[text_cols].map(remove_punctuation)
sample_data[text_cols].head()

Unnamed: 0,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
4800,average,anything,socially,never,graduated from high school,i am really obsessed with music and would love...,i am supervisor and i am really bored of it i ...,guitar i like to think i am alright i am alway...,well used to be my hair had to get rid of the ...,to many movies to just pick one and same with ...,...,martinez california,,straight,,,m,sagittarius,no,english,single
56896,athletic,,socially,never,,as for me i am a crazy busy hair stylist by da...,trying to juggle career and life,,i suppose you could ask them,,...,san francisco california,,gay,likes dogs,,m,libra and it is fun to think about,no,english,single
17834,fit,anything,socially,never,working on masters program,i like to look for the humorous side in everyt...,currently going to business school full time a...,writing telling stories laughing at myself hel...,people tell me i am sarcastic funny hilarious ...,movies office space fight club i love situatio...,...,san francisco california,does not have kids,straight,likes dogs and likes cats,atheism and laughing about it,m,gemini and it is fun to think about,no,english fluently chinese fluently spanish okay,single
27275,average,mostly anything,socially,,graduated from college university,i love trees i am not crazy about them though ...,1250 press return if you know what that is you...,different things different times but generally...,i look way too innocent clueless or oblivious ...,db tool in flames lately a lot of postrock god...,...,berkeley california,,straight,likes dogs,other and laughing about it,m,gemini and it is fun to think about,yes,english persian,single
3335,average,strictly anything,often,,graduated from college university,i come to san francisco by way of new york whe...,i have taken up iaido with a fantastic sensei ...,observing myself and others sticking with it t...,beats me nobody has ever told me and i have ne...,books i have been on a nonfiction history kick...,...,san francisco california,does not have kids,straight,likes cats,atheism,m,,no,english japanese poorly,single


In [None]:
import re

text = """meine amerikane brille.<br />
<br />
whatever i'm doing, i'm <a class="ilink" href=
"/interests?i=into+it">into it</a>. my <a class="ilink" href=
"/interests?i=intensity">intensity</a> is usually a boon.
(usually.) also, they often notice that i'm tall. which is weird,
because i'm really, really not."""

# Replace "cats" or "dogs" with "animals"
new_text = re.sub(r"<[^>]+>", "<UNK>", text)

new_text

"meine amerikane brille.<UNK>\n<UNK>\nwhatever i'm doing, i'm <UNK>into it<UNK>. my <UNK>intensity<UNK> is usually a boon.\n(usually.) also, they often notice that i'm tall. which is weird,\nbecause i'm really, really not."

### Removing stopwords

Stopwords are commonly occurring words in a language like 'the', 'to', 'at'. They usually do not enhance understanding of a text and we can get rid of them in most cases.

In [None]:
# library
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize     # to break down text into tokens

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
STOPWORDS = set(stopwords.words('english'))   # list of stopwords

# function to remove stopwords
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in word_tokenize(text) if word not in STOPWORDS])

In [None]:
# cols = ['body_type',
#  'diet',
#  'drinks',
#  'drugs',
#  'education',
#  'ethnicity',
#  'job',
#  'orientation',
#  'pets',
#  'religion',
#  'sign',
#  'smokes',
#  'speaks',
#  'status']
# # removing stopwords
# sample_data[cols] = sample_data[cols].map(remove_stopwords)
# sample_data[text_cols].head()

## Cleaning Numeric Columns

In [None]:
# all of the numeric columns in the dataframe
numeric_cols = [col for col in sample_data.columns if sample_data[col].dtype != 'object']
numeric_cols

['age', 'height', 'income']

In [None]:
sample_data[numeric_cols].info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 4800 to 29995
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     1000 non-null   int64  
 1   height  1000 non-null   float64
 2   income  1000 non-null   int64  
dtypes: float64(1), int64(2)
memory usage: 31.2 KB


In [None]:
# null numeric columns
sample_data[numeric_cols].isnull().sum()

Unnamed: 0,0
age,0
height,0
income,0


### Disguised null values
Though the `income` column has no `NaN`s, the dataset uses the sentinel value of -1 to indicate a null value. We will replace them with `NaN`.

In [None]:
# replace -1 with NaN
sample_data.income = sample_data['income'].where(sample_data.income != -1, np.nan)

In [None]:
# recheck for nulls
sample_data[numeric_cols].isnull().sum()/len(sample_data)

Unnamed: 0,0
age,0.0
height,0.0
income,0.831


83 percent the `income` column is missing, therefore using that variable in any analysis could introduce bias in our dataset. We will drop it from the dataset.

In [None]:
# dropping the income column
sample_data.drop(columns=['income'])
numeric_cols.remove('income')
numeric_cols

['age', 'height']

## Exploratory Data Analysis

In this section, we will explore some summary statistics of the `age` and `height` variables.

In [None]:
sample_data[numeric_cols].describe()

Unnamed: 0,age,height
count,1000.0,1000.0
mean,32.21,68.39
std,9.118939,3.836485
min,18.0,58.0
25%,26.0,66.0
50%,30.0,68.0
75%,37.0,71.0
max,69.0,81.0


* Mean height: 5'8.39"
* Max height: 6'9"
* Min height: 4'10"

In [None]:
# unique values in text columns
sample_data[text_cols].nunique()

Unnamed: 0,0
body_type,13
diet,14
drinks,7
drugs,4
education,29
essay0,914
essay1,882
essay2,839
essay3,788
essay4,839


In [None]:
# inspecting the unique categories of some of the variables
for var in text_cols:
  if sample_data[sample_data[var] != ''][var].nunique() < 70:
    print(f"{var} categories:", sample_data[var].unique())
    print("\n\n")

body_type categories: ['average' 'athletic' 'fit' 'a little extra' 'curvy' 'thin' 'skinny' ''
 'used up' 'full figured' 'overweight' 'rather not say' 'jacked']



diet categories: ['anything' '' 'mostly anything' 'strictly anything' 'mostly vegetarian'
 'vegetarian' 'mostly other' 'strictly vegetarian' 'vegan'
 'strictly other' 'mostly vegan' 'other' 'mostly kosher' 'strictly vegan']



drinks categories: ['socially' 'often' 'not at all' 'rarely' '' 'very often' 'desperately']



drugs categories: ['never' '' 'sometimes' 'often']



education categories: ['graduated from high school' '' 'working on masters program'
 'graduated from college university' 'graduated from masters program'
 'working on law school' 'graduated from space camp'
 'graduated from twoyear college' 'graduated from law school'
 'working on college university' 'dropped out of college university'
 'working on med school' 'high school' 'working on space camp'
 'graduated from phd program' 'dropped out of twoyear colleg

### Religion and Sign
For the purpose of simplicity, the first word in the column will be used to assign each person a value for these variables.

In [None]:
# restricting categories to the first word in the entry
sample_data[['religion', 'sign']] = sample_data[['religion', 'sign']].map(lambda text: text.split()[0] if text != '' else '')

In [None]:
# checking for correctness
print(sample_data['religion'].unique())
print('\n')
print(sample_data['sign'].unique())

['' 'atheism' 'other' 'catholicism' 'buddhism' 'agnosticism'
 'christianity' 'judaism' 'hinduism' 'islam']


['sagittarius' 'libra' 'gemini' '' 'leo' 'aquarius' 'cancer' 'taurus'
 'capricorn' 'virgo' 'aries' 'scorpio' 'pisces']


### Location
Since all respondents are for CA, we will try

In [None]:
sample_data['location'] = sample_data['location'].map(lambda text: ' '.join(text.split()[:-1]) if text != '' else '')

In [None]:
# checking for correctness
print(sample_data['location'].unique())

['martinez' 'san francisco' 'berkeley' 'walnut creek' 'stanford'
 'san anselmo' 'oakland' 'palo alto' 'pleasant hill' 'south san francisco'
 'san rafael' 'hayward' 'redwood city' 'albany' 'belmont' 'novato'
 'daly city' 'san bruno' 'mountain view' 'san mateo' 'benicia'
 'burlingame' 'sausalito' 'el cerrito' 'emeryville' 'san leandro'
 'alameda' 'richmond' 'vallejo' 'larkspur' 'millbrae' 'green brae' 'rodeo'
 'mill valley' 'muir beach' 'menlo park' 'lagunitas' 'moraga' 'san carlos'
 'san pablo' 'los angeles' 'el sobrante' 'san lorenzo' 'crockett'
 'montara' 'castro valley' 'half moon bay' 'lafayette' 'pinole' 'woodacre'
 'los gatos']


In [None]:
# sample_data.to_csv("results.csv", index=False)