# 1. Project Overview (comeback to outline the steps you'll follow in the notebook)

https://towardsdatascience.com/how-to-structure-your-data-science-notebook-to-be-easy-to-follow-2d3c2777e6e0

- describe the project in terms of business goals
- give context to the work, where it originated, and what you want to achieve.
- briefly talk about any prior knwledge (For example, if the sales data is only from one specific store, that should be mentioned. If at a certain period of time the company had problems with some of the products, like distribution issues, directly affecting sales, this should be stated too. Basically, we have to describe anything that helps understand the context of the data sources and important details.)

This project leverages the OKCupid Profiles Dataset (sourced from Kaggle) to explore, analyze, and clean data for building a content-filtering system. The primary goal is to recommend potential matches based on user preferences and attributes. The dataset does not include user images, but contains various user attributes, including demographics, interests, and lifestyle choices as well as 10 "essays" which are just answers to dating prompts such as "Dating me looks like..." and "Together we can..."


# 2. Import Libraries

This section focuses on importing the OKCupid dataset from Kaggle, looking at the comprehensive overview and descriptive statistics. I will then look at the amount of data missing.

In [71]:
import kagglehub
import pandas as pd
import re

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
from datetime import datetime, timedelta
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from nltk.stem import WordNetLemmatizer
from textblob import Word
!pip install pyspellchecker
from spellchecker import SpellChecker



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!




# 3. Load and Explore Data

In [72]:
path = kagglehub.dataset_download("andrewmvd/okcupid-profiles")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/andrewmvd/okcupid-profiles/versions/1


In [73]:
df = pd.read_csv('/root/.cache/kagglehub/datasets/andrewmvd/okcupid-profiles/versions/1/okcupid_profiles.csv', sep=",",header=0)
df.sample(5)

Unnamed: 0,age,status,sex,orientation,body_type,diet,drinks,drugs,education,ethnicity,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
53111,36,single,m,straight,fit,mostly anything,desperately,never,working on ph.d program,asian,...,brought to you by: okcupid - free online datin...,trying to make this the longest profile you ev...,"i can fix anything that isn't smarter than me,...",my very bad sense of fashion which i hope will...,now that i have okc in my life i'm no longer d...,socks dragsters rocky roads barack obama pengu...,people that get me excited. ...how to improve...,doing whatever hugh hefner is doing but in my ...,i like to answer roommate wanted ads and visit...,you're running a little late. you have a comp...
29097,24,single,f,straight,,,socially,,graduated from college/university,white,...,i am fun-loving and free spirited. i enjoy spe...,i moved to san francisco about a year ago. i'm...,listening and giving advice.,my eyes and hair.,"garden state, love actually, bridesmaids, grea...",yoga. friends/family. music. wine. the beach. ...,what it would have been like to live in an ent...,out with friends or hosting a dinner/cocktail ...,i am/was obsessed with dawson's creek.,you like what you read and see!
50217,21,single,f,straight,curvy,anything,socially,,graduated from college/university,white,...,i'm an epicurious lady. i play a little bit o...,i just moved back to the bay area after a 4-ye...,"making brunch, trying my hand at things, sleep...","my long hair, it's a force to be reckoned with.","some music i listen to: joanna newsom, the pun...","dark coffee, quirky beer, comfy bed, time to r...","how can i use that for my own devices?(design,...","drinking fine beers, chilling with the fam or ...",i'm in a cover band that primarily does bad im...,i'll let you ruminate on that one..
22032,29,single,m,straight,athletic,,socially,never,graduated from college/university,white,...,"not really a big fan of describing myself, but...",living life as best as i can and enjoying ever...,anything athletic...,"i'm not sure, but i would guess my smile...and...","i enjoy reading non-fiction books, but honestl...","my family, baseball, running shoes, powerade, ...",,going to a local sporting event or chilling wi...,,
41071,40,single,m,straight,fit,anything,,,graduated from college/university,"hispanic / latin, white",...,*i'll be out of town for a bit. back in early ...,,enjoying people for who they are. adapting to ...,my southamerican accent.,just ask.,good coffee and lots of it !! bikes / books ru...,"ways to improve my life in general, looking fo...",out with friends after work.,,if you like drinking coffee and riding bikes. ...


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   sex          59946 non-null  object 
 3   orientation  59946 non-null  object 
 4   body_type    54650 non-null  object 
 5   diet         35551 non-null  object 
 6   drinks       56961 non-null  object 
 7   drugs        45866 non-null  object 
 8   education    53318 non-null  object 
 9   ethnicity    54266 non-null  object 
 10  height       59943 non-null  float64
 11  income       59946 non-null  int64  
 12  job          51748 non-null  object 
 13  last_online  59946 non-null  object 
 14  location     59946 non-null  object 
 15  offspring    24385 non-null  object 
 16  pets         40025 non-null  object 
 17  religion     39720 non-null  object 
 18  sign         48890 non-null  object 
 19  smok

In [75]:
df.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


In [76]:
# Calculate the percentage of missing values for each column
missing_summary = df.isnull().agg(['sum', 'mean']).T

# Rename columns for clarity
missing_summary.columns = ['# Missing Values', '% Missing']

# Format percentage as a string with two decimal places
missing_summary['% Missing'] = (missing_summary['% Missing'] * 100).apply(lambda x: f"{x:.2f}%")

# Display the result
print(missing_summary)

             # Missing Values % Missing
age                       0.0     0.00%
status                    0.0     0.00%
sex                       0.0     0.00%
orientation               0.0     0.00%
body_type              5296.0     8.83%
diet                  24395.0    40.69%
drinks                 2985.0     4.98%
drugs                 14080.0    23.49%
education              6628.0    11.06%
ethnicity              5680.0     9.48%
height                    3.0     0.01%
income                    0.0     0.00%
job                    8198.0    13.68%
last_online               0.0     0.00%
location                  0.0     0.00%
offspring             35561.0    59.32%
pets                  19921.0    33.23%
religion              20226.0    33.74%
sign                  11056.0    18.44%
smokes                 5512.0     9.19%
speaks                   50.0     0.08%
essay0                 5488.0     9.15%
essay1                 7572.0    12.63%
essay2                 9638.0    16.08%


The sample showed me rows with a lot of missing information. I want to take a deeper look to see if it is worth dropping rows with more than half of the features missing as there is no way to fill in the information lost.

In [77]:
# Count the number of NaN values per row
nan_counts_per_row = df.isnull().sum(axis=1)

# Define a threshold (e.g., more than half the columns)
threshold = df.shape[1] // 2

# Count the rows with NaN values exceeding the threshold
rows_above_threshold = (nan_counts_per_row > threshold).sum()

# Display the count
print(f"Number of rows with NaNs exceeding the threshold of {threshold}: {rows_above_threshold}")

percent_missing = (rows_above_threshold / len(df)) * 100
print(f"Percentage of rows with NaNs exceeding the threshold of {threshold}: {percent_missing:.2f}%")

Number of rows with NaNs exceeding the threshold of 15: 931
Percentage of rows with NaNs exceeding the threshold of 15: 1.55%


I have decided to drop the income, body_type, sign, and last_online columns as they don't hold a significant importance in matchmaking.

In [78]:
df.drop(columns=['income', 'body_type', 'sign', 'last_online'], inplace=True)

#4. Data Preprocessing

1.45% of the data contains rows with more than 15 features missing. I will remove them as dropping them will not impact the dataset.

In [79]:
df = df[nan_counts_per_row <= threshold]

In [80]:
df.sample(10)

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
27503,22,single,m,straight,mostly anything,not at all,never,graduated from college/university,asian,75.0,...,,i'm a software engineer creating and architect...,rock climbing (ok i'm getting a lot better and...,i'm tall.,"the count of monte cristo, watership down, men...",computer cars blazers good food python (pr...,"systems design, software architecture my futu...",with friends,,you think you can pique my interest
57447,27,single,m,straight,mostly anything,socially,never,graduated from two-year college,hispanic / latin,72.0,...,"hey im veracious, loving, romantic, without a ...",,,,,,,,,
50361,28,single,m,straight,mostly anything,socially,,graduated from college/university,white,71.0,...,i'm a software engineer musician surfer. it ke...,enjoying it,spontaneously breaking into songs,"kinda random question, nobody's ever told me. ...",,music the beach the snow electricity water my ...,making things,"at band practice, playing a show, or grabbing ...",i have a molar with 5 cusps,
36860,31,single,f,gay,anything,socially,,,black,64.0,...,"yes, my glasses are real. i did not steal them...",living it.,being awesome.,my glasses...,"to kill a mockingbird, the coldest winter ever...",laughter my mom my glasses (told you they're r...,the next 5 minutes. and the next 5 years.,probably sleeping. my work weeks are long and ...,that i'm painfully shy.,you like me...?
39959,19,single,f,bisexual,strictly other,socially,sometimes,,"black, hispanic / latin",70.0,...,hey! so... i'm a student. i'm pretty much stra...,going to school,"drawing, love kids, animals, partying, sleepin...","i'm tall, i'm super chill, and i go with the f...","the girl with the dragon tattoo, harry potter ...",the things we all need food weed alcohol music...,my future,probably up to no good...,i cried when i saw happy feet,you want to get to know me. not if you're just...
5169,26,single,f,straight,mostly anything,socially,never,graduated from masters program,asian,66.0,...,i'm passionate about life and curious to know ...,working. working. working. but when i'm not i...,"re-framing things in a better.. more ""positive...",how accepting i am.,i like variety and there are too many favorite...,in no particular order: - black eyeliner - boo...,trying to be productive.,see above.,"that my profile name is ""babyhuny"" lol.",my profile sparks an interest or brought a smi...
5005,42,single,f,straight,mostly anything,socially,,graduated from college/university,other,62.0,...,i can spend a lot of time creating an appealin...,"making time for doing things i enjoy, when pos...",i am good at creating. i am not good at lying....,my eyes,the list is too long. i enjoy cooking and have...,"art, music, good food, sleep, caffine, good fr...",what it all means in the end.,friday's are a good night to be out. seek art ...,i can be very shy. none of my friends believe ...,my profile sparks your interest and you have a...
46079,43,single,f,straight,,socially,never,graduated from two-year college,white,68.0,...,"hello there, all you single silicon valley men...","i am not employed right now, that was by choic...",getting bored with filling out online dating d...,my smile.,books: kite runner a farewell to arms a prayer...,friends music my dog fresh air laughter sunshine,,"either i go out or stay in. totally random, no...","i have webbed toes. not really, i just couldn...",you want to.
45525,31,single,m,straight,anything,socially,never,graduated from masters program,white,69.0,...,,"building on a passion, making it my job.","crepes figure jumping: flips, axels, lutzes yo...",my accent ... or lack of ... depending how obn...,"oui-oui et la gomme magique, tiens bon ninon, ...",- 12 inch round griddle - aeropress - fresh br...,"how i can make others life easier, one crepe a...","...the same thing i do every night, try to tak...",you need to earn that one :),"you have a passion other than ""travelling"" (do..."
29037,37,single,m,straight,,not at all,never,graduated from college/university,other,71.0,...,i am not good at writing about myself. i belie...,,,,"listens to all kinds of musis. likes suspense,...",,,hanging out with friends and going out to eat.,,


## 4.1 Cleaning Numerical Columns

The numerical columns include age, height, and income.

### Age Column

In [81]:
print("Age Min", df['age'].min())
print("Age Max", df['age'].max())

Age Min 18
Age Max 109


**Handling Outliers in Age Column**
- The minimum age is acceptable and doesn't require changes.
- The maximum age needs to be addressed as it exceeds realistic limits.
- Further steps:
  1. Investigate rows with ages over 100.
  2. Determine whether to remove or impute those values.

In [82]:
df['age'].value_counts().sort_index()

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
18,298
19,582
20,925
21,1246
22,1895
23,2546
24,3189
25,3497
26,3663
27,3625


Looked closer at ages 109, and 110 and decided to drop them due to a lot of missing data that cannot be filled in.

In [83]:
df[(df['age'] == 109) | (df['age'] == 110)]

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
25324,109,available,m,straight,mostly other,,never,working on masters program,,95.0,...,,,,nothing,,,,,,


In [84]:
# filtering the data to no longer include ages 109 and 110

df = df[(df['age'] != 109) & (df['age'] != 110)]

# verifying the changes
df['age'].value_counts().sort_index()

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
18,298
19,582
20,925
21,1246
22,1895
23,2546
24,3189
25,3497
26,3663
27,3625


The `age` column does not have any missing values, so we will move on to the `height` column.

### Height Column

In [85]:
print("Min Height:",df['height'].min())
print("Max Height:",df['height'].max())

Min Height: 1.0
Max Height: 95.0


In [86]:
df['height'].value_counts().sort_index()

Unnamed: 0_level_0,count
height,Unnamed: 1_level_1
1.0,1
3.0,1
4.0,1
6.0,1
8.0,1
9.0,1
26.0,1
36.0,9
37.0,2
42.0,1


There is no information on the unit for height. In this case, it makes sense to assume the height is in inches. I will filter out anomalies and the lowest recorded height will be 4'9" (59 inches) to 6'6" (80 inches).

In [87]:
df = df[(df['height'] >= 59) & (df['height'] <= 80)]

In [88]:
print("Min Height:",df['height'].min())
print("Max Height:",df['height'].max())

Min Height: 59.0
Max Height: 80.0


In [89]:
df.sample(2)

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
4655,31,single,f,straight,mostly anything,socially,never,graduated from college/university,"native american, white",65.0,...,i am not a native californian but i came here ...,"living the dream, commuting to the east bay, w...","making others comfortable, laughing at a joke ...","my smile, it is frequently present on my face.",i go through phases with books..i have read a ...,"water, red lipstick, a book, my contacts, hot ...",myself if i am being honest. for about 8 hours...,working a my second job :(,"hmmm, i will share that when you are not a str...","you are funny, kind, easy going and a gentlema..."
41017,42,available,m,straight,,rarely,,graduated from college/university,white,68.0,...,i love people. i like nothing more than sharin...,being a pm. fixing up the house. trying to lea...,understanding what those around me need.,my presence in the room.,"books: animal farm, 1984, elric series, heart ...","friends, sex, integrity, purpose, love, and ho...","privacy, law, and the internet (somewhat ironi...",relaxing after a long week and trying to keep ...,you'd have to ask.,"...you are silly and playful, but appreciate d..."


The height column has 3 missing values, therefore I will drop it as there is no way to fill in the height with accuracy.

In [90]:
df.dropna(subset=['height'], inplace=True)

## 4.2 Cleaning Categorical Columns

Categorical columns include demographics, interests, and lifestyle choices as well as 10 "essays" which are just answers to dating prompts.

### Status

The status column has no missing values, therefore cleaning it will be simple.

In [91]:
# checking that the values in status are standardized
df['status'].value_counts()

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
single,54642
seeing someone,2032
available,1842
married,298
unknown,8


`single` has the most values followed by `seeing someone`. `available` and single can be merged together since there is overlap.

In [92]:
df['status'] = df['status'].replace({'single': 'single', 'available': 'single'})

print(df['status'].value_counts())

status
single            56484
seeing someone     2032
married             298
unknown               8
Name: count, dtype: int64


### Sex

In [93]:
# checking for missing values as well as making sure everything is standardized.
df['sex'].value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
m,35091
f,23731


### Orientation

In [94]:
df['orientation'].value_counts()

Unnamed: 0_level_0,count
orientation,Unnamed: 1_level_1
straight,50613
gay,5495
bisexual,2714


### Diet

Diet has 24395 missing values which affects 40.9% of the data. Let's take a look.

In [95]:
df['diet'].value_counts()

Unnamed: 0_level_0,count
diet,Unnamed: 1_level_1
mostly anything,16493
anything,6128
strictly anything,5088
mostly vegetarian,3434
mostly other,1000
strictly vegetarian,869
vegetarian,662
strictly other,444
mostly vegan,337
other,328


In [96]:
df['diet'].sample(10)

Unnamed: 0,diet
36769,mostly vegetarian
28727,
53371,
22520,
6428,mostly anything
26601,
57407,
46142,mostly anything
59540,
49593,


In [97]:
diet_mapping = {
    'anything': 'anything',
    'mostly anything': 'anything',
    'strictly anything': 'anything',
    'vegetarian': 'vegetarian',
    'mostly vegetarian': 'vegetarian',
    'strictly vegetarian': 'vegetarian',
    'vegan': 'vegan',
    'mostly vegan': 'vegan',
    'strictly vegan': 'vegan',
    'other': 'other',
    'mostly other': 'other',
    'strictly other': 'other',
    'kosher': 'kosher',
    'mostly kosher': 'kosher',
    'strictly kosher': 'kosher',
    'halal': 'halal',
    'mostly halal': 'halal',
    'strictly halal': 'halal'
}

df['diet'] = df['diet'].map(diet_mapping).fillna('unknown')

In [98]:
df['diet'].value_counts()

Unnamed: 0_level_0,count
diet,Unnamed: 1_level_1
anything,27709
unknown,23486
vegetarian,4965
other,1772
vegan,701
kosher,113
halal,76


In [99]:
df['diet'].sample(5)

Unnamed: 0,diet
15388,unknown
2214,unknown
7249,unknown
44175,other
24593,anything


### Drinks/Drugs/Smoke

The `drinks` column has 2,985 missing values which represents 4.98% of the data.
`drugs` has 14,080 missing values which represents 23.49% of data.
`smokes` has 5,512 missing values which represents 9.19% of data.

In [100]:
df['drinks'].value_counts()

Unnamed: 0_level_0,count
drinks,Unnamed: 1_level_1
socially,41376
rarely,5901
often,5104
not at all,3226
very often,457
desperately,312


In [101]:
drinks_mapping = {
    'socially': 'socially',
    'rarely': 'rarely/not at all',
    'not at all': 'rarely/not at all',
    'often': 'often',
    'very often': 'often',
    'desperately': 'often'
}

df['drinks'] = df['drinks'].map(drinks_mapping).fillna('unknown')

print(df['drinks'].value_counts())

drinks
socially             41376
rarely/not at all     9127
often                 5873
unknown               2446
Name: count, dtype: int64


In [102]:
# drugs
df['drugs'].value_counts()

Unnamed: 0_level_0,count
drugs,Unnamed: 1_level_1
never,36972
sometimes,7665
often,392


In [103]:
df['drugs'] = df['drugs'].fillna('unknown')
print(df['drugs'].isnull().sum())

0


In [104]:
# smokes
df['smokes'].value_counts()

Unnamed: 0_level_0,count
smokes,Unnamed: 1_level_1
no,43484
sometimes,3739
when drinking,3008
yes,2196
trying to quit,1471


In [105]:
smokes_mapping = {
    'no': 'no',
    'sometimes': 'occasionally',
    'when drinking': 'occasionally',
    'yes': 'regularly',
    'trying to quit': 'trying to quit'
}

df['smokes'] = df['smokes'].map(smokes_mapping).fillna('unknown')

print(df['smokes'].value_counts())

smokes
no                43484
occasionally       6747
unknown            4924
regularly          2196
trying to quit     1471
Name: count, dtype: int64


### Education

6,628 values missing

In [106]:
df['education'].value_counts()

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
graduated from college/university,23816
graduated from masters program,8901
working on college/university,5672
working on masters program,1674
graduated from two-year college,1525
graduated from high school,1413
graduated from ph.d program,1264
graduated from law school,1111
working on two-year college,1063
dropped out of college/university,993


In [107]:
# Define mapping for education levels
education_mapping = {
    'graduated from college/university': 'college/university_graduated',
    'working on college/university': 'college/university_studying',
    'dropped out of college/university': 'college/university_dropped out',
    'graduated from masters program': 'masters_graduated',
    'working on masters program': 'masters_studying',
    'dropped out of masters program': 'masters_dropped out',
    'graduated from ph.d program': 'ph.d_graduated',
    'working on ph.d program': 'ph.d_studying',
    'dropped out of ph.d program': 'ph.d_dropped out',
    'graduated from law school': 'law school_graduated',
    'working on law school': 'law school_studying',
    'dropped out of law school': 'law school_dropped out',
    'graduated from med school': 'med school_graduated',
    'working on med school': 'med school_studying',
    'dropped out of med school': 'med school_dropped out',
    'graduated from two-year college': 'two-year college_graduated',
    'working on two-year college': 'two-year college_studying',
    'dropped out of two-year college': 'two-year college_dropped out',
    'graduated from high school': 'high school_graduated',
    'working on high school': 'high school_studying',
    'dropped out of high school': 'high school_dropped out',
    'space camp': 'other',
    'working on space camp': 'other',
    'dropped out of space camp': 'other',
    'graduated from space camp': 'other'
}

# Apply the mapping
df['education'] = df['education'].map(education_mapping).fillna('unknown')

# Verify the result
print(df['education'].value_counts())


education
college/university_graduated      23816
masters_graduated                  8901
unknown                            7158
college/university_studying        5672
masters_studying                   1674
other                              1664
two-year college_graduated         1525
high school_graduated              1413
ph.d_graduated                     1264
law school_graduated               1111
two-year college_studying          1063
college/university_dropped out      993
ph.d_studying                       975
med school_graduated                443
law school_studying                 268
med school_studying                 211
two-year college_dropped out        191
masters_dropped out                 140
ph.d_dropped out                    126
high school_dropped out              98
high school_studying                 87
law school_dropped out               17
med school_dropped out               12
Name: count, dtype: int64


In [108]:
print(df['education'].isnull().sum())

0


### Ethnicity

In [109]:
# checking the values
print(df['ethnicity'].value_counts())

ethnicity
white                                              32495
asian                                               6019
hispanic / latin                                    2763
black                                               1974
other                                               1669
                                                   ...  
black, native american, indian, white                  1
black, native american, pacific islander, other        1
asian, middle eastern, black, pacific islander         1
middle eastern, black, pacific islander, white         1
asian, black, indian                                   1
Name: count, Length: 217, dtype: int64


Since there are over 217 different ethnicity combinations in this dataset, it's best to standardize the ethnicity column to reduce ambiguity.

In [110]:
def standardize_ethnicity(value):
    if pd.isna(value):  # if missing,
        return 'unknown'
    # Split by comma, strip whitespace, convert to lowercase, and remove duplicates
    ethnicities = sorted(set([eth.strip().lower() for eth in value.split(',')]))
    # Combine back into a standardized string
    return ', '.join(ethnicities)

# Apply the cleaning function to the 'ethnicity' column
df['ethnicity'] = df['ethnicity'].apply(standardize_ethnicity)

# Group rare combinations into a 'mixed' category (optional)
common_ethnicities = ['white', 'asian', 'black', 'hispanic / latin', 'native american', 'pacific islander', 'middle eastern']
df['ethnicity'] = df['ethnicity'].apply(
    lambda x: x if x in common_ethnicities else ('mixed' if ',' in x else x)
)

# Verify the cleaned and standardized column
print(df['ethnicity'].value_counts())


ethnicity
white               32495
mixed                6778
asian                6019
unknown              5262
hispanic / latin     2763
black                1974
other                1669
indian               1062
pacific islander      413
middle eastern        324
native american        63
Name: count, dtype: int64


### Job

In [111]:
print(df['job'].value_counts())

job
other                                7547
student                              4851
science / tech / engineering         4825
computer / hardware / software       4682
artistic / musical / writer          4410
sales / marketing / biz dev          4373
medicine / health                    3659
education / academia                 3497
executive / management               2357
banking / financial / real estate    2240
entertainment / media                2234
law / legal services                 1369
hospitality / travel                 1352
construction / craftsmanship         1016
clerical / administrative             801
political / government                697
rather not say                        431
transportation                        363
unemployed                            270
retired                               246
military                              201
Name: count, dtype: int64


Although the job column is useful in its current form. The best thing to do is fill in the NaNs as 'rather not say'.

In [112]:
df['job'] = df['job'].fillna('rather not say')
print(df['job'].value_counts())

job
rather not say                       7832
other                                7547
student                              4851
science / tech / engineering         4825
computer / hardware / software       4682
artistic / musical / writer          4410
sales / marketing / biz dev          4373
medicine / health                    3659
education / academia                 3497
executive / management               2357
banking / financial / real estate    2240
entertainment / media                2234
law / legal services                 1369
hospitality / travel                 1352
construction / craftsmanship         1016
clerical / administrative             801
political / government                697
transportation                        363
unemployed                            270
retired                               246
military                              201
Name: count, dtype: int64


### Location

In [113]:
df['location'].value_counts()

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
"san francisco, california",30514
"oakland, california",7107
"berkeley, california",4150
"san mateo, california",1309
"palo alto, california",1052
...,...
"jackson, mississippi",1
"ozone park, new york",1
"lake orion, michigan",1
"cambridge, massachusetts",1


We're not able to see all the locations, but we're going to go ahead and standardized.

In [114]:
df['location'] = df['location'].str.lower().str.strip()

### Offspring

In [115]:
df['offspring'].value_counts()

Unnamed: 0_level_0,count
offspring,Unnamed: 1_level_1
doesn't have kids,7509
"doesn't have kids, but might want them",3859
"doesn't have kids, but wants them",3554
doesn't want kids,2909
has kids,1874
has a kid,1869
"doesn't have kids, and doesn't want any",1128
"has kids, but doesn't want more",440
"has a kid, but doesn't want more",274
"has a kid, and might want more",229


In [116]:
offspring_mapping = {
    "doesn't have kids": "no kids, no preference",
    "doesn't have kids, but might want them": "no Kids, might want",
    "doesn't have kids, but wants them": "no kids, wants",
    "doesn't want kids": "no kids, doesn't want",
    "has kids": "has kids, no preference",
    "has a kid": "has kids, no preference",
    "doesn't have kids, and doesn't want any": "no kids, doesn't want",
    "has kids, but doesn't want more": "has kids, doesn't want more",
    "has a kid, but doesn't want more": "has kids, doesn't want more",
    "has a kid, and might want more": "has kids, might want more",
    "wants kids": "wants kids",
    "might want kids": "might want kids",
    "has kids, and might want more": "has kids, might want more",
    "has a kid, and wants more": "has kids, wants more",
    "has kids, and wants more": "has kids, wants more"
}

df['offspring'] = df['offspring'].map(offspring_mapping).fillna('unknown')

In [117]:
df['offspring'].sample(15)

Unnamed: 0,offspring
17295,unknown
55434,"has kids, no preference"
58993,"no Kids, might want"
18422,"no kids, no preference"
14556,"no Kids, might want"
41641,unknown
31602,unknown
51760,unknown
11018,unknown
216,unknown


### Pets

In [118]:
df['pets'].value_counts()

Unnamed: 0_level_0,count
pets,Unnamed: 1_level_1
likes dogs and likes cats,14753
likes dogs,7191
likes dogs and has cats,4293
has dogs,4096
has dogs and likes cats,2324
likes dogs and dislikes cats,2022
has dogs and has cats,1464
has cats,1395
likes cats,1057
has dogs and dislikes cats,548


In [119]:
pets_mapping = {
    "likes dogs and likes cats": "likes pets",
    "likes dogs": "likes dogs",
    "likes cats": "likes cats",
    "likes dogs and has cats": "has pets",
    "has dogs": "has pets",
    "has dogs and likes cats": "has pets",
    "has dogs and has cats": "has pets",
    "has cats": "has pets",
    "has dogs and dislikes cats": "has pets",
    "likes dogs and dislikes cats": "likes dogs",
    "dislikes dogs and likes cats": "likes cats",
    "dislikes dogs and dislikes cats": "dislikes pets",
    "dislikes cats": "dislikes pets",
    "dislikes dogs": "dislikes pets",
    "dislikes dogs and has cats": "has pets"
}

# Apply mapping directly to the 'pets' column
df['pets'] = df['pets'].map(pets_mapping).fillna("unknown")

# Verify changes
print(df['pets'].value_counts())

pets
unknown          19006
likes pets       14753
has pets         14201
likes dogs        9213
likes cats        1292
dislikes pets      357
Name: count, dtype: int64


### Religion

In [120]:
df['religion'].value_counts()

Unnamed: 0_level_0,count
religion,Unnamed: 1_level_1
agnosticism,2701
other,2671
agnosticism but not too serious about it,2631
agnosticism and laughing about it,2488
catholicism but not too serious about it,2304
atheism,2166
other and laughing about it,2108
atheism and laughing about it,2067
christianity but not too serious about it,1945
christianity,1939


In [121]:
# Define a dictionary-based mapping for standardizing religion
religion_mapping = {
    "agnosticism": "agnosticism",
    "agnosticism but not too serious about it": "agnosticism",
    "agnosticism and laughing about it": "agnosticism",
    "agnosticism and somewhat serious about it": "agnosticism",
    "agnosticism and very serious about it": "agnosticism",
    "atheism": "atheism",
    "atheism but not too serious about it": "atheism",
    "atheism and laughing about it": "atheism",
    "atheism and somewhat serious about it": "atheism",
    "atheism and very serious about it": "atheism",
    "christianity": "christianity",
    "christianity but not too serious about it": "christianity",
    "christianity and laughing about it": "christianity",
    "christianity and somewhat serious about it": "christianity",
    "christianity and very serious about it": "christianity",
    "catholicism": "catholicism",
    "catholicism but not too serious about it": "catholicism",
    "catholicism and laughing about it": "catholicism",
    "catholicism and somewhat serious about it": "catholicism",
    "catholicism and very serious about it": "catholicism",
    "judaism": "judaism",
    "judaism but not too serious about it": "judaism",
    "judaism and laughing about it": "judaism",
    "judaism and somewhat serious about it": "judaism",
    "judaism and very serious about it": "judaism",
    "buddhism": "buddhism",
    "buddhism but not too serious about it": "buddhism",
    "buddhism and laughing about it": "buddhism",
    "buddhism and somewhat serious about it": "buddhism",
    "buddhism and very serious about it": "buddhism",
    "islam": "islam",
    "islam but not too serious about it": "islam",
    "islam and laughing about it": "islam",
    "islam and somewhat serious about it": "islam",
    "islam and very serious about it": "islam",
    "hinduism": "hinduism",
    "hinduism but not too serious about it": "hinduism",
    "hinduism and laughing about it": "hinduism",
    "hinduism and somewhat serious about it": "hinduism",
    "hinduism and very serious about it": "hinduism",
}

# Apply mapping directly to 'religion' column
df['religion'] = df['religion'].str.lower().map(religion_mapping).fillna("other")

# Verify results
print(df['religion'].value_counts())


religion
other           27014
agnosticism      8772
atheism          6964
christianity     5748
catholicism      4726
judaism          3080
buddhism         1937
hinduism          446
islam             135
Name: count, dtype: int64


### Speaks Column

In [122]:
df['speaks'].value_counts()

Unnamed: 0_level_0,count
speaks,Unnamed: 1_level_1
english,21016
english (fluently),6583
"english (fluently), spanish (poorly)",2050
"english (fluently), spanish (okay)",1905
"english (fluently), spanish (fluently)",1275
...,...
"english (fluently), french (okay), italian (okay), hebrew (okay)",1
"english (fluently), farsi (poorly), spanish (poorly), french (poorly)",1
"english (okay), tagalog (okay), japanese (poorly), french (poorly)",1
"english, spanish (fluently), lisp (okay)",1


I want to standardize this column, but I wonder if standardizing will impact PCA results. I will preserve the original column, but create a new column that standardizes the 'speak' column to see if it has an impact on PCA. If not, I will drop the column.

In [123]:
df['speaks_original'] = df['speaks']

In [124]:
def categorize_speaks(speaks):
    if pd.isna(speaks):
        return "unknown"  # Handle missing values

    # Remove fluency descriptors (e.g., "(fluently)", "(okay)", "(poorly)")
    cleaned_languages = re.sub(r"\s?\(.*?\)", "", speaks)

    # Convert to lowercase and split into individual languages
    languages = set(cleaned_languages.lower().split(", "))

    # Classify as 'multilingual' if more than one language is listed
    return "monolingual" if len(languages) == 1 else "multilingual"

# Apply function directly to the 'speaks' column
df['speaks'] = df['speaks_original'].apply(categorize_speaks)

In [125]:
df['speaks'].value_counts()

Unnamed: 0_level_0,count
speaks,Unnamed: 1_level_1
multilingual,30348
monolingual,28433
unknown,41


In [126]:
df[['speaks_original','speaks']].sample(5)

Unnamed: 0,speaks_original,speaks
8845,"english (okay), arabic (fluently)",multilingual
8640,"english (fluently), french (okay)",multilingual
39170,"english (fluently), spanish (okay), german (po...",multilingual
24030,"english, spanish (okay)",multilingual
50000,english,monolingual


For some reason, the speaks_original column contains non-human languages such as C++. Let's remove it.

In [127]:
programming_languages = {"c++", "java", "python", "javascript", "html", "css", "ruby", "swift", "php", "r", "sql"}

def clean_speaks(text):
    if pd.isna(text):
        return text  # Keep NaNs as they are

    words = re.split(r',\s*', str(text).lower())  # Split by commas and clean spaces
    cleaned_words = []

    for word in words:
        # Remove proficiency labels
        word = re.sub(r"\(.*?\)", "", word).strip()

        # Exclude programming languages
        if word not in programming_languages and word.isalpha():
            cleaned_words.append(word)

    return ", ".join(cleaned_words) if cleaned_words else None  # Return cleaned list or None if empty

# Apply the cleaning function to the speaks_original column
df['speaks_original'] = df['speaks_original'].apply(clean_speaks)

In [128]:
df[['speaks_original','speaks']].sample(5)

Unnamed: 0,speaks_original,speaks
53239,english,monolingual
31428,"english, other",multilingual
27812,english,monolingual
57341,english,monolingual
35175,english,monolingual


##4.3 Cleaning Essay Columns

In [129]:
df['combined_essay_cols'] = df[
    ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
].fillna("").apply(lambda x: " ".join(x), axis=1)

In [130]:
df.drop(columns=['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9'], inplace=True)

In [131]:
df['combined_essay_cols'].sample()

In [132]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
spell = SpellChecker()

def preprocess_text(text):
  text = text.lower()
  text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
  tokens = text.split()  # Faster alternative: text.split()
  return tokens

def remove_stopwords(tokens):
  return [word for word in tokens if word not in stop_words]

def remove_long_words(tokens, max_length=20):
  return [word for word in tokens if len(word) <= max_length]

def lemmatization(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]


In [133]:
def clean_text(text):
  if not isinstance(text, str) or text.strip() =="":  # Ensure text is a string, handle NaNs
        return ""

  tokens = preprocess_text(text)
  tokens = remove_stopwords(tokens)
  tokens = remove_long_words(tokens)  # Removes words > 20 characters
  tokens = lemmatization(tokens)
  return " ".join(tokens)  # Convert back to string for TF-IDF


df['cleaned_essays'] = df['combined_essay_cols'].astype(str).swifter.apply(clean_text)

## BERT Embeddings Approach

Sentence embeddings allow us to convert sentences into numerical vectors, capturing their meaning in a way that's easy for machine learning models to use.

all-MiniLM-L6-v2 is a sentence-transformer model that is great for tasks like semantic search, clustering, or identifying sentence similarity.

In [140]:
# from transformers import AutoTokenizer, AutoModel
# import torch
# import torch.nn.functional as F

In [139]:
#df.info()

### Final Dataframe

In [138]:
#df.to_csv('okcupid_cleaned.csv', index=False)

In [141]:
# # Load model from HuggingFace Hub
# tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
# model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# def bert_embeddings(text):

#   def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

#   encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

#   # Compute token embeddings
#   with torch.no_grad():
#     model_output = model(**encoded_input)

#   # Perform pooling
#   essay_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

#   # Normalize embeddings
#   essay_embeddings = F.normalize(essay_embeddings, p=2, dim=1)
#   normalized_essay = np.array(essay_embeddings)
#   return


In [142]:
# from transformers import AutoTokenizer, AutoModel
# import torch
# import torch.nn.functional as F

# #Mean Pooling - Take attention mask into account for correct averaging
# def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# # Sentences we want sentence embeddings for
# sentences = ['This is an example sentence', 'Each sentence is converted']

# # Load model from HuggingFace Hub
# tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
# model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# # Tokenize sentences
# encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# # Compute token embeddings
# with torch.no_grad():
#     model_output = model(**encoded_input)

# # Perform pooling
# sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# # Normalize embeddings
# sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

# print("Sentence embeddings:")
# print(sentence_embeddings)