# 1. Project Overview (comeback to outline the steps you'll follow in the notebook)

This project aims to build a data-driven matchmaking system that enhances compatibility in online dating by leveraging user demographics, lifestyle traits, and personal essays. Using the OKCupid Profiles Dataset from Kaggle, the system employs BERT embeddings for text analysis and feature importance for structured data to generate personalized, content-based recommendations. Unlike modern dating apps that rely on swiping behavior, this approach prioritizes compatibility over interaction history, making it ideal for a blind dating experience where personality and shared values take center stage.

# 2. Import Libraries

This section focuses on importing the OKCupid dataset from Kaggle, looking at the comprehensive overview and descriptive statistics. I will then look at the amount of data missing.

In [None]:
import kagglehub
import pandas as pd
import re

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

!pip install swifter
import swifter

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
from datetime import datetime, timedelta
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from nltk.stem import WordNetLemmatizer
from textblob import Word
!pip install pyspellchecker
from spellchecker import SpellChecker



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!




# 3. Load and Explore Data

In [None]:
path = kagglehub.dataset_download("andrewmvd/okcupid-profiles")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/andrewmvd/okcupid-profiles/versions/1


In [None]:
df = pd.read_csv('/root/.cache/kagglehub/datasets/andrewmvd/okcupid-profiles/versions/1/okcupid_profiles.csv', sep=",",header=0)
df.sample(5)

Unnamed: 0,age,status,sex,orientation,body_type,diet,drinks,drugs,education,ethnicity,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
14923,28,seeing someone,f,gay,,vegetarian,socially,never,graduated from college/university,,...,i am a secret softy. i wear an armor of steel ...,"figuring out what makes me happy. oh, and wor...","blurring the lines of ""appropriate"" conversati...",my tie-clip? or suspenders? maybe my tattoos?,"no one belongs here more than you, the red ten...",,,"some things i'd enjoy doing: (specifically, th...",,you want to go on adventures or have deep/triv...
11084,24,single,m,straight,average,mostly anything,socially,never,graduated from masters program,asian,...,grew up in northern california and went to sch...,graduated from ucsd with a ms in biology. i cu...,picking up new sports and hobbies. i've been s...,i've heard people describe me as quite at firs...,"the hunger games (addicting), the princess bri...",close friends family internet food trucks iphone,"restaurants i want to go to, cities i want to ...","it depends, sometimes i'll be hanging out with...",i like visiting home decor stores... like west...,if anything here interests you!
26526,27,seeing someone,f,straight,thin,,socially,never,graduated from college/university,"asian, white",...,i feel like i'm going against cultural and soc...,"at a crossroads in sf, deciding which directio...",adapting to new situations. forcing myself out...,"some have said the voice, some say the smile, ...",two books that changed my life: the ecology of...,1) cheesy sentimentality. 2) awesome people. 3...,how i can find a calm center when everything c...,"brushing my teeth. actually, that's every nigh...",i'm a pretty strong infj (myers-briggs). but i...,"all of the following are positive, women and m..."
34561,41,single,f,straight,average,mostly anything,socially,never,,"native american, white",...,i'm so much better about answering direct ques...,i'm currently managing an independent bookstor...,making people feel better. seriously. that's m...,"i have no idea, you'll have to ask them.","books: maia, richard adams; the secret history...",-my girls -amazing food -my ipod -compassion -...,why people do what they do. why i do what i do...,"half the time, i'm with kids, which means movi...",naruto has made me cry.,-you're not just looking for a booty call. -yo...
50473,33,single,m,straight,average,,socially,,graduated from college/university,"native american, white",...,"i am a damn sexy, fun loving, geek. i watch do...",living it the best i can and making do with wh...,computers i have been tearing them down since ...,i asked a friend to answer this she said my ey...,i could fill up 2 pages answering this questio...,"1. family, always comes first and i wouldn't b...","how epic the new elder scrolls mmo will be, an...",either home gaming and listening to music or o...,every year for the last 6 years i have volunte...,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   status       59946 non-null  object 
 2   sex          59946 non-null  object 
 3   orientation  59946 non-null  object 
 4   body_type    54650 non-null  object 
 5   diet         35551 non-null  object 
 6   drinks       56961 non-null  object 
 7   drugs        45866 non-null  object 
 8   education    53318 non-null  object 
 9   ethnicity    54266 non-null  object 
 10  height       59943 non-null  float64
 11  income       59946 non-null  int64  
 12  job          51748 non-null  object 
 13  last_online  59946 non-null  object 
 14  location     59946 non-null  object 
 15  offspring    24385 non-null  object 
 16  pets         40025 non-null  object 
 17  religion     39720 non-null  object 
 18  sign         48890 non-null  object 
 19  smok

In [None]:
df.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


In [None]:
# Calculate the percentage of missing values for each column
missing_summary = df.isnull().agg(['sum', 'mean']).T

# Rename columns for clarity
missing_summary.columns = ['# Missing Values', '% Missing']

# Format percentage as a string with two decimal places
missing_summary['% Missing'] = (missing_summary['% Missing'] * 100).apply(lambda x: f"{x:.2f}%")

# Display the result
print(missing_summary)

             # Missing Values % Missing
age                       0.0     0.00%
status                    0.0     0.00%
sex                       0.0     0.00%
orientation               0.0     0.00%
body_type              5296.0     8.83%
diet                  24395.0    40.69%
drinks                 2985.0     4.98%
drugs                 14080.0    23.49%
education              6628.0    11.06%
ethnicity              5680.0     9.48%
height                    3.0     0.01%
income                    0.0     0.00%
job                    8198.0    13.68%
last_online               0.0     0.00%
location                  0.0     0.00%
offspring             35561.0    59.32%
pets                  19921.0    33.23%
religion              20226.0    33.74%
sign                  11056.0    18.44%
smokes                 5512.0     9.19%
speaks                   50.0     0.08%
essay0                 5488.0     9.15%
essay1                 7572.0    12.63%
essay2                 9638.0    16.08%


The sample showed me rows with a lot of missing information. I want to take a deeper look to see if it is worth dropping rows with more than half of the features missing as there is no way to fill in the information lost.

In [None]:
# Count the number of NaN values per row
nan_counts_per_row = df.isnull().sum(axis=1)

# Define a threshold (e.g., more than half the columns)
threshold = df.shape[1] // 2

# Count the rows with NaN values exceeding the threshold
rows_above_threshold = (nan_counts_per_row > threshold).sum()

# Display the count
print(f"Number of rows with NaNs exceeding the threshold of {threshold}: {rows_above_threshold}")

percent_missing = (rows_above_threshold / len(df)) * 100
print(f"Percentage of rows with NaNs exceeding the threshold of {threshold}: {percent_missing:.2f}%")

Number of rows with NaNs exceeding the threshold of 15: 931
Percentage of rows with NaNs exceeding the threshold of 15: 1.55%


I have decided to drop the income, body_type, sign, and last_online columns as they don't hold a significant importance in matchmaking.

In [None]:
df.drop(columns=['income', 'body_type', 'sign', 'last_online'], inplace=True)

#4. Data Preprocessing

1.45% of the data contains rows with more than 15 features missing. I will remove them as dropping them will not impact the dataset.

In [None]:
df = df[nan_counts_per_row <= threshold]

In [None]:
df.sample(10)

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
56611,31,single,f,straight,strictly anything,rarely,never,graduated from masters program,hispanic / latin,63.0,...,,,,,"i love to eat i love sushi, thai, mexican stea...",,"my family, friends, my future",in the movie theater.,,"you want to have a new friend, have fun or a l..."
52802,56,single,m,straight,,socially,,dropped out of college/university,white,71.0,...,"i drive around the bay area on weekends, looki...",trying to build a company that's gonna change ...,solving problems that require you to hold seve...,my striking good looks.,"moby dick, duck soup, iko iko, an italian sub ...","food, shelter, clothing, sleep, nature, people",the information ecology. ... and what's for di...,unwinding from a busy week.,"i'm wearing a blue shirt, today",you feel like chatting.
15938,57,single,m,straight,,rarely,,graduated from masters program,white,71.0,...,,,,,,,,,,
4195,19,single,m,straight,mostly anything,not at all,never,working on high school,hispanic / latin,72.0,...,i'm 18 years and love to enjoy life. i like to...,i am a student that is trying to finish colleg...,"baseball, making people laugh, drinking coffee...",they notice my hair because when its loose its...,my favorite books are the harry potter series(...,my family. my iphone a daily shower my basebal...,life. its fricken crazy how we all came to be ...,out with my friends or drag racing. sometimes ...,i'm willing to admit that my first girlfriend ...,your interested in meeting me or just want to ...
45154,26,seeing someone,m,bisexual,mostly anything,socially,never,working on space camp,,72.0,...,i'm just me. i don't really know how to explai...,i'm currently working as a computer technician...,"martial arts (the ones i've already learned), ...","my nerd powers!! i don't wear them physically,...",i'm a big fan of shakespeare and don't try to ...,"a computer, a job, my family, a well crafted w...",why is my astrological description is complete...,doing whatever the hell i please!! i never bog...,..hmm... i will admit to sleeping bare ass! if...,"you are sane, intellegent and interested. all ..."
48684,28,single,m,straight,,socially,never,working on masters program,asian,73.0,...,"well, in a nut-shell i consider myself and eas...",finishing up my final year in grad school and ...,"board games, karaoke, and eating a lot. if you...",hmmm...dunno but i hope it's something good.,"books: the walking drunk, trainspotting, cat's...",1. whole milk (it really is the best) 2. the d...,,usually out and about around inner richmond of...,i have quite a few geeky obsessions. i've gone...,
3078,22,single,m,gay,vegan,socially,,graduated from college/university,indian,74.0,...,"just graduated from the institution, now spend...",finding beauty in post-graduation nothingness....,"recognizing dystopia, singing the destiny's ch...","""you're tall""","books: the god of small things, black skin whi...","baduizm, borderlands, my bike, a pencil and pa...",-my love for arundhati roy -projects i want to...,,,you are interested in the best vegan peanut bu...
37349,28,single,f,straight,anything,socially,,working on ph.d program,white,63.0,...,seriously i just can't summarize myself! i thi...,for now... i am trying to find my balance betw...,- laughing! laughing everyday should be a requ...,depends on the people i guess: i prefer those ...,"books: authors like milan kundera, katherine p...","1/ music, music, and music 2/ friends 3/ trave...",well... i do think a lot... human beings fasci...,as many people say i don't have a typical frid...,definitely not on the internet,you got inspired!
24038,40,single,f,straight,,socially,never,graduated from masters program,white,68.0,...,"i am an (allegedly) good looking, successful, ...",i have both achieved the improbable and failed...,intuitive driving and polarizing. but i am a l...,"how would i know. remember, i am a narcissist;-)",i read kierkegaard and listen to eminem (admit...,knowledge. freedom. laughter. love. inspiratio...,- if most people really are that spiritually u...,working out. cooking dinner with friends. at t...,is the fact that i am desperately trying to ma...,you can read and comprehend my self summary wi...
45681,29,single,m,straight,,socially,never,dropped out of college/university,white,70.0,...,,,,,,,everything. i over analyze.,,,you would like to know me better.


## 4.1 Cleaning Numerical Columns

The numerical columns include age, height, and income.

### Age Column

In [None]:
print("Age Min", df['age'].min())
print("Age Max", df['age'].max())

Age Min 18
Age Max 109


**Handling Outliers in Age Column**
- The minimum age is acceptable and doesn't require changes.
- The maximum age needs to be addressed as it exceeds realistic limits.
- Further steps:
  1. Investigate rows with ages over 100.
  2. Determine whether to remove or impute those values.

In [None]:
df['age'].value_counts().sort_index()

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
18,298
19,582
20,925
21,1246
22,1895
23,2546
24,3189
25,3497
26,3663
27,3625


Looked closer at ages 109, and 110 and decided to drop them due to a lot of missing data that cannot be filled in.

In [None]:
df[(df['age'] == 109) | (df['age'] == 110)]

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
25324,109,available,m,straight,mostly other,,never,working on masters program,,95.0,...,,,,nothing,,,,,,


In [None]:
# filtering the data to no longer include ages 109 and 110

df = df[(df['age'] != 109) & (df['age'] != 110)]

# verifying the changes
df['age'].value_counts().sort_index()

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
18,298
19,582
20,925
21,1246
22,1895
23,2546
24,3189
25,3497
26,3663
27,3625


The `age` column does not have any missing values, so we will move on to the `height` column.

### Height Column

In [None]:
print("Min Height:",df['height'].min())
print("Max Height:",df['height'].max())

Min Height: 1.0
Max Height: 95.0


In [None]:
df['height'].value_counts().sort_index()

Unnamed: 0_level_0,count
height,Unnamed: 1_level_1
1.0,1
3.0,1
4.0,1
6.0,1
8.0,1
9.0,1
26.0,1
36.0,9
37.0,2
42.0,1


There is no information on the unit for height. In this case, it makes sense to assume the height is in inches. I will filter out anomalies and the lowest recorded height will be 4'9" (59 inches) to 6'6" (80 inches).

In [None]:
df = df[(df['height'] >= 59) & (df['height'] <= 80)]

In [None]:
print("Min Height:",df['height'].min())
print("Max Height:",df['height'].max())

Min Height: 59.0
Max Height: 80.0


In [None]:
df.sample(2)

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
33484,25,single,f,straight,anything,socially,never,graduated from college/university,white,65.0,...,"hello everybody, im looking for some nice, fri...",,,"happy, likes to party and to have fun",horror and comedy,".... my friends, sleeping,dancing,uuhhmmm....",,,,you dont just wanna have sex..... if you repla...
30693,28,single,m,straight,mostly anything,socially,sometimes,graduated from college/university,black,67.0,...,i recently moved to san francisco from dc and ...,maintaining and progressing a professional car...,being a complete mystery.,"physically, eyes or smile i suppose. i'm more ...",i wish i had more time to read fiction these d...,,ideas for screenplays. food. sex.,relaxing at home after a typical long work week.,you should be highly skeptical of anyone's ans...,i'm not really the type to place conditions on...


The height column has 3 missing values, therefore I will drop it as there is no way to fill in the height with accuracy.

In [None]:
df.dropna(subset=['height'], inplace=True)

## 4.2 Cleaning Categorical Columns

Categorical columns include demographics, interests, and lifestyle choices as well as 10 "essays" which are just answers to dating prompts.

### Status

The status column has no missing values, therefore cleaning it will be simple.

In [None]:
# checking that the values in status are standardized
df['status'].value_counts()

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
single,54642
seeing someone,2032
available,1842
married,298
unknown,8


`single` has the most values followed by `seeing someone`. `available` and single can be merged together since there is overlap.

In [None]:
df['status'] = df['status'].replace({'single': 'single', 'available': 'single'})

print(df['status'].value_counts())

status
single            56484
seeing someone     2032
married             298
unknown               8
Name: count, dtype: int64


### Sex

In [None]:
# checking for missing values as well as making sure everything is standardized.
df['sex'].value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
m,35091
f,23731


### Orientation

In [None]:
df['orientation'].value_counts()

Unnamed: 0_level_0,count
orientation,Unnamed: 1_level_1
straight,50613
gay,5495
bisexual,2714


### Diet

Diet has 24395 missing values which affects 40.9% of the data. Let's take a look.

In [None]:
df['diet'].value_counts()

Unnamed: 0_level_0,count
diet,Unnamed: 1_level_1
mostly anything,16493
anything,6128
strictly anything,5088
mostly vegetarian,3434
mostly other,1000
strictly vegetarian,869
vegetarian,662
strictly other,444
mostly vegan,337
other,328


In [None]:
df['diet'].sample(10)

Unnamed: 0,diet
32468,mostly vegetarian
35866,
8234,strictly vegetarian
48581,mostly anything
58409,
9002,mostly anything
32175,
49627,
11568,mostly anything
6634,


In [None]:
diet_mapping = {
    'anything': 'anything',
    'mostly anything': 'anything',
    'strictly anything': 'anything',
    'vegetarian': 'vegetarian',
    'mostly vegetarian': 'vegetarian',
    'strictly vegetarian': 'vegetarian',
    'vegan': 'vegan',
    'mostly vegan': 'vegan',
    'strictly vegan': 'vegan',
    'other': 'other',
    'mostly other': 'other',
    'strictly other': 'other',
    'kosher': 'kosher',
    'mostly kosher': 'kosher',
    'strictly kosher': 'kosher',
    'halal': 'halal',
    'mostly halal': 'halal',
    'strictly halal': 'halal'
}

df['diet'] = df['diet'].map(diet_mapping).fillna('unknown')

In [None]:
df['diet'].value_counts()

Unnamed: 0_level_0,count
diet,Unnamed: 1_level_1
anything,27709
unknown,23486
vegetarian,4965
other,1772
vegan,701
kosher,113
halal,76


In [None]:
df['diet'].sample(5)

Unnamed: 0,diet
28464,unknown
14245,vegetarian
22219,anything
1682,unknown
33599,unknown


### Drinks/Drugs/Smoke

The `drinks` column has 2,985 missing values which represents 4.98% of the data.
`drugs` has 14,080 missing values which represents 23.49% of data.
`smokes` has 5,512 missing values which represents 9.19% of data.

In [None]:
df['drinks'].value_counts()

Unnamed: 0_level_0,count
drinks,Unnamed: 1_level_1
socially,41376
rarely,5901
often,5104
not at all,3226
very often,457
desperately,312


In [None]:
drinks_mapping = {
    'socially': 'socially',
    'rarely': 'rarely/not at all',
    'not at all': 'rarely/not at all',
    'often': 'often',
    'very often': 'often',
    'desperately': 'often'
}

df['drinks'] = df['drinks'].map(drinks_mapping).fillna('unknown')

print(df['drinks'].value_counts())

drinks
socially             41376
rarely/not at all     9127
often                 5873
unknown               2446
Name: count, dtype: int64


In [None]:
# drugs
df['drugs'].value_counts()

Unnamed: 0_level_0,count
drugs,Unnamed: 1_level_1
never,36972
sometimes,7665
often,392


In [None]:
df['drugs'] = df['drugs'].fillna('unknown')
print(df['drugs'].isnull().sum())

0


In [None]:
# smokes
df['smokes'].value_counts()

Unnamed: 0_level_0,count
smokes,Unnamed: 1_level_1
no,43484
sometimes,3739
when drinking,3008
yes,2196
trying to quit,1471


In [None]:
smokes_mapping = {
    'no': 'no',
    'sometimes': 'occasionally',
    'when drinking': 'occasionally',
    'yes': 'regularly',
    'trying to quit': 'trying to quit'
}

df['smokes'] = df['smokes'].map(smokes_mapping).fillna('unknown')

print(df['smokes'].value_counts())

smokes
no                43484
occasionally       6747
unknown            4924
regularly          2196
trying to quit     1471
Name: count, dtype: int64


### Education

6,628 values missing

In [None]:
df['education'].value_counts()

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
graduated from college/university,23816
graduated from masters program,8901
working on college/university,5672
working on masters program,1674
graduated from two-year college,1525
graduated from high school,1413
graduated from ph.d program,1264
graduated from law school,1111
working on two-year college,1063
dropped out of college/university,993


In [None]:
# Define mapping for education levels
education_mapping = {
    'graduated from college/university': 'college/university_graduated',
    'working on college/university': 'college/university_studying',
    'dropped out of college/university': 'college/university_dropped out',
    'graduated from masters program': 'masters_graduated',
    'working on masters program': 'masters_studying',
    'dropped out of masters program': 'masters_dropped out',
    'graduated from ph.d program': 'ph.d_graduated',
    'working on ph.d program': 'ph.d_studying',
    'dropped out of ph.d program': 'ph.d_dropped out',
    'graduated from law school': 'law school_graduated',
    'working on law school': 'law school_studying',
    'dropped out of law school': 'law school_dropped out',
    'graduated from med school': 'med school_graduated',
    'working on med school': 'med school_studying',
    'dropped out of med school': 'med school_dropped out',
    'graduated from two-year college': 'two-year college_graduated',
    'working on two-year college': 'two-year college_studying',
    'dropped out of two-year college': 'two-year college_dropped out',
    'graduated from high school': 'high school_graduated',
    'working on high school': 'high school_studying',
    'dropped out of high school': 'high school_dropped out',
    'space camp': 'other',
    'working on space camp': 'other',
    'dropped out of space camp': 'other',
    'graduated from space camp': 'other'
}

# Apply the mapping
df['education'] = df['education'].map(education_mapping).fillna('unknown')

# Verify the result
print(df['education'].value_counts())


education
college/university_graduated      23816
masters_graduated                  8901
unknown                            7158
college/university_studying        5672
masters_studying                   1674
other                              1664
two-year college_graduated         1525
high school_graduated              1413
ph.d_graduated                     1264
law school_graduated               1111
two-year college_studying          1063
college/university_dropped out      993
ph.d_studying                       975
med school_graduated                443
law school_studying                 268
med school_studying                 211
two-year college_dropped out        191
masters_dropped out                 140
ph.d_dropped out                    126
high school_dropped out              98
high school_studying                 87
law school_dropped out               17
med school_dropped out               12
Name: count, dtype: int64


In [None]:
print(df['education'].isnull().sum())

0


### Ethnicity

In [None]:
# checking the values
print(df['ethnicity'].value_counts())

ethnicity
white                                              32495
asian                                               6019
hispanic / latin                                    2763
black                                               1974
other                                               1669
                                                   ...  
black, native american, indian, white                  1
black, native american, pacific islander, other        1
asian, middle eastern, black, pacific islander         1
middle eastern, black, pacific islander, white         1
asian, black, indian                                   1
Name: count, Length: 217, dtype: int64


Since there are over 217 different ethnicity combinations in this dataset, it's best to standardize the ethnicity column to reduce ambiguity.

In [None]:
def standardize_ethnicity(value):
    if pd.isna(value):  # if missing,
        return 'unknown'
    # Split by comma, strip whitespace, convert to lowercase, and remove duplicates
    ethnicities = sorted(set([eth.strip().lower() for eth in value.split(',')]))
    # Combine back into a standardized string
    return ', '.join(ethnicities)

# Apply the cleaning function to the 'ethnicity' column
df['ethnicity'] = df['ethnicity'].apply(standardize_ethnicity)

# Group rare combinations into a 'mixed' category (optional)
common_ethnicities = ['white', 'asian', 'black', 'hispanic / latin', 'native american', 'pacific islander', 'middle eastern']
df['ethnicity'] = df['ethnicity'].apply(
    lambda x: x if x in common_ethnicities else ('mixed' if ',' in x else x)
)

# Verify the cleaned and standardized column
print(df['ethnicity'].value_counts())


ethnicity
white               32495
mixed                6778
asian                6019
unknown              5262
hispanic / latin     2763
black                1974
other                1669
indian               1062
pacific islander      413
middle eastern        324
native american        63
Name: count, dtype: int64


### Job

In [None]:
print(df['job'].value_counts())

job
other                                7547
student                              4851
science / tech / engineering         4825
computer / hardware / software       4682
artistic / musical / writer          4410
sales / marketing / biz dev          4373
medicine / health                    3659
education / academia                 3497
executive / management               2357
banking / financial / real estate    2240
entertainment / media                2234
law / legal services                 1369
hospitality / travel                 1352
construction / craftsmanship         1016
clerical / administrative             801
political / government                697
rather not say                        431
transportation                        363
unemployed                            270
retired                               246
military                              201
Name: count, dtype: int64


Although the job column is useful in its current form. The best thing to do is fill in the NaNs as 'rather not say'.

In [None]:
df['job'] = df['job'].fillna('rather not say')
print(df['job'].value_counts())

job
rather not say                       7832
other                                7547
student                              4851
science / tech / engineering         4825
computer / hardware / software       4682
artistic / musical / writer          4410
sales / marketing / biz dev          4373
medicine / health                    3659
education / academia                 3497
executive / management               2357
banking / financial / real estate    2240
entertainment / media                2234
law / legal services                 1369
hospitality / travel                 1352
construction / craftsmanship         1016
clerical / administrative             801
political / government                697
transportation                        363
unemployed                            270
retired                               246
military                              201
Name: count, dtype: int64


### Location

In [None]:
df['location'].value_counts()

Unnamed: 0_level_0,count
location,Unnamed: 1_level_1
"san francisco, california",30514
"oakland, california",7107
"berkeley, california",4150
"san mateo, california",1309
"palo alto, california",1052
...,...
"jackson, mississippi",1
"ozone park, new york",1
"lake orion, michigan",1
"cambridge, massachusetts",1


We're not able to see all the locations, but we're going to go ahead and standardized.

In [None]:
df['location'] = df['location'].str.lower().str.strip()

### Offspring

In [None]:
df['offspring'].value_counts()

Unnamed: 0_level_0,count
offspring,Unnamed: 1_level_1
doesn't have kids,7509
"doesn't have kids, but might want them",3859
"doesn't have kids, but wants them",3554
doesn't want kids,2909
has kids,1874
has a kid,1869
"doesn't have kids, and doesn't want any",1128
"has kids, but doesn't want more",440
"has a kid, but doesn't want more",274
"has a kid, and might want more",229


In [None]:
offspring_mapping = {
    "doesn't have kids": "no kids, no preference",
    "doesn't have kids, but might want them": "no kids, might want",
    "doesn't have kids, but wants them": "no kids, wants",
    "doesn't want kids": "no kids, doesn't want",
    "has kids": "has kids, no preference",
    "has a kid": "has kids, no preference",
    "doesn't have kids, and doesn't want any": "no kids, doesn't want",
    "has kids, but doesn't want more": "has kids, doesn't want more",
    "has a kid, but doesn't want more": "has kids, doesn't want more",
    "has a kid, and might want more": "has kids, might want more",
    "wants kids": "wants kids",
    "might want kids": "might want kids",
    "has kids, and might want more": "has kids, might want more",
    "has a kid, and wants more": "has kids, wants more",
    "has kids, and wants more": "has kids, wants more"
}

df['offspring'] = df['offspring'].map(offspring_mapping).fillna('unknown')

In [None]:
df['offspring'].sample(15)

Unnamed: 0,offspring
23642,"no kids, wants"
45370,"no Kids, might want"
19824,"no kids, doesn't want"
5375,"no Kids, might want"
55932,"no Kids, might want"
20108,unknown
30028,unknown
40023,"has kids, no preference"
7106,"no kids, wants"
55875,"no Kids, might want"


### Pets

In [None]:
df['pets'].value_counts()

Unnamed: 0_level_0,count
pets,Unnamed: 1_level_1
likes dogs and likes cats,14753
likes dogs,7191
likes dogs and has cats,4293
has dogs,4096
has dogs and likes cats,2324
likes dogs and dislikes cats,2022
has dogs and has cats,1464
has cats,1395
likes cats,1057
has dogs and dislikes cats,548


In [None]:
pets_mapping = {
    "likes dogs and likes cats": "likes pets",
    "likes dogs": "likes dogs",
    "likes cats": "likes cats",
    "likes dogs and has cats": "has pets",
    "has dogs": "has pets",
    "has dogs and likes cats": "has pets",
    "has dogs and has cats": "has pets",
    "has cats": "has pets",
    "has dogs and dislikes cats": "has pets",
    "likes dogs and dislikes cats": "likes dogs",
    "dislikes dogs and likes cats": "likes cats",
    "dislikes dogs and dislikes cats": "dislikes pets",
    "dislikes cats": "dislikes pets",
    "dislikes dogs": "dislikes pets",
    "dislikes dogs and has cats": "has pets"
}

# Apply mapping directly to the 'pets' column
df['pets'] = df['pets'].map(pets_mapping).fillna("unknown")

# Verify changes
print(df['pets'].value_counts())

pets
unknown          19006
likes pets       14753
has pets         14201
likes dogs        9213
likes cats        1292
dislikes pets      357
Name: count, dtype: int64


### Religion

In [None]:
df['religion'].value_counts()

Unnamed: 0_level_0,count
religion,Unnamed: 1_level_1
agnosticism,2701
other,2671
agnosticism but not too serious about it,2631
agnosticism and laughing about it,2488
catholicism but not too serious about it,2304
atheism,2166
other and laughing about it,2108
atheism and laughing about it,2067
christianity but not too serious about it,1945
christianity,1939


In [None]:
# Define a dictionary-based mapping for standardizing religion
religion_mapping = {
    "agnosticism": "agnosticism",
    "agnosticism but not too serious about it": "agnosticism",
    "agnosticism and laughing about it": "agnosticism",
    "agnosticism and somewhat serious about it": "agnosticism",
    "agnosticism and very serious about it": "agnosticism",
    "atheism": "atheism",
    "atheism but not too serious about it": "atheism",
    "atheism and laughing about it": "atheism",
    "atheism and somewhat serious about it": "atheism",
    "atheism and very serious about it": "atheism",
    "christianity": "christianity",
    "christianity but not too serious about it": "christianity",
    "christianity and laughing about it": "christianity",
    "christianity and somewhat serious about it": "christianity",
    "christianity and very serious about it": "christianity",
    "catholicism": "catholicism",
    "catholicism but not too serious about it": "catholicism",
    "catholicism and laughing about it": "catholicism",
    "catholicism and somewhat serious about it": "catholicism",
    "catholicism and very serious about it": "catholicism",
    "judaism": "judaism",
    "judaism but not too serious about it": "judaism",
    "judaism and laughing about it": "judaism",
    "judaism and somewhat serious about it": "judaism",
    "judaism and very serious about it": "judaism",
    "buddhism": "buddhism",
    "buddhism but not too serious about it": "buddhism",
    "buddhism and laughing about it": "buddhism",
    "buddhism and somewhat serious about it": "buddhism",
    "buddhism and very serious about it": "buddhism",
    "islam": "islam",
    "islam but not too serious about it": "islam",
    "islam and laughing about it": "islam",
    "islam and somewhat serious about it": "islam",
    "islam and very serious about it": "islam",
    "hinduism": "hinduism",
    "hinduism but not too serious about it": "hinduism",
    "hinduism and laughing about it": "hinduism",
    "hinduism and somewhat serious about it": "hinduism",
    "hinduism and very serious about it": "hinduism",
}

# Apply mapping directly to 'religion' column
df['religion'] = df['religion'].str.lower().map(religion_mapping).fillna("other")

# Verify results
print(df['religion'].value_counts())


religion
other           27014
agnosticism      8772
atheism          6964
christianity     5748
catholicism      4726
judaism          3080
buddhism         1937
hinduism          446
islam             135
Name: count, dtype: int64


### Speaks Column

In [None]:
df['speaks'].value_counts()

Unnamed: 0_level_0,count
speaks,Unnamed: 1_level_1
english,21016
english (fluently),6583
"english (fluently), spanish (poorly)",2050
"english (fluently), spanish (okay)",1905
"english (fluently), spanish (fluently)",1275
...,...
"english (fluently), french (okay), italian (okay), hebrew (okay)",1
"english (fluently), farsi (poorly), spanish (poorly), french (poorly)",1
"english (okay), tagalog (okay), japanese (poorly), french (poorly)",1
"english, spanish (fluently), lisp (okay)",1


I want to standardize this column, but I wonder if standardizing will impact PCA results. I will preserve the original column, but create a new column that standardizes the 'speak' column to see if it has an impact on PCA. If not, I will drop the column.

In [None]:
df['speaks_original'] = df['speaks']

In [None]:
def categorize_speaks(speaks):
    if pd.isna(speaks):
        return "unknown"  # Handle missing values

    # Remove fluency descriptors (e.g., "(fluently)", "(okay)", "(poorly)")
    cleaned_languages = re.sub(r"\s?\(.*?\)", "", speaks)

    # Convert to lowercase and split into individual languages
    languages = set(cleaned_languages.lower().split(", "))

    # Classify as 'multilingual' if more than one language is listed
    return "monolingual" if len(languages) == 1 else "multilingual"

# Apply function directly to the 'speaks' column
df['speaks'] = df['speaks_original'].apply(categorize_speaks)

In [None]:
df['speaks'].value_counts()

Unnamed: 0_level_0,count
speaks,Unnamed: 1_level_1
multilingual,30348
monolingual,28433
unknown,41


In [None]:
df[['speaks_original','speaks']].sample(5)

Unnamed: 0,speaks_original,speaks
43778,"english (fluently), spanish (fluently), japane...",multilingual
22779,"english (fluently), hungarian (fluently), span...",multilingual
47788,english,monolingual
23823,"english (fluently), spanish (poorly), yiddish ...",multilingual
23998,english (fluently),monolingual


For some reason, the speaks_original column contains non-human languages such as C++. Let's remove it.

In [None]:
programming_languages = {"c++", "java", "python", "javascript", "html", "css", "ruby", "swift", "php", "r", "sql"}

def clean_speaks(text):
    if pd.isna(text):
        return text  # Keep NaNs as they are

    words = re.split(r',\s*', str(text).lower())  # Split by commas and clean spaces
    cleaned_words = []

    for word in words:
        # Remove proficiency labels
        word = re.sub(r"\(.*?\)", "", word).strip()

        # Exclude programming languages
        if word not in programming_languages and word.isalpha():
            cleaned_words.append(word)

    return ", ".join(cleaned_words) if cleaned_words else None  # Return cleaned list or None if empty

# Apply the cleaning function to the speaks_original column
df['speaks_original'] = df['speaks_original'].apply(clean_speaks)

In [None]:
df[['speaks_original','speaks']].sample(5)

Unnamed: 0,speaks_original,speaks
40124,english,monolingual
12243,"english, spanish",multilingual
56309,english,monolingual
54165,english,monolingual
25234,english,monolingual


##4.3 Cleaning Essay Columns

In [None]:
df['combined_essay_cols'] = df[
    ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
].fillna("").apply(lambda x: " ".join(x), axis=1)

In [None]:
df.drop(columns=['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9'], inplace=True)

In [None]:
df['combined_essay_cols'].sample()

Unnamed: 0,combined_essay_cols
1063,"i was born here in san francisco, but was rais..."


In [None]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
spell = SpellChecker()

def preprocess_text(text):
  text = text.lower()
  text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
  tokens = text.split()  # Faster alternative: text.split()
  return tokens

def remove_stopwords(tokens):
  return [word for word in tokens if word not in stop_words]

def remove_long_words(tokens, max_length=20):
  return [word for word in tokens if len(word) <= max_length]

def lemmatization(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]


In [None]:
def clean_text(text):
  if not isinstance(text, str) or text.strip() =="":  # Ensure text is a string, handle NaNs
        return ""

  tokens = preprocess_text(text)
  tokens = remove_stopwords(tokens)
  tokens = remove_long_words(tokens)  # Removes words > 20 characters
  tokens = lemmatization(tokens)
  return " ".join(tokens)  # Convert back to string for TF-IDF


df['cleaned_essays'] = df['combined_essay_cols'].astype(str).swifter.apply(clean_text)

Pandas Apply:   0%|          | 0/58822 [00:00<?, ?it/s]

In [None]:
df['speaks_original'].sample(10)

Unnamed: 0,speaks_original
29632,"english, hebrew"
8466,english
39772,english
36858,english
51084,english
51380,english
12806,english
19711,english
6968,english
5445,"english, english"


In [None]:
df.sample(10)

Unnamed: 0,age,status,sex,orientation,diet,drinks,drugs,education,ethnicity,height,job,location,offspring,pets,religion,smokes,speaks,speaks_original,combined_essay_cols,cleaned_essays
21296,42,seeing someone,m,straight,unknown,socially,never,college/university_graduated,white,67.0,executive / management,"san francisco, california",unknown,unknown,agnosticism,no,multilingual,"english, french","like a lot of people i meet in sf, i'm a trans...",like lot people meet sf im transplant grew bac...
20187,45,single,m,gay,unknown,socially,never,masters_graduated,asian,66.0,artistic / musical / writer,"mill valley, california",unknown,unknown,other,no,multilingual,"english, french, spanish, norwegian","i teach, i write, i travel. i have a passion f...",teach write travel passion three thing ive for...
59528,22,single,m,straight,unknown,rarely/not at all,never,high school_graduated,hispanic / latin,66.0,construction / craftsmanship,"oakland, california",unknown,has pets,other,no,multilingual,"english, spanish",i'm sporty i play guitar i love music trying t...,im sporty play guitar love music trying join p...
5023,19,single,m,straight,anything,socially,never,high school_graduated,asian,72.0,student,"oakland, california",unknown,unknown,agnosticism,no,multilingual,"english, chinese","i am asian i am 6 feet tall, i'm into cars, an...",asian 6 foot tall im car shoe preferably jorda...
50975,25,single,f,straight,vegetarian,rarely/not at all,never,masters_graduated,mixed,68.0,education / academia,"san francisco, california",unknown,has pets,other,no,multilingual,"english, farsi, french",inquisitive. sensitive. loquacious teaching ea...,inquisitive sensitive loquacious teaching eage...
32104,32,single,f,straight,unknown,socially,never,college/university_graduated,asian,67.0,entertainment / media,"san francisco, california",unknown,unknown,other,no,monolingual,english,i love the arts and music and travel are a big...,love art music travel big part life working di...
19886,43,single,f,straight,anything,socially,never,college/university_graduated,other,60.0,banking / financial / real estate,"oakland, california","no kids, no preference",has pets,catholicism,occasionally,multilingual,"english, tagalog, spanish","""attitude is a little thing that makes a big d...",attitude little thing make big difference wins...
46941,33,single,f,straight,anything,socially,never,unknown,unknown,62.0,sales / marketing / biz dev,"san francisco, california","no kids, no preference",unknown,other,no,multilingual,"english, indonesian","hello! to start, i am from indonesia and have ...",hello start indonesia calling bay home almost ...
25358,30,single,m,straight,unknown,socially,never,college/university_graduated,unknown,69.0,political / government,"berkeley, california","no kids, wants",has pets,catholicism,no,monolingual,english,singing marvin gaye and elton john songs at ...,singing marvin gaye elton john song karaoke ba...
12118,59,single,f,bisexual,unknown,rarely/not at all,never,two-year college_graduated,white,64.0,construction / craftsmanship,"oakland, california",unknown,has pets,other,no,monolingual,english,"12 step recovery, spiritual work/retreats lis...",12 step recovery spiritual workretreats listen...


### Final Dataframe

In [None]:
df.to_csv('df_cleaned.csv', index=False)

In [None]:
# # Load model from HuggingFace Hub
# tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
# model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# def bert_embeddings(text):

#   def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

#   encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

#   # Compute token embeddings
#   with torch.no_grad():
#     model_output = model(**encoded_input)

#   # Perform pooling
#   essay_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

#   # Normalize embeddings
#   essay_embeddings = F.normalize(essay_embeddings, p=2, dim=1)
#   normalized_essay = np.array(essay_embeddings)
#   return


In [None]:
# from transformers import AutoTokenizer, AutoModel
# import torch
# import torch.nn.functional as F

# #Mean Pooling - Take attention mask into account for correct averaging
# def mean_pooling(model_output, attention_mask):
#     token_embeddings = model_output[0] #First element of model_output contains all token embeddings
#     input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
#     return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# # Sentences we want sentence embeddings for
# sentences = ['This is an example sentence', 'Each sentence is converted']

# # Load model from HuggingFace Hub
# tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
# model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# # Tokenize sentences
# encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# # Compute token embeddings
# with torch.no_grad():
#     model_output = model(**encoded_input)

# # Perform pooling
# sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# # Normalize embeddings
# sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

# print("Sentence embeddings:")
# print(sentence_embeddings)