# Notes to ICPSR search committee

This code is part of a project that seeks to identify the predictors of attractiveness in online dating. I scraped data from over a quarter million user profiles to support this project. This .ipynb file is part of the process I went through to transform my scraped data into a form that I could conduct analyses on.

This code is one stage of a multi-stage process transforming approximately 50 GBs of HTML code (spread across over a quarter million txt files) into a useable dataset. The previous stage (which produced "Ok_Data_first_stage.csv") transformed the data into a tabular format. This stage further processes the resulting textual data by:

1. Cleaning users' self-response essays (e.g., self-descriptions, overview of their interests), specifically by removing most remaining html and removing textual representations of emoticons.
2. Using cleaned essays to identify users who appear to have not put much effort into their profiles.
3. Cleaning personality data that I found embedded in the html of users' pages. The dating website uses these data to create graphics indicating qualities of the users, such as extroversion and political beliefs.
4. Cleaning self-descriptions (e.g., descriptions of sexual identity, age, height, educational attainment).
5. Writing these changes to a new CSV file (Ok_Data_second_stage.csv)

I selected this code because it highlights my experience working with text data and should be interpretable on its own. Other python code associated with this cleaning process, or code highlighting other skills (e.g., API development, MTurk study automation, more complex uses of the pandas library) are available upon request.

In [31]:
import re
from bs4 import BeautifulSoup
import pandas as pd
from pandas import Series, DataFrame
from ast import literal_eval
import ast
import json
import numpy as np

import pandas.types as ptypes

In [32]:
df = pd.read_csv("Ok_Data_first_stage.csv",encoding = "cp1252" ,low_memory=False)

## Drop Unwanted Columns

In [33]:
df = df.drop('Unnamed: 0', 1)
df = df.drop('Unnamed: 0_x', 1)
df = df.drop('Unnamed: 0_y', 1)

"""
I grabbed several important variables while merging all data. However, I have reason to believe these data
are not completely accurate. Because of this, I drop these variables and redo the text processing necessary to 
create these variables.
"""

df = df.drop('Asian', 1)
df = df.drop('Black', 1)
df = df.drop('Hispanic', 1)
df = df.drop('Hookup', 1)
df = df.drop('Indian', 1)
df = df.drop('Long-term', 1)
df = df.drop('Middle Eastern', 1)
df = df.drop('Multi-ethnic', 1)
df = df.drop('Native American', 1)
df = df.drop('New friends', 1)
df = df.drop('Other', 1)
df = df.drop('Pacific Islander', 1)
df = df.drop('Short-term', 1)
df = df.drop('White', 1)
df = df.drop('build', 1)
df = df.drop('educ', 1)
df = df.drop('gender', 1)
df = df.drop('height', 1)
df = df.drop('sex', 1)
df = df.drop('status', 1)

"""
age and city variables are derived from profiles (rather than initial search for users at different levels of attractiveness). 
I do not redo age and location variables. I have no reasons to believe they could be done incorrectly
"""

'\nage and city variables are derived from profiles (rather than initial search for users at different levels of attractiveness). \nI do not redo age and location variables. I have no reasons to believe they could be done incorrectly\n'

In [34]:
df.columns

Index(['city', 'state', 'date of search', 'attractiveness', 'user's gender',
       'location of search', 'About me',
       'Favorite books, movies, shows, music, and food',
       'I spend a lot of time thinking about', 'I’m really good at',
       'My self-summary', 'On a typical Friday night I am',
       'Six things I could never do without',
       'The first things people usually notice about me',
       'The most private thing I’m willing to admit',
       'What I’m doing with my life', 'You should message me if', 'age',
       'location', 'looking_for_section', 'personality1', 'personality2',
       'user_info_1', 'user_info_2', 'user_info_3', 'seeking_both',
       'seeking_men', 'seeking_women', 'to_drop'],
      dtype='object')

## Clean Essays

- create two essay copies. The first completes basic html removal, while the second removes emoticons
- currently retains links, but one version removed emoticons

In [35]:


def essay_cleaner(c, essay):
    x = c[essay]
    if type(x) == str:
        
        x = x.replace("<br/>"," ")
        x = x.replace('</div>',"")
        x = x.replace("\'","'")
        x = x.replace("&amp;","&")
        x = x[40:]
        
        return x
    else:
        return None

    

def essay_cleaner_no_emo(c, essay):
    x = c[essay]
    if type(x) == str:

        t = re.sub(r'\<img alt=(.*?)\/>', 'EMOTICON', x)
        t = re.sub(r'\\img alt=(.*?)\/>', 'EMOTICON', t)
        
        #t = re.sub(r'\<a href=(.*?)\/a>', 'LINK', t)
        #t = re.sub(r'\a href=(.*?)\/a>', 'LINK', t)
        return t
    else:
        return None

    

essay_list = ['Favorite books, movies, shows, music, and food',
'I spend a lot of time thinking about',
'I’m really good at',
'My self-summary',
'On a typical Friday night I am',
'Six things I could never do without',
'The first things people usually notice about me',
'The most private thing I’m willing to admit',
'What I’m doing with my life',
'You should message me if',
'About me']

for i in essay_list:
    df[i] = df.apply(essay_cleaner,essay = i, axis=1)
    df[i + "_no_emoticons"] = df.apply(essay_cleaner_no_emo,essay = i, axis=1)


## Evaluate Missing Data for Essays, create poor-quality profile variable

- provide a word cound accross all essays. When the word count == 0, that means the user did not write anything about themselves

In [45]:
print("Do we have data for the essay column?")
print(" ")

for essay in essay_list:
    print("Proportion of users who provided data: " , len(df[df[essay].notnull() == True])/len(df))
          
    print(df[essay].notnull().value_counts())
    print(" ")

Do we have data for the essay column?
 
Proportion of users who provided data:  0.7094222315890889
True     188502
False     77210
Name: Favorite books, movies, shows, music, and food, dtype: int64
 
Proportion of users who provided data:  0.663127747335461
True     176201
False     89511
Name: I spend a lot of time thinking about, dtype: int64
 
Proportion of users who provided data:  0.7256202203889925
True     192806
False     72906
Name: I’m really good at, dtype: int64
 
Proportion of users who provided data:  0.9303531643282953
True     247206
False     18506
Name: My self-summary, dtype: int64
 
Proportion of users who provided data:  0.6734697717829831
True     178949
False     86763
Name: On a typical Friday night I am, dtype: int64
 
Proportion of users who provided data:  0.6799768170048774
True     180678
False     85034
Name: Six things I could never do without, dtype: int64
 
Proportion of users who provided data:  0.23962786776660444
False    202040
True      63672
Name:

In [46]:
# create a word count of total number of words typed in profile (accross all essays)

def profile_word_counter(p):
    counter = 0
    for essay in essay_list: 
        if p[essay] != None:
            
            essay = essay + "_no_emoticons" #do word count on no-emoticon version
            counter += len(p[essay].split())
    return counter

df['total_word_count'] = df.apply(profile_word_counter, axis=1)

## Clean Personality Data

- get conservatism measure

In [47]:
"""
This starts cleaning the messy personality data by removing all non-personality data
"""

def personality_cleaner(c):
    
    return re.findall(r'\[(.*?)\]',c['personality1'])

df['personality1_clean'] = df.apply(personality_cleaner, axis=1)

In [48]:
"""
Using the somewhat cleaned variable, search through it to find the conservative score
"""
def personality_cleaner2(c):
    x = list(c['personality1_clean'])[0]
    y = re.findall(r'\{(.*?)\}',x)

    for i in y:
    
        i = '{'+i
        i = i+'}'
        i = json.loads(i)
        
        if i['name'] == "Conservative":
            return i['percentile']

df['conservative']= df.apply(personality_cleaner2, axis=1)

In [49]:
# How many users have a conservatism score?
df['conservative'].isnull().value_counts()

True     172511
False     93201
Name: conservative, dtype: int64

## Straight, Gay, Bi user categories, based on how they appeared in search

search_based_sexuality ignores how users self-identify in their profile. Rather, it determines their sexuality by how they appeared in my initial search for users. For example, a man who appeared while searching for users seeking women but not for users seeking men would be identified as straight. A women who self identifies as bisexual but only appeared in my search for users seeking men would also be identified in this variable as straight.

In [50]:
def set_sexuality(c):
    
    g = c["user's gender"]
    
    #seeking:
    men = c["seeking_men"]
    women = c["seeking_women"]
    both = c["seeking_both"]
    
    if g == "women":
        if men == 1 and women == 0:
            return "straight"
        if men == 0 and women == 1:
            return "gay"
        if men == 1 and women == 1:
            return "bisexual"
    if g == "men":
        if men == 0 and women == 1:
            return "straight"
        if men == 1 and women == 0:
            return "gay"
        if men == 1 and women == 1:
            return "bisexual"
        
df["search_based_sexuality"] = df.apply(set_sexuality, axis=1)

In [51]:
mendf = df[df["user's gender"]=="men"]
womendf = df[df["user's gender"]=="women"]


In [52]:
mendf["search_based_sexuality"].value_counts()

straight    141333
gay          22464
bisexual      4986
Name: search_based_sexuality, dtype: int64

In [53]:
womendf["search_based_sexuality"].value_counts()

straight    73778
bisexual    14167
gay          8984
Name: search_based_sexuality, dtype: int64

## Clean User Info Section 1

The User Info data currently looks something like this for each user:
- "Straight, Man, Single, 6’ 0”, Average build"

This section parses out the following data from each user's info section

- gender
- sex
- availability
- height
- body type

In [54]:
gender = ['man', 'woman', 'agender', 'androgynous', 'bigender', 'cis man',
              'cis woman', 'genderfluid', 'genderqueer', 'gender nonconforming',
              'hijra', 'intersex', 'non-binary', 'other', 'pangender', 'transfeminine',
              'transgender', 'transmasculine', 'transsexual', 'trans man', 'trans woman',
              'two spirit']

sexuality = ['straight', 'gay', 'bisexual', 'asexual', 'demisexual',
                 'heteroflexible', 'homoflexible', 'lesbian', 'pansexual',
                 'queer', 'questioning', 'sapiosexual']

status = ['single','seeing someone','married','open relationship','available']

build = ['rather not say', 'thin', 'overweight', 'average build',
                 'fit', 'jacked', '"a little extra" build', 'curvy', 'full figured',
                 'used up']

def get_user_info_1(c):
    
    """
    Users can report multiple responses (up to 5) for their sex and gender. I collect these responses into a list, sort 
    the list by alphabetical order, then transform the list into a string before writing to the sex and gender variable.
    
    The sex_m/gender_m categorizes users with multiple responses as "multiple"
    """
    
    sex_list = []
    sex_var = ""
    sex_multi = ""
    gender_list = []
    gender_var = ""
    gender_multi = ""
    
    build_var = ""
    relationship_status_var = ""
    height = ""
    
    x = c['user_info_1'].split(",")
    
    for i in x:
        i = i.strip()
        i = i.lower()
        
        if i in sexuality:
            sex_list.append(i)
        elif i in gender:
            gender_list.append(i)
        elif i in build:
            build_var = i
        elif i in status:
            relationship_status_var = i
        else:
            try:
                i = (int(i[0]) * 12) + int(i[3:-1])
                height = i
            except: 
                print(i)
    
    sex_list.sort() 
    sex_var = ", ".join(sex_list)
    
    if len(sex_list) > 1:
        sex_multi = "multiple"
    else:
        sex_multi = sex_var
        
    
    gender_list.sort()
    gender_var = ", ".join(gender_list)
    
    if len(gender_list) > 1:
        gender_multi = "multiple"
    else:
        gender_multi = gender_var
    
    list_to_return = Series([sex_var,sex_multi,gender_var,gender_multi,build_var,relationship_status_var,height])
    
    return list_to_return
    
#create user info section 1 variables in DataFrame   
df[['sex','sex_m','gender','gender_m','body_type','relationship_status',
    'height']] = df.apply(get_user_info_1, axis=1)  
    

## Clean User Info Section 2

The User Info data currently looks something like this for each user:
- "Hispanic / Latin, Speaks English and Spanish, Working on High school, Christian"

This section parses out the following data from each user's info section

- race
- educ
- religion
- importance of religion

In [55]:
education = ['working on high school', 'working on two-year college', 'working on university',
                 'working on space camp', 'working on post grad',
                 'attended high school', 'attended two-year college', 'attended university', 'attended space camp',
                 'attended post grad',
                 'dropped out of high school', 'dropped out of two-year college', 'dropped out of university',
                 'dropped out of space camp', 'dropped out of post grad']

religion = ['agnostic','atheist','christian','jewish','catholic','muslim','hindu',
           'buddhist','sikh','other']

religion_importance = ["it’s important", "it’s not important", 'laughing about it','very serious about it']


def get_user_info_2(c):
    
    educ_var = ""
    religion_var = ""
    religion_importance_var = ""
    
    asian = False
    black = False
    hispanic = False
    middle_eastern = False
    native_american = False
    indian = False
    pacific_islander = False
    white = False
    other = False
    multi_ethnic = False
    
    x = c['user_info_2'].replace(')', '')
    x = x.replace('but', ',')
    x = x.replace('and', ',')
    x = re.split(r'[,(]',x)
    
    for i in x:
        i = i.strip()
        i = i.lower()
        
        if i in education:
            educ_var = i
        elif i in religion:
            religion_var = i
        elif i in religion_importance:
            religion_importance_var = i
        
        elif 'asian' in i:
            asian = True
        elif 'black' in i:
            black = True
        elif 'hispanic' in i:
            hispanic = True
        elif 'indian' in i:
            indian = True
        elif 'middle eastern' in i:
            middle_eastern = True
        elif 'native american' in i:
            native_american = True
        elif 'pacific isl' in i: # I stripped all "and"s out
            pacific_islander = True
        elif 'white' in i:
            white = True
        elif ('other' in i) and ('religion' not in i): #needed to make sure user didn't select "other religion"
            other = True
        elif 'multi-ethnic' in i:
            multi_ethnic = True    
    
    list_to_return = Series([educ_var, religion_var, religion_importance_var,
                             asian, black, hispanic, middle_eastern, native_american,
                            indian, pacific_islander, white, other, multi_ethnic])
    
    return list_to_return

df[['educ','religion','religion_importance','asian','black','hispanic','middle_eastern',
  'native_american','indian','pacific_islander','white','other','multi_ethnic']] = df.apply(get_user_info_2, axis=1)  

## Clean User Info Section 3

The User Info data currently looks something like this for each user:
- Never smokes, Doesn’t drink, Doesn’t do drugs, Omnivore, Doesn’t have kids, Sagittarius

This section parses out the following data from each user's info section

- kids

In [56]:
kids = ["doesn’t have kids",               
"doesn’t have kids but might want them",       
"doesn’t have kids but wants them",            
"doesn’t have kids and doesn’t want them",     
"might want kids",                             
"wants kids",                                  
"doesn’t want kids",
"has kid(s)",
"has kid(s) and might want more",
"has kid(s) and doesn’t want more",
"has kid(s) and wants more"]

def get_user_info_3(c):
    
    
    if (type(c['user_info_3']) != str):
        return None
    
    x = c['user_info_3'].split(",")
    
    for i in x:
        i = i.strip()
        i = i.lower()
        
        if i in kids:
            return i
    

df['kids_categories'] = df.apply(get_user_info_3, axis=1)  

In [57]:
df['kids_categories'].value_counts()

doesn’t have kids                          66032
doesn’t have kids but might want them      33830
doesn’t have kids but wants them           32209
doesn’t have kids and doesn’t want them    11519
might want kids                             4747
has kid(s)                                  4505
has kid(s) and might want more              2715
wants kids                                  2320
has kid(s) and doesn’t want more            1765
doesn’t want kids                           1270
has kid(s) and wants more                    982
Name: kids_categories, dtype: int64

## Clean Looking For Section

The Looking For data currently looks something like this for each user:
- "single Women, near me, ages 18?20, for short & long term dating, Casual sex, and New friends."

This section parses out the following data from each user's seeking section:

- looking for men
- looking for women
- looking for single people
- looking for single or non-single people
- lower age limit
- upper age limit
- seeking a long-term relationship
- seeking a short-term relationship
- seeking a casual relationship
- seeking a friendship
- seeking a non-monogamous relationship




In [58]:
men = [
'single men',
'men',
'single people',
'people']

women = ['single women',
'women',
'single people',
'people'
]

single = ['single women',
'single men',
'single people']

non_single = [
'women',
'people',
'men']

"""
To find all possible combinations of long, short, hookup, and friend, I manually examined all responses that contained a 
relevant keyword (e.g., long, short, friend, hookup, casual) but weren't contained in my lists corresponding to these 
variables. I then added them to the appropriate list until I captured all possible combinations.
"""

long_type = ["long-term dating","short & long term dating","for short & long term dating and new friends.",
            "for short & long term dating and casual sex.","for non-monogamous short & long term dating and casual sex.",
            "for long-term dating and casual sex.","for non-monogamous long-term dating and casual sex.",
            "for non-monogamous short & long term dating","for short & long term dating.",
             "for long-term dating and new friends.","for short & long term dating","for long-term dating",
            "for non-monogamous short & long term dating and new friends.","for long-term dating.",
            "for non-monogamous short & long term dating.","for non-monogamous long-term dating and new friends.",
            "for non-monogamous long-term dating.","for non-monogamous long-term dating"]

short_type = ["short-term dating","short & long term dating","for short & long term dating and new friends.",
              "for short-term dating and casual sex.","for short & long term dating and casual sex.",
              "for non-monogamous short & long term dating and casual sex.",
              "for non-monogamous short-term dating and casual sex.","for short-term dating","for short & long term dating",
             "for non-monogamous short & long term dating","for short-term dating and new friends.",
             "for short & long term dating.","for non-monogamous short-term dating",
              "for non-monogamous short & long term dating and new friends.","for short & long term dating.",
             "for short-term dating.","for non-monogamous short-term dating and new friends.",
             "for non-monogamous short & long term dating.","for non-monogamous short-term dating."]

hookup = ["hookup","casual","casual sex","for casual sex and new friends.","for short & long term dating and casual sex.",
         "for non-monogamous casual sex and new friends.","for non-monogamous short & long term dating and casual sex.",
         "for casual sex.","for non-monogamous short-term dating and casual sex.","for short-term dating and casual sex.",
         "for long-term dating and casual sex.","for non-monogamous casual sex.","for non-monogamous long-term dating and casual sex."]

friend = ["new friends","and new friends","for casual sex and new friends","for short & long term dating and new friends.",
         "for non-monogamous casual sex and new friends.","and new friends.","for casual sex and new friends.",
         "for long-term dating and new friends.","for short-term dating and new friends.","for new friends.",
         "for non-monogamous short & long term dating and new friends.","for non-monogamous short-term dating and new friends.",
         "for non-monogamous new friends.","for non-monogamous long-term dating and new friends."]



def looking_for(c):
    
    
    men_var = False
    women_var = False
    single_var = False
    non_single_var = False
    
    age_lower_var = 0
    age_upper_var = 0
    
    hookup_var = False
    short_var = False
    long_var = False
    friend_var = False
    
    non_monogamous_var = False

    x = c['looking_for_section'].split(",")
    
    counter = 0
    for i in x:
        i = i.strip()
        i = i.lower()
        
        #gender
        if i in men:
            men_var = True
        if i in women:
            women_var = True
        
        #monogomy
        if i in single:
            single_var = True
        elif i in non_single:
            non_single_var = True
        
        #age limit
        if "ages" in i:
            
            ages = i[5:]
            ages = ages.split("?")
            
            age_lower_var = ages[0]
            age_upper_var = ages[1]
            
        elif "age" in i:
            
            ages = i[4:]
            age_lower_var = ages
            age_upper_var = ages
            
        #relationship type seeking
        if i in long_type:
            long_var = True
        if i in short_type:
            short_var = True
        if i in hookup:
            hookup_var = True
        if i in friend:
            friend_var = True
            
        if "non-monogamous" in i:
            non_monogamous_var = True
        
        """
        if "long" in i and long_var == False:
            print(i)
        """    
    
    list_to_return = Series([men_var,women_var,single_var,non_single_var,age_lower_var,age_upper_var,hookup_var,
                            short_var,long_var,friend_var,non_monogamous_var])
    
    return list_to_return
    

df[["seeking_men_profile","seeking_women_profile","seeking_only_singles","seeking_singles_and_non",
   "age_lower_limit","age_upper_limit","hookup","short_term","long_term","friendship",
   "seeking_non_monogomous_relationship"]] = df.apply(looking_for, axis=1)  

## Remove Unneeded Data

In [59]:
essay_list = ['Favorite books, movies, shows, music, and food',
'I spend a lot of time thinking about',
'I’m really good at',
'My self-summary',
'On a typical Friday night I am',
'Six things I could never do without',
'The first things people usually notice about me',
'The most private thing I’m willing to admit',
'What I’m doing with my life',
'You should message me if',
'About me']

for essay in essay_list:
    df=df.drop(essay, 1)

In [60]:
df=df.drop('personality1', 1)
df=df.drop('personality2', 1)
df=df.drop('user_info_1', 1)
df=df.drop('user_info_2', 1)
df=df.drop('user_info_3', 1)
df=df.drop('looking_for_section', 1)

In [61]:
df=df.drop('personality1_clean', 1)

## Rename Essay Variables

In [62]:
df.rename(columns={'Favorite books, movies, shows, music, and food_no_emoticons': 'Essay1', 
                   'I spend a lot of time thinking about_no_emoticons': 'Essay2',
                  'I’m really good at_no_emoticons':'Essay3',
                  'My self-summary_no_emoticons':'Essay4',
                  'On a typical Friday night I am_no_emoticons': 'Essay5',
                  'Six things I could never do without_no_emoticons':'Essay6',
                  'The first things people usually notice about me_no_emoticons':'Essay7',
                  'The most private thing I’m willing to admit_no_emoticons':'Essay8',
                  'What I’m doing with my life_no_emoticons':'Essay9',
                  'You should message me if_no_emoticons':'Essay10',
                  'About me_no_emoticons':'Essay11'}, inplace=True)

## Rename Gender Variable

In [63]:
df.rename(columns={"user's gender":'search_based_gender'}, inplace=True)

In [64]:
df.to_csv("Ok_Data_second_stage.csv")