# Data Cleaning Process

Get the codebook from https://data.austintexas.gov/City-Government/Final-Report-of-the-Asian-American-Quality-of-Life/hc5t-p62z

## Import Libraries and Read Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Dataset_Asian.csv')

## Explore Dataset

In [3]:
df.head()

Unnamed: 0,Survey ID,Age,Gender,Ethnicity,Marital Status,Education Completed,Household Size,No One,Spouse,Children,...,Other Description (Non-city-based Ethnic),Paper (Non-city-based General),TV/Radio (Non-city-based General),Website (Non-city-based General),Social Networks (Non-city-based General),People (Non-city-based General),Other (Non-city-based General),Other Description (Non-city-based General),Preferred Type,Concerns
0,80314,,,Vietnamese,,,,,,,...,,,,,,,,,,
1,60171,60.0,Male,Chinese,Married,13.0,6.0,0,0.0,Living with children,...,,,,,,,,,,
2,1011601,23.0,Female,Chinese,Single,16.0,3.0,0,0.0,0,...,,No,No,No,No,Yes,No,,email,traffic
3,50046,73.0,Female,Chinese,Other,13.0,1.0,Living with no one,0.0,0,...,,,,,,,,,,
4,10494,29.0,Male,Asian Indian,Single,17.0,1.0,Living with no one,0.0,0,...,,,,,,,,,,


In [4]:
df.dtypes

Survey ID                                       int64
Age                                           float64
Gender                                         object
Ethnicity                                      object
Marital Status                                 object
Education Completed                           float64
Household Size                                float64
No One                                         object
Spouse                                         object
Children                                       object
Grand Children                                 object
Parent                                         object
Grandparent                                   float64
Brother/Sister                                 object
Other Relative                                float64
Friends                                        object
Other                                         float64
Other Description                              object
Religion                    

### Calculate Percentage of Missing Values per Feature

In [5]:
columns = df.columns
percentages = []
for column in columns:
    percentages.append((df[column].isnull().sum()/df.shape[0])*100)
df_perc = pd.DataFrame({'percentages' : percentages}, index = columns)

In [6]:
df_over_25 = df_perc[df_perc['percentages'] > 25]
df_over_25

Unnamed: 0,percentages
Other Description,98.581832
Religion Other,98.428517
Other Employment Description,99.386738
Occupation,30.471445
Occupation Other,91.107704
Satisfaction,25.833653
Health Info Discription,97.201993
Housing (Other),98.31353
Status of Ownership (Other),98.65849
Other Transportation Description,99.923342


### Compare Features with Codebook

In [7]:
_ = [print(column, end=', ') for column in df.columns]

Survey ID, Age, Gender, Ethnicity, Marital Status, Education Completed, Household Size, No One, Spouse, Children, Grand Children, Parent, Grandparent, Brother/Sister, Other Relative, Friends, Other , Other Description, Religion, Religion Other, Full Time Employment, Part Time Employment, Self Employed Full Time, Self Employed Part Time, Student, Homemaker, Disabled, Unemployed, Retired, Other Employement, Other Employment Description, Occupation, Occupation Other, Income, Achieving Ends Meet, US Born, Duration of Residency, Primary Language, English Speaking, English Difficulties, Familiarity with America, Familiarity with Ethnic Origin, Identify Ethnically, Belonging, Discrimination , Present Health, Present Mental Health, Present Oral Health, Hygiene Assistance, Smoking, Drinking, Regular Exercise, Healthy Diet, Hypertension, Heart Disease, Stroke, Diabetes, Cancer, Arthritis, Hepatitis, Kidney Problem, Asthma, COPD, Physical Check-up, Dentist Check-up, Urgentcare, Folkmedicine, Prim

* The features are aligned with the codebook and many of them are already transformed into dummy variables. Let's first handle the feature names

In [8]:
columns = df.columns
discard = r' /-()'
c_names = []
for column in columns:
    t = list(column)
    t = [s.lower() for s in t]
    for letter in t:
        if letter in discard:
            index = t.index(letter)
            t[index] = '_'
    t = ''.join(t)
    c_names.append(t)
df.columns = c_names

In [9]:
df.head()

Unnamed: 0,survey_id,age,gender,ethnicity,marital_status,education_completed,household_size,no_one,spouse,children,...,other_description__non_city_based_ethnic_,paper__non_city_based_general_,tv_radio__non_city_based_general_,website__non_city_based_general_,social_networks__non_city_based_general_,people__non_city_based_general_,other__non_city_based_general_,other_description__non_city_based_general_,preferred_type,concerns
0,80314,,,Vietnamese,,,,,,,...,,,,,,,,,,
1,60171,60.0,Male,Chinese,Married,13.0,6.0,0,0.0,Living with children,...,,,,,,,,,,
2,1011601,23.0,Female,Chinese,Single,16.0,3.0,0,0.0,0,...,,No,No,No,No,Yes,No,,email,traffic
3,50046,73.0,Female,Chinese,Other,13.0,1.0,Living with no one,0.0,0,...,,,,,,,,,,
4,10494,29.0,Male,Asian Indian,Single,17.0,1.0,Living with no one,0.0,0,...,,,,,,,,,,


In [10]:
_ = [print(column, end=', ') for column in df.columns]

survey_id, age, gender, ethnicity, marital_status, education_completed, household_size, no_one, spouse, children, grand_children, parent, grandparent, brother_sister, other_relative, friends, other_, other_description, religion, religion_other, full_time_employment, part_time_employment, self_employed_full_time, self_employed_part_time, student, homemaker, disabled, unemployed, retired, other_employement, other_employment_description, occupation, occupation_other, income, achieving_ends_meet, us_born, duration_of_residency, primary_language, english_speaking, english_difficulties, familiarity_with_america, familiarity_with_ethnic_origin, identify_ethnically, belonging, discrimination_, present_health, present_mental_health, present_oral_health, hygiene_assistance, smoking, drinking, regular_exercise, healthy_diet, hypertension, heart_disease, stroke, diabetes, cancer, arthritis, hepatitis, kidney_problem, asthma, copd, physical_check_up, dentist_check_up, urgentcare, folkmedicine, prim

### Check for the Open-Ended Questions

In [11]:
df.other_.unique()

array([nan,  0.])

In [12]:
df.other_description.unique()

array([nan, 'classmate', 'Dog', 'Girlfriend', 'roommate', 'boyfriend',
       'Boyfriend', 'Domestic partner', 'Niece', 'Significant Other',
       'Fiance', 'Parents sometimes', 'fiance', 'Grandchild on & off',
       'In-laws', 'Son In Law', 'Cousins', 'Coworker', 'Brother-in-law',
       'guidian', 'Grand Uncle', 'Partner', 'Mother-in-law', 'Other nuns',
       'brother-in-law', 'Housemates (renting room in house)',
       'girlfriend'], dtype=object)

In [13]:
df.religion_other.unique()

array([nan, 'daoist', 'and 6', 'Agnostic', 'not sure yet', 'Jain',
       'Jainism', 'Christian Orthodox', 'Orthodox Christian', 'Christian',
       'INC', 'Caodaism', 'Cao Dai', 'Catholic and Buddhist', 'Lutheran',
       'Catholic, and Buddhism', 'Ancestors Veneration',
       'worship ancestors', 'christian', 'Born again Christian', 'SDA',
       'Pentecostal Christian', 'Bahai', 'Jewish', 'Shinto'], dtype=object)

In [14]:
df.other_employement.unique()

array([nan,  0.])

In [15]:
df.other_employment_description.unique()

array([nan, 'Other', 'other', 'parttime TA', 'visiting scholar',
       'Part time Registered Nurse', 'volunteer', 'In-between jobs',
       'Looking for a job', 'TA', 'temporary retirement', 'Volunteer',
       'Seasonal employee', 'Contractor', 'Pastor'], dtype=object)

In [16]:
df.occupation_other.unique()

array([nan, 'IT', 'Surgical tech', 'Software Engineer',
       'Graduate Research Assistant', 'other', 'Teacher', 'Cashier',
       'engineering', 'Research Assistant',
       'Research Assistant/ Teaching Assistant', 'TA', 'works at campus',
       'resident assistant', 'Advertising', 'research assistant',
       'Pastor', 'Engineer', 'Dental assistant', 'translator',
       'University', 'teaching assistant', 'Teaching Assistant',
       'Accountant', 'computer', 'Data Analyst', 'Computer', 'media',
       'Other', 'software engineer', 'Farmer', 'Public Safety',
       'scientific research', 'Children Minister at Church',
       'Software architect', 'Government Job', 'Chef',
       'IT software engineer', 'Pharmacy Tech', 'Data analyst',
       'Coach Gymnastics', 'Food service', 'light and salt service',
       'Consultant', '1&4', 'Electrical Engineer', 'Real estate',
       'Information technology software Engineer', 'Idk',
       'Research Scientist', 'IT Professional', 'Informa

In [17]:
df.other.unique()

array([nan,  0.,  1.])

In [18]:
df.health_info_discription.unique()

array([nan, 'TV Health program', 'courses', 'magazine/google', 'Yelp',
       'at school', 'TV, Book', 'Light and Salt Association and AARC',
       'University/School', 'HED329K / HED370',
       'Taiwanese heath info TV program', 'none', 'CS439', 'Related news',
       'church', 'insurance agent', 'myself', 'work', 'Google',
       'magazine', 'Yelp, Health web',
       'self, worked in the healthcare business',
       'higher research education. I am at nurse practitioner',
       'Internet, google', 'reading', 'workplace', 'Self', 'Internet',
       'Office clinic', 'CDC, NIH-NLM', 'Neighbors', 'self',
       'Dr. Oz show', 'zoc doc', 'through employment',
       'Health & News XXX', 'word of mouth', 'Family Doctor',
       'Co-workers', 'Work', 'mass media TV, newspaper', 'School',
       'medical journals', 'books and radio', 'Korean Newspaper',
       'google/personal research', 'newspaper, media', 'Army Base',
       'Doctors in my country of origin', 'Television, Newspapers',


In [19]:
df.housing_.unique()

array([nan, 'Apartment/ Townhouse/ Condominium', 'One-family house', '5',
       'Mobile house', 'Two-family house/ duplex', '6'], dtype=object)

In [20]:
df.housing__other_.unique()

array([nan, "children's home", 'dorm at UT', 'Dorm', 'Apartment', 'dorm',
       'apartment', 'dorms', 'residence hall', ']', 'Dormitory',
       "Friend's House", 'Coop', 'Room', 'Mansion', "Don't know",
       'Share a garage room', '4', 'duplex', 'Sharing', 'Up & down',
       'trailer'], dtype=object)

In [21]:
df.status_of_ownership__other_.unique()

array([nan, "Daughter's House", "Child's residence", 'Parents` house',
       "In-laws' home", 'Offspring`s House', "Friend's House",
       "Child's home", 'Just live in it', "Son's house",
       "residing in son's home", "Child's House", 'Live with family',
       'Relative Housing', 'Public Housing', '3',
       'Relatives owned the residence', '2',
       "Brother's house. They live in San Jose", 'share', "Son's home",
       'Sharing', 'rent w/ another family', 'Family owned',
       'live with aunt', 'Live at home??', '??', 'Living w/ son-in-law',
       'Parsonage', 'Paying Guest', 'Family', 'rent a room', 'apartment',
       'dorm? Financial aid'], dtype=object)

In [22]:
df.other_transportation.unique()

array([nan,  0.,  1.])

In [23]:
df.other_transportation_description.unique()

array([nan, 'Skateboarding', 'bus'], dtype=object)

In [24]:
df.preferred_type.unique()

array([nan, 'email', 'TV, News', 'Wechat, chinese newspaper', 'internet',
       'Social networking', 'Internet', 'social network',
       'Chinese website', 'Chinese', 'facebook', 'Phone', 'Email',
       'TV, Website', 'FB, TV', 'chinese organizations', 'phone call',
       'flyers, social media, teach-in activities',
       'internet, mobile apps, internet', 'people', 'Website',
       'government website', 'social media', 'website',
       'internet, newspaper', 'internet and social network',
       'flyers/handbooks', 'intenet', 'news', 'text message',
       'Wechat, LINE', 'Internet, email', 'message, internet',
       'TV/ Website', 'no', 'internet and newspaper', 'mobile phone',
       'information brought home by children', 'Social Media/ Newsletter',
       'email, web, TV', 'Wechat (a messeging software...)',
       'internet, TV', 'Social Media', 'Newspaper, internet',
       'facebook, Pinterest', 'weibo, wechat', 'TV',
       'newspaper, internet', 'SNS', 'personnel', 't

In [25]:
df.concerns.unique()

array([nan, 'traffic', 'Traffic, Trees',
       'Please improve public transportation, bus services, and provide senior housing.',
       'Traffic jam', 'none', 'Traffic problems, overpopulation',
       'more benefit for senior', 'speedup services like building permit',
       'tax too high', 'Public Safety, Water Supply', 'Traffic', 'no',
       'More Public Transportation & direct flight',
       'traffic is bad during rushhour',
       'Public Transportation, traffic problem', 'improve traffic system',
       'Street lights are always off/ broken', 'Good Job!',
       'more public transportation option, more metropolitan parks',
       'improve public transportation system',
       'inconvenient public trans, waiting time is too long', 'Water',
       'Solid waste disposal service is way too expensive; should have been based on need, i.e. I do not have much trash as others.  I should pay less.  The same service is charged by different item fees.  Re',
       'improve highway infras

### Drop Features over 25% Missing Values

# Encode Variables