# Thanksgiving 2015 Survey on FiveThirtyEight  

This tutorial is found on this [website](https://www.dataquest.io/blog/pandas-tutorial-python-2/). The data were taken from the online news FiveThirtyEight's article [Here’s What Your Part Of America Eats On Thanksgiving](https://fivethirtyeight.com/features/heres-what-your-part-of-america-eats-on-thanksgiving/).  

## I. Data Cleaning:

In [1]:
import pandas as pd

# The data is stored using Latin-1 encoding, so we additionally need to specify the encoding keyword argument.

thanksgiving = pd.read_csv("thanksgiving-2015-poll-data.csv", encoding="Latin-1")
thanksgiving.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


In [3]:
thanksgiving.shape

(1058, 65)

In [4]:
# I want to see what unique values are in the Do you celebrate Thanksgiving? column of data

thanksgiving["Do you celebrate Thanksgiving?"].unique()

array(['Yes', 'No'], dtype=object)

In [6]:
thanksgiving["What is typically the main dish at your Thanksgiving dinner?"].unique()

array(['Turkey', 'Tofurkey', 'Other (please specify)', nan, 'Ham/Pork',
       'Turducken', 'Roast beef', 'Chicken', "I don't know"], dtype=object)

In [9]:
# I want to view all the column names to see all of the survey questions

thanksgiving.columns[:]

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

## I.1. Clean up Gender:

In [10]:
# I want to know if there is missing data or not male/female response in the gender

thanksgiving["What is your gender?"].value_counts(dropna=False)

Female    544
Male      481
NaN        33
Name: What is your gender?, dtype: int64

In [11]:
# I want to assign 0 to Male, and 1 to Female

import math

def gender_code(gender_string):
    if isinstance(gender_string, float) and math.isnan(gender_string):
        return gender_string
    return int(gender_string == "Female")

# This is a custom function that will do the transformation I want

thanksgiving["gender"] = thanksgiving["What is your gender?"].apply(gender_code)
thanksgiving["gender"].value_counts(dropna=False)

 1.0    544
 0.0    481
NaN      33
Name: gender, dtype: int64

## I.2. Clean up Income:

In [12]:
thanksgiving["How much total combined money did all members of your HOUSEHOLD earn last year?"].value_counts(dropna=False)

$25,000 to $49,999      180
Prefer not to answer    136
$50,000 to $74,999      135
$75,000 to $99,999      133
$100,000 to $124,999    111
$200,000 and up          80
$10,000 to $24,999       68
$0 to $9,999             66
$125,000 to $149,999     49
$150,000 to $174,999     40
NaN                      33
$175,000 to $199,999     27
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

**Identify the patterns:**  
Looking at this, there are 4 different patterns for the values in the column:

1. X to Y
2. NaN
3. X and up
4. Prefer not to answer

I want  
X to Y = (X+Y)/2  
X and up = X  
Prefer not to answer = NaN

The custom function will do

1. Take a string called value as input.
2. Check to see if value is 200,000 and up, and return 200000 if so.
3. Check if value is Prefer not to answer, and return NaN if so.
4. Check if value is NaN, and return NaN if so.
5. Clean up value by removing any dollar signs or commas.
6. Split the string to extract the incomes, then average them.

In [13]:
import numpy as np

def clean_income(value):
    if value == "$200,000 and up":
        return 200000
    elif value == "Prefer not to answer":
        return np.nan
    elif isinstance(value, float) and math.isnan(value):
        return np.nan
    value = value.replace(",", "").replace("$", "")
    income_high, income_low = value.split(" to ")
    return (int(income_high) + int(income_low)) / 2

In [16]:
# apply the function above to Income

thanksgiving["income"] = thanksgiving["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(clean_income)
thanksgiving["income"].value_counts(dropna=False)

 37499.5     180
NaN          169
 62499.5     135
 87499.5     133
 112499.5    111
 200000.0     80
 17499.5      68
 4999.5       66
 137499.5     49
 162499.5     40
 187499.5     27
Name: income, dtype: int64