# Analysis of Thanksgiving dinner in the US
This is a dataquest project looking at survey data on Thanksgiving dinner in the US done by FiveThirtyEight. The survey asked 1058 people a variety of questions on the types of food served, distance traveled, and their demographics.

In [1]:
import pandas as pd
data = pd.read_csv('thanksgiving.csv', encoding = 'Latin-1')
data.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


In [2]:
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

## Check how many people celebrate thanksgiving

In [3]:
celebrate_count = data["Do you celebrate Thanksgiving?"].value_counts()
celebrate_count
print(len(data))

1058


## Remove people who don't celebrate thanksgiving from dataset

In [4]:
thanksgiving = data[data['Do you celebrate Thanksgiving?'] == 'Yes']
print(thanksgiving["Do you celebrate Thanksgiving?"].value_counts())
print(len(thanksgiving))

Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64
980


## Look at the counts of main dish served

In [5]:
print(thanksgiving['What is typically the main dish at your Thanksgiving dinner?'].value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


## Is gravy served or not when tofurkey is the main dish?
Tofurkey is an unusual main course. Do these people serve gravy with their meal?

In [6]:
tofurkey = thanksgiving[thanksgiving['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']
print(tofurkey['Do you typically have gravy?'])

4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


## What types of pie do people eat?
The endless battle between apple, pumpkin, and pecan pie. Which is the best?\**

(\**Sweet potato is actually the best, but this study didn't include that as an option)

In [7]:
print(thanksgiving['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].head())

0    Apple
1    Apple
2    Apple
3      NaN
4    Apple
Name: Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple, dtype: object


In [8]:
# Pie data is in seperate columns (Apple, Pumpkin, Pecan)
# Entries are either the pie type or empty (NaN)
# Generate a boolean series to see null values for each column
apple_is_null = thanksgiving['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].isnull()
pumpkin_is_null = thanksgiving['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].isnull()
pecan_is_null = thanksgiving['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].isnull()

In [9]:
# Combine series to select only people who ate pies
# ate_pies = thanksgiving[apple_is_null & pumpkin_is_null & pecan_is_null]
ate_pies = apple_is_null & pumpkin_is_null & pecan_is_null
print(len(ate_pies))

980


## How many people ate pies vs didn't eat pies?
Unknown: what is wrong with the people who are turning down pie?

In [10]:
# Look at totals of how many people ate pies vs didn't (False = ate)
ate_pies.value_counts()

False    876
True     104
dtype: int64

## What is the average age of people who completed the survey?

In [11]:
# Look at the age column
thanksgiving['Age'].head(15)

0     18 - 29
1     18 - 29
2     18 - 29
3     30 - 44
4     30 - 44
5     18 - 29
6     18 - 29
7     18 - 29
8     30 - 44
9     30 - 44
11    30 - 44
12    18 - 29
13    18 - 29
14        60+
15    30 - 44
Name: Age, dtype: object

In [12]:
# Need to convert this format to numeric values
# split each at the space, replace the + sign on 60+
def age_convert(df):
    if pd.isnull(df):
        return None
    else:
        temp_list = df.split(' ')
        value = temp_list[0]
        value = value.replace('+', '')
        age_int = int(value)
        return age_int

### Better age formula: 
Set age as an integer of the mean of the age range (18-29 becomes 23). This should be more representative than doing the younger age.

In [13]:
# Better formula: set age as middle of range (18-29 becomes 23)
def age_convert(df):
    if pd.isnull(df):
        return None
    else:
        temp_list = df.split(' ')
        low = temp_list[0]
        if len(temp_list) == 3:
            high = temp_list[2]
        else: 
            high = '60'
        low = low.replace('+', '')
        low_int = int(low)
        high_int = int(high)
        age = int((low_int + high_int)/2)
        return age

In [14]:
int_age = thanksgiving['Age'].apply(age_convert, convert_dtype = False)
int_age.head()

0    23
1    23
2    23
3    37
4    37
Name: Age, dtype: object

In [15]:
# Add integer age column to dataframe
thanksgiving['int_age'] = thanksgiving['Age'].apply(age_convert)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [16]:
thanksgiving['int_age'].describe()

count    947.000000
mean      44.791975
std       13.630972
min       23.000000
25%       37.000000
50%       52.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

## Notes on age:
The ages were converted from survey ranges (18-29, 30-44, etc.) and don't reflect the actual age of each person. The age of each person is set as the first value in the range (e.g. everyone who answered 18-29 will be listed as age 18). 

## What is the average income of people who completed the survey?

In [17]:
thanksgiving['How much total combined money did all members of your HOUSEHOLD earn last year?'].head(10)

0      $75,000 to $99,999
1      $50,000 to $74,999
2            $0 to $9,999
3         $200,000 and up
4    $100,000 to $124,999
5            $0 to $9,999
6      $25,000 to $49,999
7    Prefer not to answer
8      $75,000 to $99,999
9      $25,000 to $49,999
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: object

Structure of income responses is similar to age. We'll need to split these and assign them as integer values

In [18]:
def convert_income(series):
    if pd.isnull(series):
        return None
    else:
        if series == 'Prefer not to answer':
            return None
        else:   
            temp_list = series.split(' ')
            value = temp_list[0]
            clean1 = value.replace('$','')
            clean2 = clean1.replace(',','')
            int_value = int(clean2)
            return int_value
        
thanksgiving['int_income'] = thanksgiving['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(convert_income)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [19]:
thanksgiving['int_income'].describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64

In [20]:
thanksgiving['int_income'].value_counts()

25000.0     166
75000.0     127
50000.0     127
100000.0    109
200000.0     76
10000.0      60
0.0          52
125000.0     48
150000.0     38
175000.0     26
Name: int_income, dtype: int64

# How does income relate to distance traveled for Thanksgiving?
Look to see if there's a correlation between income level and distance traveled. Compare people earning under 100k vs over 100k.

In [44]:
under_100k = thanksgiving[thanksgiving['int_income'] < 100000]

In [47]:
print('Percentage of responses')
print(under_100k['How far will you travel for Thanksgiving?'].value_counts(normalize = True)*100)

Percentage of responses
Thanksgiving is happening at my home--I won't travel at all                         38.533835
Thanksgiving is local--it will take place in the town I live in                     30.639098
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    23.684211
Thanksgiving is out of town and far away--I have to drive several hours or fly       7.142857
Name: How far will you travel for Thanksgiving?, dtype: float64


In [48]:
print('Percentage of responses')
over_100k = thanksgiving[thanksgiving['int_income'] > 100000]
print(over_100k['How far will you travel for Thanksgiving?'].value_counts(normalize = True)*100)

Percentage of responses
Thanksgiving is happening at my home--I won't travel at all                         48.936170
Thanksgiving is local--it will take place in the town I live in                     22.872340
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16.489362
Thanksgiving is out of town and far away--I have to drive several hours or fly      11.702128
Name: How far will you travel for Thanksgiving?, dtype: float64


## Findings:
It looks like people with higher incomes stay home for Thanksgiving more than people with lower incomes. This makes sense because people with higher incomes generally have larger homes that are more likely to accomadate many guests. 

## Is there a correlation between age and attending a "friendsgiving"?
"Friendgiving" is when you have Thanksgiving with friends instead of family and is believed to be more popular with younger people. Let's see if the data supports that assumption.

In [32]:
# Make pivot table from dataset
thanksgiving.pivot_table(values = 'int_age',
                         index = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?',
                         columns = 'Have you ever attended a "Friendsgiving?"')

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,46.645875,41.926316
Yes,46.289617,39.424419


## Look to see if there's a correlation between income and attending a "friendsgiving"

In [33]:
thanksgiving.pivot_table(values = 'int_income',
                        index = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?',
                        columns = 'Have you ever attended a "Friendsgiving?"')

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


## Findings:
People who have attended a "friendsgiving" have an average age of 39 while those who haven't are 46 so it does appear to be more common in younger people. Interestingly there is no age difference between people who have or haven't tried to meet up with friends on Thanksgiving night.

Also the average income of people who have attended a "friendsgiving" is lower than those who have not (\$66k vs \$78k). This income discrepancy could be due to age differences or perhaps suggests that those with lower incomes are less likely to travel to see family and instead celebrate with local friends.