<a href="https://colab.research.google.com/github/tlcuzick/data-science-projects/blob/main/thanksgiving-dinner-analysis/thanksgiving_dinner_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
data = pd.read_csv('thanksgiving.csv', encoding='Latin-1')
print(data.head(3))

   RespondentID  ...           US Region
0    4337954960  ...     Middle Atlantic
1    4337951949  ...  East South Central
2    4337935621  ...            Mountain

[3 rows x 65 columns]


In [2]:
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

In [3]:
print(data['Do you celebrate Thanksgiving?'].value_counts())

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [4]:
celebrate_thanksgiving = data['Do you celebrate Thanksgiving?'] == 'Yes'
data = data[celebrate_thanksgiving == True]

In [5]:
print(data['What is typically the main dish at your Thanksgiving dinner?'].value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


In [6]:
eats_tofurkey = data['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey'
print(data[eats_tofurkey == True]['Do you typically have gravy?'])

4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


In [7]:
apple_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'])
pumpkin_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'])
pecan_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'])
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
print(ate_pies.value_counts())

False    876
True     104
dtype: int64


In [8]:
def convert_age_string(age_string):
    if pd.isnull(age_string):
        return None
    new_string = age_string.split(' ')[0].replace('+', '')
    return int(new_string)
data['int_age'] = data['Age'].apply(convert_age_string)
print(data['int_age'].describe())

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64


Please note that the ages upon which the above statistics are based are approximations extrapolated from age buckets.

There were four distinct age buckets included in the "Age" field: 18 - 29, 30 - 44, 45 - 59, and 60+. Since my code only considered the first age in the respective age ranges it's possible that the aggregate statistics are skewed towards younger ages.

In [9]:
def convert_income_string(income_string):
    if pd.isnull(income_string):
        return None
    new_string = income_string.split(' ')[0].replace('$', '').replace(',', '')
    if new_string == 'Prefer':
        return None
    return int(new_string)
data['int_income'] = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(convert_income_string)
print(data['int_income'].describe())

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64


Please note that the incomes upon which the above statistics are based are approximations extrapolated from income brackets. There were ten distinct income brackets included in the "How much total combined money did all members of your HOUSEHOLD earn last year?" field. Since my code only considered the first (lowest) income in each respective income bracket, it's possible that the aggregate statistics are skewed towards smaller incomes.

In [10]:
less_than_150k = data['int_income'] < 150000
less_than_150k_rows = data[less_than_150k == True]
print(less_than_150k_rows['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64


In [11]:
greater_than_150k = data['int_income'] > 150000
greater_than_150k_rows = data[greater_than_150k == True]
print(greater_than_150k_rows['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64


Clearly, people making less than 150,000 a year tend to travel less on Thanksgiving than those making more. This could be interpreted to mean that people making less money, who would tend to be younger, often visit their parents homes for Thanksgiving.

In [12]:
print(data.pivot_table(index='Have you ever tried to meet up with hometown friends on Thanksgiving night?',columns='Have you ever attended a "Friendsgiving?"',values='int_age'))

Have you ever attended a "Friendsgiving?"                  No        Yes
Have you ever tried to meet up with hometown fr...                      
No                                                  42.283702  37.010526
Yes                                                 41.475410  33.976744


In [13]:
print(data.pivot_table(index='Have you ever tried to meet up with hometown friends on Thanksgiving night?',columns='Have you ever attended a "Friendsgiving?"',values='int_income'))

Have you ever attended a "Friendsgiving?"                     No           Yes
Have you ever tried to meet up with hometown fr...                            
No                                                  78914.549654  72894.736842
Yes                                                 78750.000000  66019.736842


It appears that despite the apparent similarities between a "Friendsgiving" and simply having Thanksgiving dinner with your friends, only the latter was correlated with youth and a lower income. "Friendsgivings" by contrast seem to correlate with higher age and income, even when they overlap with simply having Thanksgiving dinner with friends.