# Analyzing Thanksgiving Data
## Data taken from FiveThirtyEight

In [1]:
# We'll be using pandas for this project
import pandas as pd

# Read in data and view head() to get a feel for it.
data = pd.read_csv('thanksgiving.csv',encoding='Latin-1')

data.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


In [2]:
# Review columns to get a better feel for the data.
data.columns

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

In [3]:
# Check value counts to celebrating Thanksgiving
data['Do you celebrate Thanksgiving?'].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [4]:
# Filter out all respondents who don't celebrate Thanksgiving
data = data[data['Do you celebrate Thanksgiving?']=='Yes']

In [5]:
# Check for only 'Yes'
data['Do you celebrate Thanksgiving?'].value_counts()

Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64

***
### Main Dish Counts:

In [6]:
# Check counts of 'What is typically the main dish at your Thanksgiving dinner?'
data['What is typically the main dish at your Thanksgiving dinner?'
    ].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

#### Which Tofurkey eaters use gravy?

In [7]:
data[data['What is typically the main dish at your Thanksgiving dinner?'
         ]=='Tofurkey']['Do you typically have gravy?']

4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object

In [8]:
# How many people had none of the surveyed pies served for Thanksgiving?
apple_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'
                   ].isnull()
pumpkin_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'
                   ].isnull()
pecan_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'
                   ].isnull()
all_pies_null = apple_isnull & pumpkin_isnull & pecan_isnull

print(all_pies_null.value_counts())
# False are those who had at least one type of pie served
# True are those who had none of these pies served

False    876
True     104
dtype: int64


In [9]:
# Another way of getting number of "no pies people"
apple = 'Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'
pumpkin = 'Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'
pecan = 'Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'

pies_df = data[(data[apple]=='Apple') | (data[pumpkin]=='Pumpkin') | (data[pecan]=='Pecan')]

pies_people = pies_df.shape[0]

print(pies_people)

no_pies_people = len(data) - pies_people

print(no_pies_people)

876
104


In [10]:
# Analyze age data

# This function takes the age string (format: "18 - 29") and return and int of
# first value. Special cases are "60+" which returns 60 and null which returns
# None.
def age_str_to_int(age):
    try:
        first_age = age.split()[0]
    except Exception:
        return None
    first_age = first_age.replace('+','')
    first_age = int(first_age)
    return first_age

data['int_age'] = data['Age'].apply(age_str_to_int,convert_dtype=False)

print(data['int_age'].head())
print(data['int_age'].describe())

0    18
1    18
2    18
3    30
4    30
Name: int_age, dtype: object
count     947
unique      4
top        45
freq      269
Name: int_age, dtype: int64


In [11]:
# Analyze money data. 
# Convert total money sting column into new integer column

# This function takes a string (of format: "$0 - $9,999") and returns an integer
# of the first value in the string. Special cases are "Prefer to not answer" and 
# null values which both return None.
def money_str_to_int(money):
    try:
        first_money = money.split()[0]
    except Exception:
        return None
    if first_money == 'Prefer':
        return None
    first_money = first_money.replace('$','').replace(',','')
    first_money = int(first_money)
    return first_money

income_str = 'How much total combined money did all members of your HOUSEHOLD earn last year?'

data['int_income'] = data[income_str].apply(money_str_to_int,convert_dtype=False)

print(data['int_income'].head())
print(data['int_income'].describe())

0     75000
1     50000
2         0
3    200000
4    100000
Name: int_income, dtype: object
count       829
unique       10
top       25000
freq        166
Name: int_income, dtype: int64


***
## Traveling Miles:
Let's investigate traveling miles: it's assumed that people who make less money
will not travel as far as those who make more money. We can test this by 
comparing the travel distances of those who make less than $150,000 with those 
who make more than this.

In [12]:
less_income_travel_dist = data[data['int_income']<150000]['How far will you travel for Thanksgiving?'
                             ].value_counts()

more_income_travel_dist = data[data['int_income']>=150000]['How far will you travel for Thanksgiving?'
                             ].value_counts()

print(less_income_travel_dist)
print(more_income_travel_dist)


Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64
Thanksgiving is happening at my home--I won't travel at all                         66
Thanksgiving is local--it will take place in the town I live in                     34
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    25
Thanksgiving is out of town and far away--I have to drive several hours or fly      15
Name: How far will you travel for Thanksgiving?, dtype: int64


### Comparing:
It's hard to compare these directly because they have different population sizes. A better way of comparing them is to normalize them based on their population size which is done next.

In [13]:
less_income_travel_dist = less_income_travel_dist/less_income_travel_dist.sum()
more_income_travel_dist = more_income_travel_dist/more_income_travel_dist.sum()
print(less_income_travel_dist)
print(more_income_travel_dist)

Thanksgiving is happening at my home--I won't travel at all                         0.407837
Thanksgiving is local--it will take place in the town I live in                     0.294630
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    0.217707
Thanksgiving is out of town and far away--I have to drive several hours or fly      0.079826
Name: How far will you travel for Thanksgiving?, dtype: float64
Thanksgiving is happening at my home--I won't travel at all                         0.471429
Thanksgiving is local--it will take place in the town I live in                     0.242857
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    0.178571
Thanksgiving is out of town and far away--I have to drive several hours or fly      0.107143
Name: How far will you travel for Thanksgiving?, dtype: float64


### Findings:
It's interesting to note that richer people host more and travel further distances while poorer people host less and travel smaller distances. This makes intuitive sense because richer people generally have larger houses for hosting more people and they can better afford plane flights whereas poorer people generally have smaller houses for hosting and less money for plane flights.

***
# Friendship Investigation:

In [14]:
import numpy as np
indx = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?'
col = 'Have you ever attended a "Friendsgiving?"'


data['float_age']=data['int_age'].astype('float')
data.pivot_table(index=indx, columns=col, values='float_age',aggfunc=np.mean)


"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


### Insights:
It looks like younger people spend Thanksgiving with their friends more than older people based on the average age of the groups. 