# **Guided Project: Analyzing Thanksgiving Dinner.**



In this project we will work with a dataset on on Thanksgiving dinners in the US compiled by FiveThirtyEight. The first 5 rows of the dataset:

In [133]:
import pandas as pd
data=pd.read_csv("thanksgiving.csv",encoding="Latin-1")
print(data.head(5))

   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   
1    4337951949                            Yes   
2    4337935621                            Yes   
3    4337933040                            Yes   
4    4337931983                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             
1                                             Turkey             
2                                             Turkey             
3                                             Turkey             
4                                           Tofurkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      
1                                                NaN                                      
2                            

The participants were asked questions about their situation, whether or not they celebrate Thanksgiving, what kind of dish and desert they serve, how is it cooked and served, who they invite, where the dinner is hosted.. etc

The data consists of the following columns:

In [134]:
data.columns.tolist()

['RespondentID',
 'Do you celebrate Thanksgiving?',
 'What is typically the main dish at your Thanksgiving dinner?',
 'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
 'How is the main dish typically cooked?',
 'How is the main dish typically cooked? - Other (please specify)',
 'What kind of stuffing/dressing do you typically have?',
 'What kind of stuffing/dressing do you typically have? - Other (please specify)',
 'What type of cranberry saucedo you typically have?',
 'What type of cranberry saucedo you typically have? - Other (please specify)',
 'Do you typically have gravy?',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
 'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Cauliflower',
 

First, we want to understand **what people ate for Thanksgiving** so we will filter out responses from people who don't celebrate it. This information is contained in the column: *Do you celebrate Thanksgiving*?

## Do you celebrate Thanksgiving?

In [135]:
total=len(data["Do you celebrate Thanksgiving?"].dropna())
data["Do you celebrate Thanksgiving?"].value_counts().apply(lambda x: str(round((x/total)*100,1))+' %')

Yes    92.6 %
No      7.4 %
Name: Do you celebrate Thanksgiving?, dtype: object

In [136]:
data=data.ix[data["Do you celebrate Thanksgiving?"]=='Yes',:]
print(data["Do you celebrate Thanksgiving?"].unique())
total_length=len(data)

['Yes']


Let's explore **what main dishes people tend to eat during Thanksgiving dinner**. This is contained in the column *What is typically the main dish at your Thanksgiving dinner*?

## What is typically the main dish at your Thanksgiving?

In [137]:
data["What is typically the main dish at your Thanksgiving dinner?"].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

We notice that 20 out of the 980 diners have Tofurkey as a main dish for Thanksgiving dinner. Let's see if they serve it with gravy. The column *Do you typically have gravy?* contains that info.

In [138]:
data_tofurkey=data.ix[data["What is typically the main dish at your Thanksgiving dinner?"]=="Tofurkey",:]
total_tofurkey=len(data_tofurkey)
print(data_tofurkey["Do you typically have gravy?"].value_counts().apply(lambda x: str(round((x/total_tofurkey)*100))+' %'))

Yes    60 %
No     40 %
Name: Do you typically have gravy?, dtype: object


60% of people who serve Tofurkey as a main dish typically have gravy with it.

## Do you eat Apple, Pecan or Pumpkin pies during Thanksgiving dinner?

Now that we've looked into the main dishes, let's explore the **dessert dishes**. Specifically, we'll look at how many people eat Apple, Pecan, or Pumpkin pie during Thanksgiving dinner. This data is encoded in the following three columns:
- Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple
- Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin
- Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan

In all three columns, the value is either the name of the pie if the person eats it for Thanksgiving dinner, or null otherwise.

In [139]:
apple_isnull=data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"].isnull()
pumpkin_isnull=data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"].isnull()
pecan_isnull=data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"].isnull()
ate_pies=apple_isnull&pumpkin_isnull&pecan_isnull
print(ate_pies.value_counts())

False    876
True     104
dtype: int64


**ate_pies=False** means the person ate at least one of the types of pies for Thanksgiving dinner.

**ate_pies=True** means the person ate none of the types of pies.

## How old are you ?

Now, let's analyze the Age column in more depth. The Age column contains values that fall in one of the following categories:

In [140]:
data["Age"].unique().tolist()

['18 - 29', '30 - 44', '60+', '45 - 59', nan]

Because we're missing the exact age value, we won't be able to extract an exact integer value. We will take the minimal age in each category as the age representing it.

In [141]:
def convert_str_int(string_to_conv):
    if pd.isnull(string_to_conv): 
        return None
    else: 
        string_to_conv=string_to_conv.replace('+', ' ')
        first=string_to_conv.split(' ')[0]
        return int(first)
    
data["int_age"]=data["Age"].apply(lambda x: convert_str_int(x))
data["int_age"].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

Since we took the lower bound of each age group category, this is not a true depiction of the ages of survey participants.

## How much total combined money did all members of your Household earn last year ?

Like the Age column, *The How much total combined money did all members of your HOUSEHOLD earn last year?* column contains categories, but can be converted to numerical values. Here are the unique values in the column:

In [142]:
data["How much total combined money did all members of your HOUSEHOLD earn last year?"].unique().tolist()

['$75,000 to $99,999',
 '$50,000 to $74,999',
 '$0 to $9,999',
 '$200,000 and up',
 '$100,000 to $124,999',
 '$25,000 to $49,999',
 'Prefer not to answer',
 '$10,000 to $24,999',
 '$175,000 to $199,999',
 '$150,000 to $174,999',
 '$125,000 to $149,999',
 nan]

In [143]:
def convert_income(income):
    if pd.isnull(income):
        return None
    income = income.split(" ")[0]
    if income == "Prefer":
        return None
    for element in ["$",","]:
        income = income.replace(element, "")
    return int(income)
    
data["int_income"]=data["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(lambda x: convert_income(x))
data["int_income"].describe()


count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64

##  How far will you travel for Thanksgiving?

We can now see how the distance someone travels for Thanksgiving dinner relates to their income level. 

Hypothese: People earning less money could be younger, and would travel to their parent's houses for Thanksgiving. People earning more are more likely to have Thanksgiving at their house.

We can test this by filtering data based on int_income, and seeing what the values in the *How far will you travel for Thanksgiving?* column are.

First let's see See how far people earning **under 150000** will travel.

In [144]:
data.ix[data['int_income']<150000,"How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

Now let's see See how far people earning **over 150000** will travel.

In [145]:
data.ix[data['int_income']>=150000,"How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         66
Thanksgiving is local--it will take place in the town I live in                     34
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    25
Thanksgiving is out of town and far away--I have to drive several hours or fly      15
Name: How far will you travel for Thanksgiving?, dtype: int64

?????????????

## Have you ever attended a "Friendsgiving?

We want to study the link between friendship and age.
There are two columns which directly pertain to friendship, *Have you ever tried to meet up with hometown friends on Thanksgiving night?*, and *Have you ever attended a "Friendsgiving?*. 

Both questions seem skewed towards younger people. Let's see if this hypothesis holds up.

In [129]:
print('Average age of respondents for each category:  ')

data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?",
                 columns='Have you ever attended a "Friendsgiving?"',
                 values="int_age")

Average age of respondents for each category:  


"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


In [128]:
print('Average income of respondents for each category:  ')
data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?",
                 columns='Have you ever attended a "Friendsgiving?"',
                 values="int_income")

Average income of respondents for each category:  


"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


This means that adults around the age of 34 and incomes around 66K are more likely to attend a Friendsgiving, and try to meet up with friends on Thanksgiving.

Here are some potential next steps:

Figure out the most common dessert people eat.
Figure out the most common complete meal people eat.
Identify how many people work on Thanksgiving.
Find regional patterns in the dinner menus.
Find age, gender, and income based patterns in dinner menus.