# Analyzing U.S Thanksgiving 2015 data
### In this exercise, we will make use of Python dataframes to analyze the U.S 2015 thanksgiving data. We will make use of concepts such as Boolean subsetting of datasets, string manipulation, Series object, indexing, pivot tables and use of methods such as apply(), list comprehension, value_counts(), describe()

In [1]:
import pandas as pd
data = pd.read_csv("thanksgiving.csv", encoding="Latin-1")
print(data.head(1))

   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      

  How is the main dish typically cooked?  \
0                                  Baked   

  How is the main dish typically cooked? - Other (please specify)  \
0                                                NaN                

  What kind of stuffing/dressing do you typically have?  \
0                                        Bread-based      

  What kind of stuffing/dressing do you typically have? - Other (please specify)  \
0                                                NaN                               

  What type of cranberry saucedo you typically have?  \
0          

In [2]:
data.columns

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

Filtering data set for respondents who celebrate thanksgiving


In [3]:
data["Do you celebrate Thanksgiving?"].value_counts()


Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [4]:
x = data["Do you celebrate Thanksgiving?"] ==  "Yes"
data = data[x]
data.iloc[0:5]

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


Let's understand what's the main dish people ate for Thanksgiving 2015


In [5]:
data['What is typically the main dish at your Thanksgiving dinner?'].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

## As expected, Turkey takes the top spot

## Gravies used with Tofurkey?


In [6]:
tofurkey_gravy = data['What is typically the main dish at your Thanksgiving dinner?'] == "Tofurkey"
data["Do you typically have gravy?"].value_counts()

Yes    892
No      82
Name: Do you typically have gravy?, dtype: int64

Seems like Tofurkey is better with gravy

## People who ate any of the three pies: Apple, pumpkin or pecan


In [7]:
apple_isnull = data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"].isnull()
pumpkin_isnull = data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin"].isnull()
pecan_isnull = data["Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan"].isnull()
ate_pies = (apple_isnull & pumpkin_isnull & pecan_isnull)
ate_pies.value_counts()

False    876
True     104
dtype: int64

In [8]:
# Analyzing Age data
(data["Age"]).describe()


count         947
unique          4
top       45 - 59
freq          269
Name: Age, dtype: object

In [9]:
def tonum(string):
    if pd.isnull(string):
        return None
    else:
        string = string.split(" ")[0]
        string = string.replace('+', '')
        return int(string)
        
data["int_age"] = data["Age"].apply(tonum)
data['int_age'].describe()


count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

In [10]:
data["How much total combined money did all members of your HOUSEHOLD earn last year?"].value_counts()  


$25,000 to $49,999      166
$75,000 to $99,999      127
$50,000 to $74,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

## Let's modify the income variable so it becomes suitable for numerical analysis'

In [11]:
def income(string):
    if pd.isnull(string) or string == "Prefer not to answer":
        return None
    string = string.split(" ")[0]
    string = string.replace("$", "")
    string = string.replace(",", "")
    return int(string)

data["int_income"] = data["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(income)
print(data["int_income"].iloc[0:5])

0     75000.0
1     50000.0
2         0.0
3    200000.0
4    100000.0
Name: int_income, dtype: float64


In [12]:
data["int_income"].describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64

In [13]:
inc_lessthan_150k = data[data["int_income"] > 150000]
inc_lessthan_150k["How far will you travel for Thanksgiving?"].value_counts()


Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64

### More People with higher income have thanksgiving at home than low income.  This could be possible because people with higher income are older, possibly parents and their kids tend to visit them during thanksgiving.

# Friendship and Age

In [14]:
data.pivot_table(index = "Have you ever tried to meet up with hometown friends on Thanksgiving night?", columns = 'Have you ever attended a "Friendsgiving?"', values = "int_age")               


"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


# Friendship and Income

In [15]:
data.pivot_table(index = "Have you ever tried to meet up with hometown friends on Thanksgiving night?", columns = 'Have you ever attended a "Friendsgiving?"', values = "int_income")               


"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


# Findings
As we can see, people in the lower income group are mostly younger population who is more likely to attend friendsgiving.

# Most Common desert

In [16]:
desert_question_list = ['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Apple cobbler',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Blondies',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Brownies',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Carrot cake',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cheesecake',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cookies',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Fudge',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Ice cream',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Peach cobbler',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - None',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Other (please specify)',
       'Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Other (please specify).1']

desert_types = [d.replace('Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - ',"") for d in desert_question_list]
desert_types

['Apple cobbler',
 'Blondies',
 'Brownies',
 'Carrot cake',
 'Cheesecake',
 'Cookies',
 'Fudge',
 'Ice cream',
 'Peach cobbler',
 'None',
 'Other (please specify)',
 'Other (please specify).1']

Now we have a much better looking desert list

Now, We need to find counts of each type of desert in the data

In [17]:
from pandas import Series
desert_counts = [data[d].value_counts()[0] for d in desert_question_list]
deserts = Series(desert_counts, index=desert_types)
deserts

Apple cobbler               110
Blondies                     16
Brownies                    128
Carrot cake                  72
Cheesecake                  191
Cookies                     204
Fudge                        43
Ice cream                   266
Peach cobbler               103
None                        295
Other (please specify)      134
Other (please specify).1     13
dtype: int64

In [18]:
deserts.sort_values(axis = 0)

Other (please specify).1     13
Blondies                     16
Fudge                        43
Carrot cake                  72
Peach cobbler               103
Apple cobbler               110
Brownies                    128
Other (please specify)      134
Cheesecake                  191
Cookies                     204
Ice cream                   266
None                        295
dtype: int64

## Findings
A large number of respondents chose not to eat desert apparently. But it seems like Ice cream, cookies and cheese cake are popular desert choices. Viable options for businesses to invest in marketing campaigns for these items. 


# How many people work on Thanksgiving?

In [19]:
x = data['Will you employer make you work on Black Friday?']
x.value_counts()

Yes              43
No               20
Doesn't apply     7
Name: Will you employer make you work on Black Friday?, dtype: int64

# Findings
We have very few(70) respondents in all. Out of them, 43 said their employer will make them work on Black Friday. 


# Patterns across regions


Lets understand how people differ in choosing side dishes across regions

In [20]:
side_dish_list = ['Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Cauliflower',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Corn',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Cornbread',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Fruit salad',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Green beans/green bean casserole',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Macaroni and cheese',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Mashed potatoes',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Rolls/biscuits',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Squash',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Vegetable salad',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Yams/sweet potato casserole',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Other (please specify)',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Other (please specify).1']

In [21]:
side_dishes = [i.replace('Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - ', "") for i in side_dish_list]

In [22]:
side_dishes

['Brussel sprouts',
 'Carrots',
 'Cauliflower',
 'Corn',
 'Cornbread',
 'Fruit salad',
 'Green beans/green bean casserole',
 'Macaroni and cheese',
 'Mashed potatoes',
 'Rolls/biscuits',
 'Squash',
 'Vegetable salad',
 'Yams/sweet potato casserole',
 'Other (please specify)',
 'Other (please specify).1']

In [23]:
side_dish_counts = [data[sd].value_counts()[0] for sd in side_dish_list]
side_dish = Series(side_dish_counts, index=side_dishes)
side_dish.sort_values()

Other (please specify).1              7
Cauliflower                          88
Other (please specify)              111
Brussel sprouts                     155
Squash                              171
Macaroni and cheese                 206
Vegetable salad                     209
Fruit salad                         215
Cornbread                           235
Carrots                             242
Corn                                464
Yams/sweet potato casserole         631
Green beans/green bean casserole    686
Rolls/biscuits                      766
Mashed potatoes                     817
dtype: int64

We see that Mashed Potatoes are the most consumed side dish items

Now, Let's look at how Mashed Potatoes are consumed across regions



In [24]:
data['Mashed_potatoes'] = data["Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Mashed potatoes"]
data.pivot_table(index = "US Region", columns = 'Mashed_potatoes', values = "int_age") 

Mashed_potatoes,Mashed potatoes
US Region,Unnamed: 1_level_1
East North Central,41.314961
East South Central,42.733333
Middle Atlantic,40.476923
Mountain,41.921053
New England,40.615385
Pacific,39.080357
South Atlantic,39.859873
West North Central,37.292308
West South Central,35.957143


We see that Mashed Potatoes as a side dish are fairly equally consumed throughout the nation. 



## Now, Lets explore Black Friday Shopping trends across regions.

In [25]:
data['Bfs'] = data['Will you shop any Black Friday sales on Thanksgiving Day?']
data['Bfs'].value_counts()

No     727
Yes    224
Name: Bfs, dtype: int64

Only 224 Respondents said they'd shop on Black Friday. Lets subset the data for the shoppers and see if regional trends exists

In [26]:
bfsyes = data['Bfs'] == "Yes" 
BFshoppers = data[bfsyes]
BFshoppers['US Region'].value_counts()

South Atlantic        54
East North Central    35
Middle Atlantic       34
West South Central    27
Pacific               21
East South Central    17
West North Central    14
Mountain               8
New England            7
Name: US Region, dtype: int64

# Findings
A lot more shoppers in Delaware, Maryland, Virginia, West Virginia, North Carolina, South Carolina, Georgia, Florida, and the District of Columbia than the rest of U.S in 2015 Thanksgiving

## Let's see how people working in retail shop for Black Friday

In [27]:
rtl = data[data['Do you work in retail?'] == "Yes"]
rtl['Bfs'].value_counts()


No     37
Yes    33
Name: Bfs, dtype: int64

## Findings
Surprising that people who work in retail and will not shop on thanksgiving are larger in number. This could be because they are more aware of the gimmics companies use to trick gullible people into spending money on stuff the retail workers know the reality about.  

## Let's see how main dish differs with gender and age




In [28]:
data['What is your gender?']
x = data['What is typically the main dish at your Thanksgiving dinner?']
data.pivot_table(index='What is your gender?', columns = x, values = 'int_age')

What is typically the main dish at your Thanksgiving dinner?,Chicken,Ham/Pork,I don't know,Other (please specify),Roast beef,Tofurkey,Turducken,Turkey
What is your gender?,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Female,38.0,35.6,18.0,40.5,45.0,35.538462,60.0,40.45354
Male,41.0,31.846154,24.0,47.307692,36.5,26.571429,30.0,40.472585


# Findings
Chicken, Ham/Pork and Tofurkey are more popular among Younger women and men given that 
these items are relatively cheaper. Turducken, Roast beef and Turkey are more popular among 
Older men and women given that they are usually among the higher income group who can afford these more expensive items.
  