# Ramsey King
# DSC 540 - Data Preparation
# Weeks 5 & 6
# July 16, 2021
## For this assignment you need to complete 8 of the following exercises against this data.  You must select at least 2 methods from Chapters 7, 8, 10 & 11.

In [12]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

candy_2016_df = pd.read_excel('BOING-BOING-CANDY-HIERARCHY-2016-SURVEY-Responses.xlsx')
candy_2016_df.shape

(1259, 123)

### The 2 method from Chapter 7 that I will chose are filtering out missing data and binning. This will be done by filtering out the rows where the column 'How old are you?' is non-numeric.  From there, the ages will be binned.

In [13]:
# rename the 'How old are you?' column to 'age'
candy_2016_df = candy_2016_df.rename(columns={candy_2016_df.columns[3]:'age'})

# remove rows where 'age' column does not contain a number
candy_2016_df = candy_2016_df[pd.to_numeric(candy_2016_df['age'], errors='coerce').notnull()]
candy_2016_df.shape

(1191, 123)

In [14]:
# bin 'age' column based on age ranges (<17, 18-25, 26-34, 35-44, 45-54, 55-64, >65)
bins = [0,17,24,34,44,54,64,140]
cats = pd.cut(candy_2016_df['age'], bins)
pd.value_counts(cats)
# candy_2016_df.plot.bar(x=cats,y=pd.value_counts(cats))

(34, 44]     382
(24, 34]     317
(44, 54]     292
(54, 64]     121
(17, 24]      37
(0, 17]       21
(64, 140]     19
Name: age, dtype: int64

### The 2 methods that I will chose from Chapter 8 are combining/merging datasets and pivoting data.  This will be achieved as follows:
1. Create a subset of the 2016 and 2017 candy datasets based on all participants who gave a response in the '[100 Grand Bar]' column
2. Merge these subset datasets together (adding the rows of the 2016 dataset to the 2017 dataset).
3. Creating a pivot table based on the 100 Grand Bar response type by gender

In [15]:
# Load the 2016 and 2017 datasets (since I've re-written over the 2016 dataset above)

candy_2016_df = pd.read_excel('BOING-BOING-CANDY-HIERARCHY-2016-SURVEY-Responses.xlsx')
candy_2017_df = pd.read_excel('candyhierarchy2017.xlsx')

print("2016 dimensions:", candy_2016_df.shape, "\n2017 dimensions:", candy_2017_df.shape)

2016 dimensions: (1259, 123) 
2017 dimensions: (2460, 120)


  warn(msg)


In [16]:
# rename the '100 Grand Bar' column '100_grand_bar', 'Your gender:' column 'gender', and 'How old are you?' 'age'
candy_2016_df = candy_2016_df.rename(columns={candy_2016_df.columns[2]:'gender',
                                              candy_2016_df.columns[3]:'age',
                                              candy_2016_df.columns[6]:'hundred_grand_bar'})
#candy_2016_df.columns


# check for empty values in the hundred_grand_bar column in the 2016 dataset.
miss = candy_2016_df['hundred_grand_bar'].isnull().sum()
if miss>0:
    print("hundred_grand_bar has {} missing value(s)".format(miss))
else:
    print("hundred_grand_bar has no missing values.")


hundred_grand_bar has 78 missing value(s)


In [17]:
# remove rows where the 100 Grand Bar value is null
candy_2016_df = candy_2016_df[candy_2016_df['hundred_grand_bar'].isnull() == False]
candy_2016_df.shape

(1181, 123)

In [18]:
# rename the '100 Grand Bar' column '100_grand_bar', 'Your gender:' column 'gender', and 'How old are you?' 'age'
candy_2017_df = candy_2017_df.rename(columns={candy_2017_df.columns[2]:'gender',
                                              candy_2017_df.columns[3]:'age',
                                              candy_2017_df.columns[6]:'hundred_grand_bar'})
#candy_2017_df.columns


# check for empty values in the hundred_grand_bar column in the 2016 dataset.
miss = candy_2017_df['hundred_grand_bar'].isnull().sum()
if miss>0:
    print("hundred_grand_bar has {} missing value(s)".format(miss))
else:
    print("hundred_grand_bar has no missing values.")

hundred_grand_bar has 747 missing value(s)


In [19]:
# remove rows where the 100 Grand Bar value is null
candy_2017_df = candy_2017_df[candy_2017_df['hundred_grand_bar'].isnull() == False]
candy_2017_df.shape

(1713, 120)

In [20]:
'''
merge the 2016 and 2017 candy datasets together by adding rows.  Before we do this, we will grab only the gender, age,
and hundred_grand_bar columns and put them into a new dataframe
'''
candy_2016_subset = candy_2016_df[['gender', 'age', 'hundred_grand_bar']]
candy_2017_subset = candy_2017_df[['gender', 'age', 'hundred_grand_bar']]

combined_candy_subset = candy_2016_subset.append(candy_2017_subset)
combined_candy_subset

Unnamed: 0,gender,age,hundred_grand_bar
0,Male,22,JOY
1,Male,45,MEH
2,Female,48,JOY
3,Male,57,JOY
4,Male,42,MEH
...,...,...,...
2454,Female,26,JOY
2455,Male,24,JOY
2456,Female,33,MEH
2457,Female,26,MEH


In [21]:
'''
Now we are ready to pivot the data based on gender.
'''
# drop the rows with missing values for age
combined_candy_subset = combined_candy_subset[combined_candy_subset['age'].isnull() == False]
combined_candy_subset = combined_candy_subset[pd.to_numeric(combined_candy_subset['age'], errors='coerce').notnull()]

# convert age column to numeric
combined_candy_subset['age'] = pd.to_numeric(combined_candy_subset['age'])
combined_candy_subset.dtypes
combined_candy_subset.pivot_table(values='age', index=['gender','hundred_grand_bar'],aggfunc=np.mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,age
gender,hundred_grand_bar,Unnamed: 2_level_1
Female,DESPAIR,37.26923
Female,JOY,41.43659
Female,MEH,39.3
I'd rather not say,DESPAIR,34.33333
I'd rather not say,JOY,39.35294
I'd rather not say,MEH,86.88462
Male,DESPAIR,41.37333
Male,JOY,43.60352
Male,MEH,1329787000000000.0
Other,DESPAIR,35.66667


### The 2 methods that I will chose from Chapter 10 are Grouping with Dicts and <ENTER HERE> This will be achieved as follows:
1. Grouping with Dicts -- create a mapping of different types of candy according to manufacturer (or genre-chocolate,
hard, soft, sugar. Create a subset of the datasets, and use the 2015 dataset for this exercise. Ex:
mapping = {'COLUMN NAME': 'GENRE/MANUFACTURER', 'COLUMN NAME': 'GENRE/MANUFACTURER} PAGE 442 OF PYTHON FOR DATA ANALYSIS
2. PG 456. Spilt-Apply-Combine -- do something there.

In [23]:
candy_2015_df = pd.read_excel('CANDY-HIERARCHY-2015-SURVEY-Responses.xlsx')
candy_2015_df.columns
mapping = {'[Butterfinger]': 'Candy Bar',
           '[100 Grand Bar]': 'Candy Bar',
           '[Anonymous brown globs that come in black and orange wrappers]': 'Unknown',
           '[Any full-sized candy bar]': 'Candy Bar',
           '[Black Jacks]': 'Chewy',
           '[Bonkers]': 'Chewy',
           '[Bottle Caps]': 'Hard',
           '[Box’o’ Raisins]': 'Chocolate',
           '[Brach products (not including candy corn)]': 'Unknown',
           '[Bubble Gum]': 'Chewing Gum',
           '[Cadbury Creme Eggs]': 'Chocolate',
           '[Candy Corn]': 'Unknown',
           '[Chiclets]': 'Chewing Gum',
           '[Caramellos]': 'Caramels',
           '[Snickers]': 'Candy Bar',
           '[Dark Chocolate Hershey]': 'Candy Bar',
           '[Dots]': 'Chewy',
           '[Fuzzy Peaches]': 'N/A',
           '[Generic Brand Acetaminophen]': 'N/A',
           '[Glow sticks]': 'N/A',
           '[Broken glow stick]': 'N/A',
           '[Goo Goo Clusters]': 'Candy Bar',
           '[Good N\' Plenty]': 'Licorice',
           '[Gum from baseball cards]': 'Chewing Gum',
           '[Gummy Bears straight up]': 'Gummies',
           '[Creepy Religious comics/Chick Tracts]': 'N/A',
           '[Healthy Fruit]': 'N/A',
           '[Heath Bar]': 'Candy Bar',
           '[Hershey’s Kissables]': 'Chocolate',
           '[Hershey’s Milk Chocolate]': 'Chocolate',
           '[Hugs (actual physical hugs)]': 'N/A',
           '[Jolly Rancher (bad flavor)]': 'Hard',
           '[Jolly Ranchers (good flavor)]': 'Hard',
           '[Kale smoothie]': 'N/A',
           '[Kinder Happy Hippo]': 'Chocolate',
           '[Kit Kat]': 'Chocolate',
           '[Hard Candy]': 'Hard',
           '[Lapel Pins]': 'N/A',
           '[LemonHeads]': 'Hard',
           '[Licorice]': 'Licorice',
           '[Licorice (not black)]': 'Licorice',
           '[Lindt Truffle]': 'Chocolate',
           '[Lollipops]': 'Lollipops and Sours',
           '[Mars]': 'Candy Bar',
           '[Mary Janes]': 'Chewy',
           '[Maynards]': 'Lollipops and Sours',
           '[Milk Duds]': 'Chocolate',
           '[LaffyTaffy]': 'Chewy',
           '[Minibags of chips]': 'N/A',
           '[JoyJoy (Mit Iodine)]': 'N/A',
           '[Reggie Jackson Bar]': 'Candy Bar',
           '[Pixy Stix]': 'Unknown',
           '[Nerds]': 'Hard',
           '[Nestle Crunch]': 'Chocolate',
           '[Now\'n\'Laters]': 'Chewy',
           '[Pencils]': 'N/A',
           '[Milky Way]': 'Candy Bar',
           '[Reese’s Peanut Butter Cups]': 'Chocolate',
           '[Tolberone something or other]': 'N/A',
           '[Runts]': 'Hard',
           '[Junior Mints]': 'Chocolate',
           '[Senior Mints]': 'N/A',
           '[Mint Kisses]': 'Chocolate',
           '[Mint Juleps]': 'Chewy',
           '[Mint Leaves]': 'N/A',
           '[Peanut M&M’s]': 'Chocolate',
           '[Regular M&Ms]': 'Chocolate',
           '[Mint M&Ms]': 'Chocolate',
           '[Ribbon candy]': 'N/A',
           '[Rolos]': 'Chocolate',
           '[Skittles]': 'Hard',
           '[Smarties (American)]': 'Hard',
           '[Smarties (Commonwealth)]': 'Hard',
           '[Chick-o-Sticks (we don’t know what that is)]': 'N/A',
           '[Spotted Dick]': 'N/A',
           '[Starburst]': 'Chewy',
           '[Swedish Fish]': 'Chewy',
           '[Sweetums]': 'Lollipops and Sours',
           '[Those odd marshmallow circus peanut things]': 'N/A',
           '[Three Musketeers]': 'Candy Bar',
           '[Peterson Brand Sidewalk Chalk]': 'N/A',
           '[Peanut Butter Bars]': 'Candy Bar',
           '[Peanut Butter Jars]': 'N/A',
           '[Trail Mix]': 'N/A',
           '[Twix]': 'Candy Bar',
           '[Vicodin]': 'N/A',
           'TEST':'TEST',
           '[White Bread]': 'N/A',
           '[Whole Wheat anything]': 'N/A',
           '[York Peppermint Patties]': 'Chocolate',
           '[Sea-salt flavored stuff, probably chocolate, since this is the "it" flavor of the year]': 'Chocolate',
           '[Necco Wafers]': 'Unknown'}