# DTSC 580: Data Manipulation

## Assignment:  Halloween Candy

### Name: 

## Overview

In this assignment, your job will be to clean and wrangle data from a survey of Halloween candy to prepare it for a machine learning project. After completing your assignment, you will:

1. submit your notebook to be automatically graded to CodeGrade 

2. answer questions about this data and the cleaning process through a Brightspace quiz. 

Once you have completed all the tasks in the notebook, save your notebook as `halloween_candy`, and verify that all the tests pass in CodeGrade.  Then when you take the quiz you can be more sure that the answers you provide in the quiz are accurate.  **Note that some CodeGrade automatic tests will not pass until the entire notebook is finished so you should wait until at least you have completed the data cleaning section to check that part of your work.**

## Data Set

The data set that we will be using is the 2017 Halloween Candy Hierarchy data set as discussed in this [boingboing](https://boingboing.net/2017/10/30/the-2017-halloween-candy-hiera.html) article.  You can also read more about the data in the [Science Creative Quarterly](https://www.scq.ubc.ca/so-much-candy-data-seriously/).

The following are the rating instructions from the survey:  

> Basically, consider that feeling you get when you receive this item in your Halloween haul. Does it make you really happy (JOY)? Or is it something that you automatically place in the junk pile (DESPAIR)? MEH for indifference, and you can leave blank if you have no idea what the item is.

Note that the original data set has been slightly altered from its original state, and if you wanted to perform any analysis for future projects, you would need to download the data directly from the links above.

This data is a great example of a messy data set, especially since they allowed respondents to enter text for a number of the fields. Also, note that some of the comments in the file might be considered inappropriate to some readers but cleaning this type of data is normal in a lot of data science projects.

## Note

<u>Show Work</u>

Remember that you must show your work.  Students submissions are spot checked manually throughout the term to verify that they are not hard coding the answer from looking only in the file or in CodeGrade's expected output.  If this is seen, the student's answer will be manually marked wrong and their grade will be changed to reflect this. 

For example, if the answer to Q1, the mean of a specific column, is 22:
```
# correct way
Q1 = df['column_name'].mean()

# incorrect way
Q1 = 22 
```

## Our End Goal

Our end goal for this project is to clean the data so that we could then create a machine learning model. We want to see if we are able to predict a person's gender based purely on their candy preferences. Although, you will not be creating a model for this assignment, only cleaning the data. The results of the models that I used after cleaning the data are provided at the end of this notebook.

## Initial Import & Exploration

In [904]:
# initial imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)

Let's start by importing our data and creating a DataFrame called `candy`.  We need to include `encoding='iso-8859-1'` during the import because there are special characters in the data that Pandas doesn't recognize. This happens a lot when attempting to import data where the public is able to input answers, especially if there are foreign language characters included. The normal encoding for Pandas is `utf-8`, so changing the encoding allows Pandas to recognize those special characters. 

Run the following code, with the encoding argument, and it should import correctly.

In [905]:
# read_csv with iso-8859-1 encoding; using latin-1 would also work here
candy_full = pd.read_csv('candy.csv', encoding='iso-8859-1')

# copy to new DF so that we can have a copy of the original import if needed
candy = candy_full.copy()

Let's take a brief look at the data by using `head()`.

In [906]:
# first five rows
candy.head()

Unnamed: 0,Internal ID,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,"Q5: STATE, PROVINCE, COUNTY, ETC",Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,...,Q8: DESPAIR OTHER,Q9: OTHER COMMENTS,Q10: DRESS,Unnamed: 113,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo],"Click Coordinates (x, y)"
0,90258773,,,,,,,,,,...,,,,,,,,,,
1,90272821,No,Male,44.0,USA,NM,MEH,DESPAIR,JOY,MEH,...,,Bottom line is Twix is really the only candy w...,White and gold,,Sunday,,1.0,,,"(84, 25)"
2,90272829,,Male,49.0,USA,Virginia,,,,,...,,,,,,,,,,
3,90272840,No,Male,40.0,us,or,MEH,DESPAIR,JOY,MEH,...,,Raisins can go to hell,White and gold,,Sunday,,1.0,,,"(75, 23)"
4,90272841,No,Male,23.0,usa,exton pa,JOY,DESPAIR,JOY,DESPAIR,...,,,White and gold,,Friday,,1.0,,,"(70, 10)"


Next, run the following code to see information about the DataFrame.

In [907]:
# check info about the DataFrame
candy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2479 entries, 0 to 2478
Columns: 120 entries, Internal ID to Click Coordinates (x, y)
dtypes: float64(4), int64(1), object(115)
memory usage: 2.3+ MB


Notice that this did not print the columns as you might be used to seeing. According to the Pandas documentation:  "If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used." 

We can make the columns display by setting the `max_cols` argument equal to the number of columns in the data set.

In [908]:
# check info, set max_cols
candy.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2479 entries, 0 to 2478
Data columns (total 120 columns):
 #    Column                                                                                 Non-Null Count  Dtype  
---   ------                                                                                 --------------  -----  
 0    Internal ID                                                                            2479 non-null   int64  
 1    Q1: GOING OUT?                                                                         2368 non-null   object 
 2    Q2: GENDER                                                                             2437 non-null   object 
 3    Q3: AGE                                                                                2394 non-null   object 
 4    Q4: COUNTRY                                                                            2414 non-null   object 
 5    Q5: STATE, PROVINCE, COUNTY, ETC                                   

Of course, if you are just looking for the column names, you can just use a simple `for` loop.

In [909]:
# print a list of column names
for col in candy.columns:
    print(col)

Internal ID
Q1: GOING OUT?
Q2: GENDER
Q3: AGE
Q4: COUNTRY
Q5: STATE, PROVINCE, COUNTY, ETC
Q6 | 100 Grand Bar
Q6 | Anonymous brown globs that come in black and orange wrappers	(a.k.a. Mary Janes)
Q6 | Any full-sized candy bar
Q6 | Black Jacks
Q6 | Bonkers (the candy)
Q6 | Bonkers (the board game)
Q6 | Bottle Caps
Q6 | Box'o'Raisins
Q6 | Broken glow stick
Q6 | Butterfinger
Q6 | Cadbury Creme Eggs
Q6 | Candy Corn
Q6 | Candy that is clearly just the stuff given out for free at restaurants
Q6 | Caramellos
Q6 | Cash, or other forms of legal tender
Q6 | Chardonnay
Q6 | Chick-o-Sticks (we donÕt know what that is)
Q6 | Chiclets
Q6 | Coffee Crisp
Q6 | Creepy Religious comics/Chick Tracts
Q6 | Dental paraphenalia
Q6 | Dots
Q6 | Dove Bars
Q6 | Fuzzy Peaches
Q6 | Generic Brand Acetaminophen
Q6 | Glow sticks
Q6 | Goo Goo Clusters
Q6 | Good N' Plenty
Q6 | Gum from baseball cards
Q6 | Gummy Bears straight up
Q6 | Hard Candy
Q6 | Healthy Fruit
Q6 | Heath Bar
Q6 | Hershey's Dark Chocolate
Q6 | HersheyÕ

This data set is pretty messy. Your goal is now to perform the following actions to get it to the point where it can be passed to a machine learning model.

**Note: Unless the instructions ask you to do something different, please always update the original `candy` DataFrame for the exercises below.  The automatic grading in CodeGrade will check your final DataFrame and ensure that you have performed all required data manipulations.  Also, feel free to add additional cells as needed.**  

## Data Cleaning

**Exercise_A:** Taking a look at the column names, you may notice that some include the character `Õ`. This should instead be an apostrophe `'` mark. Rename the column names that include the `Õ` character and replace it was an apostrophe.  

Remember that you should be updating the `candy` DataFrame for the tasks listed as "Exercises" unless told differently. 

In [910]:
### ENTER CODE HERE ###

for col in candy.columns:
    candy.rename(columns = {col:col.replace('Õ','\'')}, inplace=True)
  
#for col in candy.columns:
 #   print(col)

**Q1:** How many duplicated rows are there in the file? Assume that a duplicate is any row that is *exactly* the same as another one. Save this number as `Q1`.

In [911]:
### ENTER CODE HERE ###
Q1 = candy.duplicated().sum()

**Q2:** How many duplicated rows are there in the file if we were to assume that a duplicate is any row with the same `Internal ID` number as another. In other words, even if the other values are different, a row would count as a duplicate if it had the same `Internal ID` as another. Save this number as `Q2`.

In [912]:
### ENTER CODE HERE ###
Q2 = candy.duplicated(subset=['Internal ID']).sum()

**Exercise_B:** Drop any duplicates from the `candy` DataFrame.  Duplicates are to be defined as any row with the same `Internal ID` as another. Use the default setting that keeps the first record from the duplicates.

In [913]:
### ENTER CODE HERE ###
candy.drop_duplicates(subset=['Internal ID'], inplace=True)

**Exercise_C:** Your next task is to remove the following columns from the `candy` DataFrame as we will not use these columns for this project.  You are welcome to do further analysis on these columns but do not save your analysis in this notebook.

Remove the following columns: `Internal ID`, `Q5: STATE, PROVINCE, COUNTY, ETC`, `Q7: JOY OTHER`, `Q8: DESPAIR OTHER`, `Q9: OTHER COMMENTS`, `Unnamed: 113`, `Click Coordinates (x, y)`.

In [914]:
### ENTER CODE HERE ###
candy.drop(['Internal ID', 'Q5: STATE, PROVINCE, COUNTY, ETC', 'Q7: JOY OTHER', 'Q8: DESPAIR OTHER', 'Q9: OTHER COMMENTS', 'Unnamed: 113', 'Click Coordinates (x, y)'], axis=1, inplace=True)


**Code Check:** As a check for the above exercises, the shape of your data should now be: `(2460, 113)`

In [915]:
### ENTER CODE HERE ###
candy.shape

(2460, 113)

**Exercise_D:** Let's now take a look at the `Q2: GENDER` column since this will be what we are trying to predict. Take a look at the value counts for this column.

In [916]:
### ENTER CODE HERE ###
candy['Q2: GENDER'].value_counts()

Male                  1466
Female                 839
I'd rather not say      83
Other                   30
Name: Q2: GENDER, dtype: int64

**Q3:** How many missing values are in the `Q2: GENDER` column? Save this as `Q3`.

In [917]:
### ENTER CODE HERE ###
Q3 = candy['Q2: GENDER'].isnull().sum()

**Exercise_E:** Using the `candy` DataFrame, remove all rows with a missing value in the `Q2: GENDER` column.  (This should overwrite and be saved as `candy` like you have been doing for the previous exercises.)

In [918]:
### ENTER CODE HERE ###
candy['Q2: GENDER'].dropna(inplace=True)

**Exercise_F:** For this project, we want to use binary classification, which predicts one of two classes. We want to predict between `Male` or `Female`. Because of this, select only the rows that contain either `Male` or `Female` in the `Q2: GENDER` column.

In [919]:
### ENTER CODE HERE ###
candy = candy[(candy['Q2: GENDER'] == 'Male')| (candy['Q2: GENDER'] =='Female')]

**Code Check:** As a check for the above exercises, the shape of your data should now be: `(2305, 113)`

In [920]:
### ENTER CODE HERE ###
candy.shape

(2305, 113)

Now, let's work on filling some of the missing data.  There are easier ways to do this with the sklearn library which you will learn about more in the machine learning classes, but for now, let's try to practice our Pandas skills.

**Q4:** How many missing values are in the `Q1: GOING OUT?` column? Save this number as `Q4`.

In [921]:
### ENTER CODE HERE ###
Q4 = candy['Q1: GOING OUT?'].isnull().sum()


**Exercise_G:** For a future analysis question, we are interested in those that we know will *definitely* go out for Halloween.  Because of this, fill all missing values in the `Q1: GOING OUT?` column with a `No` value.

In [922]:
### ENTER CODE HERE ###
candy['Q1: GOING OUT?'] = candy['Q1: GOING OUT?'].fillna('No')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


**Code Check:** Double check your above work and look at the value counts for the `Q1: GOING OUT?` column.  Make sure that you only have "Yes" and No" values and that they add up to 2305, which is the number of rows you should have at this step in the assignment.

In [923]:
### ENTER CODE HERE ###
candy['Q1: GOING OUT?'].value_counts()
candy['Q1: GOING OUT?'].value_counts().sum()

2305

**Q5:** To get ready for the next step, let's practice selecting all the columns: going from `Q6 | 100 Grand Bar` to `Q11: DAY`.  Save this slice as `Q5`.

In [924]:
### ENTER CODE HERE ###
Q5 = candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY']

**Exercise_H:** Now that you know how to slice the data, fill any missing values in the `candy` DataFrame for those columns (going from `Q6 | 100 Grand Bar` to `Q11: DAY`) with the string `NO_ANSWER`. 

In [925]:
### ENTER CODE HERE ###
candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY'] = candy.loc[:,'Q6 | 100 Grand Bar':'Q11: DAY'].fillna('NO_ANSWER')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, val, pi)


**Exercise_I:** For all four `Q12: Media` columns in the `candy` DataFrame, fill the missing values with `0.0`.

In [926]:
### ENTER CODE HERE ###
candy.loc[:,candy.columns.str.contains('Q12:')] = candy.loc[:,candy.columns.str.contains('Q12:')].fillna(0.0)
  

**Code Check:** As a check for the above code, make sure that there are no missing values left for the `Q6` to `Q12` columns.  

In [927]:
### ENTER CODE HERE ###
candy.loc[:,'Q6 | 100 Grand Bar':'Q12: MEDIA [Yahoo]'].isna().sum()

Q6 | 100 Grand Bar                                                                        0
Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes)    0
Q6 | Any full-sized candy bar                                                             0
Q6 | Black Jacks                                                                          0
Q6 | Bonkers (the candy)                                                                  0
                                                                                         ..
Q11: DAY                                                                                  0
Q12: MEDIA [Daily Dish]                                                                   0
Q12: MEDIA [Science]                                                                      0
Q12: MEDIA [ESPN]                                                                         0
Q12: MEDIA [Yahoo]                                                              

Now, let's look at the very messy `Q4: COUNTRY` column and see what we can do about it. First, run the code below to look at the different unique values in the data.

In [928]:
# check unique values
candy['Q4: COUNTRY'].unique()

array(['USA ', 'USA', 'us', 'usa', nan, 'canada', 'Canada', 'Us', 'US',
       'Murica', 'United States', 'uk', 'United Kingdom', 'united states',
       'Usa', 'United States ', 'United staes',
       'United States of America', 'UAE', 'England', 'UK', 'canada ',
       'United states', 'u.s.a.', '35', 'france',
       'United States of America ', 'america', 'U.S.A.', 'finland',
       'unhinged states', 'Mexico', 'Canada ', 'united states of america',
       'US of A', 'The United States', 'North Carolina ', 'Unied States',
       'Netherlands', 'germany', 'Europe', 'U S', 'u.s.', 'U.K. ',
       'Costa Rica', 'The United States of America', 'unite states',
       'U.S.', '46', 'Australia', 'Greece', 'USA? Hard to tell anymore..',
       "'merica", '45', 'United State', '32', 'France', 'australia',
       'Can', 'Canae', 'Trumpistan', 'Ireland', 'United Sates', 'Korea',
       'California', 'Unites States', 'Japan', 'USa', 'South africa',
       'I pretend to be from Canada, but I am

**Code Check:** As a check for the Country column, check to see how many unique values are in the data.  You should have `115` different unique values for the `Q4: COUNTRY` column.  If you have less or more than this number, double check your work above.

In [929]:
# check the Q4: COUNTRY number of unique values
candy['Q4: COUNTRY'].nunique()

115

We want to clean up this data to only include four areas: USA, Canada, Europe (the continent, not necessarily the European Union), and Other.

There are different ways to do this, but I would suggest that you look at the way we handled the `property_type` column in the `vienna` data set and the code in the `amenities_to_columns()` function in the module notebook.  These might be a little harder than those examples but they should give you a good baseline approach.  

You could use `replace()` for this step, and it is fine if you ultimately decide to do this, but I would suggest that you come up with a solution similar to what was shown in the `vienna` data cleaning notebook.  This method would be much more robust if you had many more values in your data.

I suggest the following order for this section to make it easier:
- Fill in all missing values with `Other`
- Code Australia as `Other` (doing this step will help when trying to use `us` in the next step if you use string methods)
- Combine all USA entries together as `USA`
- Combine Canadian entries as `CA`
- Combine European entries as `EU`
- Everything else gets coded as `Other`

**Exercise_J:** Fill the missing values in the `Q4: Country` column with `Other`.

In [930]:
### ENTER CODE HERE ###

candy.loc[:,'Q4: COUNTRY'] = candy.loc[:,'Q4: COUNTRY'].fillna('Other')

In [931]:
def update_country(data):
  
  new_countries = ['Other','USA','CA','EU']

  data[new_countries] = data[new_countries].fillna('Other')
  data.loc[data['Q4: Country'].str.contains('Australia', case=False)] = 'Other'
  data.loc[data['Q4: Country'].str.contains('us|america', case=False)] = 'USA'
  data.loc[data['Q4: Country'].str.contains('canada', case=False)] = 'CA'
  data.loc[data['Q4: Country'].str.contains('europe|eu|england|uk|greece|german|land', case=False), 'cooking'] = 1
  data.loc[data['Q4: Country'].str.contains('refrigerator|freezer|fridge', case=False), 'fridge'] = 1


**Code Check:** Double check that there are no missing values in the `Q4: COUNTRY` column.  Also, double check the unique values to make sure that "Other" was added.  This should mean that you now have `116` unique values for this column.

In [932]:
# check missing Q4 values
### ENTER CODE HERE ###
candy['Q4: COUNTRY'].isnull().sum()

0

In [933]:
# check unique values 
### ENTER CODE HERE ###
print(candy['Q4: COUNTRY'].unique())
print(candy['Q4: COUNTRY'].nunique())


['USA ' 'USA' 'us' 'usa' 'Other' 'canada' 'Canada' 'Us' 'US' 'Murica'
 'United States' 'uk' 'United Kingdom' 'united states' 'Usa'
 'United States ' 'United staes' 'United States of America' 'UAE'
 'England' 'UK' 'canada ' 'United states' 'u.s.a.' '35' 'france'
 'United States of America ' 'america' 'U.S.A.' 'finland'
 'unhinged states' 'Mexico' 'Canada ' 'united states of america' 'US of A'
 'The United States' 'North Carolina ' 'Unied States' 'Netherlands'
 'germany' 'Europe' 'U S' 'u.s.' 'U.K. ' 'Costa Rica'
 'The United States of America' 'unite states' 'U.S.' '46' 'Australia'
 'Greece' 'USA? Hard to tell anymore..' "'merica" '45' 'United State' '32'
 'France' 'australia' 'Can' 'Canae' 'Trumpistan' 'Ireland' 'United Sates'
 'Korea' 'California' 'Unites States' 'Japan' 'USa' 'South africa'
 'I pretend to be from Canada, but I am really from the United States.'
 'Usa ' 'Uk' 'Germany' 'Canada`' 'Scotland' 'UK ' 'Denmark'
 'United Stated' 'France ' 'Switzerland' 'UD' 'Scotland ' 'South

**Exercise_K:** Combine all Australia entries into `Other`.  Watch out for capitalization issues.  You should have `114` unique values after this step.

In [934]:
### ENTER CODE HERE ###
candy.loc[candy['Q4: COUNTRY'].str.contains('Australia', case=False), 'Q4: COUNTRY'] = 'Other'
candy[candy['Q4: COUNTRY']=='Other']

Unnamed: 0,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),...,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q10: DRESS,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]
5,No,Male,,Other,JOY,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,JOY,DESPAIR,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,0.0,1.0,0.0,0.0
10,Yes,Male,43.0,Other,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
118,No,Male,,Other,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
154,Yes,Female,,Other,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
358,No,Female,,Other,JOY,MEH,MEH,DESPAIR,DESPAIR,MEH,...,JOY,DESPAIR,DESPAIR,JOY,Blue and black,Sunday,0.0,1.0,0.0,0.0
536,No,Female,,Other,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
576,No,Male,,Other,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,DESPAIR,DESPAIR,MEH,White and gold,Sunday,0.0,0.0,1.0,0.0
611,No,Male,,Other,MEH,MEH,JOY,MEH,MEH,DESPAIR,...,DESPAIR,DESPAIR,DESPAIR,JOY,White and gold,Friday,0.0,1.0,0.0,0.0
637,Yes,Female,7.0,Other,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
716,No,Male,,Other,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0


In [935]:
# check unique values
### ENTER CODE HERE ###
candy['Q4: COUNTRY'].nunique()

114

**Exercise_L:** Combine all United States entries together into `USA`.  These would include the following:
```
'USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'United states', 'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states','U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates', 'California', 'Unites States', 'USa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated', 'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes', 'USA USA USA!!!!'
```

In [936]:
### ENTER CODE HERE ###
usa_names = ['USA ', 'USA', 'us', 'usa', 'Us', 'US', 'Murica', 'United States', 'united states', 'Usa', 'United States ', 'United staes', 'United States of America', 'United states', 'u.s.a.', 'United States of America ', 'america', 'U.S.A.', 'unhinged states', 'united states of america', 'US of A', 'The United States', 'North Carolina ', 'Unied States', 'U S', 'u.s.', 'The United States of America', 'unite states','U.S.', 'USA? Hard to tell anymore..', "'merica", 'United State', 'United Sates', 'California', 'Unites States', 'USa', 'I pretend to be from Canada, but I am really from the United States.', 'Usa ', 'United Stated', 'New Jersey', 'United ststes', 'America', 'United Statss', 'murrika', 'USA! USA! USA!', 'USAA', 'united States ', 'N. America', 'USSA', 'U.S. ', 'u s a', 'United Statea', 'united ststes', 'USA USA USA!!!!']

candy.loc[candy['Q4: COUNTRY'].isin(usa_names), 'Q4: COUNTRY'] = 'USA'

**Code Check:** You should be merging the above values together into 1 (`USA`) and be left with 61 unique values after this step (including the `USA` value).

In [937]:
# check unique values
### ENTER CODE HERE ###
print(candy['Q4: COUNTRY'].unique())
print(candy['Q4: COUNTRY'].nunique())

['USA' 'Other' 'canada' 'Canada' 'uk' 'United Kingdom' 'UAE' 'England'
 'UK' 'canada ' '35' 'france' 'finland' 'Mexico' 'Canada ' 'Netherlands'
 'germany' 'Europe' 'U.K. ' 'Costa Rica' '46' 'Greece' '45' '32' 'France'
 'Can' 'Canae' 'Trumpistan' 'Ireland' 'Korea' 'Japan' 'South africa' 'Uk'
 'Germany' 'Canada`' 'Scotland' 'UK ' 'Denmark' 'France ' 'Switzerland'
 'UD' 'Scotland ' 'South Korea' 'CANADA' 'Indonesia' 'The Netherlands'
 'endland' 'soviet canuckistan' 'Singapore' 'China' 'Taiwan' 'Ireland '
 'hong kong' 'spain' 'Sweden' 'Hong Kong' 'Narnia'
 'subscribe to dm4uz3 on youtube' 'United kingdom' "I don't know anymore"
 'Fear and Loathing']
61


**Exercise_M:** Combine the Canadian entries (both upper and lower case) and label them as `CA`. Be careful as there are extra spaces, characters, and misspellings (Can, Canae). 

These values include:
```
'canada', 'Canada', 'canada ', 'Canada ', 'Can', 'Canae', 'Canada`', 'CANADA'
```

In [938]:
### ENTER CODE HERE ###
candy.loc[candy['Q4: COUNTRY'].str.contains('^can', case=False), 'Q4: COUNTRY'] = 'CA'

**Code Check:** You should be merging 8 values together into 1 (`CA`) and be left with 54 unique values after this step (including the `CA` value).

In [939]:
# check unique values
### ENTER CODE HERE ###
print(candy['Q4: COUNTRY'].unique())
print(candy['Q4: COUNTRY'].nunique())

['USA' 'Other' 'CA' 'uk' 'United Kingdom' 'UAE' 'England' 'UK' '35'
 'france' 'finland' 'Mexico' 'Netherlands' 'germany' 'Europe' 'U.K. '
 'Costa Rica' '46' 'Greece' '45' '32' 'France' 'Trumpistan' 'Ireland'
 'Korea' 'Japan' 'South africa' 'Uk' 'Germany' 'Scotland' 'UK ' 'Denmark'
 'France ' 'Switzerland' 'UD' 'Scotland ' 'South Korea' 'Indonesia'
 'The Netherlands' 'endland' 'soviet canuckistan' 'Singapore' 'China'
 'Taiwan' 'Ireland ' 'hong kong' 'spain' 'Sweden' 'Hong Kong' 'Narnia'
 'subscribe to dm4uz3 on youtube' 'United kingdom' "I don't know anymore"
 'Fear and Loathing']
54


**Exercise_N:** Combine the European entries and label them as `EU`. Again, we are looking at the continent of Europe and not necessarily the countries that are a part of the European Union.  

These values include:
```
'uk', 'United Kingdom', 'England', 'UK', 'france', 'finland', 'Netherlands', 'germany', 'Europe', 'U.K. ', 'Greece', 'France', 'Ireland', 'Uk', 'Germany', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland', 'Scotland ', 'The Netherlands', 'Ireland ', 'spain', 'Sweden', 'United kingdom'
```

In [940]:
### ENTER CODE HERE ###
europe = ['uk', 'United Kingdom', 'England', 'UK', 'france', 'finland', 'Netherlands', 'germany', 'Europe', 'U.K. ', 'Greece', 'France', 'Ireland', 'Uk', 
          'Germany', 'Scotland', 'UK ', 'Denmark', 'France ', 'Switzerland', 'Scotland ', 'The Netherlands', 'Ireland ', 'spain', 'Sweden', 'United kingdom']

candy.loc[candy['Q4: COUNTRY'].isin(europe), 'Q4: COUNTRY'] = "EU"

**Code Check:** You should be merging 26 entries together and be left with 29 unique values after this step (including the `EU` value).

In [941]:
# check unique values
### ENTER CODE HERE ###
print(candy['Q4: COUNTRY'].unique())
print(candy['Q4: COUNTRY'].nunique())

['USA' 'Other' 'CA' 'EU' 'UAE' '35' 'Mexico' 'Costa Rica' '46' '45' '32'
 'Trumpistan' 'Korea' 'Japan' 'South africa' 'UD' 'South Korea'
 'Indonesia' 'endland' 'soviet canuckistan' 'Singapore' 'China' 'Taiwan'
 'hong kong' 'Hong Kong' 'Narnia' 'subscribe to dm4uz3 on youtube'
 "I don't know anymore" 'Fear and Loathing']
29


**Exercise_O:** Finally, combine the other entries and label them as `Other`.

In [942]:
### ENTER CODE HERE ###
candy.loc[~candy['Q4: COUNTRY'].isin(['USA','CA','EU']), 'Q4: COUNTRY'] = 'Other'

**Code Check:** Double check that you only have four unique values in the `Q4: COUNTRY` column: `USA`, `Other`, `CA`, and `EU`

In [943]:
# check unique values
### ENTER CODE HERE ###
print(candy['Q4: COUNTRY'].unique())
print(candy['Q4: COUNTRY'].nunique())

['USA' 'Other' 'CA' 'EU']
4


**Q6:** To double check that everything was coded correctly, save the value counts of the `Q4: COUNTRY` column as `Q6`.  You can check this once you run your CodeGrade check.

In [944]:
### ENTER CODE HERE ###
Q6 = candy['Q4: COUNTRY'].value_counts()

We now want to look at the `Q3: AGE` column. Let's look at all the unique values.

In [945]:
# check unique age values for the age column
### ENTER CODE HERE ###
candy['Q3: AGE'].unique()

array(['44', '49', '40', '23', nan, '53', '33', '43', '56', '64', '37',
       '48', '54', '36', '45', '25', '34', '35', '38', '58', '50', '47',
       '16', '52', '63', '65', '41', '27', '31', '59', '61', '46', '42',
       '62', '29', '39', '32', '28', '69', '67', '30', '22', '51', '70',
       '24', '19', 'Old enough', '57', '60', '66', '12', 'Many', '55',
       '72', '?', '21', '11', 'no', '9', '68', '20', '6', '10', '71',
       '13', '26', '45-55', '7', '39.4', '74', '18', 'older than dirt',
       '17', '15', '8', '75', '5u', 'Enough', 'Over 50', '90', '76',
       'sixty-nine', 'ancient', '77', 'OLD', 'old', '73', '70 1/2', '14',
       'MY NAME JEFF', '4', '59 on the day after Halloween', 'old enough',
       'your mom', 'I can remember when Java was a cool new language',
       '60+'], dtype=object)

Again, this is a pretty messy column of data. This is a good example of why those that create online surveys shouldn't allow the individual to just put any value into the field. But it is now our job to clean this up.

**Exercise_P:** Your task is to put these values into the following categorical bins: `unknown`, `17 and under`, `18-25`, `26-35`, `36-45`, `46-55`, and `56+`. 

- The category labels should exactly match the above.
- Missing values should be replaced with the `unknown` category
- To make things easier and avoid ambiguity, let's say that any value with text, even if we could determine the age, will be binned with the `unknown` category. For example: `sixty-nine` should be coded as `unknown`, `45-55` should be coded as `unknown`, `59 on the day after Halloween` should be coded as `unknown`, etc.
- Ensure that the category labels are unordered but reorder the categories so that 'unknown' is listed in the first position. This is not really needed but will help us grade your assignment. The categories should be listed as follows: `Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')`

First, we will replace any non-numeric value (those with text as mentioned above) with a missing value.  This will allow you to turn the other values into floats so that you can bin them. Just don't forget to code the missing values as `unknown` when you are done.  To replace the non-numeric values, run the following code:

In [946]:
# create True/False index
age_index = candy['Q3: AGE'].str.isnumeric()

# for the index, fill missing values with False
age_index = age_index.fillna(False)

# select Age column for only those False values from index and code as missing
candy.loc[~age_index, 'Q3: AGE'] = np.nan

In [947]:
### ENTER REST OF CODE HERE ###
candy.loc[:,'Q3: AGE'] = candy.loc[:,'Q3: AGE'].astype(float)

candy['Q3: AGE'] = pd.cut(candy['Q3: AGE'], bins=[0,17,25,35,45,55,candy['Q3: AGE'].max()], 
                                        labels=['17 and under','18-25','26-35','36-45','46-55','56+'])

candy['Q3: AGE'] = candy['Q3: AGE'].cat.add_categories('unknown').cat.reorder_categories(['unknown', '17 and under','18-25','26-35','36-45','46-55','56+'], ordered=False)

In [948]:
candy['Q3: AGE'] = candy['Q3: AGE'].fillna('unknown')

**Exercise_Q:** Double check yourself by checking the categories for the `Q3: AGE` column. It should output: `Index(['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+'], dtype='object')`

In [949]:
# double check categories
### ENTER CODE HERE ###
candy['Q3: AGE']

1         36-45
2         46-55
3         36-45
4         18-25
5       unknown
         ...   
2474      18-25
2475      26-35
2476      26-35
2477        56+
2478        56+
Name: Q3: AGE, Length: 2305, dtype: category
Categories (7, object): ['unknown', '17 and under', '18-25', '26-35', '36-45', '46-55', '56+']

**Code Check:** To double check your above binning worked correctly, your value counts (sorted by the index) should be as follows:

```
unknown: 60 
17 and under: 49 
18-25: 85
26-35: 520
36-45: 768
46-55: 525
56+: 298
```

In [950]:
### ENTER CODE HERE ###
candy['Q3: AGE'].value_counts()

36-45           768
46-55           525
26-35           520
56+             298
18-25            85
unknown          60
17 and under     49
Name: Q3: AGE, dtype: int64

You can also double check some of your work up to this point by making sure that there are no missing values in the data set anymore.

**Code Check:** Check to see if there are any missing values in the data set. Your output should show `0`.

In [951]:
### ENTER CODE HERE ###
candy.isnull().sum().value_counts()

0    113
dtype: int64

**Exercise_R:** Before you move on to the Data Analysis section, reset the index for `candy` ensuring that it goes from 0 to n-1.  

In [952]:
### ENTER CODE HERE ###

candy.reset_index(inplace=True, drop=True)
candy

Unnamed: 0,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),...,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q10: DRESS,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]
0,No,Male,36-45,USA,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,...,DESPAIR,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,0.0,1.0,0.0,0.0
1,No,Male,46-55,USA,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
2,No,Male,36-45,USA,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,0.0,1.0,0.0,0.0
3,No,Male,18-25,USA,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,JOY,White and gold,Friday,0.0,1.0,0.0,0.0
4,No,Male,unknown,Other,JOY,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,JOY,DESPAIR,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300,No,Male,18-25,USA,JOY,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,DESPAIR,MEH,DESPAIR,MEH,White and gold,Friday,0.0,0.0,0.0,0.0
2301,No,Female,26-35,USA,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,JOY,DESPAIR,MEH,JOY,Blue and black,Friday,0.0,1.0,0.0,0.0
2302,No,Female,26-35,USA,MEH,DESPAIR,JOY,DESPAIR,MEH,JOY,...,MEH,DESPAIR,DESPAIR,MEH,Blue and black,Friday,0.0,1.0,0.0,0.0
2303,No,Male,56+,USA,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0


I would suggest that you stop here and run your code through CodeGrade to check the previous steps before continuing.  Just keep in mind that there are some preparation steps below that will be marked incorrect because you have not yet gotten to them.

## Data Analysis

Make sure that you are ready to answer any of the following questions about the data set that may appear on your quiz.  Please use the cleaned, final `candy` data to answer these questions. Note that the answers here may be different than any that appear in the article about this data set or that could be found using Excel. Ours has been altered and cleaned in a different way than the original authors did. Also, please do not use Excel to try to find these answers. First, you may not get the correct answer, and more importantly, we want you to practice your Pandas skills.

**Exercise_S:** How many rows were in the original, uncleaned data that you imported? How many rows are in the cleaned data? How many did we end up removing from the data set?

In [953]:
### ENTER CODE HERE ###

print(candy_full.shape)
print(candy.shape)
print(candy_full.shape[0]-candy.shape[0])


(2479, 120)
(2305, 113)
174


**Exercise_T:** What percentage of respondents are planning to go out trick-or-treating? (Again, make sure that you are using the final, cleaned data for this and all the following questions.)

In [954]:
### ENTER CODE HERE ###
candy['Q1: GOING OUT?'].value_counts(normalize=True) #12% are going out

No     0.870716
Yes    0.129284
Name: Q1: GOING OUT?, dtype: float64

**Exercise_U:** What percentage of respondents 17 and younger are planning to go out for trick-or-treating?

In [955]:
### ENTER CODE HERE ###
candy[candy['Q3: AGE']=='17 and under']['Q1: GOING OUT?'].value_counts(normalize=True) #69.3% are going out

Yes    0.693878
No     0.306122
Name: Q1: GOING OUT?, dtype: float64

**Exercise_V:** What are the total value counts and the normalized percentages of the age categories from the age column?

In [956]:
### ENTER CODE HERE ###
candy['Q3: AGE'].value_counts().sort_index()

unknown          60
17 and under     49
18-25            85
26-35           520
36-45           768
46-55           525
56+             298
Name: Q3: AGE, dtype: int64

In [957]:
### ENTER CODE HERE ###
candy['Q3: AGE'].value_counts(normalize=True).sort_index()

unknown         0.026030
17 and under    0.021258
18-25           0.036876
26-35           0.225597
36-45           0.333189
46-55           0.227766
56+             0.129284
Name: Q3: AGE, dtype: float64

**Exercise_W:** What are the total counts and percentages for the gender column?

In [958]:
### ENTER CODE HERE ###
candy['Q2: GENDER'].value_counts().sort_index()

Female     839
Male      1466
Name: Q2: GENDER, dtype: int64

In [959]:
### ENTER CODE HERE ###
candy['Q2: GENDER'].value_counts(normalize=True).sort_index()

Female    0.363991
Male      0.636009
Name: Q2: GENDER, dtype: float64

**Exercise_X:** What are the breakdown of counts for the country column?

In [960]:
### ENTER CODE HERE ###
candy['Q4: COUNTRY'].value_counts()

USA      1955
CA        216
EU         73
Other      61
Name: Q4: COUNTRY, dtype: int64

**Exercise_Y:** How many total respondents voted joy in candy corn and how many voted despair? Did more people vote joy or despair for candy corn?

In [961]:
### ENTER CODE HERE ###
candy['Q6 | Candy Corn'].value_counts()

DESPAIR      702
NO_ANSWER    620
MEH          529
JOY          454
Name: Q6 | Candy Corn, dtype: int64

In [962]:
### ENTER CODE HERE ###

**Exercise_Z:** How many people voted joy in Reese's Peanut Butter Cups? In Snickers? Did more people vote joy for Reese's Peanut Butter Cups or for Snickers?

In [963]:
### ENTER CODE HERE ###
candy['Q6 | Reese\'s Peanut Butter Cups'].value_counts()

JOY          1416
NO_ANSWER     623
MEH           178
DESPAIR        88
Name: Q6 | Reese's Peanut Butter Cups, dtype: int64

In [964]:
### ENTER CODE HERE ###
candy['Q6 | Snickers'].value_counts()

JOY          1325
NO_ANSWER     633
MEH           273
DESPAIR        74
Name: Q6 | Snickers, dtype: int64

**Exercise_AA:** How many people voted joy in Twix? In Kit Kats? Did more people vote joy for Twix or for Kit Kats?

In [965]:
### ENTER CODE HERE ###
candy['Q6 | Twix'].value_counts().JOY

1339

In [966]:
### ENTER CODE HERE ###
candy['Q6 | Kit Kat'].value_counts().JOY

1367

**Exercise_AB:** How many people voted joy in white bread? For whole wheat items? Did more people vote joy for white bread or whole wheat items?

In [967]:
### ENTER CODE HERE ###
candy['Q6 | White Bread'].value_counts().JOY

43

In [968]:
### ENTER CODE HERE ###
candy['Q6 | Whole Wheat anything'].value_counts().JOY

110

**Exercise_AC:** How many people voted joy for Bonkers the board game? For Bonkers the candy? Did more people vote joy for the board game or for the candy?

In [969]:
### ENTER CODE HERE ###
candy['Q6 | Bonkers (the board game)'].value_counts().JOY

188

In [970]:
### ENTER CODE HERE ###

candy['Q6 | Bonkers (the candy)'].value_counts().JOY

109

**Exercise_AD:** How many people voted joy for a box of raisins? For the Blue-Ray DVD of the Real Housewives of Orange County Season 9? Did more people vote joy for a box of raisins or for the DVD?

In [971]:
### ENTER CODE HERE ###
candy['Q6 | Box\'o\'Raisins'].value_counts().JOY

108

In [972]:
### ENTER CODE HERE ###
candy['Q6 | Real Housewives of Orange County Season 9 Blue-Ray'].value_counts().JOY

86

**Exercise_AE:** What is the favorite day of the week for the respondents (both by total counts and percentages)?

In [973]:
### ENTER CODE HERE ###
candy['Q11: DAY'].value_counts()

Friday       1026
NO_ANSWER     658
Sunday        621
Name: Q11: DAY, dtype: int64

In [974]:
### ENTER CODE HERE ###
candy['Q11: DAY'].value_counts(normalize=True)

Friday       0.445119
NO_ANSWER    0.285466
Sunday       0.269414
Name: Q11: DAY, dtype: float64

**Exercise_AF:** Do more respondents see 'white and gold' or 'blue and black' for the [color of the dress](https://en.wikipedia.org/wiki/The_dress) (both total counts and percentages)?

In [975]:
### ENTER CODE HERE ###
candy['Q10: DRESS'].value_counts()

White and gold    1027
NO_ANSWER          679
Blue and black     599
Name: Q10: DRESS, dtype: int64

In [976]:
### ENTER CODE HERE ###
candy['Q10: DRESS'].value_counts(normalize=True)

White and gold    0.445553
NO_ANSWER         0.294577
Blue and black    0.259870
Name: Q10: DRESS, dtype: float64

**Exercise_AG:** For those respondents that clicked on the media link (listed as `Q12` columns on the survey), which link did they click on the most?

In [977]:
### ENTER CODE HERE ###
candy.loc[:,'Q12: MEDIA [Daily Dish]':'Q12: MEDIA [Yahoo]'].value_counts().sort_index() #Science

Q12: MEDIA [Daily Dish]  Q12: MEDIA [Science]  Q12: MEDIA [ESPN]  Q12: MEDIA [Yahoo]
0.0                      0.0                   0.0                0.0                    776
                                                                  1.0                     61
                                               1.0                0.0                     96
                         1.0                   0.0                0.0                   1289
1.0                      0.0                   0.0                0.0                     83
dtype: int64

**Exercise_AH:** How many males voted joy for receiving a copy of the Real Housewives of Orange County Season 9 DVD for Halloween? Females? Those 17 or younger?

In [978]:
### ENTER CODE HERE ###
candy.loc[candy['Q2: GENDER']=="Male"]['Q6 | Real Housewives of Orange County Season 9 Blue-Ray'].value_counts()

DESPAIR      901
NO_ANSWER    439
MEH           81
JOY           45
Name: Q6 | Real Housewives of Orange County Season 9 Blue-Ray, dtype: int64

In [979]:
candy.loc[candy['Q2: GENDER']=="Female"]['Q6 | Real Housewives of Orange County Season 9 Blue-Ray'].value_counts()

DESPAIR      497
NO_ANSWER    253
MEH           48
JOY           41
Name: Q6 | Real Housewives of Orange County Season 9 Blue-Ray, dtype: int64

In [980]:
candy.loc[candy['Q3: AGE']=="17 and under"]['Q6 | Real Housewives of Orange County Season 9 Blue-Ray'].value_counts()

DESPAIR      27
NO_ANSWER    17
MEH           3
JOY           2
Name: Q6 | Real Housewives of Orange County Season 9 Blue-Ray, dtype: int64

**Exercise_AI:** The authors tried a funny way to determine a respondent's political leaning. Note this was purely a joke and not meant to be scientific.  How many total respondents voted joy in the following: Blue M&M's, Red M&M's, Green Party M&M's, Independent M&M's, and Abstained from M&M'ing.

In [981]:
### ENTER CODE HERE ###
print("Blue",candy.loc[:,['Q6 | Blue M&M\'s']].value_counts().JOY)
print("Red", candy.loc[:,'Q6 | Red M&M\'s'].value_counts().JOY)
print("Green", candy.loc[:,'Q6 | Green Party M&M\'s'].value_counts().JOY)
print("Independent", candy['Q6 | Independent M&M\'s'].value_counts().JOY)
print("Abstained", candy['Q6 | Abstained from M&M\'ing.'].value_counts().JOY)

Blue 963
Red 949
Green 915
Independent 711
Abstained 202


**Exercise_AJ:** Select only the Q6 candy columns (`Q6 | 100 Grand Bar` through `Q6 | York Peppermint Patties`) in the data set and save this as a new DataFrame called `candy_reduced`.

In [982]:
### ENTER CODE HERE ###
candy_reduced = candy.loc[:,'Q6 | 100 Grand Bar' : 'Q6 | York Peppermint Patties']


**Exercise_AK:** Determine what candy/item from the `candy_reduced` DataFrame has the most number of JOY votes and the least number of JOY votes. A simple way to do this is to filter the entire DataFrame for any `JOY` values, then use `count()`, then sort the values in descending order. See this [stackoverflow question](https://stackoverflow.com/questions/63103090/how-do-i-count-specific-values-across-multiple-columns-in-pandas) and answers.

In [983]:
### ENTER CODE HERE ###
candy_reduced[candy_reduced=='JOY'].count().sort_values(ascending=False)

Q6 | Any full-sized candy bar                                                  1477
Q6 | Reese's Peanut Butter Cups                                                1416
Q6 | Kit Kat                                                                   1367
Q6 | Cash, or other forms of legal tender                                      1363
Q6 | Twix                                                                      1339
                                                                               ... 
Q6 | JoyJoy (Mit Iodine!)                                                        72
Q6 | Gum from baseball cards                                                     43
Q6 | White Bread                                                                 43
Q6 | Candy that is clearly just the stuff given out for free at restaurants      37
Q6 | Broken glow stick                                                           24
Length: 103, dtype: int64

**Exercise_AL:** Using the above as an example, what candy/item has the most DESPAIR votes? 

In [984]:
### ENTER CODE HERE ###
candy_reduced[candy_reduced=='DESPAIR'].count().sort_values(ascending=False)

Q6 | Broken glow stick                                     1535
Q6 | Real Housewives of Orange County Season 9 Blue-Ray    1398
Q6 | Gum from baseball cards                               1386
Q6 | White Bread                                           1376
Q6 | Kale smoothie                                         1365
                                                           ... 
Q6 | Regular M&Ms                                            71
Q6 | Twix                                                    67
Q6 | Cash, or other forms of legal tender                    63
Q6 | Kit Kat                                                 47
Q6 | Any full-sized candy bar                                15
Length: 103, dtype: int64

**Exercise_AM:** What candy/item has the most MEH votes?

In [985]:
### ENTER CODE HERE ###
candy_reduced[candy_reduced=='MEH'].count().sort_values(ascending=False)

Q6 | Lollipops                                             877
Q6 | Hard Candy                                            855
Q6 | Bonkers (the candy)                                   818
Q6 | Minibags of chips                                     718
Q6 | 100 Grand Bar                                         715
                                                          ... 
Q6 | Reese's Peanut Butter Cups                            178
Q6 | Kale smoothie                                         162
Q6 | Real Housewives of Orange County Season 9 Blue-Ray    129
Q6 | Broken glow stick                                      99
Q6 | Creepy Religious comics/Chick Tracts                   95
Length: 103, dtype: int64

**Exercise_AN:** What candy/item did the most people not recognize or have an opinion on? (According to the survey, the respondents were asked to leave a question blank if they did not know the item)

In [986]:
### ENTER CODE HERE ###
candy_reduced[candy_reduced=='NO_ANSWER'].count().sort_values(ascending=False)

Q6 | JoyJoy (Mit Iodine!)               942
Q6 | Maynards                           939
Q6 | Reggie Jackson Bar                 933
Q6 | Bonkers (the board game)           926
Q6 | Sweetums (a friend to diabetes)    924
                                       ... 
Q6 | Kit Kat                            616
Q6 | Any full-sized candy bar           615
Q6 | Hershey's Dark Chocolate           615
Q6 | Hershey's Milk Chocolate           614
Q6 | Peanut M&M's                       614
Length: 103, dtype: int64

In the final piece of the analysis, we will determine what candy/items have the most and lowest "net_feelies" (calculated by the authors as the total joy count minus the total despair count).

First, we will create two Series, one with JOY counts and one with DESPAIR counts to add to our `candy_reduced` data.

**Exercise_AO:** Create a Series called `joy_count` that lists total counts for JOY for each column, making sure to keep it in the same order as the columns in the `candy_reduced` DataFrame. Hint: This should be almost exactly how we determined which candy/items had the most JOY votes, but we would not do any sorting.

In [987]:
### ENTER CODE HERE ###
joy_count = candy_reduced[candy_reduced=='JOY'].count()


**Exercise_AP:** Same as above except you will create a Series called `despair_count` that lists the total counts for DESPAIR for each column.

In [988]:
### ENTER CODE HERE ###
despair_count = candy_reduced[candy_reduced=='DESPAIR'].count()

**Exercise_AQ:** Take the transpose of the `candy_reduced` DataFrame and save this transposed data as `candy_reduced_transpose`.

In [989]:
### ENTER CODE HERE ###
candy_reduced_transpose = candy_reduced.T

**Exercise_AR:** Add a new column called "joy_count" using the `joy_count` Series above and a new column called 'despair_count" using the `despair_count` Series above to the `candy_reduced_transpose` DataFrame.

In [990]:
### ENTER CODE HERE ###
candy_reduced_transpose['joy_count'] = joy_count
candy_reduced_transpose['despair_count'] = despair_count

**Exercise_AS:** Add a new column to the `candy_reduced_transpose` DataFrame called "net_feelies" that takes the `joy_count` column and subtracts the `despair_count` column.

In [991]:
### ENTER CODE HERE ###
candy_reduced_transpose['net_feelies'] = candy_reduced_transpose['joy_count'] - candy_reduced_transpose['despair_count']

**Exercise_AT:** Select only the `joy_count`, `despair_count`, and `net_feelies` columns from the `candy_reduced_transpose` DataFrame. Sort this DataFrame in descending order by `net_feelies` and save this as `candy_net_sorted`.

In [992]:
### ENTER CODE HERE ###
candy_net_sorted = candy_reduced_transpose[['joy_count','despair_count','net_feelies']].sort_values(by='net_feelies',ascending=False)



Be prepared to answer what candy/item had the most and least `net_feelies` values.

## Encoding DataFrame

We now want to get the `candy` DataFrame ready to run a machine learning algorthim to determine if we could predict a person's gender based on what candy they prefer.

You will learn more about this in the machine learning classes, but some algorithms work exclusively with numeric values. We will now turn all of our values into numeric values.  There are easier ways to do this with sklearn, which you will study in later courses, but we will use Pandas to perform these exercises for further practice.

**Exercise_AU:** For grading purposes, we want to leave the `candy` DataFrame as is. Make a copy of the `candy` DataFrame and save this new DataFrame as `candy_encode`.

In [993]:
### ENTER CODE HERE ###
candy_encode = candy.copy()

**Exercise_AV:** For the `candy_encode` DataFrame, replace any `Female` values with `0` and any `Male` values with `1`.

In [994]:
### ENTER CODE HERE ###
gender_numeric = {'Male':1,'Female':0}

candy_encode['Q2: GENDER'] = candy_encode['Q2: GENDER'].map(gender_numeric)

candy_encode['Q2: GENDER']

0       1
1       1
2       1
3       1
4       1
       ..
2300    1
2301    0
2302    0
2303    1
2304    0
Name: Q2: GENDER, Length: 2305, dtype: int64

**Exercise_AW:** Again, you will learn more about this later, but we need to separate the column that we want to predict (called the response) and the columns that we will use to make the predictions (called the features).  **For both of the items below, make sure that the index is reset and goes from 0 to n-1.**

- Select only the `Q2: GENDER` column from `candy_encode` and save this as `candy_response`.  **Note: This should be a Series.**
- Drop the following columns from the `candy_encode` DataFrame: `Q2: GENDER`,`Q1: GOING OUT?`,`Q3: AGE`,`Q4: COUNTRY`,`Q10: DRESS`,`Q11: DAY`, `Q12: MEDIA [Daily Dish]`,`Q12: MEDIA [Science]`,`Q12: MEDIA [ESPN]`,`Q12: MEDIA [Yahoo]`.  Save the remaining columns as `candy_features`.

In [995]:
### ENTER CODE HERE ###
candy_response = candy_encode['Q2: GENDER']
candy_features = candy_encode.drop(columns =['Q2: GENDER','Q1: GOING OUT?','Q3: AGE','Q4: COUNTRY','Q10: DRESS','Q11: DAY', 'Q12: MEDIA [Daily Dish]',
                                    'Q12: MEDIA [Science]','Q12: MEDIA [ESPN]','Q12: MEDIA [Yahoo]'])
candy_features

Unnamed: 0,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),Q6 | Bottle Caps,Q6 | Box'o'Raisins,Q6 | Broken glow stick,Q6 | Butterfinger,...,Q6 | Three Musketeers,Q6 | Tolberone something or other,Q6 | Trail Mix,Q6 | Twix,"Q6 | Vials of pure high fructose corn syrup, for main-lining into your vein",Q6 | Vicodin,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties
0,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,...,JOY,JOY,DESPAIR,JOY,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR,DESPAIR
1,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER
2,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,DESPAIR,JOY,MEH,JOY,DESPAIR,JOY,JOY,DESPAIR,DESPAIR,DESPAIR
3,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,JOY,JOY,DESPAIR,JOY,MEH,JOY,JOY,DESPAIR,DESPAIR,JOY
4,JOY,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,MEH,MEH,DESPAIR,JOY,...,JOY,JOY,MEH,JOY,DESPAIR,DESPAIR,JOY,DESPAIR,DESPAIR,JOY
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300,JOY,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,MEH,DESPAIR,DESPAIR,MEH,...,MEH,MEH,JOY,JOY,MEH,JOY,DESPAIR,MEH,DESPAIR,MEH
2301,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,DESPAIR,DESPAIR,JOY,...,MEH,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,JOY,DESPAIR,MEH,JOY
2302,MEH,DESPAIR,JOY,DESPAIR,MEH,JOY,DESPAIR,MEH,MEH,DESPAIR,...,JOY,JOY,MEH,MEH,MEH,JOY,MEH,DESPAIR,DESPAIR,MEH
2303,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER


**Exercise_AX:** Use Panda's `get_dummies()` to encode the `candy_features` data, making sure to set `drop_first=True`. Save this as `candy_features_encoded`.

In [999]:
### ENTER CODE HERE ###
candy_features_encoded = pd.get_dummies(candy_features,drop_first=True)




182246

In [997]:
candy

Unnamed: 0,Q1: GOING OUT?,Q2: GENDER,Q3: AGE,Q4: COUNTRY,Q6 | 100 Grand Bar,Q6 | Anonymous brown globs that come in black and orange wrappers\t(a.k.a. Mary Janes),Q6 | Any full-sized candy bar,Q6 | Black Jacks,Q6 | Bonkers (the candy),Q6 | Bonkers (the board game),...,Q6 | Whatchamacallit Bars,Q6 | White Bread,Q6 | Whole Wheat anything,Q6 | York Peppermint Patties,Q10: DRESS,Q11: DAY,Q12: MEDIA [Daily Dish],Q12: MEDIA [Science],Q12: MEDIA [ESPN],Q12: MEDIA [Yahoo]
0,No,Male,36-45,USA,MEH,DESPAIR,JOY,MEH,DESPAIR,DESPAIR,...,DESPAIR,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,0.0,1.0,0.0,0.0
1,No,Male,46-55,USA,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0
2,No,Male,36-45,USA,MEH,DESPAIR,JOY,MEH,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,DESPAIR,White and gold,Sunday,0.0,1.0,0.0,0.0
3,No,Male,18-25,USA,JOY,DESPAIR,JOY,DESPAIR,MEH,DESPAIR,...,JOY,DESPAIR,DESPAIR,JOY,White and gold,Friday,0.0,1.0,0.0,0.0
4,No,Male,unknown,Other,JOY,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,JOY,DESPAIR,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2300,No,Male,18-25,USA,JOY,DESPAIR,MEH,DESPAIR,DESPAIR,MEH,...,DESPAIR,MEH,DESPAIR,MEH,White and gold,Friday,0.0,0.0,0.0,0.0
2301,No,Female,26-35,USA,MEH,DESPAIR,JOY,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,JOY,DESPAIR,MEH,JOY,Blue and black,Friday,0.0,1.0,0.0,0.0
2302,No,Female,26-35,USA,MEH,DESPAIR,JOY,DESPAIR,MEH,JOY,...,MEH,DESPAIR,DESPAIR,MEH,Blue and black,Friday,0.0,1.0,0.0,0.0
2303,No,Male,56+,USA,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,...,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,NO_ANSWER,0.0,0.0,0.0,0.0


**Code Check:** Make sure that the `candy`, `candy_features`, `candy_response` and `candy_features_encoded` has an index that goes from 0 to n-1 or your final CodeGrade tests will not pass.

## Final Analysis

Great work! You have now cleaned your data and prepared it to be passed to a machine learning model.  

I created models using Random Forest, Logistic Regression, and XGBoost algorithms, and they all returned around 70% accuracy rates. However, the other accuracy metrics (that you will learn about more in the machine learning classes) didn't look as good. Given the metrics that were calculated, I would say that based only on this data, using candy preference is not that great of an indicator of someone's gender.

**Next Steps:**  Make sure that your notebook passes all the CodeGrade tests and then use this notebook to answer questions in the corresponding quiz in Brightspace.