## Week 12 Assignment - W200 Python Fundamentals for Data Science, UC Berkeley MIDS

Write code in this Jupyter Notebook to solve the following problems. This assignment addresses material covered in Unit 11. Please upload this **Notebook** with your solutions to your GitHub repository in your SUBMISSIONS/week_12 folder by 11:59PM PST the night before class. If you turn-in anything on ISVC please do so under the Week 12 Assignment category. 

## Objectives

- Explore and glean insights from a real dataset using pandas
- Practice using pandas for exploratory analysis, information gathering, and discovery
- Practice using matplotlib for data visualization

## Dataset

You are to analyze campaign contributions to the 2016 U.S. presidential primary races made in California. Use the csv file located here: https://drive.google.com/file/d/1Lgg-PwXQ6TQLDowd6XyBxZw5g1NGWPjB/view?usp=sharing. You should download and save this file in the same folder as this notebook is stored.  This file originally came from the U.S. Federal Election Commission (https://www.fec.gov/).

** DO NOT PUSH THIS FILE TO YOUR GITHUB REPO ! **

Documentation for this data can be found here: https://drive.google.com/file/d/11o_SByceenv0NgNMstM-dxC1jL7I9fHL/view?usp=sharing

## General Guidelines:

- This is a **real** dataset and so it may contain errors and other pecularities to work through
- This dataset is ~218mb, which will take some time to load (and probably won't load in Google Sheets or Excel)
- If you make assumptions, annotate them in your responses
- While there is one code/markdown cell positioned after each question as a placeholder, some of your code/responses may require multiple cells
- Double-click the markdown cells that say YOUR ANSWER HERE to enter your written answers. If you need more cells for your written answers, make them markdown cells (rather than code cells)

## Setup

Run the two cells below. 

The first cell will load the data into a pandas dataframe named `contrib`. Note that a custom date parser is defined to speed up loading. If Python were to guess the date format, it would take even longer to load.  

The second cell subsets the dataframe to focus on just the primary period through May 2016. Otherwise, we would see general election donations which would make it harder to draw conclusions about the primaries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime

# These commands below set some options for pandas and to have matplotlib show the charts in the notebook
pd.set_option('display.max_rows', 1000)
pd.options.display.float_format = '{:,.2f}'.format
%matplotlib inline

# Define a date parser to pass to read_csv
d = lambda x: pd.datetime.strptime(x, '%d-%b-%y')

# Load the data
contrib = pd.read_csv('./P00000001-CA.csv', index_col=False, parse_dates=['contb_receipt_dt'], date_parser=d)
print(contrib.shape)

# Note - for now, it is okay to ignore the warning about mixed types. 

In [None]:
# Subset data to primary period 
contrib = contrib.copy()[contrib['contb_receipt_dt'] <= datetime.datetime(2016, 5, 31)]
print(contrib.shape)

## 1. Exploring Data

**1a. First, take a preliminary look at the data.**
- Print the *shape* of the data. What does this tell you about the number of variables and rows you have?
- Print a list of column names. 
- Review the documentation for this data (link above). Do you have all of the columns you expect to have?
- Sometimes variable names are not clear unless we read the documentation. In your own words, based on the documentation, what information does the `election_tp` variable contain?

In [2]:
# 1a YOUR CODE HERE

`1a YOUR RESPONSE HERE`

**1b. Print the first 5 rows from the dataset to manually look through some of your data.**

In [3]:
# 1b YOUR CODE HERE

**1c. Pick three variables from the dataset above and run some quick sanity checks.**

When working with a new dataset, it is important to explore and sanity check your variables.  For example, you may want to examine the maximum and minimum values, a frequency count, or something else. Use markdown cells to explain if your sanity checks "pass" your scrutiny or if you have concerns about the integrity of your data. 

In [4]:
# 1c YOUR CODE HERE for variable #1

In [None]:
# 1c YOUR CODE HERE for variable #2

In [None]:
# 1c YOUR CODE HERE for variable #3

`1c YOUR RESPONSE HERE`

**1d. Plotting a histogram** 

Make a histogram of **one** of the variables you picked above. What are some insights that you can see from this histogram? 
Remember to include on your histogram:
- Include a title
- Include axis labels
- The correct number of bins to see the breakout of values

In [2]:
# 1d YOUR CODE HERE

`1d YOUR RESPONSE HERE`

## 2. Exploring Campaign Contributions

Let's investigate the donations to the candidates.

**2a. Present a table that shows the number of donations to each candidate.**

- When presenting data as a table, it is often best to sort the data in a meaningful way. This makes it easier for your reader to examine what you've done and to glean insights.  From now on, all tables that you present in this assignment (and course) should be sorted.
- Hint: Use the `groupby` method.
- Hint: Use the `sort_values` method to sort the data so that candidates with the largest number of donations appear on top. 

Which candidate received the largest number of contributions (variable 'contb_receipt_amt')?

In [3]:
# 2a YOUR CODE HERE

`2a YOUR RESPONSE HERE`

**2b. Now, present a table that shows the total value of donations to each candidate.**

Which candidate raised the most money in California?

In [None]:
# 2b YOUR CODE HERE

`2b YOUR RESPONSE HERE`

**2c. Combine the tables.**

- What is the "type" of the two tables you presented above - Series or DataFrames?
- Convert any Series to DataFrames.
- Rename the variable (column) names to accurately describe what is presented.
- Merge together your tables to show the *count* and the *value* of donations to each candidate in one table.
- Hint: Use the `merge` method.

In [None]:
# 2c YOUR CODE HERE

**2d. Calculate and add a new variable to the table from 2c that shows the average \$ per donation.**

In [None]:
# 2d YOUR CODE HERE

**2e. Plotting a Bar Chart**

Make a bar chart that shows two different bars per candidate with one bar as the total value of the donations and the other as average $ per donation. 
- Show the Candidates Name on the x-axis
- Show the amount on the y-axis
- Include a title
- Include axis labels
- Hint: You can make the y-axis a log-scale if you'd prefer

In [None]:
# 2e YOUR CODE HERE

**2f. Comment on the results of your data analysis in a short paragraph.**

- There are several interesting conclusions you can draw from the table you have created.
- What have you learned about campaign contributions in California?

`2f YOUR RESPONSE HERE`

## 3. Exploring Donor Occupations

Above in part 2, we saw that some simple data analysis can give us insights into the campaigns of our candidates. Now let's quickly look to see what *kind* of person is donating to each campaign using the `contbr_occupation` variable.

**3a. Show the top 5 occupations of individuals that contributed to Hillary Clinton.** 

- Subset your data to create a dataframe with only donations for Hillary Clinton.
- Then use the `value_counts` and `head` methods to present the top 5 occupations (`contbr_occupation`) for her donors.
- Note: we are just interested in the count of donations, not the value of those donations.

In [None]:
# 3a YOUR CODE HERE

**3b. Write a function called `get_donors`.**

Imagine that you want to do the previous operation on several candidates.  To keep your work neat, you want to take the work you did on the Clinton-subset and wrap it in a function that you can apply to other subsets of the data.

- The function should take a DataFrame as a parameter, and return a Series containing the counts for the top 5 occupations contained in that DataFrame.

In [None]:
def get_donors(df):
    """This function takes a dataframe that contains a variable named contbr_occupation.
    It outputs a Series containing the counts for the 5 most common values of that
    variable."""
    
    # 3b YOUR CODE HERE

**3c. Now run the `get_donors` function on subsets of the dataframe corresponding to three candidates.**

- Hillary Clinton
- Bernie Sanders
- Donald Trump

In [None]:
# 3c YOUR CODE HERE

**3d. Finally, use `groupby` to separate the entire dataset by candidate.**

- Call .apply(get_donors) on your groupby object, which will apply the function you wrote to each subset of your data.
- Look at your output and marvel at what pandas can do in just one line!

In [None]:
# 3d YOUR CODE HERE

**3e. Comment on your findings in a short paragraph.**

`3e YOUR RESPONSE HERE`

**3f. Think about your findings in section 3 vs. your findings in section 2 of this assignment.**

Do you have any new insights into the results you got in section 2, now that you see the top occupations for each candidate

`3f YOUR RESPONSE HERE`


## 4. Plotting Data

There is an important element that we have not yet explored in this dataset - time.

**4a. Present a single line chart with the following elements.**

- Show the date on the x-axis
- Show the contribution amount on the y-axis
- Include a title
- Include axis labels

In [4]:
# 4a YOUR CODE HERE

**4b. Make a better time-series line chart**

This chart is messy and it is hard to gain insights from it.  Improve the chart from 4a so that your new chart shows a specific insight. In the spot provided, write the insight(s) that can be gained from this new time-series line chart.

In [5]:
# 4b YOUR CODE HERE

`4b YOUR RESPONSE HERE`

## If you have feedback for this homework, please submit it using the link below:

http://goo.gl/forms/74yCiQTf6k