# Setup

Run the following code before working on any problem to import util functions used to test your code

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from cse163_utils import assert_equals, check_approx_equals

%matplotlib inline
sns.set()

# Problem 1: Pandas Continued

Run the following code to import data about cancer rates into a pandas dataframe called df. Note the names of the columns, as these will be useful for the functions you'll be required to write

In [2]:
df = pd.read_csv('cancer.csv')

df.head()

Unnamed: 0,Area,Count,Event Type,Population,Race,Sex,Year
0,Alabama,4366,Mortality,2293259,All Races,Female,1999
1,Alabama,9452,Incidence,2302835,All Races,Female,2000
2,Alabama,4425,Mortality,2302835,All Races,Female,2000
3,Alabama,9938,Incidence,2309496,All Races,Female,2001
4,Alabama,4550,Mortality,2309496,All Races,Female,2001


## Problem 1.1: `washington_rate`

Write a function named `washington_rate` that accepts a Pandas dataframe and returns the complete row of data for all information about cancer deaths (Event Type = Mortality) that occurred in Washington (Area = Washington)

In [12]:
# Type your solution here
def washington_rate(df):
    mask = (df['Event Type'] == 'Mortality') & (df['Area'] == 'Washington')
    return df[mask]

In [13]:
# Test
wr = washington_rate(df)
print(wr.head())

assert_equals(168, len(wr))
assert_equals(803464, wr['Count'].sum())

print('Success!')

             Area  Count Event Type  Population       Race     Sex  Year
19825  Washington   5126  Mortality     2934901  All Races  Female  1999
19827  Washington   5212  Mortality     2967734  All Races  Female  2000
19829  Washington   5224  Mortality     3002080  All Races  Female  2001
19831  Washington   5276  Mortality     3036742  All Races  Female  2002
19833  Washington   5414  Mortality     3065739  All Races  Female  2003
Success!


## Problem 1.2: `either_area_rate`

Write a function `either_area_rate` that that accepts a Pandas dataframe, an area a1, and an area a2 and returns the complete row of data for all information about cancer deaths (Event Type = Mortality) that occurred in an Area equal to a1 or a2.

If a2 is not specified, it should default to Washington. So, the call to `either_area_rate(df, Nevada)` would return rows with an area of Washington or Nevada.
Note that the function should still be able to take two parameters: `either_area_rate(df, Ohio, Michigan)` should return rows in Ohio or Michigan.

Note: This function will be similar to 1.1!


In [18]:
### edTest(test_either_area_rate_1) ###

# Type your solution here
def either_area_rate(df, a1, a2='Washington'):
    mask = (df['Event Type'] == 'Mortality') & ((df['Area'] == a1) | (df['Area'] == a2))
    return df[mask]

In [19]:
# Test 1
nev_col = either_area_rate(df, 'Nevada', 'Colorado')
print(nev_col.head())

assert_equals(336, len(nev_col))
assert_equals(794626, nev_col['Count'].sum())

print('Success!')

          Area  Count Event Type  Population       Race     Sex  Year
1801  Colorado   2874  Mortality     2101358  All Races  Female  1999
1803  Colorado   2902  Mortality     2147742  All Races  Female  2000
1805  Colorado   3095  Mortality     2195705  All Races  Female  2001
1807  Colorado   3109  Mortality     2228665  All Races  Female  2002
1809  Colorado   3110  Mortality     2253214  All Races  Female  2003
Success!


In [20]:
# Test 2
ohio_wash = either_area_rate(df, 'Ohio')
print(ohio_wash.head())

assert_equals(336, len(ohio_wash))
assert_equals(2607690, ohio_wash['Count'].sum())

print('Success!')

       Area  Count Event Type  Population       Race     Sex  Year
14353  Ohio  12305  Mortality     5835017  All Races  Female  1999
14355  Ohio  12131  Mortality     5845425  All Races  Female  2000
14357  Ohio  11996  Mortality     5853691  All Races  Female  2001
14359  Ohio  12253  Mortality     5859385  All Races  Female  2002
14361  Ohio  12250  Mortality     5868535  All Races  Female  2003
Success!


## Problem 1.3: `occurrences_in_pop`

Write a function named `occurrences_in_pop` that accepts a Pandas dataframe, an integer representing a population minimum `m`, and a given sex ('Male', 'Female', or 'Male and Female') and returns the complete row of data for all cancer incidence (Event Type = Incidence) of the given sex for all data points with population greater than or equal to `m`. Returns None if no records exist that satisfy the given conditions.

In [25]:
### edTest(test_occurrences_in_pop_1) ###

# Type your solution here
def occurrences_in_pop(df, m, sex):
    mask = (df['Population'] > m) & (df['Sex'] == sex) & (df['Event Type'] == 'Incidence')
    return df[mask]

In [26]:
# Test 1
male_10mil = occurrences_in_pop(df, 10000000, 'Male')
print(male_10mil.head())

assert_equals(425, len(male_10mil))
assert_equals(76938250, male_10mil['Count'].sum())

print('Success!')

            Area  Count Event Type  Population       Race   Sex  Year
44       Alabama  67396  Incidence    11515137  All Races  Male  2009
706      Arizona  68986  Incidence    15749010  All Races  Male  2009
958      Arizona  63324  Incidence    13571975      White  Male  2009
1492  California  69293  Incidence    16699043  All Races  Male  1999
1494  California  69892  Incidence    16937562  All Races  Male  2000
Success!


In [27]:
# Test 2
female_500k = occurrences_in_pop(df, 500000, 'Female')
print(len(female_500k), female_500k['Count'].sum())
print(female_500k.head())

assert_equals(2422, len(female_500k))
assert_equals(95317826, female_500k['Count'].sum())

print('Success!')

2422 95317826
      Area  Count Event Type  Population       Race     Sex  Year
1  Alabama   9452  Incidence     2302835  All Races  Female  2000
3  Alabama   9938  Incidence     2309496  All Races  Female  2001
5  Alabama  10133  Incidence     2314370  All Races  Female  2002
7  Alabama   9592  Incidence     2324069  All Races  Female  2003
9  Alabama  10221  Incidence     2337857  All Races  Female  2004
Success!


## Problem 1.4: `deaths_per_year`

This problem is going to be a little different then the other pandas problems we have been working with so far. While most times, you'll be writing your own pandas code for your homework, it's often a useful skill to learn how to debug and fix pandas error, especially since the error you get from the output isn't too useful. For this problem, the task is to fix the partially filled in function according to the spec defined below. Note that there may be more than one bug and could range between a pandas error to an algorithmic error!

**Fix** a function named `deaths_per_year` that accepts a Pandas dataframe and returns a series with the number of cancer deaths (Event Type = Mortality) for each year between 2002 - 2008 (inclusive) for both sexes and all races.

**Hint:** If you aren't getting the correct numbers, you might want to invesigate the dataset a little more closesly. Here's a block of code you can try printing out if you're stuck!

```
df.loc[[5, 32, 59]]
```

In [None]:
### edTest(test_deaths_per_year) ###
def deaths_per_year(df):
        new = df.dropna()

    mask = (df['Event Type'] == 'Mortality') & (df['Year'] >= 2002) & (df['Year'] <= 2008)
    new = new[mask]
    return new.groupby('Year')['Count'].sum()
print(df.loc[[5, 32, 59]])

       Area  Count Event Type  Population       Race              Sex  Year
5   Alabama  10133  Incidence     2314370  All Races           Female  2002
32  Alabama  11250  Incidence     2165719  All Races             Male  2002
59  Alabama  21383  Incidence     4480089  All Races  Male and Female  2002


In [40]:
# Test
dpy = deaths_per_year(df)
print(dpy)

assert_equals(7, len(dpy))
assert_equals(11746632, dpy.sum())

print('Success!')

Year
2002    6752608
2003    6750617
2004    6713782
2005    6784154
2006    6792608
2007    6830699
2008    6864484
Name: Count, dtype: int64


AssertionError: Failed: Expected 11746632, but received 47488952

### Extra edTests
You can safely ignore the following cells.

In [0]:
### edTest(test_either_area_rate_2) ###

In [0]:
### edTest(test_occurrences_in_pop_2) ###

# Problem 2: Plotting Code

We will be using Seaborn in this section to make visualizations. As mentioned in lecture, Seaborn has great documentation, so you should take this time to read about some of the functions you might need in this class below. Feel free to look at some of the examples included in the links below to determine whether or not you might need to use the function.


Here's some seaborn functions you might need for this section:

* [Bar/Violin Plot](https://seaborn.pydata.org/generated/seaborn.catplot.html)
* [Plot a Distribution](https://seaborn.pydata.org/generated/seaborn.kdeplot.html)
* [Scatter/Line Plot](https://seaborn.pydata.org/generated/seaborn.relplot.html)
* [Linear Regression Plot](https://seaborn.pydata.org/generated/seaborn.regplot.html)
* [Compare Two Variables](https://seaborn.pydata.org/generated/seaborn.jointplot.html)
* [Heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap)


Note: The Seaborn library has been included at the top and is in a variable named `sns`

## Problem 2.1: Line Chart

Create a function called `plot_line` that accepts a Pandas dataframe and creates and displays a line chart using Seaborn that plots similar information to what you generated in 1.4.  

You should generate a line plot, where the years are on the x axis and the count is on the y axis, of the number of cancer deaths (Event Type = Mortality) for each year between 2002 - 2008 (inclusive) for both sexes and all races *in the state of Washington*.

**Please provide a descriptive title and axis labels for your generated visualization.**

**NOTE:** the information you're plotting in this problem is very similar to problem 1.4, except this time you are constrained to a single area (*only* Washington). This means that you should be able to look up the exact row that corresponds to mortality counts for all races and both sexes in the state of Washington directly **without needing a groupby** (there's nothing to aggregate in this case, since you can look it up directly in the dataframe). Most of the logic for filtering the correct data will be the same as in problem 1.4, so feel free to copy over relevant parts of your solution here.

In [0]:
# Type your solution here


In [0]:
plot_line(df)

## Problem 2.2: Regression Plot

Create a function called `plot_regression` that accepts a Pandas dataframe and crates a linear regression plot between Population and Count for cancer incidence (Event Type = Incidence). Remember that we need to pass in 'Population' and 'Count' as X and Y to the `regplot` function!

**Please provide a descriptive title and axis labels for your generated visualization.**

In [0]:
# Type your solution here


In [0]:
plot_regression(df)

## Problem 2.3: Playing with Seaborn

For this problem, we will be progressively be adding to a plot to explore some of the other arguments you can use with seaborn. We will be focusing on the `relplot` function from seaborn the the various arguments you can use with this function. Remember, for each visualization you create, make sure to provide a **descriptive title** and **axis labels** so that we can understand what you are visualizing.

Every function in problem 2.3 will take in the cancer dataset as a dataframe and produce a seaborns plot.


### Problem 2.3.a) A basic `relplot`
Create a function called `plot_problem_23a` which plots the general count of cancer events over the years. Remember, that you will want to be using the `relplot` function.

In [0]:
# Type your solution here


In [0]:
plot_problem_23a(df)

#### Problem 2.3.b) Analyzing the `relplot`
The graph that we just created in the previous problem showcased a peculiar spike at the year 2009. Since this is interesting, we should try to focus in on this year and try to visualize other aspects of the data.

Create a function `plot_problem_23b` which plots the general count of cancer events over the population of the areas in the year 2009. 

In [0]:
# Type your solution here


In [0]:
plot_problem_23b(df)

### Problem 2.3.c) Adding in extra `seaborn` *flare*
It looks like we can see a bit more detail of what is going on with our data in the year 2009, but we definitely are missing some details. We should see if we can encode more data within our visualizations.

Create a function `plot_problem_23c` which plots the general count of cancer events over race. Each point should have the hue correlated with the type of cancer event (which is either 'Mortality' or 'Incidence') and the size correlated with the population.

In [0]:
# Type your solution here


In [0]:
plot_problem_23c(df)

### Problem 2.3.d) Thoughts and Bonus!
This last plot is pretty different then the other plots that we have plotted mostly due to the extra information we have plotted. Recall the lecture we had on Monday, we discussed three types of data, Quantitative, Ordinal, and Nominal, and a hierachy of ways to encode that data. In this last plot, we can actually see the hierachy in play (try to look at each column and mark which type of data it is)!

As a bonus, create a function `plot_problem_23d` that uses some of the seaborn features we haven't talked about yet. You should look through the seaborn documentation for the `relplot` function and see if there are any other arguments you could use to personalize your plot! There might be a cool way to *color* the points or perhaps a way to place plots side-by-side (the solution will showcase one such example). It may be helpful to look at the example plots in the documentation to see what seaborn showcases.

In [0]:
# Type your solution here


In [0]:
plot_problem_23d(df)

# Problem 3: Discussion



## Problem 3.1
Do you think a line chart is an effective visualization in problem 2.1? Explain in 1-2 sentences why or why not.






Answer:

## Problem 3.2
What do you think are the limitations of this dataset?

Answer: