In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('lab02.ok')

# Lab 2: split-apply-combine in pandas & data visualization

Question:

Do Netflix subscribers prefer older or newer movies? Are there other factors that affect this preference such as *rating*?

Task:

- Split the dataset into groups, one for each year, and then to compute one or several summary statistics.
- See whether this statistic increases over the years.


Source: https://github.com/datacamp/community-groupby

## Data Exploration with pandas

### Import your data

In [69]:
# Import packages and set visualization style
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [70]:
# Import data and check out head of DataFrame
netflix_data = pd.read_csv('netflix.csv')
netflix_data.head()

In [71]:
# We can also print the last five rows using .tail():
netflix_data.tail()

#### Question 1a

Write a function `stack_frame` that takes as arguments two data frames and returns one with the data frames stacked on top of each other. Once you have the function written, test it on the head and tail of the netflix dataset.

<!--
BEGIN QUESTION
name: q1a
manual: false
points: 2
-->

In [81]:
def stack_frame(frame1, frame2):
    """Stack the two frames together.
    Assume both frames have the same number of columns.
    """
    assert len(frame1.columns) == len(frame2.columns), "both data frames must have the same number of columns"
    ...

stacked_head_tail = stack_frame(netflix_data.head(), netflix_data.tail()) # BEGIN SOLUTION

In [None]:
ok.grade("q1a");

Since both data frames have five rows, our stacked frame should have ten rows.

### View your Data

Call `.info()` on the `netflix_data`.  How many observations are there? How many features (columns are there)? 
`user_rating_score`, has only 605 non-null values. This means that there are 395 missing values:

In [7]:
# Check out info of DataFrame
netflix_data.info()

Call `set` on the `rating` variable of `netflix_data`.

<!--
BEGIN QUESTION
name: q1b
manual: false
points: 2
-->

In [86]:
...

Next, let's make a pivot table to see how the user rating score changes by year and the rating of the movie.

In [9]:
netflix_data.pivot_table('user_rating_score', index='rating', columns='release_year').iloc[:, -5:]


How would you make the same pivot table for the first 5 years?

<!--
BEGIN QUESTION
name: q1c
manual: false
points: 2
-->

In [83]:
first_five_years = ...
print(first_five_years)

In [None]:
ok.grade("q1c");

Is there more missing data in the early years or later years? Does match your expectations?

### Missing Data and Duplicates

We noticed that .... we can use `dropna` to remove the missing values.  WARNING: this is a dangerous practice,  in general, because the missing data can be informative about what is going on.  We'll discuss in this in a couple of weeks.  For now, just know that we're throwing out a lot data that might be useful.

In [13]:
# Drop rows with missing values and drop duplicate
type(netflix_data.dropna())
type(netflix_data.drop_duplicates())

Which method of dropping data gives us a larger data frame? Write code below to find out how many rows are left after dropping rows `dropna` and how many are left using `drop_duplicates`.

<!--
BEGIN QUESTION
name: q2a
manual: false
points: 2
-->

In [14]:
num_rows_drop_na = ...
print(num_rows_drop_na)

num_rows_drop_dup = ...
print(num_rows_drop_dup)

num_rows_drop_both = ...
print(num_rows_drop_both)

In [None]:
ok.grade("q2a");

We are going to assign `netflix_data.dropna()` to the variable name `df` (short for "data frame")

In [54]:
df = netflix_data.dropna()

### What kinds of problems might arise from dropping obsevations with NAs in them?

*Write your answer here, replacing this text.*

In [55]:
# Get summary stats of the data frame.
df.describe()

### What do you notice is peculiar about user_rating_size? 
### Do those summary statistics match the data you see from the `stack_frames()` function?

*Write your answer here, replacing this text.*

## Using groupby (split-apply-combine) to answer the question

<a id='step_1'></a>

Let us first use a *groupby* method to split the data into groups, where each group is the set of movies released in a given year. 

<!--
BEGIN QUESTION
name: q3a
manual: false
points: 2
-->

In [19]:
# Group by year
netflix_by_year = ...

In [None]:
ok.grade("q3a");

Let us compute summary statistics of our response variable `user_rating_score` in order to capture the main characteristics for each value of `release_year`.

In [21]:
netflix_by_year.describe()

How would you modify the above code to only print the summary statistics for `user_rating_score`?
<!--
BEGIN QUESTION
name: q3b
manual: false
points: 2
-->

In [56]:
...

We can visualize the above information using a boxplot as follows:

In [62]:
# Create boxplot  via factor plot
df.boxplot("user_rating_score", by="release_year")
plt.xticks(rotation=45)

Let's obtain a summary statistic, such as the mean, and summarise them by `release_year` and `rating`.

In [63]:
df.groupby(['release_year','rating']).mean()

### Back to the Baby names dataset

Letts take another look at the babynames dataset we discussed in class.  This time, we'll get data on baby names by US state.

### `fetch_and_cache` Helper

The following function downloads and caches data in the `data/` directory and returns the `Path` to the downloaded file. The cell below the function describes how it works. 

In [27]:
import requests
from pathlib import Path

def fetch_and_cache(data_url, file, data_dir="data", force=False):
    """
    Download and cache a url and return the file object.
    
    data_url: the web address to download
    file: the file in which to save the results.
    data_dir: (default="data") the location to save the data
    force: if true the file is always re-downloaded 
    
    return: The pathlib.Path to the file.
    """
    data_dir = Path(data_dir)
    data_dir.mkdir(exist_ok=True)
    file_path = data_dir/Path(file)
    if force and file_path.exists():
        file_path.unlink()
    if force or not file_path.exists():
        print('Downloading...', end=' ')
        resp = requests.get(data_url)
        with file_path.open('wb') as f:
            f.write(resp.content)
        print('Done!')
    else:
        import time 
        created = time.ctime(file_path.stat().st_ctime)
        print("Using cached version downloaded at", created)
    return file_path

In Python, a `Path` object represents the filesystem paths to files (and other resources). The `pathlib` module is effective for writing code that works on different operating systems and filesystems. 

To check if a file exists at a path, use `.exists()`. To create a directory for a path, use `.mkdir()`. To remove a file that might be a [symbolic link](https://en.wikipedia.org/wiki/Symbolic_link), use `.unlink()`. 

This function creates a path to a directory that will contain data files. It ensures that the directory exists (which is required to write files in that directory), then proceeds to download the file based on its URL.

The benefit of this function is that not only can you force when you want a new file to be downloaded using the `force` parameter, but in cases when you don't need the file to be re-downloaded, you can use the cached version and save download time.

Below we use `fetch_and_cache` to download the `namesbystate.zip` zip file, which is a compressed directory of CSV files. 

**This might take a little while! Consider stretching.**

In [28]:
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'
namesbystate_path = fetch_and_cache(data_url, 'namesbystate.zip')

*Optional Hacking Challenge:* Use the `zipfile` module, `pd.read_csv`, and `pd.concat` to build a single dateframe called `baby_names` containing all of the data from each state with the `column_labels` below. a `ZipFile` object has an attribute `filelist` and a method `open`. Each `.TXT` file inside `namesbystate.zip` is a CSV file for the names of babies born in one state.

If you don't want to figure out how to read a zip file, you can just scroll down and use the cell where we've done it for you.

In [29]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

The following cell builds the final full `baby_names` DataFrame. It first builds one dataframe per state, because that's how the data are stored in the zip file. Here is documentation for [pd.concat](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.concat.html) if you want to know more about its functionality. 

In [30]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh: 
        return pd.read_csv(fh, header=None, names=column_labels)

states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename) 
    if f.filename.endswith('.TXT')
]

baby_names = states[0]
for state_df in states[1:]:
    baby_names = pd.concat([baby_names, state_df])
baby_names = baby_names.reset_index().iloc[:, 1:]

In [31]:
len(baby_names)

In [32]:
baby_names.head()

## Review: Slicing Data Frames - selecting rows and columns


### Selection Using Label/Index (using loc)

**Column Selection** 
To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage looks like `df.loc[rowname,colname]`. (Reminder that the colon `:` means "everything").  For example, if we want the `color` column of the `ex` data frame, we would use : `ex.loc[:, 'color']`

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would give select the columns `Name` and the columns after.

- *Alternative:* While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

**Row Selection**
Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the dataframe.

In [33]:
#Example:
baby_names.loc[2:5, 'Name']

In [34]:
#Example:  Notice the difference between these two methods
baby_names.loc[2:5, ['Name']]

The `.loc` actually uses the Pandas row index rather than row id/position of rows in the dataframe to perform the selection. Also, notice that if you write `2:5` with `loc[]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 5. 


### Review: Selection using Integer location (using iloc)

There is another Pandas slicing function called `iloc[]` which lets you slice the dataframe by row position and column position instead of by row index and column label (for `loc[]`). This is really the main difference between the 2 functions and it is **important** that you remember the difference and why you might want to use one over the other. In addition, with `iloc[]`, the end index is NOT included, like with normal Python slicing.

Below, we have sorted the `baby_names` dataframe. Notice how the *position* of a row is not necessarily equal to the *index* of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the different between `loc[]` and `iloc[]`.

In [35]:
sorted_baby_names = baby_names.sort_values(by=['Name'])
sorted_baby_names.head()

How would you get the 2nd, 3rd, and 4th rows with only the `Name` column of the `baby_names` dataframe using `iloc[]`
<!--
BEGIN QUESTION
name: q3c
manual: false
points: 2
-->

In [36]:
...

Observe the difference when using `loc[]` with 1:4, since it selects using the *index*.

In [37]:
sorted_baby_names.loc[1:4, "Name"]

Lastly, we can change the index of a dataframe using the `set_index` method.

In [38]:
#Example: We change the index from 0,1,2... to the Name column
df = baby_names[:5].set_index("Name") 
df

How would you look up rows by name directly, using Mary and Anna?
<!--
BEGIN QUESTION
name: q3d
manual: false
points: 2
-->

In [39]:
...

However, if we still want to access rows by location we will need to use the integer loc (`iloc`) accessor:

In [40]:
#Example: 
#df.loc[2:5,"Year"] You can't do this
df.iloc[1:4,2:3]

### Question 4

Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `Name` and `Year` **in that order** from the `baby_names` table.

<!--
BEGIN QUESTION
name: q4
-->

<!--
BEGIN QUESTION
name: q4
manual: false
points: 2
-->

In [41]:
name_and_year = ...
name_and_year[:5]

In [None]:
ok.grade("q4");

Note that `.loc[]` can be used to re-order the columns within a dataframe.

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df['column name'] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
==   | a == b   | Does a equal b?
<=   | a <= b   | Is a less than or equal to b?
>=   | a >= b   | Is a greater than or equal to b?
<    | a < b    | Is a less than b?
&#62;    | a &#62; b    | Is a greater than b?
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in California

In [42]:
ca = baby_names[baby_names['State'] == 'CA']

### Question 5
Select the names in Year 2000 (for all baby_names) that have larger than 3000 counts. What do you notice?

(If you use `p & q` to filter the dataframe, make sure to use `df[df[(p) & (q)]]` or `df.loc[df[(p) & (q)]])`

**Remember** that both slicing and `loc` will achieve the same result, it is just that `loc` is typically faster in production. You are free to use whichever one you would like.

<!--
BEGIN QUESTION
name: q5
-->

In [43]:
result = ...
result.head()

### Question 6

Some names gain/lose popularity because of cultural phenomena such as a political figure coming to power. Below, we plot the popularity of the name Hillary over time. What do you notice about this plot? What might be the cause of the steep drop?

<!--
BEGIN QUESTION
name: q6
-->

In [44]:
hillary_baby_name = baby_names[(baby_names['Name'] == 'Hillary') & (baby_names['State'] == 'CA') &(baby_names['Sex'] == 'F')]
plt.plot(hillary_baby_name['Year'], hillary_baby_name['Count'])
plt.title("Hillary Popularity Over Time")
plt.xlabel('Year')
plt.ylabel('Count');

*Write your answer here, replacing this text.*

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

In [None]:
# Save your notebook first, then run this cell to submit.
ok.submit()