In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("discussion.ipynb")

# DSC 80 - Discussion 04

### Due Date: Saturday October 22, 11:59 PM

**Discussions will be due by the end of the day on Saturday**

# Combining DataFrames, Permutation Testing & Data Visualization


In [1]:
# import libraries
import pandas as pd
import numpy as np
import os
from IPython.display import HTML

In [2]:
# for formatting purposes
def multi_table(table_list):
    ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell
    '''
    return HTML(
        '<table><tr style="background-color:white;">' + 
        ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +
        '</tr></table>'
    )

In [3]:
from discussion import *

# 1. Review: Combining DataFrames 

#### `merge()`

* Used to combine two (or more) dataframes on the basis of **values of common columns** (indices can also be used, use `left_index=True` and/or `right_index=True`).
    * If we are joining columns on columns, the DataFrame indexes will be ignored. 
    * If we are joining indexes on indexes or indexes on a column or columns, the index will be passed on.

* **`on`**: column or index level names to join on. 
    * These must be found in both DataFrames. 
    * If `on` is `None` and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

* **`how`**: specifies how to determine which keys are to be included in the resulting table. 
    * If a key (column name) combination does not appear in either the left or the right tables, the values in the joined table will be `np.NaN`.
    * Defaults to `inner`, joining will be performed on index. 

#### `concat()`

* Used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).
    * Useful if we have two or more data sets containing the same columns but different rows of data.
    * We can also the columns from one `Dataframe` to those of another `Dataframe`.

#### `join()`

* Used to merge two dataframes on the basis of the index; instead of using `merge()` with the option `left_index=True` we can use `join()`.
    * Join operation honors the object on which it is called: `a.join(b)` $ \neq $ `b.join(a)`.

<img src="imgs/join_types.jpg">

1. **Inner Join** – default behavior, only keep rows where the merge “on” value exists in both the left and right dataframes.
2. **Left Outer** – keep every row in the left dataframe.
    * Where there are missing values of the “on” variable in the right dataframe, add `np.NaN` values in the result.
3. **Right Join** – keep every row in the right dataframe. 
    * Where there are missing values of the “on” variable in the left column, add `np.NaN` values in the result.
4. **Outer Join** – returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with `NaNs` elsewhere.

We'll start with a simple example:

In [5]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')

joined = left.join(right, lsuffix='_l', rsuffix='_r')

multi_table([left, right, joined])

Now let's try something a bit more complex:

In [6]:
df1_data = {
    'Year' : [2014, 2014, 2014, 2014, 2014],
    'Week' : ['A', 'B', 'B', 'C', 'D'],
    'Color' : ['Red', 'Red', 'Black', 'Red', 'Green'],
    'Val' : [50, 60, 70, 10, 20]
}

df1 = pd.DataFrame(df1_data).set_index('Week')

df2_data = {
    'Year' : [2014, 2014, 2014, 2014, 2014],
    'Week' : ['A', 'B', 'C', 'C', 'D'],
    'Color' : ['Black', 'Black', 'Green', 'Red', 'Red'],
    'Score' : [30, 100, 50, 20, 40]
}

df2 = pd.DataFrame(df2_data).set_index('Week')

multi_table([df1, df2])

In [7]:
# how many rows, how many columns?
df1.join(df2, lsuffix='_l', rsuffix = '_r')

In [8]:
# will this be any different?
df2.join(df1, lsuffix='_l', rsuffix = '_r')

# 2. Review: Permutation Testing

### Hypothesis testing

- In "vanilla" hypothesis testing, we are given a **single** observed sample, and are asked to make an assumption as to how it came to be.
    - This assumption is the **null hypothesis**.
    - This assumption must be a **probability model**, since we use it to generate new data.
- We simulate data under the null hypothesis to answer the question, **if this assumption is true, how likely is the given observation?**

### Permutation testing

* **Given two observed samples, are they fundamentally different, or could they have been generated by the same process?**
* In a permutation test, we decide whether two **fixed** random samples come from the same distribution.
- Unlike in the previous hypothesis testing examples, when conducting a permutation test, you do not know **what distribution** generated your two samples!


## Revisit: Birth weight example 👶

### Birth weight and Maternal Age

- Is there a significant difference in the weights of babies born to mothers who belong to certain weight categories?
- We have 3 groups:
    - Babies whose mothers belonged to age group 15-25
    - Babies whose mothers belonged to age group 25-35
    - Babies whose mothers belonged to age group 35-45
- In each group, the relevant attribute is the birth weight of the baby. 

In [9]:
# Kaiser dataset, 70s 
import os
baby_fp = os.path.join('data', 'baby1.csv')
baby = pd.read_csv(baby_fp)
baby.head()
# baby['Maternal Age'].max()

In [10]:
mother_age_and_birthweight = baby[['Maternal Age', 'Birth Weight']]
mother_age_and_birthweight.head()

### Exploratory data analysis

How many babies are in each group?

In [11]:
# Data binning
bins = [15,25,35,45]
group_names = ['15-25','25-35','35-45']
mother_age_and_birthweight['Age Bracket'] = pd.cut(mother_age_and_birthweight['Maternal Age'],bins,labels=group_names)
mother_age_and_birthweight.head()

What is the average birth weight within each group?

In [12]:
mother_age_and_birthweight.groupby('Age Bracket')[['Birth Weight']].mean()

Note that 16 ounces are in 1 pound, so the above weights are ~7-8 pounds.

### Visualizing birth weight distributions

- Below, we draw the distribution of birth weights, separated by mother's Age group.
- The histograms appear to be different, but is the difference possible **due to random chance** or is there a significant difference in the two distributions?

In [13]:
title = "Birth Weight by Mother's Age Group"

(
    mother_age_and_birthweight
    .groupby('Age Bracket')['Birth Weight']
    .plot(kind='hist', density=True, legend=True,
          ec='w', bins=np.arange(50, 200, 5), alpha=0.75,
          title=title)
);    

In [14]:
(
    mother_age_and_birthweight
    .groupby('Age Bracket')['Birth Weight']
    .plot(kind='kde', legend=True,
          title=title)
);    

In [15]:
from matplotlib import pyplot as plt
shuffled_weights = (
    mother_age_and_birthweight['Birth Weight']
    .sample(frac=1)
    .reset_index(drop=True) # Question: What will happen if we do not reset the index?
)

original_and_shuffled = (
    mother_age_and_birthweight
    .assign(**{'Shuffled Birth Weight': shuffled_weights})
)


original_and_shuffled.head(10)
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

title = 'Birth Weights by Maternal Age Group, SHUFFLED'
original_and_shuffled.groupby('Age Bracket')['Shuffled Birth Weight'].plot(kind='kde', title=title, ax=axes[0])

title = 'Birth Weights by Maternal Age Group'
original_and_shuffled.groupby('Age Bracket')['Birth Weight'].plot(kind='kde', title=title, ax=axes[1]);

In [16]:
observed_difference = (
    mother_age_and_birthweight
    .groupby('Age Bracket')['Birth Weight']
    .mean()
    .diff()
    .iloc[-1]
)

observed_difference

# 3. Plotting in `pandas`

## is as easy as `.plot()`

* `Series.plot()` plots a column.

In [17]:
data = pd.read_csv('data/data1.csv')

In [18]:
data.head()

In [19]:
# select a column from data
z0 = data['z0']
z0.head()

* Use a line plot to plot numeric data.
* `data.plot()` plots a line plot by default.
    - The x-axis is the index by default
    - Can be called out using the key-word argument `x`.

In [20]:
# index is [0...1000]
z0.plot()

In [21]:
# set index to plot correct x-axis
z0 = data.set_index('x').loc[:, 'z0']
z0.head()

In [22]:
z0.plot()

In [23]:
# set x-axis using a keyword argument
data.plot(x='x', y='z0')

### Plotting (quantitative) empirical distributions in Pandas

* Use the key-word argument `kind`
```
kind : str
    - 'hist' : histogram
    - 'box' : boxplot
    - 'kde' : Kernel Density Estimation plot
    ...
```
* The `hist` keyword by default uses 10 bins, and returns the *count* of observations within those bins.
    - use `density=True` to return a histogram whose area is normalized to 1.

In [24]:
# histogram of z0 values; 
# 25 bins.
# density = normalized histogram

z0.plot(kind='hist', bins=25, density=True, ec='w')

In [25]:
# kernel density estimate of the distribution
# smooth approximation of the empirical distribution

z0.plot(kind='kde')

In [26]:
z0.plot(kind='box')

---

## Practice Problems

**Question 1**

Like in Discussion-03, You are given two seperate dataframes: `mlb_2017` and `mlb_2018`. Both dataframes contain statistics for the 2017 and 2018 baseball seasons respectively. Your job is two combine these two dataframes into one using the following guidelines:

* The dataframe you return should be indexed by team name (`Tm`).
* The dataframe you return should include columns `#Bat`, `BatAge`, `R/G`, `G` from `mlb_2017` and `mlb_2018`.
* Use the suffixes `_prev` and `_curr` to differentiate between statistics from both seasons.
* The dataframe you return should have all rows from `mlb_2018` irrespective of the team name `Tm`.

Create a function `specific_combined_seasons` that returns, as a tuple, the following:

* The combined dataframe described above.
* The highest average `R/G` of the 2017 and 2018 seasons combined.

In [28]:
# read in the following .txt files
mlb_2017 = pd.read_csv(os.path.join('data','mlb_2017.txt'))
mlb_2018 = pd.read_csv(os.path.join('data','mlb_2018.txt'))

multi_table([mlb_2017.head(), mlb_2018.head()])

In [29]:
# don't change this cell -- it is needed for the tests to work
mlb_2017 = pd.read_csv(os.path.join('data','mlb_2017.txt'))
mlb_2018 = pd.read_csv(os.path.join('data','mlb_2018.txt'))
q1_out = specific_combined_seasons(mlb_2017, mlb_2018)

In [None]:
grader.check("q1")

**Question 2**

Plot the counts of meals in `tips` by day. Your plotting function, `plot_meal_by_day` should return an `matplotlib.axes._subplots.AxesSubplot` object; your plot should look like the plot below. 

Note: You don't need to have exact same colors, but the plot orientation, axis order, and title should match.

<img src="imgs/barh1.png" width="50%"/>

In [35]:
# don't change this cell -- it is needed for the tests to work
tips = sns.load_dataset('tips')
q1_fig = plot_meal_by_day(tips)

In [None]:
grader.check("q2")

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file and not here by running the doctests: `python -m doctest discussion.py`.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()