In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("discussion.ipynb")

# DSC 80 - Discussion 03

### Due Date: Saturday October 15, 11:59 PM

**Discussions will be due by the end of the day on Saturday**

---


In [1]:
# import libraries
import pandas as pd
import numpy as np
import os
from IPython.display import HTML

In [2]:
# for formatting purposes
def multi_table(table_list):
    ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell
    '''
    return HTML(
        '<table><tr style="background-color:white;">' + 
        ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +
        '</tr></table>'
    )

In [3]:
from discussion import *

# Review: Hypothesis Testing & Combining DataFrames

## Hypothesis Testing

We will cover examples of two key types of hypothesis testing
* Comparing two categorical distributions
* Comparing sub-group statistic to a population parameter

### Steps to follow to solve a hypothesis testing problem
1. Form the Null and Alternate hypothesis
2. Define the test statistic
3. Calculate observed/sample test statistic (form an intuition of how big/small/extreme it is)
4. Simulate one instance of test statistic under null hypothesis. Simulate the whole null distribution
5. Calculate p-value based on null distribution and observed test statistic
6. Plot the null distribution, observed statistic and validate the p-value
7. Provide conclusion as to whether you reject or fail-to-reject null hypothesis

### Are Snapchat users similar in age distribution compared to overall social media users?
Based on the (fake) survey collected from 1000 Snapchat users, we want to understand if the age distribution of these two user groups are significantly different?
- Note that the survey data below is already aggregated, which can be used directly.

In [5]:
age_df = pd.DataFrame([['GenZ', 0.64, 0.48],
                    ['Millennials', 0.24, 0.36],
                    ['GenX', 0.08, 0.10],
                    ['Boomers', 0.04, 0.06]],
                   columns=['Age Group', 'Snapchat', 'Social Media']).set_index('Age Group')

age_df

#### Step-1: Form the Null and Alternate Hypotheses
Null hypothesis:

Alternate hypothesis: 

#### Step-2: Define the test-statistic

A common test statistic used while comparing two categorical distributions is TVD (Total Variation Distance)

**TVD:** Sum of absolute differences between corresponding values of two distribution, divided by 2.

#### Step-3: Calculate observed/sample test statistic

In [6]:
age_df.plot(kind='barh', title='Age Group Distribution of Snapchat vs Social Media');

In [7]:
observed_tvd = np.sum(np.abs(age_df['Snapchat'] - age_df['Social Media'])) / 2
observed_tvd

# Think - What would be the observed TVD if the two groups are exactly similar?
# Further - What does this imply about the TVD value as the difference increases/decreases?

#### Step-4 Simulate the test-statistic under null hypothesis. And simulate the null distribution

In [8]:
# Simulate one instance - Keep null hypothesis in mind
# There is no difference between Snapchat age groups wrt. overall age groups.
# So, a sample under null hypothesis should like an 'overall social media' distribution.
np.random.multinomial(1000, age_df['Social Media']) / 1000

In [9]:
# Simulate N times - Larger the N, smoother will be the Null distribution
# Traditional way - Use a for loop and store the null test statistic
# Faster way - Utilize numpy functionalities
N = 5000
np.random.multinomial(1000, age_df['Social Media'], size=N) / 1000

In [10]:
null_dists = np.random.multinomial(1000, age_df['Social Media'], size=N) / 1000

# Power of broadcasting: Can take difference between 2D array and 1D array 
# by broadcasting 1D array to 2D array
null_tvds = np.sum(np.abs(null_dists - age_df['Social Media'].to_numpy()), axis=1) / 2
null_tvds

#### Step-5: Calculate p-value

In [11]:
p_val = np.mean(null_tvds >= observed_tvd)
p_val

#### Step-6: Visualize and validate the p-value

In [12]:
import matplotlib.pyplot as plt
pd.Series(null_tvds).plot(kind='hist', 
                     density=True,
                     ec='w',
                     title='Simulated Null TVDs & Observed TVD');
plt.axvline(x=observed_tvd, color='red', linewidth=2);

In [13]:
# Recall Null Hypothesis - 

#  Think - What happens to your belief about the null hypothesis if we assume the distribution is taken 
#  from only 10 or 100 users?
#  Would you believe the null hypothesis more or believe it less??

#### Step-7: Provide conclusion

Conclusion: 

**Question 1**

A study from a competitive programming test shows a lower average score for students who used 'Go' programming language compared to others. Test whether this lower avg. score of 'Go' programming language is due to chance alone.

- The function should take a DataFrame like `prog_df`, number of null test statistic simulations `N`, 
- It should return a list containing a) observed test statistic, b) and the p-value of the hypothesis test. Note that the function should work for any `prog_df` having same column names, and the same unique values in `language` column.

Hint: Don't forget to utilize `numpy` functionality to eliminate the `for` loop. Check lecture 06 final example for a very similar problem and implementation.

In [15]:
prog_df = pd.read_csv(os.path.join('data','prog_df.csv'))
grouped_df = prog_df.groupby('language')['score'].agg(['mean', 'count'])
grouped_df

In [16]:
# What's the test statistic?
# What's the observed test statistic?
# How to simulate the null hypothesis?
# p-value?

In [17]:
# don't change this cell -- it is needed for the tests to work

prog_df = pd.read_csv(os.path.join('data','prog_df.csv'))
q1_out = hyp_test_lower_avg(prog_df, 1000)

In [None]:
grader.check("q1")

## Combining DataFrames 

* [`concat()`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)
* [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)
* [`join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)

### `concat()`

* Used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).
    * Useful if we have two or more data sets containing the same columns but different rows of data.
    * We can also concat the columns from one `Dataframe` to those of another `Dataframe`.

In [21]:
# left dataframe
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Aaron', 'Marina', 'Justin', 'Janine', 'Ilya'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})

# right dataframe
right = pd.DataFrame(
   {'id':[1,2,3,4,5],
   'Name': ['Enrique', 'Parker', 'Erik', 'Allston', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})

multi_table([left, right])

In [22]:
# add 'left' below 'right'
pd.concat([right, left])

In [23]:
# if you want to keep track of the names dataframes after concat, use 'keys'
pd.concat([right, left], keys=['right', 'left'])

In [24]:
# add 'left' to the side of 'right'
pd.concat([right, left], axis=1)

### `merge()`

* Used to combine two (or more) dataframes on the basis of **values of common columns** (indices can also be used, use `left_index=True` and/or `right_index=True`).
    * If we are joining columns on columns, the DataFrame indexes will be ignored. 
    * If we are joining indexes on indexes or indexes on a column or columns, the index will be passed on.

* **`on`**: column or index names to join on. 
    * These must be found in both DataFrames. 
    * If `on` is `None` and not merging on indexes, then this defaults to the intersection of the columns in both DataFrames.

In [25]:
multi_table([left, right])

In [26]:
# merge left and right tables on 'id' column
on_id = pd.merge(left,right,on='id')

# how many rows, how many columns?
multi_table([left, right, on_id])

In [27]:
# merge left and right tables on 'id' and 'subject_id' columns
on_id_subject = pd.merge(left,right,on=['id', 'subject_id'])

# how many rows, how many columns, what are the indices?
multi_table([left, right, on_id_subject])

* **`how`**: specifies how to determine which keys are to be included in the resulting table. 
    * If a key (column name) combination does not appear in either the left or the right tables, the values in the joined table will be `np.NaN`.
    * Defaults to `inner`

In [28]:
data_a = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Manny', 'Will', 'Hunter', 'Ian', 'Eric'], 
        'last_name': ['Machado', 'Myers', 'Renfroe', 'Kinsler', 'Hosmer']}
df_a = pd.DataFrame(data_a, columns = ['subject_id', 'first_name', 'last_name'])

data_b = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Cody', 'Justin', 'Corey', 'Clayton', 'Kenley'], 
        'last_name': ['Bellinger', 'Turner', 'Seager', 'Kershaw', 'Jansen']}
df_b = pd.DataFrame(data_b, columns = ['subject_id', 'first_name', 'last_name'])

multi_table([df_a, df_b])

| Merge Method  | Description                  |
| :-------      | :---------------------------:| 
| `left`        | Use keys from left object    | 
| `right`       | Use keys from right object   | 
| `outer`       | Use union of keys            |
| `inner`       | Use intersection of keys     | 

In [29]:
# based on the output below, what 'how' argument was passed into pd.merge?
how_list = ['outer', 'inner', 'right', 'left']

merge_method = np.random.choice(how_list)

pd.merge(df_a, df_b, on='subject_id', how=merge_method)

In [30]:
# let's check!
merge_method

### `join()`

* Used to merge two dataframes on the basis of the index; instead of using `merge()` with the option `left_index=True`, we can use `join()`.
    * Join operation honors the object on which it is called: `a.join(b)` $ \neq $ `b.join(a)`.

Facts about `merge()` vs `join()`:

* `merge()` is the underlying function used for all merge/join behavior
* `join()` is basically a specific behavior of `merge()` (left join using indices)
* For all practical purposes, `merge()` is usually used. While speaking, in general, merging and joining mean the same thing - combining DataFrames/tables based on common columns or indices.

<img src="imgs/join_types.jpg">

1. **Inner Join** – only keep rows where the merge “on” value exists in both the left and right dataframes.
2. **Left Outer** – keep every row in the left dataframe.
    * Where there are missing values of the “on” variable in the right dataframe, add `np.NaN` values in the result.
3. **Right Join** – keep every row in the right dataframe. 
    * Where there are missing values of the “on” variable in the left column, add `np.NaN` values in the result.
4. **Outer Join** – returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with `NaNs` elsewhere.

We'll start with a simple example:

In [31]:
df1 = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
df2 = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')

joined = df1.join(df2, lsuffix='_l', rsuffix='_r')

multi_table([df1, df2, joined])

Now let's try something a bit more complex:

In [32]:
df1_data = {
    'Year' : [2014, 2014, 2014, 2014, 2014],
    'Week' : ['A', 'B', 'B', 'C', 'D'],
    'Color' : ['Red', 'Red', 'Black', 'Red', 'Green'],
    'Val' : [50, 60, 70, 10, 20]
}

df1 = pd.DataFrame(df1_data).set_index('Week')

df2_data = {
    'Year' : [2014, 2014, 2014, 2014, 2014],
    'Week' : ['A', 'B', 'C', 'C', 'D'],
    'Color' : ['Black', 'Black', 'Green', 'Red', 'Red'],
    'Score' : [30, 100, 50, 20, 40]
}

df2 = pd.DataFrame(df2_data).set_index('Week')

multi_table([df1, df2])

In [33]:
# how many rows, how many columns?
df1.join(df2, lsuffix='_l', rsuffix = '_r')

In [34]:
# will this be any different?
df2.join(df1, lsuffix='_l', rsuffix = '_r')

### Data Science Interview Question
How many rows will you get if you perform:
- a) `df1` **left join** `df2` on `'letter'` ?
- b) `df1` **inner join** `df2` on `'letter'` ?
- c) `df1` **right join** `df2` on `'letter'` ?

Answer the question without using python code.
Can you write how the final merged/joined table will look like?

In [35]:
df1 = pd.DataFrame({
    'letter' : [1, 1, 2, 3, 4, 4],
    'alphabet' : ['A', 'B', 'C', 'D', 'E', 'F']
})

df2 = pd.DataFrame({
    'letter' : [1, 2, 4, 4, 4],
    'alphabet' : ['G', 'H', 'I', 'J', 'K']
})

multi_table([df1, df2])

**Question 2**

You are given two seperate dataframes: `mlb_2017` and `mlb_2018`. Both dataframes contain statistics for the 2017 and 2018 baseball seasons respectively. Your job is two combine these two dataframes into one using the following guidelines:

* The dataframe you return should be indexed by team name (`Tm`).
* The dataframe you return should include all columns from both `mlb_2017` and `mlb_2018`.
* Use the suffixes `_2017` and `_2018` to differentiate between statistics from both seasons.

Create a function `combined_seasons` that returns, as a tuple, the following:

* The combined dataframe described above.
* The team with most homeruns (`HR`) bewteen the 2017 and 2018 seasons combined.

In [37]:
# read in the following .txt files
mlb_2017 = pd.read_csv(os.path.join('data','mlb_2017.txt'))
mlb_2018 = pd.read_csv(os.path.join('data','mlb_2018.txt'))

multi_table([mlb_2017.head(), mlb_2018.head()])

In [38]:
# don't change this cell -- it is needed for the tests to work
mlb_2017 = pd.read_csv(os.path.join('data','mlb_2017.txt'))
mlb_2018 = pd.read_csv(os.path.join('data','mlb_2018.txt'))
q2_out = combined_seasons(mlb_2017, mlb_2018)

In [None]:
grader.check("q2")

**Question 3**

Using the same two dataframes, `mlb_2017` and `mlb_2018`, create a function `seasonal_average` that combines them and takes the mean of each column for each team. 

* The dataframe you return should be indexed by team name (`Tm`).
* Each column should contain the average value between the *2017* and *2018* seasons for the given statistic for each team.
    * For example, the `HR` column should contain the average value for `HR` for each team between the *2017* and *2018* seasons.

In [44]:
# don't change this cell -- it is needed for the tests to work
mlb_2017 = pd.read_csv(os.path.join('data','mlb_2017.txt'))
mlb_2018 = pd.read_csv(os.path.join('data','mlb_2018.txt'))
q3_out = seasonal_average(mlb_2017, mlb_2018)

In [None]:
grader.check("q3")

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file by running the doctests: `python -m doctest discussion.py`.

In [49]:
# !python -m doctest discussion.py

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()