# Datascience module version
Like any other 'maintained' package, the datascience package is constantly revised. 

Version 0.17.6 had some changes to the API, which breaks some of the examples. We therefore need to make sure we are using a version smaller than 0.17.6 and might have to downgrade to version 0.17.5. 

We do this by 'escaping' jupyter and run the python package manager "pip" from the 'shell'. 

We run the command:

`!pip install --user datascience==0.17.5`

The command installs a specific version (0.17.5) of datascience in user space.

Afterwards, we have to restart the kernel and then can verify the version of datascience we are using with

```python
import datascience
datascience.__version__
````

In [None]:
!pip install --user datascience==0.17.5

In [None]:
import datascience
datascience.__version__

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Comparing Two Samples
- read in the 'data/baby.csv' file
- look at the data
- subset to `Birth Weight` and `Maternal Smoker` columns
- How many babies are in the smoker and the nonsmoker group?
- Make a histogram showing the weight distribution of babies
- Make a histogram with the distribution of the smoker babies and the non-smoker babies

--- 

back to slides

---

# Test Statistic

**Question**: What values of our statistic are in favor of the alternative: positive or negative?

- compute the average birthweights of smokers and nonsmokers
- compute the difference between the two groups
- create a function that accepts 
    - a `table`, a `group label`, and a `variable label` and
    - computes the difference in the average of the variable in the two groups

In [None]:
def difference_of_means(table, label, group_label):
    """Takes: 
        - name of table
        - column label of numerical variable
        - column label of group-label variable
    Returns: Difference of means of the two groups"""
    ...

In [None]:
difference_of_means(births, 'Birth Weight', 'Maternal Smoker')

---
back to slides

----

# Random Permutation (Shuffling)
- remember the [sample function](http://www.data8.org/datascience/reference-nb/datascience-reference.html#tbl.sample())
- by sampling with replacement, we can shuffle the order of a column
- shuffle the table and attach the shuffled `Letter` as a new column

In [None]:
letters = Table().with_column('Letter', make_array('a', 'b', 'c', 'd', 'e'))

# Simulation Under Null Hypothesis
- Shuffle the smoking/nonsmoking labels
- attach the shuffled labels to the `smoking_and_birthweight` table
- calculate the difference of means for the original labels
- calculate the difference of means for the shuffled labels

In [None]:
smoking_and_birthweight

# Permutation Test
- create a function that shuffles the labels and calculates the difference of means
- simulate 2500 times. This should yield an array with 2500 differences
- create a table and plot the histogram

In [None]:
def one_simulated_difference(table, label, group_label):
    """Takes: 
        - name of table
        - column label of numerical variable
        - column label of group-label variable
    Returns: Difference of means of the two groups after shuffling labels"""
    # array of shuffled labels
    shuffled_labels = table.sample(with_replacement = False).column(group_label)
    
    # table of numerical variable and shuffled labels
    shuffled_table = table.select(label).with_column('Shuffled Label', shuffled_labels)
    
    return difference_of_means(shuffled_table, label, 'Shuffled Label')

In [None]:
one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')

In [None]:
differences = make_array()

for i in np.arange(2500):
    new_difference = one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')
    differences = np.append(differences, new_difference)

**Question:**
Can we conclude causality?