# Datascience module version
Like any other 'maintained' package, the datascience package is constantly revised. 

Version 0.17.6 had some changes to the API, which breaks some of the examples. We therefore need to make sure we are using a version smaller than 0.17.6 and might have to downgrade to version 0.17.5. 

We do this by 'escaping' jupyter and run the python package manager "pip" from the 'shell'. 

We run the command:

`!pip install --user datascience==0.17.5`

The command installs a specific version (0.17.5) of datascience in user space.

Afterwards, we have to restart the kernel and then can verify the version of datascience we are using with

```python
import datascience
datascience.__version__
````

In [None]:
!pip install --user datascience==0.17.5

In [1]:
import datascience
datascience.__version__

'0.17.6'

In [2]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Comparing Two Samples

In [3]:
births = Table.read_table('data/baby.csv')

In [4]:
births

Birth Weight,Gestational Days,Maternal Age,Maternal Height,Maternal Pregnancy Weight,Maternal Smoker
120,284,27,62,100,False
113,282,33,64,135,False
128,279,28,64,115,True
108,282,23,67,125,True
136,286,25,62,93,False
138,244,33,62,178,False
132,245,23,65,140,False
120,289,25,62,125,False
143,299,30,66,136,True
140,351,27,68,120,False


In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')

In [None]:
smoking_and_birthweight.group('Maternal Smoker')

In [None]:
smoking_and_birthweight.hist('Birth Weight', group='Maternal Smoker')

# Test Statistic

[Question] What values of our statistic are in favor of the alternative: positive or negative?

In [None]:
means_table = smoking_and_birthweight.group('Maternal Smoker', np.average)
means_table

In [None]:
means = means_table.column('Birth Weight average')
observed_difference = means.item(1) - means.item(0)
observed_difference

In [None]:
def difference_of_means(table, label, group_label):
    """Takes: name of table, column label of numerical variable,
    column label of group-label variable
    Returns: Difference of means of the two groups"""
    
    # subset table with the two relevant columns
    reduced = table.select(...)  
    
    # table containing group means
    means_table = reduced.group(...)
    
    # array of group means
    means = means_table ...
    
    return ...

In [None]:
difference_of_means(births, 'Birth Weight', 'Maternal Smoker')

# Random Permutation (Shuffling)

In [None]:
letters = Table().with_column('Letter', make_array('a', 'b', 'c', 'd', 'e'))

In [None]:
letters.sample()

In [None]:
letters.sample(with_replacement=False)

In [None]:
letters.with_column('Shuffled', letters.sample(with_replacement=False).column(0))

# Simulation Under Null Hypothesis

In [None]:
smoking_and_birthweight

In [None]:
shuffled_labels = smoking_and_birthweight.sample(with_replacement=False).column('Maternal Smoker')

In [None]:
original_and_shuffled = smoking_and_birthweight.with_column('Shuffled Label', shuffled_labels)

In [None]:
original_and_shuffled

In [None]:
difference_of_means(original_and_shuffled, 'Birth Weight', 'Shuffled Label')

In [None]:
difference_of_means(original_and_shuffled, 'Birth Weight', 'Maternal Smoker')

# Permutation Test

In [None]:
def one_simulated_difference(table, label, group_label):
    """Takes: name of table, column label of numerical variable,
    column label of group-label variable
    Returns: Difference of means of the two groups after shuffling labels"""
    
    # array of shuffled labels
    shuffled_labels = table.sample(with_replacement = False) ...
    
    # table of numerical variable and shuffled labels
    shuffled_table = table.select(label).with_column(...)
    
    return difference_of_means(...)   

In [None]:
one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')

In [None]:
differences = make_array()

for i in np.arange(2500):
    new_difference = one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')
    differences = np.append(differences, new_difference)

In [None]:
Table().with_column('Difference Between Group Means', differences).hist()
print('Observed Difference:', observed_difference)
plots.title('Prediction Under the Null Hypothesis');