## ThinkStats 9.1 - 9.3 Companion

This notebook will allow you to practice some of the concepts from ThinkStats2 Chapter 9.

### Companion to 9.1 - 9.2

First, we'll start with the question that Allen poses at the beginning of the chapter: "Suppose we toss a coin 250 times and we see 140 heads.  Is this strong evidence that the coin is biased?"

As Allen says, classical hypothesis testing is similar to a proof by contradiction.  First, we assume that the thing we are trying to show is false (that the coin is biased).  Second, we show that this leads to an observed event being excedingly improbable (seeing 140 heads out of 250 tosses).  Finally, we can conclude that our assumption (that the coin is not biased) is unlikely to be true.

Write a function to simulate n random coin flips of a fair coin (p(heads) = 0.5).  Your function should return the number of heads that occur in those n coin clips.

In [None]:
from random import choice

def simulate_fair_coin_flips(n):
    """ Return the number of heads that occur in n flips of a
        fair coin p(heads) = 0.5 """
    pass

print simulate_fair_coin_flips(250)

Next, repeat your simulation of 240 coin flips 1000 times.  Create and display a CDF of the number of times heads appears based on  1000 random trials.

In [84]:
%matplotlib inline
import thinkstats2
import thinkplot
import matplotlib.pyplot as plt

# your implementation here (imports included for convenience)

The p-value is simply the probability that we would have seen a result as extreme (or greater) as 140 heads out of 250 flips under the hypothesis that the coin is fair (the null hypothesis).  Using the CDF you created in the previous cell, compute the p-value.  If you want to test your learning a bit more: compute the p-value without using the CDF explicitly (instead use the results of the 1000 random trials directly).

Hint: you should use the PercentileRank function of CDF to compute the p-value, however, there is one important gotcha.  The PercentileRank function returns the percentage of data that is equal to or less than the input value.  When computing the p-value we want the percentage of the data that is equal to or greater than the observed value.

The p-value we computed above is called a [one-tailed test](https://en.wikipedia.org/wiki/One-_and_two-tailed_tests) in that we only counted simulations of the null-hypothesis that had 140 or more heads (Allen uses the terminology of one versus two-sided tests, see ThinkStats2 9.4).  A two-tailed test would count simulations with 140 or more tails as well (which is what Allen shows in the book).  Whether to use a one-tailed or a two-tailed test mostly has to do with your prior expectations regarding the hypothesis you are testing.  For instance, if you had a reason to suspect that the coin would be biased towards heads (but not tails) you would use a one-tailed test.  If you had no reason to assume a priori that the coin was biased towards heads or tails, you should use a two-tailed test.

Modify your coin flip simulation code to return the number of heads or tails, whichever is larger, out of n flips.

In [None]:
def simulate_fair_coin_flips_two_sided(n):
    """ Return the number of heads or tails, whichever is larger,
        that occur in n flips of a fair coin p(heads) = 0.5 """
    pass

print simulate_fair_coin_flips_two_sided(250)

Using the function `simulate_fair_coin_flips_two_sided`, create and display a CDF of the number of times the most common outcome, heads or tails, appears based on 1000 random trials.

Use the CDF to compute a two-tailed (or two-sided) p-value for the observed data (140 heads out of 250 flips).

This approach (via simulations of the null-hypothesis) to computing p-values has its limitations.  For instance, suppose you observed 180 heads in 250 flips.  If you used your CDF from above to answer this question, what would go wrong?  What would you need to do in order to get a sensible estimate of this p-value?

### Companion to 9.3

In Section 9.3 Allen uses a permutation test to examine whether there is a significant difference between the pregnancy lengths for first babies versus others.  Here, I will ask you to implement a very similar test without using the base class `thinkstats2.HypothesisTest`.  This will be the second test you have implemented on your own.  From here on out, you may implement tests by inheriting from `thinkstats2.HypothesisTest`, or you can choose to simply roll your own.

We will test the hypothesis that the mean age of men versus women on the titanic was different.  First, let's load the data and drop any rows where age is missing.

In [85]:
import pandas as pd

data = pd.read_csv('../datasets/titanic_train.csv')
data = data.dropna(subset=['Age'])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


Write a function that takes as input a data frame and computes the absolute value of the difference in mean age between men and women.

In [None]:
def compute_age_diff(data):
    """ Compute the absolute value of the difference in mean age
        between men and women on the titanic """
    pass

observed_age_diff = compute_age_diff(data)
print "observed age difference", observed_age_diff

Write a function called `shuffle_ages` that returns a copy of the original data frame but where the Ages have been randomly permuted.

Hint: there are lots of ways to do this, but  `numpy.random.permutation` seems to be an especially succint choice.  Make sure to try this function out on a small, hand-made Pandas series to get the idea of how it works.

In [None]:
from numpy.random import permutation

def shuffle_ages(data):
    """ Return a new dataframe (don't modify the original) where
        the values in the Age column have been randomly permuted. """
    pass

compute_age_diff(shuffle_ages(data))

Using 1000 random simulations, compute the p-value for the hypothesis that the mean ages of men and women were different (you may wish to use Cdf as in the previous section).

Ignoring passengers with missing ages:

1.  Was the average age of male versus female passengers on the titanic different?
2.  What additional (if any) conclusions can you draw based on the p-value you just computed?  In other words, what does this p-value mean?

Disclaimer: (1) is a bit of a trick question (sorry!), but I included it to encourage being precise about the definition of the null hypothesis and eactly which population it refers to.