### Familiar: A Study In Data Analysis
Familiar, a startup in the new market of blood transfusion, has fallen into some tough times lately, so our goal is to help them make some insights about their product and help move the needle.

In [24]:
# Import libraries
import pandas as pd
import numpy as np

from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency

# Load a dataset and print the result
lifespans = pd.read_csv("data_lifespan.csv")
print(lifespans.head())

     pack   lifespan
0    vein  76.255090
1  artery  76.404504
2  artery  75.952442
3  artery  76.923082
4  artery  73.771212


The first thing we want to know is whether Familiar’s most basic package, the Vein Pack, actually has a significant impact on the subscribers. It would be a marketing goldmine if we can show that subscribers to the Vein Pack live longer than other people. 

We want to extract the life spans of subscribers to the 'vein' pack and save the data into a variable.

In [25]:
# Save the life spans of subscribers to the 'vein' pack
vein_pack_lifespans = lifespans.lifespan[lifespans.pack == "vein"]

# Calculate mean value
mean_vein_lifespans = np.mean(vein_pack_lifespans)
# Print the result
print(mean_vein_lifespans)

76.16901335636044


We’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy of 73 years. We would use to test the following null and alternative hypotheses:
- null: The average lifespan of a Vein Pack subscriber is 73 years
- alternative: The average lifespan of a Vein Pack subscriber is NOT 73 years

To compare a single sample of lifespans (quantitative data) and the mean of this sample to a hypothetical population value of 73 years, we'll go with a one sample t-test.

In [26]:
# Run a one sample t-test
tstat, pval = ttest_1samp(vein_pack_lifespans, 73)
# Print the result
print(pval)

5.972157921433201e-07


The p-value result is equivalent to 0.000000597. This is much smaller than a significant threshold of 0.05, so we conclude that the average lifespan of Vein Pack subscribers IS significantly different from 73 years.

### Pumping Life Into The Company
In order to differentiate Familiar’s different product lines, we’d like to compare this lifespan data to another package - the Artery Pack.

In [27]:
# Save the life spans of subscribers to the 'artery' pack
artery_pack_lifespans = lifespans.lifespan[lifespans.pack == "artery"]

# Calculate mean value
mean_vein_lifespans = np.mean(artery_pack_lifespans)
# Print the result
print(mean_vein_lifespans)

74.87366223517039


Now we’d like to find out if the average lifespan of a Vein Pack subscriber is significantly different from the average life expectancy for the Artery Pack. We would use to test the following null and alternative hypotheses:
- null: The average lifespan of a Vein Pack subscriber is equal to the average lifespan of an Artery Pack subscriber
- alternative: The average lifespan of a Vein Pack subscriber is NOT equal to the average lifespan of an Artery Pack subscriber

For this test, we’re really looking at an association between two variables: whether a subscriber got the Vein Pack or Artery Pack (a binary categorical variable) and their lifetime (a quantitative variable).

In [28]:
# Run a two-sample t-test
tstat, pval = ttest_ind(vein_pack_lifespans, artery_pack_lifespans)
# Print the result
print(pval)

0.055888830790708194


This is larger than 0.05, so we conclude that the average lifespan of Vein Pack subscribers not significantly different from the average lifespan of an Artery Pack subscriber.

### Side Effects
We have been provided another dataset containing survey data about iron counts for our subscribers. This data has been pre-processed to categorize iron counts as “low”, “normal”, and “high” for each subscriber. Familiar wants to be able to advise potential subscribers about possible side effects of these packs and whether they differ for the Vein vs. the Artery pack.

In [29]:
# Load a dataset and print the result
iron = pd.read_csv("data_iron.csv")
print(iron.head())

     pack    iron
0    vein     low
1  artery  normal
2  artery  normal
3  artery  normal
4  artery    high


First of all, we'd like to know if there an association between the pack that a subscriber gets (Vein vs. Artery) and their iron level. To do that, we must create a contingency table of the 'pack' and 'iron' columns.

In [30]:
# Create a contingency table and print the result
Xtab = pd.crosstab(iron.pack, iron.iron)
print(Xtab)

iron    high  low  normal
pack                     
artery    87   29      29
vein      20  140      40


To check if the association is significant, we would use to test the following null and alternative hypotheses:
- there is NOT an association between which pack (Vein vs. Artery) someone subscribes to and their iron level
- alternative: There is an association between which pack (Vein vs. Artery) someone subscribes to and their iron level

For this test, we’re interested in an association between two categorical variables, so the Chi-Square test is our choice.

In [31]:
# Run Chi-Square test
chi2, pval, dof, exp = chi2_contingency(Xtab)
print(pval)

9.359749337433008e-25


Our p-value is equivalent to 0.000000000000000000000000936. This is smaller than 0.05, so we conclude that there IS a significant association between pack and iron level.