# Workshop III - Pandas and Statistics

## Why Pandas?

From https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673


Pandas is quite a game changer when it comes to analyzing data with Python and it is one of the most preferred and widely used tools in data munging/wrangling if not THE most used one. Pandas is an open source, free to use (under a BSD license) and it was originally written by Wes McKinney (here’s a link to his GitHub page).

What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table in a statistical software (think Excel or SPSS for example. People who are familiar with R would see similarities to R too). This is so much easier to work with in comparison to working with lists and/or dictionaries through for loops or list comprehension


In [None]:
import pandas as pd
import scipy
import pingouin as pg

In [None]:
a = pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
a

In [None]:
a = pd.DataFrame({
    'Cat': ['I liked it.', 'It was awful.', 'Crunchy'],
    'Dog': ['Pretty good.', 'Bland.', 'Bad.'],
    'Bird' : ['Terrible.', 'Even worse.', 'Best thing ever']})
a

In [None]:
a = pd.DataFrame(
    {
    'Cat': ['I liked it.', 'It was awful.', 'Crunchy'],
    'Dog': ['Pretty good.', 'Bland.', 'Bad.'],
    'Bird' : ['Terrible.', 'Even worse.', 'Best thing ever']},
    index=['Cat food', 'Dog food', 'Bird seed'])
a

## Access the bits of data we want

### By row or column number/s - iloc

In [None]:
a.iloc[:, 1]

### By row or column names - loc

In [None]:
a.loc[:, 'Cat']

In [None]:
a.loc[['Cat food', 'Bird seed'],['Cat', 'Bird']]

In [None]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic

In [None]:
titanic.loc[titanic.age > 40, ["age","fare"]].describe()

## Almost any way of manipulating your data can be done

- Load from CSV or Excel
- Add new rows or columns
- Change values
- Merge data frames
- Statistics

In [None]:
from IPython.display import IFrame
IFrame("https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf", width=800, height=1500)

# Statistics

## The reproducibility crisis

In [None]:
from IPython.display import IFrame
IFrame("https://www.nature.com/articles/533452a.pdf", width=800, height=1500)

## What is a p-value?

## If a p-value is less than a cutoff, what does that mean?

-----

## Distribution

In [None]:
penguins = sns.load_dataset("penguins")
penguins

In [None]:
sns.displot(penguins, x="flipper_length_mm", height=6)

In [None]:
a = pd.DataFrame(penguins.loc[:, "flipper_length_mm"])
a.describe()

In [None]:
sns.catplot(data=penguins, x="flipper_length_mm", kind="box")

## Probability distributions

### Population = complete set of individuals we want information about

### Sample = subset of the population

#### - We want a non-biased sample

----

### Whats the difference between a standard deviation and a standard error of the mean?

-----

### Probabilities describe the process of sampling from a population

### The probabilities lie on a distribution

-----

## Discrete random variables v.s continuous random variables

### Binomial distribution - discrete random variable

In [None]:
from numpy import random

sns.displot(random.binomial(n=10, p=0.5, size=1000), kind="hist")

#### See also Poisson distribution

### Normal distribution - continuous random variable

In [None]:
import numpy as np
x = np.random.standard_normal(100000)
sns.displot(x,kind="kde")

#### See also t-distribution, f-distribution, chi-squared distribution - test statistics lie on a probability distribution

# Hypothesis testing

### 1. Null hypothesis

i.e. there is no difference between the mean of two populations

### 2. Alternative hypothesis

i.e. There is a difference between the mean of two populations

### 3. Test statistics

### 4. Rejection region

### 5. Check assumptions and draw conclusions

## Nominal variables - Fisher's Exact Test

In [None]:
a = pd.DataFrame(
    {
    'Cat': [7,2],
    'Dog': [1,12],},
    index=['Cat food', 'Dog food'])
a

- Null hypothesis : Whether an animal likes cat food or dog food is independent on whether they are a cat or a dog
- Alternative hypothesis : Whether an animal likes cat food or dog food is dependent on whether they are a cat or a dog
- Level of significance : 0.05
- Assumptions of the test: individual observations are independent; totals are fixed

In [None]:
oddsratio, pvalue = scipy.stats.fisher_exact(a)  
pvalue

#### As p<0.05 We reject the null hypothesis; there is an association between being a cat or a dog and liking cat food or dog food
- one variable has the ability to predict the other
- the test statistic does not indicate anything about the strength of the association

# What is a p-value?

## The probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.

In [None]:
from IPython import display
display.Image("https://s3.amazonaws.com/libapps/accounts/73970/images/hypothesis_testing.png")


## t - test

- The t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis (wikipedia)
- The t-test was developed by a chemist working for the Guinness brewing company as a simple way to measure the consistent quality of stout

In [None]:
penguins

In [None]:
penguins_subset = penguins[penguins.species != "Chinstrap"]
sns.catplot(data=penguins_subset, x= "species", y = "flipper_length_mm", orient="v", kind="box", height=6, aspect=0.6)

- Null hypothesis : The flipper length means of the Adelie and Gentoo populations are equal
- Alternative hypothesis : The flipper length means of the Adelie and Gentoo populations are different
- Level of significance : 0.05
- Assumptions of the test: Data in each group must be obtained via a random sample from the population; data in each group are normally distributed; data values are continuous; variances for the two independent groups are equal.

### Get out values from the penguins dataset

In [None]:
a_flip = penguins.loc[penguins.species=='Adelie','flipper_length_mm']
g_flip = penguins.loc[penguins.species=='Gentoo','flipper_length_mm']

#### Test for normality - Shapiro test
- Null hypothesis: flipper length for the population is normally distributed
- Alternative hypothesis: flipper length for the population is not normally distributed

In [None]:
print(scipy.stats.shapiro(a_flip))
print(scipy.stats.shapiro(g_flip))

- Alternative hypothesis rejected in each case - both populations are normally distributed

#### Test for equality of variances (homoscedasticity) - Bartlett test
- Null hypothesis: flipper length variances for the two populations are equal
- Alternative hypothesis: flipper length variances for the two populations are different

In [None]:
scipy.stats.bartlett(a_flip, g_flip)

#### Use the pingouin library to conduct the t - test
- this is newish python library (so be aware) that gives lots of additional outputs

In [None]:
pg.ttest(a_flip, g_flip)

#### As p<0.05 We reject the null hypothesis; the mean flipper length of the Gentoo and Adelie populations is different
- the test statistic does not indicate anything about the strength of the association

In [None]:
%%html
<iframe src="https://pingouin-stats.org/generated/pingouin.ttest.html#pingouin.ttest" width="1000" height="800"></iframe>

#### As p<0.05 We reject the null hypothesis; there is an association between being a cat or a dog and liking cat food or dog food
- one variable has the ability to predict the other
- the test statistic does not indicate anything about the strength of the association

## ANOVA

#### Null and alternative hypotheses?
#### Assumptions - (approx) normal distributions, independent samples, randomly sampled, equal variances
#### What to do if these assumptions aren't met?

In [None]:
from IPython.display import SVG
SVG("https://pingouin-stats.org/_images/flowchart_one_way_ANOVA.svg")

In [None]:
sns.catplot(data=penguins, x= "species", y = "flipper_length_mm", orient="v", kind="box", height=6, aspect=0.6)

In [None]:
pg.homoscedasticity(data=penguins.dropna(), dv='flipper_length_mm', group='species', method = "bartlett")

In [None]:
pg.anova(data=penguins, dv='flipper_length_mm', between='species', detailed=True)

#### Null hypothesis rejected - the means of the populations are not equal
#### We need a post-hoc test to see which means are not equal

In [None]:
pg.pairwise_tukey(data=penguins, dv='flipper_length_mm', between='species')

## The multiple comparisons problem

 - The look-elsewhere effect is a phenomenon in the statistical analysis of scientific experiments where an apparently statistically significant observation may have actually arisen by chance because of the sheer size of the parameter space to be searched (wikipedia)
 - The more inferences are made, the more likely erroneous inferences are to occur. 
 - Set a stricter significance threshold for individual comparisons, so as to compensate for the number of inferences being made.

In [None]:
pvals = [0.04, 0.001, 0.02, 0.009]
pg.multicomp(pvals, alpha=0.05, method='holm')

# Final thoughts :

### Think about how the data will be analysed in the experimental design phase, not after the experiment is conducted
### Look at the assumptions of the test being used (and how strict they are).  Use a more appropriate one if required
### Use multiple correction when doing multiple tests

In [1]:
%%html
<iframe src="https://cdn2.hubspot.net/hubfs/4627953/Essential%20Dos%20and%20Donts%20Ebook/GraphPad%20Ebook%20%7C%20Essential%20Dos%20Don%27ts.pdf?hsCtaTracking=5c7f8486-7fbb-4d16-a068-ad03d1b3af54%7C759129a2-44d6-48d7-b81c-6cb28641839e" width="1000" height="800"></iframe>
