# Basic Analysis of Data

In just looking at the data with visualization we learn a lot, but it's also easy as human to let our prior assumptions get in the way.  So a combination of both statistical tests and direct examination of the data is usually warranted to get a more complete understanding of what's happening. We'd like to be able to do statistical testing to see what the propability is that a difference is significant or could arise by chance.

For this let's use `scikit-learn` and `scipy` modules installed with mamba as `mamba install scikit-learn scipy`.

To demonstrate - let's consider some "random" distributions, one centered at 100 and one at 200, with a variance of 100.  I'll generate them with a command from the `numpy` module (<https://numpy.org/>) - specifically the `numpy.random.normal` command (<https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html>)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotnine as p9

In [None]:
dists=pd.DataFrame(np.vstack((np.random.normal(100,100,5), np.random.normal(200,100,5))).T, columns=['x', 'y']).melt()
dists

Let's plot this result:

In [None]:
p9.ggplot(data=dists, mapping=p9.aes(x='variable', y='value', color='variable'))+p9.geom_violin()+p9.geom_sina()

We can see that the boxes overlap, and given the high variance I set for the data.  What happens if I increase the number of samples? Instead of 5, let's take 100 samples

In [None]:
lotsof_dists=pd.DataFrame(np.vstack((np.random.normal(100,100,100), np.random.normal(200,100,100))).T, columns=['x', 'y']).melt()
p9.ggplot(data=lotsof_dists, mapping=p9.aes(x='variable', y='value', color='variable'))+p9.geom_violin()+p9.geom_sina(alpha=0.5)

Better!  What about 1000 samples?

In [None]:
tonsof_dists=pd.DataFrame(np.vstack((np.random.normal(100,100,1000), np.random.normal(200,100,1000))).T, columns=['x', 'y']).melt()
p9.ggplot(data=tonsof_dists, mapping=p9.aes(x='variable', y='value', color='variable'))+p9.geom_violin()+p9.geom_sina(alpha=0.5)

The point I'm trying to get across is that the more samples of a population you take (especially for a noisy population) the better your measurement.  If your difference is small (100 in this case) compared to the variance (100), it's hard to tell the difference with a few samples. 

But we don't have to eyeball it.  We can _calculate_ the confidence, the probably that something is different using the t-test.

There are many possible statistical tests, depending on the kind of data you have and the shape of the distributions of the data.  We are _not_ going to have the time to exhaustively examine them here, instead we will show a simple case or two using what is perhaps the simplest statistical test, _Student's t-test_.

The t-test is a type of hypothesis testing. In this case, we are trying to measure if the _null_ hypothesis $H_0$ i.e. that the means of the populations are equal.  You might say - "This is easy to measure - just take the mean of the samples and if they are the same, it's true, and if not, it's false!".  But it's not that simple - because you are _sampling_.  In the real world we only have a certain number of samples - and the samples often make a distribution.  So how far away the means need to be is dependent to some extent on _how many_ samples you have, and therefore how accurately you can estimate the population.

First - let's do a "1-sample" t-test - this checks if the mean of the samples is equal to the expected population mean (`popmean`).

As a reminder, we have 3 samples in memory right now - dists (5 samples each), lotsof_dists (100 samples), and tonsof_dists (1000 samples).  The "x" distribution has a expected mean of 100, and the "y" distribution has an expected mean of 200.  
Let's test using the 1-sample t-test (using `ttest_1samp` <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html>) if the x distribution passes the null hypothesis for an population mean of 100 or for a population mean of 200.

To do this, let's use the `scipy` packages, specifically the `stats` submodule <https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html>

In [None]:
from scipy import stats

In [None]:
print("5 samples Versus population mean of 100:")
print(stats.ttest_1samp(dists[dists['variable']=="x"].value, popmean=100))
print("5 samples Versus population mean of 200:")
print(stats.ttest_1samp(dists[dists['variable']=="x"].value, popmean=200))

Breaking down these results, we see two numbers - a statistic - specifically a _t-statistic_ which without getting into the math (<https://en.wikipedia.org/wiki/T-statistic>) is a ratio of how far off the mean is compared to the standard error.  The second number - more immediately applicable - is the infamous _p-value_.  A p-value (<https://en.wikipedia.org/wiki/P-value>) in hypothesis testing is the probability that you would see these kind of results, or results more extreme than these, if the null-hypothesis is _true_.  

Importantly, you should use the p-value to _nullify_ or _reject_ the null hypothesis.  So - from the first result - we can't reject the null hypothesis (that the mean of the population is 100).  Nor would I say with _confidence_ that we can reject the null hypothesis that the population mean is 200 (p=0.106), but it is _suggestive_.  All too frequently in scientific work, people are seeking a p-value of 0.05 - i.e. saying that if you pass that 5% chance threshold, you can absolutely rule out the null hypothesis.  

Let's see what happens with the larger samples:

In [None]:
print("100 samples Versus population mean of 100:")
print(stats.ttest_1samp(lotsof_dists[lotsof_dists['variable']=="x"].value, popmean=100))
print("100 samples Versus population mean of 200:")
print(stats.ttest_1samp(lotsof_dists[lotsof_dists['variable']=="x"].value, popmean=200))

In [None]:
print("1000 samples Versus population mean of 100:")
print(stats.ttest_1samp(tonsof_dists[tonsof_dists['variable']=="x"].value, popmean=100))
print("1000 samples Versus population mean of 200:")
print(stats.ttest_1samp(tonsof_dists[tonsof_dists['variable']=="x"].value, popmean=200))

As you can see now, though the p-value for the population mean of 100 ramins high, the p-value for the population mean of 200 has become vanishingly small, suggesting that we can successfully reject the null hypothesis there.   
But what if we don't _a priori_ know the population mean, and we just want to compare two distributions?  We can do that with a 2 sample t-test - using `t.test_ind` (<https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html>)

In [None]:
print("5 samples comparing x (5 samples from a normal distribution with a mean of 100) to y (5 samples from a normal distribution with a mean of 200)")
print(stats.ttest_ind(dists[dists['variable']=="x"].value, dists[dists['variable']=="y"].value))

With only 5 samples - it's hard to tell them apart - this makes sense - remember:

In [None]:
p9.ggplot(data=dists, mapping=p9.aes(x='variable', y='value', color='variable'))+p9.geom_boxplot()

But what about with 100 samples:

In [None]:
print("5 samples comparing x (5 samples from a normal distribution with a mean of 100) to y (5 samples from a normal distribution with a mean of 200)")
print(stats.ttest_ind(lotsof_dists[lotsof_dists['variable']=="x"].value, lotsof_dists[lotsof_dists['variable']=="y"].value))
p9.ggplot(data=lotsof_dists, mapping=p9.aes(x='variable', y='value', color='variable'))+p9.geom_boxplot()

Now we can see we can reject the null hypothesis that these samples have the same mean, that samples with the same mean with these kinds of distribution only have a 3.12e-8 probability.

But these are all simulated data - how do we do this on "real" data - let's go back to our good ol' penguins data set

In [None]:
penguin=pd.read_excel('../data/penguin.xlsx')
penguin=penguin[penguin.sex.notnull()]
penguin

Let's do a simple test - let's look at body mass between male and female penguins

In [None]:
print(stats.ttest_ind(penguin[penguin['sex']=="male"].body_mass_g, penguin[penguin['sex']=="female"].body_mass_g))
p9.ggplot(data=penguin, mapping=p9.aes(x='sex', y='body_mass_g', color='sex'))+p9.geom_violin()+p9.geom_sina(alpha=0.5)

Looks like they are different! But what's this I notice in the plot - I see two separate distributions in both sex samples?

In [None]:
(p9.ggplot(data=penguin, mapping=p9.aes(x='sex', y='body_mass_g', color='species'))+p9.geom_boxplot()+p9.geom_sina(alpha=0.5)).show()
(p9.ggplot(data=penguin, mapping=p9.aes(x='sex', y='body_mass_g', color='species'))+p9.geom_violin()+p9.geom_sina(alpha=0.5)).show()

Oh - look it's due to species?  The Adelie and Chinstrap seem to have largely similar distributions, but the gentoo are significantly larger.  But we don't have to test this by eye, let's test for female penguins between the 3 distributions:

In [None]:
male_penguin=penguin[penguin['sex']=="male"]
male_adelie=male_penguin[male_penguin['species']=="Adelie"]
male_gentoo=male_penguin[male_penguin['species']=="Gentoo"]
male_chinstrap=male_penguin[male_penguin['species']=="Chinstrap"]

Comparing the Adelie and the Gentoo:

In [None]:
print(stats.ttest_ind(male_adelie.body_mass_g, male_gentoo.body_mass_g))

And now comparing the Adelie and the Chinstrap:

In [None]:
print(stats.ttest_ind(male_adelie.body_mass_g, male_chinstrap.body_mass_g))

So from this we can't reject the null hypothesis that the Chinstrap and Adelie male penguins have the same mean body mass. What about the female?

In [None]:
female_adelie=penguin[(penguin.sex=="female")&(penguin.species=="Adelie")]
female_chinstrap=penguin[(penguin.sex=="female")&(penguin.species=="Chinstrap")]
print(stats.ttest_ind(female_adelie.body_mass_g, female_chinstrap.body_mass_g))

### Principle components

One problem with this data is that - for each individual penguin - there are multiple components or dimensions to the data.  This can get even more complicated with higher "dimensionality" data - imagine you are profiling the level of each gene in a thousand different cells - how do you represent this? How do you look at the similarities?  One way is something called a Principle Component Analysis (_PCA_).  

PCA operates by changing the coordinate system so that _most_ of the variation lies along a few axes, so that the variation in the data can represented in as few as 2 dimensions (so we can plot it and examine it.

<img src="../files/pca_process.png" width=800>

More information is at this website <https://setosa.io/ev/principal-component-analysis/>


We'll be using PCA from the `scikit-learn` module (<https://scikit-learn.org/stable/index.html>) - specifically the `PCA` command.

In [None]:
from sklearn.decomposition import PCA

Before we try to do this, lets clean up our penguins data set a little to remove the NAs and separate out the categories (sex, species):

In [None]:
clean_penguin=penguin.dropna().drop(columns='year').reset_index()
penguin_numbers=clean_penguin.select_dtypes(np.number)
penguin_categories=clean_penguin[['species', 'sex']]

pca=PCA(n_components=4)
penguins_pca=pca.fit_transform(penguin_numbers)
penguins_pca

We have to put this result back into a pandas to play with it further - as it's currently a "numpy array"

In [None]:
pca_dataframe=pd.DataFrame(data=penguins_pca, columns=['PC1','PC2','PC3','PC4'])

And let's add back the categorical information (the rows are still in the same order). We'll use pandas `concat` for this (<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html>)

In [None]:
pca_dataframe=pd.concat([pca_dataframe, penguin_categories], axis=1)
pca_dataframe

Let's plot the PCA using `geom_point`

In [None]:
p9.ggplot(data=pca_dataframe, mapping=p9.aes(x="PC1", y="PC2", color="species", shape="sex"))+p9.geom_point()

Plot is too small - let's make it bigger using the theme `p9.theme` and set `figure_size`

In [None]:
p9.ggplot(data=pca_dataframe, mapping=p9.aes(x="PC1", y="PC2", color="species", shape="sex"))+p9.geom_point()+p9.theme(figure_size=(16,8))

What we can see from this is that the _primary_ (i.e. first two) principle components - or most of the difference in the dataset - is in the species. However, it's important to note that we didn't _scale_ the data before taking the PCA - remembering our picture at the beginning - the body_mass numbers are not at all on the same scale or units as the flipper length measurements.  So we should have first scaled the columns before starting. 

Let's do that now.  We'll use the `StandardScaler` from scikit-learn (<https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html>) which basically does this to each column $z=(x-\mu)/s$ where z is the new result, mu is the mean of the column, and s is the standard error. 

In [None]:
from sklearn import preprocessing

In [None]:
scaler=preprocessing.StandardScaler()
scaled_penguin_numbers=scaler.fit_transform(penguin_numbers)
scaled_penguin_numbers

In [None]:
scaled_pca=PCA(n_components=4)
scaled_penguins_pca=scaled_pca.fit_transform(scaled_penguin_numbers)
scaled_pca_dataframe=pd.DataFrame(data=scaled_penguins_pca, columns=['PC1','PC2','PC3','PC4'])
scaled_pca_dataframe=pd.concat([scaled_pca_dataframe, penguin_categories], axis=1)
scaled_pca_dataframe

In [None]:
p9.ggplot(data=scaled_pca_dataframe, mapping=p9.aes(x="PC1", y="PC2", color="species", shape="sex"))+p9.geom_point()

Wow - now they are really clearly separated, this improved our result substantially, further, I can now see that there is even some separation according to sex in the data along the PCs. 