## Statistical Analysis in Python

In this section, we introduce a few useful methods for analyzing your data in Python.
Namely, we cover how to compute the mean, variance, and standard error of a data set.
For more advanced statistical analysis, we cover how to perform a
Mann-Whitney-Wilcoxon (MWW) RankSum test, how to perform an Analysis of variance (ANOVA)
between multiple data sets, and how to compute bootstrapped 95% confidence intervals for
non-normally distributed data sets.

### Python's SciPy Module

The majority of data analysis in Python can be performed with the SciPy module. SciPy
provides a plethora of statistical functions and tests that will handle the majority of
your analytical needs. If we don't cover a statistical function or test that you require
for your research, SciPy's full statistical library is described in detail at:
http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html

### Python's pandas Module

The pandas module provides powerful, efficient, R-like DataFrame objects capable of
calculating statistics en masse on the entire DataFrame. DataFrames are useful for when
you need to compute statistics over multiple replicate runs.

For the purposes of this tutorial, we will use Luis' parasite data set:

In [47]:
from pandas import *

# must specify that blank space " " is NaN
experimentDF = read_csv("data-analysis-python/parasite_data.csv", na_values=[" "])

print experimentDF

<class 'pandas.core.frame.DataFrame'>
Int64Index: 350 entries, 0 to 349
Data columns:
Virulence           300  non-null values
Replicate           350  non-null values
ShannonDiversity    350  non-null values
dtypes: float64(2), int64(1)


### Accessing data in pandas DataFrames

You can directly access any column and row by indexing the DataFrame.

In [48]:
# show all entries in the Virulence column
print experimentDF["Virulence"]

0     0.5
1     0.5
2     0.5
3     0.5
4     0.5
5     0.5
6     0.5
7     0.5
8     0.5
9     0.5
10    0.5
11    0.5
12    0.5
13    0.5
14    0.5
...
335   NaN
336   NaN
337   NaN
338   NaN
339   NaN
340   NaN
341   NaN
342   NaN
343   NaN
344   NaN
345   NaN
346   NaN
347   NaN
348   NaN
349   NaN
Name: Virulence, Length: 350


In [49]:
# show the 12th row in the ShannonDiversity column
print experimentDF["ShannonDiversity"][12]

1.58981


You can also access all of the values in a column meeting a certain criteria.

In [50]:
# show all entries in the ShannonDiversity column > 2.0
print experimentDF[experimentDF["ShannonDiversity"] > 2.0]

     Virulence  Replicate  ShannonDiversity
8          0.5          9           2.04768
89         0.6         40           2.01066
92         0.6         43           2.90081
96         0.6         47           2.02915
105        0.7          6           2.23427
117        0.7         18           2.14296
127        0.7         28           2.23599
129        0.7         30           2.48422
133        0.7         34           2.18506
134        0.7         35           2.42177
139        0.7         40           2.25737
142        0.7         43           2.07258
148        0.7         49           2.38326
151        0.8          2           2.07970
153        0.8          4           2.38474
163        0.8         14           2.03252
165        0.8         16           2.38415
170        0.8         21           2.02297
172        0.8         23           2.13882
173        0.8         24           2.53339
182        0.8         33           2.17865
196        0.8         47       

### Blank/omitted data (NA or NaN) in pandas DataFrames

Blank/omitted data is a piece of cake to handle in pandas. Here's an example data
set with NA/NaN values.

In [51]:
print experimentDF[isnan(experimentDF["Virulence"])]

     Virulence  Replicate  ShannonDiversity
300        NaN          1          0.000000
301        NaN          2          0.000000
302        NaN          3          0.833645
303        NaN          4          0.000000
304        NaN          5          0.990309
305        NaN          6          0.000000
306        NaN          7          0.000000
307        NaN          8          0.000000
308        NaN          9          0.061414
309        NaN         10          0.316439
310        NaN         11          0.904773
311        NaN         12          0.884122
312        NaN         13          0.000000
313        NaN         14          0.000000
314        NaN         15          0.000000
315        NaN         16          0.000000
316        NaN         17          0.013495
317        NaN         18          0.882519
318        NaN         19          0.000000
319        NaN         20          0.986830
320        NaN         21          0.000000
321        NaN         22       

DataFrame methods automatically ignore NA/NaN values.

In [52]:
print "Mean virulence across all treatments:", experimentDF["Virulence"].mean()

Mean virulence across all treatments: 0.75


However, not all methods in Python are guaranteed to handle NA/NaN values properly.

In [53]:
from scipy import stats

print "Mean virulence across all treatments:", stats.sem(experimentDF["Virulence"])

Mean virulence across all treatments: nan


Thus, it behooves you to take care of the NA/NaN values before performing your analysis. You can either:

**(1) filter out all of the entries with NA/NaN**

In [54]:
# NOTE: this drops the entire row if any of its entries are NA/NaN!
print experimentDF.dropna()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300 entries, 0 to 299
Data columns:
Virulence           300  non-null values
Replicate           300  non-null values
ShannonDiversity    300  non-null values
dtypes: float64(2), int64(1)


If you only care about NA/NaN values in a specific column, you can specify the
column name first.

In [55]:
print experimentDF["Virulence"].dropna()

0     0.5
1     0.5
2     0.5
3     0.5
4     0.5
5     0.5
6     0.5
7     0.5
8     0.5
9     0.5
10    0.5
11    0.5
12    0.5
13    0.5
14    0.5
...
285    1
286    1
287    1
288    1
289    1
290    1
291    1
292    1
293    1
294    1
295    1
296    1
297    1
298    1
299    1
Name: Virulence, Length: 300


**(2) replace all of the NA/NaN entries with a valid value**

In [56]:
print experimentDF.fillna(0.0)["Virulence"]

0     0.5
1     0.5
2     0.5
3     0.5
4     0.5
5     0.5
6     0.5
7     0.5
8     0.5
9     0.5
10    0.5
11    0.5
12    0.5
13    0.5
14    0.5
...
335    0
336    0
337    0
338    0
339    0
340    0
341    0
342    0
343    0
344    0
345    0
346    0
347    0
348    0
349    0
Name: Virulence, Length: 350


Take care when deciding what to do with NA/NaN entries. It can have a significant
impact on your results!

In [57]:
print "Mean virulence across all treatments w/ dropped NaN:", experimentDF["Virulence"].dropna().mean()

print "Mean virulence across all treatments w/ filled NaN:", experimentDF.fillna(0.0)["Virulence"].mean()

Mean virulence across all treatments w/ dropped NaN: 0.75
Mean virulence across all treatments w/ filled NaN: 0.642857142857


### Mean

The mean performance of an experiment gives a good idea of how the experiment will
turn out *on average* under a given treatment.

Conveniently, DataFrames have all kinds of built-in functions to perform standard
operations on them en masse: `add()`, `sub()`, `mul()`, `div()`, `mean()`, `std()`, etc.
The full list is located at: http://pandas.sourceforge.net/generated/pandas.DataFrame.html

Thus, computing the mean of a DataFrame only takes one line of code:

In [58]:
from pandas import *

print "Mean Shannon Diversity w/ 0.8 Parasite Virulence =", experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"].mean()

Mean Shannon Diversity w/ 0.8 Parasite Virulence = 1.2691338188


### Variance

The variance in the performance provides a measurement of how consistent the results
of an experiment are. The lower the variance, the more consistent the results are, and
vice versa.

Computing the variance is also built in to pandas DataFrames:

In [59]:
from pandas import *

print "Variance in Shannon Diversity w/ 0.8 Parasite Virulence =", experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"].var()

Variance in Shannon Diversity w/ 0.8 Parasite Virulence = 0.611038433313


### Standard Error of the Mean (SEM)

Combined with the mean, the SEM enables you to establish a range around a mean that
the majority of any future replicate experiments will most likely fall within.

pandas DataFrames don't have methods like SEM built in, but since DataFrame
rows/columns are treated as lists, you can use any NumPy/SciPy method you like on them.

In [60]:
from pandas import *
from scipy import stats

print "SEM of Shannon Diversity w/ 0.8 Parasite Virulence =", stats.sem(experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"])

SEM of Shannon Diversity w/ 0.8 Parasite Virulence = 0.110547585529


A single SEM will usually envelop 68% of the possible replicate means
and two SEMs envelop 95% of the possible replicate means. Two
SEMs are called the "estimated 95% confidence interval." The confidence
interval is estimated because the exact width depend on how many replicates
you have; this approximation is good when you have more than 20 replicates.

### Mann-Whitney-Wilcoxon (MWW) RankSum test

The MWW RankSum test is a useful test to determine if two distributions are significantly
different or not. Unlike the t-test, the RankSum test does not assume that the data
are normally distributed, potentially providing a more accurate assessment of the data sets.

As an example, let's say we want to determine if the results of the two following
treatments significantly differ or not:

In [61]:
# select two treatment data sets from the parasite data
treatment1 = experimentDF[experimentDF["Virulence"] == 0.5]["ShannonDiversity"]
treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]

print "Data set 1:\n", treatment1
print "Data set 2:\n", treatment2

Data set 1:
0     0.059262
1     1.093600
2     1.139390
3     0.547651
4     0.065928
5     1.344330
6     1.680480
7     0.000000
8     2.047680
9     0.000000
10    1.507140
11    0.000000
12    1.589810
13    1.144800
14    1.011190
15    0.000000
16    0.776665
17    0.001749
18    1.761200
19    0.021091
20    0.790915
21    0.000000
22    0.018867
23    0.994268
24    1.729620
25    0.967537
26    0.457318
27    0.992525
28    1.506640
29    0.697241
30    1.790580
31    1.787710
32    0.857742
33    0.000000
34    0.445267
35    0.045471
36    0.003490
37    0.000000
38    0.115830
39    0.980076
40    0.000000
41    0.820405
42    0.124755
43    0.719755
44    0.584252
45    1.937930
46    1.284150
47    1.651680
48    0.000000
49    0.000000
Name: ShannonDiversity
Data set 2:
150    1.433800
151    2.079700
152    0.892139
153    2.384740
154    0.006980
155    1.971760
156    0.000000
157    1.428470
158    1.715950
159    0.000000
160    0.421927
161    1.179920
162    0.93

A RankSum test will provide a P value indicating whether or not the two
distributions are the same.

In [62]:
from scipy import stats

z_stat, p_val = stats.ranksums(treatment1, treatment2)

print "MWW RankSum P for treatments 1 and 2 =", p_val

MWW RankSum P for treatments 1 and 2 = 0.000983355902735


If P <= 0.05, we are highly confident that the distributions significantly differ, and
can claim that the treatments had a significant impact on the measured value.

If the treatments do *not* significantly differ, we could expect a result such as
the following:

In [63]:
treatment3 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]
treatment4 = experimentDF[experimentDF["Virulence"] == 0.9]["ShannonDiversity"]

print "Data set 3:\n", treatment3
print "Data set 4:\n", treatment4

Data set 3:
150    1.433800
151    2.079700
152    0.892139
153    2.384740
154    0.006980
155    1.971760
156    0.000000
157    1.428470
158    1.715950
159    0.000000
160    0.421927
161    1.179920
162    0.932470
163    2.032520
164    0.960912
165    2.384150
166    1.879130
167    1.238890
168    1.584300
169    1.118490
170    2.022970
171    0.000000
172    2.138820
173    2.533390
174    1.212340
175    0.059135
176    1.578260
177    1.725210
178    0.293153
179    0.000000
180    0.000000
181    1.699600
182    2.178650
183    1.792580
184    1.920800
185    0.000000
186    1.583250
187    0.343235
188    1.980010
189    0.980876
190    1.089380
191    0.979254
192    1.190450
193    1.738880
194    1.154100
195    1.981610
196    2.077180
197    1.566410
198    0.000000
199    1.990900
Name: ShannonDiversity
Data set 4:
200    1.036930
201    0.938018
202    0.995956
203    1.006970
204    0.968258
205    0.000000
206    0.416046
207    1.570310
208    2.122400
209    2.

In [64]:
z_stat, p_val = stats.ranksums(treatment3, treatment4)

print "MWW RankSum P for treatments 3 and 4 =", p_val

MWW RankSum P for treatments 3 and 4 = 0.994499571124


With P > 0.05, we must say that the distributions do not significantly differ.
Thus changing the parasite virulence between 0.8 and 0.9 does not result in a
significant change in Shannon Diversity.

### One-way analysis of variance (ANOVA)

If you need to compare more than two data sets at a time, an ANOVA is your best bet. For
example, we have the results from three experiments with overlapping 95% confidence
intervals, and we want to confirm that the results for all three experiments are not
significantly different.

In [65]:
treatment1 = experimentDF[experimentDF["Virulence"] == 0.7]["ShannonDiversity"]
treatment2 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"]
treatment3 = experimentDF[experimentDF["Virulence"] == 0.9]["ShannonDiversity"]

print "Data set 1:\n", treatment1
print "Data set 2:\n", treatment2
print "Data set 3:\n", treatment3

Data set 1:
100    1.595440
101    1.419730
102    0.000000
103    0.000000
104    0.787591
105    2.234270
106    1.700440
107    0.954747
108    1.127320
109    1.761330
110    0.000000
111    0.374074
112    1.836250
113    1.583900
114    0.998377
115    0.341714
116    0.892717
117    2.142960
118    1.824870
119    0.999703
120    0.957757
121    1.152910
122    0.597295
123    1.959020
124    0.764003
125    0.614147
126    0.617618
127    2.235990
128    0.000000
129    2.484220
130    0.008294
131    1.003480
132    1.292820
133    2.185060
134    2.421770
135    0.713224
136    0.551367
137    0.006377
138    0.948393
139    2.257370
140    1.394850
141    0.547157
142    2.072580
143    1.323440
144    1.001340
145    1.042600
146    0.000000
147    1.139100
148    2.383260
149    0.056819
Name: ShannonDiversity
Data set 2:
150    1.433800
151    2.079700
152    0.892139
153    2.384740
154    0.006980
155    1.971760
156    0.000000
157    1.428470
158    1.715950
159    0.

In [66]:
from scipy import stats
	
f_val, p_val = stats.f_oneway(treatment1, treatment2, treatment3)

print "One-way ANOVA P =", p_val

One-way ANOVA P = 0.381509481874


If P > 0.05, we can claim with high confidence that the means of the results of all three
experiments are not significantly different.

### Bootstrapped 95% confidence intervals

Oftentimes in wet lab research, it's difficult to perform the 20 replicate runs
recommended for computing reliable confidence intervals with SEM.

In this case, bootstrapping the confidence intervals is a much more accurate method of
determining the 95% confidence interval around your experiment's mean performance.

Unfortunately, SciPy doesn't have bootstrapping built into its standard library yet.
However, there is already a scikit out there for bootstrapping. Enter the following
command to install it:

> sudo pip install -e git+http://github.org/cgevans/scikits-bootstrap.git#egg=Package

You may need to install `pip` first, e.g., for Mac:

> sudo easy_install pip

Bootstrapping 95% confidence intervals around the mean with this function is simple:

In [77]:
# subset a list of 10 data points
treatment1 = experimentDF[experimentDF["Virulence"] == 0.8]["ShannonDiversity"][:10]

print "Small data set:\n", treatment1

Small data set:
150    1.433800
151    2.079700
152    0.892139
153    2.384740
154    0.006980
155    1.971760
156    0.000000
157    1.428470
158    1.715950
159    0.000000
Name: ShannonDiversity


In [68]:
import scipy
import scikits.bootstrap as bootstrap

CIs = bootstrap.ci(data=treatment1, statfunction=scipy.mean)

print "Bootstrapped 95% confidence intervals\nLow:", CIs[0], "\nHigh:", CIs[1]

Bootstrapped 95% confidence intervals
Low: 0.650358748 
High: 1.727339024


Note that you can change the range of the confidence interval by setting the alpha:

In [69]:
# 80% confidence interval
CIs = bootstrap.ci(treatment1, scipy.mean, alpha=0.2)
print "Bootstrapped 80% confidence interval\nLow:", CIs[0], "\nHigh:", CIs[1]

Bootstrapped 80% confidence interval
Low: 0.827824024 
High: 1.5390109


And also modify the size of the bootstrapped sample pool that the confidence intervals
are taken from:

In [76]:
# bootstrap 20,000 samples instead of only 10,000
CIs = bootstrap.ci(treatment1, scipy.mean, n_samples=20000)
print "Bootstrapped 95% confidence interval w/ 20,000 samples\nLow:", CIs[0], "\nHigh:", CIs[1]

Bootstrapped 95% confidence interval w/ 20,000 samples
Low: 0.649929948 
High: 1.72177


Generally, bootstrapped 95% confidence intervals provide more accurate confidence
intervals than 95% confidence intervals estimated from the SEM.