# Pandas & Statistical Tests

Pandas is *the* module for data analysis, and aims to be similar in may ways to *R*.

- Check the pandas homepage: http://pandas.pydata.org
- The pandas 10-minute tutorial: http://pandas.pydata.org/pandas-docs/stable/10min.html
- A 10 minute video introduction: https://www.youtube.com/watch?v=DUCQ_HZamhs

## All earthquake data

We'll continue with the Earthquakes datasets from before, and look at **all-quakes-2.txt**, which has each eartquake that has happened since 1900 (above 5.5).

Pandas works with **DataFrames**, which is basically just a "table of values", with column names and row numbers.

##### Importing data

Pandas can really simply be used to read a **comma separated values (.csv)** file.
A **.csv** file is simple a file in which each row is a row of data, and columns are separated by commas.
The first line contains column titles.

##### Hints and tips
- **pd.read_csv(**filename**)** reads a csv file, and returns a DataFrame
- **df.head()** (on DataFrame called "df") prints the top 5 rows
- **df.tail()** (on DataFrame called "df") prints the last 5 rows

In [16]:
import pandas as pd
data = pd.read_csv("all-quakes-2.txt")
data.tail()

Unnamed: 0,year,country,scale
9769,2007,Indonesia,6.4
9770,2007,Indonesia,6.1
9771,2007,Vanuatu,6.1
9772,2007,Vanuatu,6.3
9773,2007,Micronesia,7.0


## A statstical test

We can use this data to do a simple statistical test.

Lets as the question: "Is the intensity (scale) of earthquakes in Indonesia statistically significantly different than that of the rest of the world?"

To do this, we will first create 2 new DataFrames: One with all data from Indonesia, and one with all the rest of the countries. We will create these using "boolean indexing".


In [17]:
# create a boolean list of which rows have 'country' == 'Indonesia'
indo = data['country'] == "Indonesia" 
print indo

# Use this list to "filter" the rows, and create a new DataFrame only containing values from Indonesia
indo_data = data[indo]    
print indo_data.head()
print indo_data.tail()

# Do the opposite to get a DataFrame of all other countries:

not_indo = data['country'] != "Indonesia"
not_indo_data = data[not_indo]
not_indo_data

# make sure the datasets are different sizes.
print indo_data.size
print not_indo_data.size


0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20       True
21      False
22      False
23      False
24      False
25       True
26       True
27      False
28      False
29      False
        ...  
9744    False
9745    False
9746    False
9747    False
9748    False
9749    False
9750    False
9751    False
9752    False
9753    False
9754     True
9755    False
9756    False
9757    False
9758    False
9759    False
9760    False
9761    False
9762     True
9763    False
9764    False
9765    False
9766    False
9767     True
9768     True
9769     True
9770     True
9771    False
9772    False
9773    False
Name: country, dtype: bool
    year    country  scale
20  1918  Indonesia    6.6
25  1918  Indonesia    7.5
26  1918  Indonesia    7.1
38  1919  I

## scipy.stats

The Scientific Python **scipy** libray has many statistical tests in **scipy.stats**. A full list can be found here:
http://docs.scipy.org/doc/scipy/reference/stats.html#statistical-functions.

We will be using the (probably no completely appropriate) T-test for independance.
Note that **ttest_ind** returns a "Ttest_indResult" object. What this is is not so important, just remember that we can obtain the p-value by asking for **result.pvalue**.


In [18]:
# get the columns representing the intensity earthquakes in, and not in, Indonesia, from the respective DataFrames.
indo_scale = indo_data['scale']
not_indo_scale = not_indo_data['scale']

# import the stats module
from scipy import stats

result = stats.ttest_ind(indo_scale,not_indo_scale)
print "The result of the T-test:", result.pvalue


The result of the T-test: 4.13978701354e-05


## Thats all

Well, it appears that Indonesia does indeed have more intense earthquakes that the rest of the world. Thats nice to know!