### DataFrame and Statistics

Pandas is very powerful for data analysis because with its DataFrame objects come with many statistics functionality built in.

In [1]:
import pandas as pd
from io import StringIO

data = StringIO(
    '''year\tinches\tlocation
    2006\t633.5\tutah
    2007\t356\tutah
    2008\t654\tutah
    2009\t578\tutah
    2010\t430\tutah
    2011\t553\tutah
    2012\t329.5\tutah
    2013\t382.5\tutah
    2014\t357.5\tutah
    2015\t267.5\tutah''')
snow = pd.read_table(data)

snow

Unnamed: 0,year,inches,location
0,2006,633.5,utah
1,2007,356.0,utah
2,2008,654.0,utah
3,2009,578.0,utah
4,2010,430.0,utah
5,2011,553.0,utah
6,2012,329.5,utah
7,2013,382.5,utah
8,2014,357.5,utah
9,2015,267.5,utah


The `.describe()` method gives an overview of the statistics of the data:

In [4]:
snow.describe() # The "location" column is ignored by .describe() as it is not numeric. We can pass include='all' to override this

Unnamed: 0,year,inches
count,10.0,10.0
mean,2010.5,454.15
std,3.02765,138.357036
min,2006.0,267.5
25%,2008.25,356.375
50%,2010.5,406.25
75%,2012.75,571.75
max,2015.0,654.0


In [5]:
snow.describe(include='all')

Unnamed: 0,year,inches,location
count,10.0,10.0,10
unique,,,1
top,,,utah
freq,,,10
mean,2010.5,454.15,
std,3.02765,138.357036,
min,2006.0,267.5,
25%,2008.25,356.375,
50%,2010.5,406.25,
75%,2012.75,571.75,


The `.quantile()` method will by default show the 50% quantile, but we can specify the returned quantile, or quantiles if a list is passed:

In [6]:
snow.quantile()

year      2010.50
inches     406.25
Name: 0.5, dtype: float64

In [7]:
snow.quantile(q=[.1, .9])

Unnamed: 0,year,inches
0.1,2006.9,323.3
0.9,2014.1,635.55


The `.count()` method returns the counts of non-empty cells:

In [9]:
snow.count()

year        10
inches      10
location    10
dtype: int64

The `.any()` method returns whether any of the values in the columns evaluate to the Boolean "True" by Python:

In [10]:
snow.any()

year        True
inches      True
location    True
dtype: bool

We can specify the axis in these methods:

In [11]:
snow.any(axis=1)

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool

The `.all()` method, on the other hand, return True if all of the value in the column/row evaluated to a Boolean "True" by Python:

In [12]:
snow.all()

year        True
inches      True
location    True
dtype: bool

In [13]:
snow.all(axis=1)

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool

The `.rank()` method rank and assign a number to the rank of that cell within the column:

In [14]:
snow.rank()

Unnamed: 0,year,inches,location
0,1.0,9.0,5.5
1,2.0,3.0,5.5
2,3.0,10.0,5.5
3,4.0,8.0,5.5
4,5.0,6.0,5.5
5,6.0,7.0,5.5
6,7.0,2.0,5.5
7,8.0,5.0,5.5
8,9.0,4.0,5.5
9,10.0,1.0,5.5


We can pass various parameter to `method=` for different kinds of ranking:

* `'average'`: The is the default method; if the values are the same, it will assign a ranking that is the average between them
* `'min'`: This assigns the lowest possible ranking if there is a tie
* `'max'`: Assigns the highest possible ranking if there is a tie
* `'first'`: Assigns all tied values the ranking of the first encountered value
* `'dense'`: Assign all tied values the ranking by group instead of items

The `.clip()` method trims values with given upper and/or lower bound(s):

In [15]:
snow.loc[:, 'inches'].clip(lower=400, upper=600)

0    600.0
1    400.0
2    600.0
3    578.0
4    430.0
5    553.0
6    400.0
7    400.0
8    400.0
9    400.0
Name: inches, dtype: float64

The `.corr()` methods returns the pairwise Pearson correlation coefficeint between all columns. The method also can calculate Kendall or Spearman correlation if you specify so via the `method=` parameter:

In [16]:
snow.corr()

Unnamed: 0,year,inches
year,1.0,-0.698064
inches,-0.698064,1.0


In [17]:
snow.corr(method='spearman')

Unnamed: 0,year,inches
year,1.0,-0.648485
inches,-0.648485,1.0


The `.sum()` method sums all values down a column (or a row, you'll need to specify the axis in that case). String values are concatinated by default:

In [18]:
snow.sum()

year                                           20105
inches                                        4541.5
location    utahutahutahutahutahutahutahutahutahutah
dtype: object

Similarly, the `.prod()` method returns the product of the columns/rows (string values are ignored in this case):

In [19]:
snow.prod()

year      3.325907e+18
inches    2.443332e+26
dtype: float64

The `.mean()`, `.var()`, `.mad()`, `.skew()`, and `.kurt()` methods returns the mean, variance, meaen absolute deviation, skew, and kurtosis respectively:

In [20]:
snow.mean()

year      2010.50
inches     454.15
dtype: float64

In [21]:
snow.var()

year          9.166667
inches    19142.669444
dtype: float64

In [22]:
snow.mad()

year        2.50
inches    120.38
dtype: float64

In [23]:
snow.skew()

year      0.000000
inches    0.311866
dtype: float64

In [24]:
snow.kurt()

year     -1.200000
inches   -1.586098
dtype: float64

The `.max()` and `.min()` methods return the maximum and minimum values, while the index of those values can be returned by the methods `.idxmax()` and `idxmin()` (note these last two methods fail if there are any non-numeric values in the columns):

In [25]:
snow.max()

year        2015
inches       654
location    utah
dtype: object

In [27]:
snow.loc[:, ['year', 'inches']].idxmax()

year      9
inches    2
dtype: int64

In [28]:
snow.min()

year         2006
inches      267.5
location     utah
dtype: object

In [29]:
snow.loc[:, ['year', 'inches']].idxmin()

year      0
inches    9
dtype: int64