## Summary Statistics

In [1]:
import pandas as pd
dogs = pd.read_csv('./datasets/dogs.csv')
print(dogs)

      name        breed  color  height_cm  weight_kg date_of_birth
0    Bella     Labrador  Brown         56         24    2013-07-11
1  Charlie       Poodle  Black         43         24    2016-09-16
2     Lucy    Chow Chow  Brown         46         24    2014-08-25
3   Cooper    Schnauzer   Gray         49         17    2011-12-11
4      Max     Labrador  Black         59         29    2017-01-20
5   Stella    Chihuahua    Tan         18          2    2015-04-20
6   Bernie  St. Bernard  White         77         74    2018-02-27


<br> __Mean:__ `dogs['height_cm'].mean()`

Mean tells us that where the _center_ of our data is. 
The other summary statistics, `.median()`, `.mode()`, `.min()`, `.max()`, `.var()`, `.std()`, median, mode, minimum, maximum, variance and standard devaition, respectively.

In [5]:
print('Mean height is ' + str(dogs['height_cm'].mean()) + ' and ' + "the oldest dog's dob is " + str(dogs['date_of_birth'].min()))

Mean height is 49.714285714285715 and the oldest dog's dob is 2011-12-11


## The .agg() method

The aggregate, or agg, method allows us to compute __custom__ summary statistics. We create a funtion called __pct30__ that computes the thirtieth percentile of a DataFrame column. The function takles in a column and spits out the column's thirtieth percentile.

In [8]:
def pct30(column):                      # Defining the function pct30 on the column.
    return column.quantile(0.3)

print(dogs['weight_kg'].agg(pct30))           # Subsetting the weight column of dogs df and call .agg(pct30).

22.599999999999998


The output 22.59 means the following: 30% of the dogs weigh less than or equal to 22.59 kg.

Agg can also be used on more than one column.

In [9]:
print(dogs[['weight_kg', 'height_cm']].agg(pct30))

weight_kg    22.6
height_cm    45.4
dtype: float64


Here, additionally we calculated that, 30% of the dogs' height are less than 45.4 cm.

We can also use agg to get multiple summary statistics at once.

In [12]:
def pct40(column):
    return column.quantile(0.4)

print(dogs['weight_kg'].agg([pct30,pct40]))             # Notice we used [] inside .agg to be able to compute both functions!

pct30    22.6
pct40    24.0
Name: weight_kg, dtype: float64


To be able to print out the dogs within the 30th percentile, we can write

In [20]:
print(dogs[dogs['weight_kg'] < dogs['weight_kg'].agg(pct30)])      # filtering and subsetting dogs df with .agg(pct30). 

     name      breed color  height_cm  weight_kg date_of_birth
3  Cooper  Schnauzer  Gray         49         17    2011-12-11
5  Stella  Chihuahua   Tan         18          2    2015-04-20


### Cumulative sum

In [26]:
dogs['weight_kg'].cumsum()     # Results in a number for each row of df, where the number of each row is added to previous row's number. 

0     24
1     48
2     72
3     89
4    118
5    120
6    194
Name: weight_kg, dtype: int64

Other cumulative statistics: `.cummax()`, `cummin()`, `cumprod()`.

## EXERCISE

In [2]:
sales = pd.read_csv('./datasets/sales_subset.csv')

#pd.set_option('display.expand_frame_repr', False)   # Make the display wider. Otherwise, last few columns are displayed in next line.

print(sales.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  


In [50]:
print("Mean of the weekly sales is " + str(sales['weekly_sales'].mean()))

print("Median of the weekly sales is " + str(sales['weekly_sales'].median()))

print(sales['date'].max())              # Prints out max of the date column

print(sales['date'].min())              # Prints out min of the date column

Mean of the weekly sales is 23843.95014850566
Median of the weekly sales is 12049.064999999999
2012-10-26
2010-02-05


We know compute, _Inter-Quartile range_, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if our data contains outliers.

In [64]:
import numpy as np
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


  print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, np.median]))
  print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, np.median]))
  print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, np.median]))


The _error_ we get above, can be bypassed by using "median" instead of writing np.median: `.agg([iqr, "median"])`

Or it can be bypassed by a custom function lambda: `agg([iqr, lambda x: np.median(x)])`

In [73]:
sales_1_1 = sales[(sales['store'] == 1) & (sales['department'] == 1)]

sales_1_1 = sales_1_1.sort_values('date')

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

          date  weekly_sales  cum_weekly_sales  cum_max_sales
0   2010-02-05      24924.50          24924.50       24924.50
1   2010-03-05      21827.90          46752.40       24924.50
2   2010-04-02      57258.43         104010.83       57258.43
3   2010-05-07      17413.94         121424.77       57258.43
4   2010-06-04      17558.09         138982.86       57258.43
5   2010-07-02      16333.14         155316.00       57258.43
6   2010-08-06      17508.41         172824.41       57258.43
7   2010-09-03      16241.78         189066.19       57258.43
8   2010-10-01      20094.19         209160.38       57258.43
9   2010-11-05      34238.88         243399.26       57258.43
10  2010-12-03      22517.56         265916.82       57258.43
11  2011-01-07      15984.24         281901.06       57258.43
