### Summary Statistics
#### Descriptive Statistics
There are two categories of Descriptive Statistics :-
- those that describe the values of observations in a variable - for example `sum` / `median` / `mean` / `max`
- those that describe variable spread - for example `standard deviation` / `variance` / `count` / `quartiles`

These statistics can be used for :-
- Detecting outliers
- planning data prep for machine learning
- selecting features for use in machine learning

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import scipy
from scipy import stats

In [2]:
cars = pd.read_csv("../Data/mtcars.csv")
cars.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


First - `sum` the values for each column :-

In [3]:
cars.sum()

model    Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive...
mpg                                                  642.9
cyl                                                    198
disp                                                7383.1
hp                                                    4694
drat                                                115.09
wt                                                 102.952
qsec                                                571.16
vs                                                      14
am                                                      13
gear                                                   118
carb                                                    90
dtype: object

If you wanted to `sum` the _row_ values - note that here we are _selecting_ the columns that have numeric values :-

In [9]:
cars_subset = cars[['mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']]
cars_subset.sum(axis=1)

0     328.980
1     329.795
2     259.580
3     426.135
4     590.310
5     385.540
6     656.920
7     270.980
8     299.570
9     350.460
10    349.660
11    510.740
12    511.500
13    509.850
14    728.560
15    726.644
16    725.695
17    213.850
18    195.165
19    206.955
20    273.775
21    519.650
22    506.085
23    646.280
24    631.175
25    208.215
26    272.570
27    273.683
28    670.690
29    379.590
30    694.710
31    288.890
dtype: float64

Calculate the `median` value :-

In [11]:
cars_subset.median()

mpg      19.200
cyl       6.000
disp    196.300
hp      123.000
drat      3.695
wt        3.325
qsec     17.710
vs        0.000
am        0.000
gear      4.000
carb      2.000
dtype: float64

Calculate the `mean` value :-

In [12]:
cars_subset.mean()

mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

And finally - calculate the `max` values :-

In [13]:
cars_subset.max()

mpg      33.900
cyl       8.000
disp    472.000
hp      335.000
drat      4.930
wt        5.424
qsec     22.900
vs        1.000
am        1.000
gear      5.000
carb      8.000
dtype: float64

If you wanted to find the index of the row where the _max_ value was found :-

In [16]:
mpg = cars_subset.mpg
mpg.idxmax()

19

#### Looking at summary statistics that describe variable distribution
Standard Distribution

In [17]:
cars_subset.std()

mpg       6.026948
cyl       1.785922
disp    123.938694
hp       68.562868
drat      0.534679
wt        0.978457
qsec      1.786943
vs        0.504016
am        0.498991
gear      0.737804
carb      1.615200
dtype: float64

Variance

In [18]:
cars_subset.var()

mpg        36.324103
cyl         3.189516
disp    15360.799829
hp       4700.866935
drat        0.285881
wt          0.957379
qsec        3.193166
vs          0.254032
am          0.248992
gear        0.544355
carb        2.608871
dtype: float64

Count - the number of _unique_ values for each variable

In [20]:
gear = cars_subset.gear
gear.value_counts()

3    15
4    12
5     5
Name: gear, dtype: int64