# Summarising and Computing Descriptive Stats
pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value from a Series or a Series of values from the rows and columns of a DataFrame
- have built0in handling for missing data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame([[1.4, np.nan],[7.1, -4.5],
                    [np.nan, np.nan],[0.75,-1.3]],
                     index = ['a','b','c','d'],
                     columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [3]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [4]:
df.sum(axis=1)
#NA values are excluded unless the entire slice is NA

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [5]:
#Skipping NA values can be disabled with the skipna option
df.mean(axis=1,skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

#### The below method is extremely useful

In [6]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


## List of Descriptive and Summary Stat Methods

- count: Number of non-NA values
- describe: set of summary stats
- min, max
- argmin, argmax: compute index locations at which min and max value obtained, respectively
- idxmin, idxmax: return index value where the max and min values are contained 
- quantile: compute sample quantile ranging from 0 to 1
- sum
- mean
- median
- mad: mean absolute devaition from the mean value
- prod: product of all values
- var: simple variance of all values
- std
- skew: sample skewness of values
- cumsum: cumlative sum of values
- diff: compute first arthmetic difference(useful for time series)
- pct_change: Compute percent changes (valuable for time series analysis)


## Correlation & Covariance
<b>Correlation:</b> a mutual relationship or connection between two or more things

<b>Covariance:</b> In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive


Correlation and covaraince are computed for pairs of arguments. This example considers some DF’s of stock prices and volumes obtained from Yahoo! Finance using the add-on pandas-datareader package.

In [10]:
import pandas_datareader.data as web

In [12]:
all_data = {ticker: web.get_data_yahoo(ticker)
           for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

price = pd.DataFrame({ticker: data['Adj Close']
                     for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [13]:
#compute percent changes, a time series operation
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-16,0.012526,0.009955,0.018323,0.008685
2020-01-17,0.011071,0.002392,0.005597,0.019763
2020-01-21,-0.006777,0.006218,-0.003591,0.002709
2020-01-22,0.00357,0.033915,-0.004805,0.001044
2020-01-23,0.004816,-0.007089,0.006156,0.000471


<b>.corr:</b> computes the correlation

<b>.cov:</b>  computes the covariance

In [14]:
returns['MSFT'].corr(returns['IBM'])

0.4796624218901193

In [15]:
returns['MSFT'].cov(returns['IBM'])

9.06944987354848e-05

In [16]:
#to return a full correaltion or covariance matrix
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.398013,0.574101,0.524679
IBM,0.398013,1.0,0.479662,0.410956
MSFT,0.574101,0.479662,1.0,0.660715
GOOG,0.524679,0.410956,0.660715,1.0


In [18]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000242,8e-05,0.000131,0.000123
IBM,8e-05,0.000167,9.1e-05,8e-05
MSFT,0.000131,9.1e-05,0.000215,0.000146
GOOG,0.000123,8e-05,0.000146,0.000228


<b>corrwith method</b>

Using this, you can compute pairwise correlations. Can also pass axis=‘columns’ to get row-by-row results.

In [19]:
returns.corrwith(returns.IBM)

AAPL    0.398013
IBM     1.000000
MSFT    0.479662
GOOG    0.410956
dtype: float64

## Unique Values, Value Counts

In [21]:
#To return an array of the unique values in a Series, use .unique()
obj = pd.Series(['a','a','d','c','f','c','g','f'])
obj

0    a
1    a
2    d
3    c
4    f
5    c
6    g
7    f
dtype: object

In [23]:
uniques = obj.unique()
uniques

array(['a', 'd', 'c', 'f', 'g'], dtype=object)

In [24]:
#To compute a Series containing value frequencies
obj.value_counts()

c    2
a    2
f    2
d    1
g    1
dtype: int64