# 5.3 Summarizing and Computing Descriptive Statistics

1. [General Info](#general)
1. [Correlation and Covariance](#correlation)
1. [Unique Values, Value Counts, and Membership](#value)

<a name="general"></a>
# General Info

pandas objects are equipped with many common mathematical and statistical methods.  

Generally, these methods are *reductions* or *summary statistics* and return a single value (i.e. sum or mean) from a Series, or a Series of values from the rows (or columns) of a DataFrame.  
 
There is built-in handling of missing values, but it depends on the function
1. Sum
    - If entire row is NA, then the sum is 0
    - If at least one NA, then result depends on `skipna` value
        - `True` (default) ignores NAs and returns a result using just the values
        - `False` returns NA for the result
1. Mean
    - If entire row is NA, then mean is NA
    - If `skipna` is `True`, then NAs are completely ignored in the sense they're not counted as 0 and 1 is added to the denominator (mean of [1.4, NA] is 1.4, not 0.7)
    - If `skipna` is `False

<img src="./myImages/table5.7_reductionMethods.png" width=600>

In [46]:
import numpy as np
import pandas as pd

# Make a DataFrame
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [47]:
# Sum the columns - NA values are ignored
df.sum()

one    9.25
two   -5.80
dtype: float64

In [48]:
# Sum the columns - NA values included
df.sum(axis="index", skipna=False)

one   NaN
two   NaN
dtype: float64

In [49]:
# Sum the rows
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [50]:
# Sum the rows - NA values included
df.sum(axis="columns", skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

In [51]:
# Mean - default is to ignore NAs
df.mean(axis="columns", skipna=True)

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

In [52]:
# Mean - skip rows with NA
df.mean(axis="columns", skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

There are multiple different types of these functions. Some, like `idxmin` and `idxmax` return indices; others like `cumsum` return an object same size as input, with cumulative results. Even have some like `describe` that is similar to R's `summary` and returns multiple summary stats together.

In [53]:
# Get row index of max value in each column
df.idxmax()

one    b
two    d
dtype: object

In [54]:
# Get column index of max value in each row - won't work in the future
df.idxmax(axis="columns")

  df.idxmax(axis="columns")


a    one
b    one
c    NaN
d    one
dtype: object

In [55]:
# Cumulative sum
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [56]:
# Describe distribution
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [57]:
# Describe has different summary stats for non-numeric values!
obj = pd.Series(["a", "a", "b", "c"] * 4)
print(obj)
obj.describe()

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object


count     16
unique     3
top        a
freq       8
dtype: object

<img src="./myImages/table5.8_descriptiveStats.png" width=600>

<a name="correlation"></a>
# Correlation and Covariance

1. Series
    - `corr` method computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series
    - `cov` does the same, but computes covariance
1. DataFrame
    - `corr`elation/`cov`ariance **matrix** is returned when called on a whole DataFrame
    - `corrwith` can basically apply `corr` to multiple columns given a reference column
        - If you provide two DataFrames with matching indices, then it will do pairwise operations



In [58]:
### Read in stock/volume data
price = pd.read_pickle("../../examples/yahoo_price.pkl")
volume = pd.read_pickle("../../examples/yahoo_volume.pkl")
print(f"Snippet of stock prices:\n{price.head()}\n")
print(f"Snippet of stock volumes:\n{volume.head()}\n")

Snippet of stock prices:
                 AAPL        GOOG         IBM       MSFT
Date                                                    
2010-01-04  27.990226  313.062468  113.304536  25.884104
2010-01-05  28.038618  311.683844  111.935822  25.892466
2010-01-06  27.592626  303.826685  111.208683  25.733566
2010-01-07  27.541619  296.753749  110.823732  25.465944
2010-01-08  27.724725  300.709808  111.935822  25.641571

Snippet of stock volumes:
                 AAPL      GOOG      IBM      MSFT
Date                                              
2010-01-04  123432400   3927000  6155300  38409100
2010-01-05  150476200   6031900  6841400  49749600
2010-01-06  138040000   7987100  5605300  58182400
2010-01-07  119282800  12876600  5840600  50559700
2010-01-08  111902700   9483900  4197200  51197400



In [59]:
### Use one of the above summary statistics to get percent change
returns = price.pct_change()
print(f"Snippet of percent change in stock prices:\n{returns.head()}\n")

Snippet of percent change in stock prices:
                AAPL      GOOG       IBM      MSFT
Date                                              
2010-01-04       NaN       NaN       NaN       NaN
2010-01-05  0.001729 -0.004404 -0.012080  0.000323
2010-01-06 -0.015906 -0.025209 -0.006496 -0.006137
2010-01-07 -0.001849 -0.023280 -0.003462 -0.010400
2010-01-08  0.006648  0.013331  0.010035  0.006897



In [60]:
# Compute corr and cov on two series:
# (notice the format for operation on two series in a DataFrame)
corr_MSFT_IBM = returns["MSFT"].corr(returns["IBM"])
cov_MSFT_IBM = returns["MSFT"].cov(returns["IBM"])
print(f"Correlation of MSFT returns to IBM returns:\n{corr_MSFT_IBM}\n")
print(f"Covariance of MSFT returns to IBM returns:\n{cov_MSFT_IBM}\n")

Correlation of MSFT returns to IBM returns:
0.49976361144151155

Covariance of MSFT returns to IBM returns:
8.870655479703549e-05



In [61]:
# Get a correlation or covariance matrix of each stock with all the others instead
# Note MSFT-IBM results
corrMat_returns = returns.corr()
covMat_returns = returns.cov()
print(f"Correlation matrix of stock returns:\n{corrMat_returns}\n")
print(f"Covariance matrix of stock returns:\n{covMat_returns}\n")

Correlation matrix of stock returns:
          AAPL      GOOG       IBM      MSFT
AAPL  1.000000  0.407919  0.386817  0.389695
GOOG  0.407919  1.000000  0.405099  0.465919
IBM   0.386817  0.405099  1.000000  0.499764
MSFT  0.389695  0.465919  0.499764  1.000000

Covariance matrix of stock returns:
          AAPL      GOOG       IBM      MSFT
AAPL  0.000277  0.000107  0.000078  0.000095
GOOG  0.000107  0.000251  0.000078  0.000108
IBM   0.000078  0.000078  0.000146  0.000089
MSFT  0.000095  0.000108  0.000089  0.000215



In [62]:
# Apply corr/covar across columns when providing a reference to corrwith
# Note the MSFT result is same
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

In [63]:
# Apply corr/covar pairwise using two DataFrames
corr_return_volume = returns.corrwith(volume)
print(f"Input DF1 (head)- returns:\n{returns.head()}\n")
print(f"Input DF2 (head)- volume:\n{volume.head()}\n")
print(f"Correlation of each stock's returns with its volume across all dates(head):\n{corr_return_volume}\n")

Input DF1 (head)- returns:
                AAPL      GOOG       IBM      MSFT
Date                                              
2010-01-04       NaN       NaN       NaN       NaN
2010-01-05  0.001729 -0.004404 -0.012080  0.000323
2010-01-06 -0.015906 -0.025209 -0.006496 -0.006137
2010-01-07 -0.001849 -0.023280 -0.003462 -0.010400
2010-01-08  0.006648  0.013331  0.010035  0.006897

Input DF2 (head)- volume:
                 AAPL      GOOG      IBM      MSFT
Date                                              
2010-01-04  123432400   3927000  6155300  38409100
2010-01-05  150476200   6031900  6841400  49749600
2010-01-06  138040000   7987100  5605300  58182400
2010-01-07  119282800  12876600  5840600  50559700
2010-01-08  111902700   9483900  4197200  51197400

Correlation of each stock's returns with its volume across all dates(head):
AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64



In [64]:
# Do it by Row
byRow = returns.corrwith(volume, axis="columns")
print(f"Correlation of each date's stocks by volume across all stockes (head):\n{byRow.head()}\n")

Correlation of each date's stocks by volume across all stockes (head):
Date
2010-01-04         NaN
2010-01-05    0.737298
2010-01-06    0.017069
2010-01-07    0.507614
2010-01-08   -0.779646
dtype: float64



<a name="unique"></a>
# Unique Values, Value Counts, and Membership

Final class of methods to review is another type of "summary".  

Get information about the values contained in a one-dimensional series
1. `unique` will return one occurrence of each unique value
    - order is somewhat random? Not in sorted order and not in order of first occurrence at least
1. `value_counts` is essentially R's `table` - get number of occurrences of each value in a Series. Sorted by decreasing occurrence
1. `isin` returns `True`/`False` depending on membership (just like R's `%in%`)
1. `get_indexer` (from `pd.Index`) is somewhat like R's `match` - give a list of indexers (unique values) to a list of non-unique occurrences of those values and return each non-unique's index value from the reference list.

<img src="./myImages/table5.9.png" width = 600>

In [65]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [66]:
# Print all uniques of the object
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [67]:
# Get number of occurrences of each unique value
obj.value_counts()

c    3
a    3
b    2
d    1
Name: count, dtype: int64

In [68]:
# Get a True/False mask
mask = obj.isin(["b", "c"])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [69]:
# Subset based on the mask
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [70]:
# Get reference position of each value
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)

print(f"Values to match to reference:\n{to_match}\n")
print(f"Reference values/order:\n{unique_vals}\n")
print(f"Matching indices:\n{indices}")

Values to match to reference:
0    c
1    a
2    b
3    b
4    c
5    a
dtype: object

Reference values/order:
0    c
1    b
2    a
dtype: object

Matching indices:
[0 2 1 1 0 2]


Below is a brief example of how this might be used to wrangle data for input into a histogram:

In [71]:
# Example data
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                     "Qu2": [2, 3, 1, 2, 3],
                     "Qu3": [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [72]:
# Example computation of counts of a column
data["Qu1"].value_counts().sort_index() # add sort_index to sort results by index value and not # of occurrences

Qu1
1    1
3    2
4    2
Name: count, dtype: int64

In [73]:
# Use apply to do so for all columns
toPlot = data.apply(pd.value_counts).fillna(0)
toPlot

  toPlot = data.apply(pd.value_counts).fillna(0)


Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
