# Basic Metrics

When we think about summarizing data, what are the metrics that we look at?

In this notebook, we will look in the price of weed dataset along with the demographic information of the United States. 

To read how the data was acquired, please read [this](https://github.com/amitkaps/weed/blob/master/1-Acquire.ipynb) to get more information

This notebook will make use of pandas quite a bit.

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

### Read the input datasets. There are three datasets:

1. Weed price by date / state
2. Demographics of State
3. Population of state

In [None]:
prices_pd = pd.read_csv("../data/Weed_Price.csv", parse_dates=[-1])
demography_pd = pd.read_csv("../data/Demographics_State.csv")
population_pd = pd.read_csv("../data/Population_State.csv")

In [None]:
prices_pd.head()

In [None]:
prices_pd.tail()

In [None]:
demography_pd.head()

In [None]:
population_pd.head()

In [None]:
prices_pd.dtypes

#### Sort the data on state and date, then fill NA values

In [None]:
prices_pd.sort(columns=['State', 'date'], inplace=True)
prices_pd.fillna(method='ffill', inplace=True)

### Finding mean, median, mode, variance, standard deviation for California

#### Mean

arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.

In [None]:
california_pd = prices_pd[prices_pd.State == "California"].copy(True)
california_pd.head()

In [None]:
ca_sum = california_pd['HighQ'].sum()

In [None]:
ca_count = california_pd['HighQ'].count()

In [None]:
ca_mean = ca_sum / ca_count
print "Mean weed price in CA is:", ca_mean

#### Exercise: Find CA mean for 2013, 2014 & 2015 separately

*Hint:* `california_pd.iloc[0]['date'].year`

#### Median

Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers.

In [None]:
ca_count

If count is odd, the median is the value at (n+1)/2,

else it is the average of n/2 and (n+1)/2

In [None]:
ca_highq_pd = california_pd.sort(columns=['HighQ'])
ca_highq_pd.head()

In [None]:
ca_median = ca_highq_pd.HighQ.iloc[(ca_count) / 2]
print "Median price of weed in CA is:", ca_median

#### Mode

It is the number which appears most often in a set of numbers. 

In [None]:
ca_mode = ca_highq_pd.HighQ.value_counts().index[0]
print "The most common price is CA, as indicated by its mode, is:", ca_mode

#### Variance

> Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"

It's the average distance of the data values from the *mean*

<img style="float: left;" src="img/variance.png" height="320" width="320">

In [None]:
california_pd['HighQ_dev'] = (california_pd['HighQ'] - ca_mean) ** 2

In [None]:
ca_HighQ_variance = california_pd.HighQ_dev.sum() / (ca_count - 1)
print "Variance of High Quality weed prices in CA is:", ca_HighQ_variance

#### Standard Deviation

It is the square root of variance. This will have the same units as the data and mean. 

In [None]:
ca_HighQ_SD = np.sqrt(ca_HighQ_variance)
print "Standard Deviation of High Quality weed prices in CA is:", ca_HighQ_SD

#### Using Pandas built-in function

In [None]:
california_pd.describe()

In [None]:
california_pd.HighQ.mode()

#### Co-variance 

covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.

<img style="float: left;" src="img/covariance.png" height="270" width="270">

<br>
<br>
<br>
<br>

#### Co-variance of weed price in California vs New York

In [None]:
ny_pd = prices_pd[prices_pd['State'] == 'New York'].copy(True)
ny_pd.head()

In [None]:
ny_pd = ny_pd.ix[:,[1,7]]
ny_pd.columns = ['NY_HighQ', 'date']

In [None]:
ny_pd.head()

In [None]:
ca_ny_pd = pd.merge(california_pd.ix[:,[1,7]].copy(), ny_pd, on="date")
ca_ny_pd.rename(columns={"HighQ": "CA_HighQ"}, inplace=True)
ca_ny_pd.head()

In [None]:
ny_mean = ca_ny_pd.NY_HighQ.mean()
ny_mean

In [None]:
ca_ny_pd['ca_dev'] = ca_ny_pd['CA_HighQ'] - ca_mean
ca_ny_pd.head()

In [None]:
ca_ny_pd['ny_dev'] = ca_ny_pd['NY_HighQ'] - ny_mean
ca_ny_pd.head()

In [None]:
ca_ny_cov = (ca_ny_pd['ca_dev'] * ca_ny_pd['ny_dev']).sum() / (ca_count - 1)
print "Covariance of the High Quality weed prices in CA and NY is:", ca_ny_cov

#### Using Pandas built-in function

In [None]:
ca_ny_pd.cov()

### Correlation

Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

<img style="float: left;" src="img/correlation.gif" height="270" width="270">

<br>
<br>
<br>

#### Finding correlation between weed prices in New York and California

In [None]:
ca_highq_std = ca_ny_pd.CA_HighQ.std()
ny_highq_std = ca_ny_pd.NY_HighQ.std()

ca_ny_corr = ca_ny_cov / (ca_highq_std * ny_highq_std)
print "Correlation between weed prices in NY and CA:", ca_ny_corr

In [None]:
ca_ny_pd.corr()

# Correlation != Causation

correlation between two variables does not necessarily imply that one causes the other.


<img style="float: left;" src="img/correlation_not_causation.gif" height="570" width="570">