# Basic Metrics

When we think about summarizing data, what are the metrics that we look at?

In this notebook, we will look in the price of weed dataset along with the demographic information of the United States. 


The Price of Weed website - http://www.priceofweed.com/

Crowdsources the price paid by people on the street to get weed. Self Reported.

Location is auto detected or can be choosen

Quality is classified in three categories

High

Medium

Low

Price by weight

an ounce

a half ounce

a quarter

an eighth

10 grams

5 grams

1 gram


To read how the data was acquired, please read [this](https://github.com/amitkaps/weed/blob/master/1-Acquire.ipynb) to get more information

This notebook will make use of pandas quite a bit.

In [46]:
import numpy as np
import pandas as pd
from scipy import stats

In [47]:
from datetime import datetime as dt

### Read the input datasets. There are three datasets:

1. Weed price by date / state
2. Demographics of State
3. Population of state

In [51]:
prices_pd = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
demography_pd = pd.read_csv("data/Demographics_State.csv")
population_pd = pd.read_csv("data/Population_State.csv")

In [52]:
prices_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01


In [53]:
prices_pd.tail()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
22894,Virginia,364.98,3513,293.12,3079,,284,2014-12-31
22895,Washington,233.05,3337,189.92,3562,,160,2014-12-31
22896,West Virginia,359.35,551,224.03,545,,60,2014-12-31
22897,Wisconsin,350.52,2244,272.71,2221,,167,2014-12-31
22898,Wyoming,322.27,131,351.86,197,,12,2014-12-31


In [54]:
demography_pd.head()

Unnamed: 0,region,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age
0,alabama,4799277,67,26,1,4,23680,501,38.1
1,alaska,720316,63,3,5,6,32651,978,33.6
2,arizona,6479703,57,4,3,30,25358,747,36.3
3,arkansas,2933369,74,15,1,7,22170,480,37.5
4,california,37659181,40,6,13,38,29527,1119,35.4


In [55]:
population_pd.head()

Unnamed: 0,region,value
0,alabama,4777326
1,alaska,711139
2,arizona,6410979
3,arkansas,2916372
4,california,37325068


In [7]:
prices_pd.dtypes

State             object
HighQ            float64
HighQN             int64
MedQ             float64
MedQN              int64
LowQ             float64
LowQN              int64
date      datetime64[ns]
dtype: object

#### Sort the data on state and date, then fill NA values

In [56]:
prices_pd.sort_values(by=['State', 'date'], ascending=True, na_position='first',inplace=True)
prices_pd.(method='ffill', inplace=True)

### Finding mean, median, mode, variance, standard deviation for California

#### Mean

arithmetic average of a range of values or quantities, computed by dividing the total of all values by the number of values.

In [12]:
california_pd = prices_pd[prices_pd.State == "California"].copy(True)
california_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01
55,California,243.96,16512,189.35,19151,161.3,1096,2015-01-01
106,California,248.2,12571,192.8,13406,191.94,804,2014-02-01
157,California,243.3,16904,188.95,19764,161.3,1123,2015-02-01
208,California,247.6,12988,192.97,13906,191.4,839,2014-03-01


In [60]:
california_pd.shape

(449, 9)

In [61]:
california_pd.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,HighQ_dev
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,245.376125,14947.073497,191.268909,16769.821826,177.197617,976.298441,2.976043
std,1.727046,1656.133565,1.524028,2433.943191,14.765425,120.246714,3.961134
min,241.84,12021.0,187.85,12724.0,161.3,770.0,1.5e-05
25%,244.48,13610.0,190.26,14826.0,161.3,878.0,0.106357
50%,245.31,15037.0,191.57,16793.0,188.57,982.0,0.729103
75%,246.22,16090.0,192.55,18435.0,191.32,1060.0,4.435761
max,248.82,18492.0,193.63,22027.0,193.88,1232.0,12.504178


In [62]:
ca_sum = california_pd['HighQ'].sum()

In [63]:
ca_count = california_pd['HighQ'].count()

In [64]:
ca_mean = ca_sum / ca_count
print "Mean weed price in CA is:", ca_mean

Mean weed price in CA is: 245.37612472160356


#### Exercise: Find CA mean for 2013, 2014 & 2015 separately

*Hint:* `california_pd.iloc[0]['date'].year`

#### Median

Denotes value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it. Simply put, it is the *middle* value in the list of numbers.

In [65]:
ca_count

449

If count is odd, the median is the value at (n+1)/2,

else it is the average of n/2 and (n+1)/2

In [66]:
ca_highq_pd = california_pd['HighQ']
ca_highq_pd.head()

4      248.78
55     243.96
106    248.20
157    243.30
208    247.60
Name: HighQ, dtype: float64

In [18]:
ca_median = ca_highq_pd.iloc[(ca_count) / 2]
print "Median price of weed in CA is:", ca_median

Median price of weed in CA is: 244.75


#### Mode

It is the number which appears most often in a set of numbers. 

In [72]:
ca_mode = ca_highq_pd.value_counts().index[0]
print "The most common price is CA, as indicated by its mode, is:", ca_mode

The most common price is CA, as indicated by its mode, is: 245.05


#### Variance

> Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"

It's the average distance of the data values from the *mean*

<img style="float: left;" src="img/variance.png" height="320" width="320">

In [20]:
california_pd['HighQ_dev'] = (california_pd['HighQ'] - ca_mean) ** 2

In [74]:
ca_HighQ_variance = california_pd.HighQ_dev.sum() / (ca_count - 1)
print "Variance of High Quality weed prices in CA is:", ca_HighQ_variance

Variance of High Quality weed prices in CA is: 2.9826862879812275


#### Standard Deviation

It is the square root of variance. This will have the same units as the data and mean. 

In [75]:
ca_HighQ_SD = np.sqrt(ca_HighQ_variance)
print "Standard Deviation of High Quality weed prices in CA is:", ca_HighQ_SD

Standard Deviation of High Quality weed prices in CA is: 1.727045537321245


#### Using Pandas built-in function

In [23]:
california_pd.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,HighQ_dev
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,245.376125,14947.073497,191.268909,16769.821826,177.197617,976.298441,2.976043
std,1.727046,1656.133565,1.524028,2433.943191,14.765425,120.246714,3.961134
min,241.84,12021.0,187.85,12724.0,161.3,770.0,1.5e-05
25%,244.48,13610.0,190.26,14826.0,161.3,878.0,0.106357
50%,245.31,15037.0,191.57,16793.0,188.57,982.0,0.729103
75%,246.22,16090.0,192.55,18435.0,191.32,1060.0,4.435761
max,248.82,18492.0,193.63,22027.0,193.88,1232.0,12.504178


In [24]:
california_pd.HighQ.mode()

0    245.03
1    245.05
dtype: float64

#### Co-variance 

covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.

<img style="float: left;" src="img/covariance.png" height="270" width="270">

<br>
<br>
<br>
<br>

#### Co-variance of weed price in California vs New York

In [25]:
ny_pd = prices_pd[prices_pd['State'] == 'New York'].copy(True)
ny_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
26,New York,351.98,5800,268.88,5824,190.38,482,2014-01-01
77,New York,343.8,7840,263.56,8716,161.3,616,2015-01-01
128,New York,352.35,6051,268.5,6115,190.16,497,2014-02-01
179,New York,343.09,8058,262.93,9015,161.3,628,2015-02-01
230,New York,351.18,6209,267.69,6356,189.64,507,2014-03-01


In [26]:
ny_pd = ny_pd.ix[:,[1,7]]
ny_pd.columns = ['NY_HighQ', 'date']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


In [27]:
ny_pd.head()

Unnamed: 0,NY_HighQ,date
26,351.98,2014-01-01
77,343.8,2015-01-01
128,352.35,2014-02-01
179,343.09,2015-02-01
230,351.18,2014-03-01


In [28]:
ca_ny_pd = pd.merge(california_pd.ix[:,[1,7]].copy(), ny_pd, on="date")
ca_ny_pd.rename(columns={"HighQ": "CA_HighQ"}, inplace=True)
ca_ny_pd.head()

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,CA_HighQ,date,NY_HighQ
0,248.78,2014-01-01,351.98
1,243.96,2015-01-01,343.8
2,248.2,2014-02-01,352.35
3,243.3,2015-02-01,343.09
4,247.6,2014-03-01,351.18


In [29]:
ny_mean = ca_ny_pd.NY_HighQ.mean()
ny_mean

346.91276169265035

In [30]:
ca_ny_pd['ca_dev'] = ca_ny_pd['CA_HighQ'] - ca_mean
ca_ny_pd.head()

Unnamed: 0,CA_HighQ,date,NY_HighQ,ca_dev
0,248.78,2014-01-01,351.98,3.403875
1,243.96,2015-01-01,343.8,-1.416125
2,248.2,2014-02-01,352.35,2.823875
3,243.3,2015-02-01,343.09,-2.076125
4,247.6,2014-03-01,351.18,2.223875


In [31]:
ca_ny_pd['ny_dev'] = ca_ny_pd['NY_HighQ'] - ny_mean
ca_ny_pd.head()

Unnamed: 0,CA_HighQ,date,NY_HighQ,ca_dev,ny_dev
0,248.78,2014-01-01,351.98,3.403875,5.067238
1,243.96,2015-01-01,343.8,-1.416125,-3.112762
2,248.2,2014-02-01,352.35,2.823875,5.437238
3,243.3,2015-02-01,343.09,-2.076125,-3.822762
4,247.6,2014-03-01,351.18,2.223875,4.267238


In [32]:
ca_ny_cov = (ca_ny_pd['ca_dev'] * ca_ny_pd['ny_dev']).sum() / (ca_count - 1)
print "Covariance of the High Quality weed prices in CA and NY is:", ca_ny_cov

Covariance of the High Quality weed prices in CA and NY is: 5.916814967288421


#### Using Pandas built-in function

In [33]:
ca_ny_pd.cov()

Unnamed: 0,CA_HighQ,NY_HighQ,ca_dev,ny_dev
CA_HighQ,2.982686,5.916815,2.982686,5.916815
NY_HighQ,5.916815,12.245147,5.916815,12.245147
ca_dev,2.982686,5.916815,2.982686,5.916815
ny_dev,5.916815,12.245147,5.916815,12.245147


### Correlation

Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

<img style="float: left;" src="img/correlation.gif" height="270" width="270">

<br>
<br>
<br>

#### Finding correlation between weed prices in New York and California

In [34]:
ca_highq_std = ca_ny_pd.CA_HighQ.std()
ny_highq_std = ca_ny_pd.NY_HighQ.std()

ca_ny_corr = ca_ny_cov / (ca_highq_std * ny_highq_std)
print "Correlation between weed prices in NY and CA:", ca_ny_corr

Correlation between weed prices in NY and CA: 0.9790439611064713


In [35]:
ca_ny_pd.corr()

Unnamed: 0,CA_HighQ,NY_HighQ,ca_dev,ny_dev
CA_HighQ,1.0,0.979044,1.0,0.979044
NY_HighQ,0.979044,1.0,0.979044,1.0
ca_dev,1.0,0.979044,1.0,0.979044
ny_dev,0.979044,1.0,0.979044,1.0


## Standard Error

The standard error of the mean (SE of the mean) estimates the variability between sample means that you would obtain if you took multiple samples from the same population. The standard error of the mean estimates the variability between samples whereas the standard deviation measures the variability within a single sample.

It is calculated as:

<img style="float: left;" src="img/standard-error.png" height="270" width="270">

<br>
<br>
<br>
<br>

#### Standard Error of weed price in California and New York

In [41]:
from scipy import stats

In [42]:
stats.sem(ca_ny_pd['CA_HighQ'])

0.08150431811126478

In [43]:
stats.sem(ca_ny_pd['NY_HighQ'])

0.16514249274593745