<h3> Basic summarization of the Data </h3>

In [294]:
import numpy as np
import pandas as pd
import matplotlib

In [295]:
pop_state = pd.read_csv('Population_State.csv')

In [296]:
pop_state.shape

(51, 2)

In [297]:
pop_state.head()

Unnamed: 0,region,value
0,alabama,4777326
1,alaska,711139
2,arizona,6410979
3,arkansas,2916372
4,california,37325068


In [298]:
pop_state.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 2 columns):
region    51 non-null object
value     51 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.2+ KB


In [299]:
dem_state = pd.read_csv('Demographics_State.csv')

In [300]:
dem_state.shape

(51, 9)

In [301]:
dem_state.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 9 columns):
region               51 non-null object
total_population     51 non-null int64
percent_white        51 non-null int64
percent_black        51 non-null int64
percent_asian        51 non-null int64
percent_hispanic     51 non-null int64
per_capita_income    51 non-null int64
median_rent          51 non-null int64
median_age           51 non-null float64
dtypes: float64(1), int64(7), object(1)
memory usage: 4.0+ KB


<h1>Central Limit Theorem</h1>

Given a distribution with a mean u and variance sigma^2, the sampling distribution of the mean approaches a normal distribution with a mean u and variance sigma^2/N as the sample size increases.

<h3> Estimating the population parameters of USA based on the states data </h3>

In [302]:
pop_state.describe()

Unnamed: 0,value
count,51.0
mean,6061543.352941
std,6838088.582209
min,562803.0
25%,1697554.5
50%,4340167.0
75%,6649654.5
max,37325068.0


Example program to illustrate central theorem

In [337]:
# TODO: Put a visualization for an example or show a demo from some other program online ?

<h2> Confidence Intervals </h2>

When one can estimate the mean of a population, it is rare to already know its standard deviation. 
Therefore, the construction of a confidence interval almost always involves the estimation of both μ and σ.

s divided by the square root of N (N is the sample size) is used as an estimate of σM

Let us calculuate the mean and standard deviation for second dataset

In [303]:
pop_states = pd.read_csv('Population_State.csv')

In [304]:
d = np.array(pop_states['value']) # todo: change 'value' column name to 'population count'

In [305]:
demograph_states = pd.read_csv('Demographics_State.csv')

In [306]:
b = np.array(demograph_states['total_population'])

In [307]:
pop_states.describe()

Unnamed: 0,value
count,51.0
mean,6061543.352941
std,6838088.582209
min,562803.0
25%,1697554.5
50%,4340167.0
75%,6649654.5
max,37325068.0


In [308]:
demograph_states.describe()

Unnamed: 0,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,6108560.666667,70.254902,10.823529,3.72549,10.803922,28053.803922,719.490196,37.639216
std,6904016.38773,16.116877,10.867761,5.355664,9.996038,4659.378182,189.820375,2.352367
min,570134.0,23.0,0.0,1.0,1.0,20618.0,448.0,29.6
25%,1712494.5,59.5,3.0,1.0,4.5,24908.5,566.0,36.3
50%,4361333.0,74.0,7.0,2.0,8.0,26824.0,664.0,37.6
75%,6712318.5,82.5,14.5,4.0,12.5,30144.0,839.0,38.95
max,37659181.0,94.0,49.0,37.0,47.0,45290.0,1220.0,43.2


Above comparision of the two datasets us that the sample mean estimate varies from various data sets when estimating the population parameter, in this case the population mean. 

Confidence intervals give us a way to specify a range within which the estimated population parameter could fall.

<h2> Calculating Confidence Interval for high quality weed prices for Alaska </h2>

In [309]:
# CI for high quality weed prices for Alaska
from scipy import stats
from pandas import DataFrame

weed_price = DataFrame.from_csv('Weed_Price.csv')
df = weed_price.loc['Alaska']

nv = df['HighQ']
nv

N=len(nv)
N

mean,std=(np.mean(df['HighQ']), np.std(df['HighQ']))
mean, std

ci=stats.norm.interval(0.95,loc=mean,scale=std/np.sqrt(N))
ci

(290.96643984923782, 291.99756905944804)

<h2> Calculating Confidence Interval for high quality weed prices for Florida </h2>

In [310]:
df = weed_price.loc['Florida']
nv = df['HighQ']
nv
N=len(nv)
N
mean,std=(np.mean(df['HighQ']),
                   np.std(df['HighQ']))
mean, std
ci=stats.norm.interval(0.95,loc=mean,scale=std/np.sqrt(N))
ci

(302.35844561983299, 302.78217798818491)

<h3> Bootstrapping Confidence Intervals </h3>

In [311]:
import scikits.bootstrap as bootstrap
import scipy

In [312]:
CIs = bootstrap.ci(data=b, statfunction=scipy.mean)  

In [313]:
CIs

array([ 4580812.21568627,  8599414.03921569])

In [314]:
c = np.array(pop_state['value'])

In [315]:
CIs_for_pop = bootstrap.ci(data=c, statfunction=scipy.mean)  

In [316]:
CIs_for_pop

array([ 4607019.58823529,  8534988.09803922])

<h3> Introduction to Z-score </h3>

Z-score tells us how far is each observation away from the mean of the dataset.

$$z = ({x - μ })/σ $$ 


* μ is the mean of the population;
* σ is the standard deviation of the population. 
* x is the value of each observation

In [317]:
a = np.array(pop_state['value'])

In [318]:
a

array([ 4777326,   711139,  6410979,  2916372, 37325068,  5042853,
        3572213,   900131,   605759, 18885152,  9714569,  1362730,
        1567803, 12823860,  6485530,  3047646,  2851183,  4340167,
        4529605,  1329084,  5785496,  6560595,  9897264,  5313081,
        2967620,  5982413,   990785,  1827306,  2704204,  1317474,
        8793888,  2055287, 19398125,  9544249,   676253, 11533561,
        3749005,  3836628, 12699589,  1052471,  4630351,   815871,
        6353226, 25208897,  2766233,   625498,  8014955,  6738714,
        1850481,  5687219,   562803])

* We know from summary statistics above, that the mean for this is: 6061543.352941
* Suppose, we want to know to z-score of the observation for ALABAMA, WEST VIRGINIA, CALIFORNIA 

Z-score for this can be obtained using indexes on array for each state as shown below:

In [319]:
from scipy import stats
zscore_array = stats.zscore(a)

In [320]:
zscore_array[0]

-0.18967229426808219

In [321]:
zscore_array[50]

-0.81213565284583833

In [322]:
zscore_array[4]

4.6174617039192078

<h2> A/B testing: Impact of regulation and deregulation on a couple of states </h2>

   * In A/B Testing you check two groups and see if there are any differences
   * Determine if the difference is statistically significant


In [323]:
weed_prices = pd.read_csv('Weed_Price.csv')

In [341]:
from pandas import DataFrame
df = DataFrame.from_csv('Weed_Price.csv')
sf = df.loc['Alaska'][(df.loc['Alaska']['date'] < '2014-11-02') ]
sf.describe()

# alaska data before november 2014
#https://en.wikipedia.org/wiki/Legality_of_cannabis_by_U.S._jurisdiction#.C2.A0Alaska
#State = State in US
#HighQ = Price for HighQuality Weed
#HighQN = Number of people who came on the site and pledged the value for HighQ weed on a given day.

#(Alaska prices before deregulation and after de-regulation)

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,309.0,309.0,309.0,309.0,242.0,309.0
mean,288.561456,302.864078,259.331036,368.831715,387.232727,29.841424
std,2.078002,25.247238,3.337891,41.994173,14.260693,3.527825
min,284.21,252.0,251.87,296.0,359.87,26.0
25%,287.32,283.0,256.51,336.0,375.82,26.0
50%,288.7,307.0,260.97,371.0,388.58,30.0
75%,289.81,327.0,261.95,405.0,403.33,33.0
max,292.1,338.0,263.79,435.0,407.31,37.0


In [340]:
# alaska data after november 2014
af = df.loc['Alaska'][(df.loc['Alaska']['date'] > '2014-11-02') ]
af.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,139.0,139.0,139.0,139.0,0.0,139.0
mean,297.980719,361.985612,268.081871,494.604317,,37.841727
std,5.459218,22.675718,3.424556,44.888044,,1.580034
min,290.27,338.0,260.96,437.0,,37.0
25%,292.84,347.0,264.61,461.5,,37.0
50%,302.82,352.0,269.9,481.0,,37.0
75%,303.02,363.5,270.82,502.0,,37.0
max,303.78,406.0,271.5,576.0,,41.0


In [339]:
# Oregon data before regularization
of = df.loc['Oregon'][(df.loc['Oregon']['date'] < '2014-11-04') ]
of.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,311.0,311.0,311.0,311.0,242.0,311.0
mean,209.933762,1860.012862,185.487781,1600.221865,170.173388,76.209003
std,1.689441,135.259249,2.385111,155.090563,5.117401,7.70618
min,207.66,1606.0,181.5,1329.0,162.91,61.0
25%,208.64,1754.0,183.525,1472.0,165.75,70.0
50%,209.55,1853.0,185.33,1605.0,169.42,77.0
75%,211.335,1988.0,188.46,1751.5,175.2,82.0
max,213.54,2076.0,189.15,1857.0,178.05,86.0


In [342]:
# Oregon data after regularization
of = df.loc['Oregon'][(df.loc['Oregon']['date'] > '2014-11-04') ]
of.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,137.0,137.0,137.0,137.0,0.0,137.0
mean,205.238905,2250.40146,180.280073,2098.992701,,98.59854
std,1.727567,148.092998,1.209925,172.594582,,8.583037
min,202.02,2083.0,178.04,1871.0,,87.0
25%,205.19,2146.0,180.27,1973.0,,91.0
50%,205.57,2204.0,180.86,2049.0,,98.0
75%,206.58,2253.0,181.09,2125.0,,99.0
max,207.47,2533.0,181.73,2406.0,,113.0


In [328]:
dcf = df.loc['District of Columbia'][(df.loc['District of Columbia']['date'] < '2014-11-04') ]
dcf.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,311.0,311.0,311.0,311.0,242.0,311.0
mean,348.30836,529.276527,290.89492,442.154341,210.563554,45.041801
std,2.173169,53.350521,2.333702,55.026052,1.951919,3.081923
min,339.15,431.0,284.26,343.0,205.81,39.0
25%,346.37,483.5,288.85,395.0,209.83,41.0
50%,348.64,526.0,291.0,442.0,209.83,47.0
75%,350.22,576.5,292.52,495.0,211.38,47.0
max,352.28,614.0,296.06,533.0,216.65,48.0


In [329]:
dacf = df.loc['District of Columbia'][(df.loc['District of Columbia']['date'] > '2014-11-04') ]
dacf.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,137.0,137.0,137.0,137.0,0.0,137.0
mean,347.872774,678.781022,282.278175,613.540146,,50.072993
std,1.481656,46.465312,0.928332,64.2004,,2.147622
min,345.01,620.0,280.5,539.0,,48.0
25%,347.69,645.0,281.83,567.0,,49.0
50%,348.47,663.0,282.1,589.0,,49.0
75%,348.86,687.0,282.55,626.0,,50.0
max,349.85,764.0,286.99,729.0,,54.0


In [330]:
calicf = df.loc['California'][(df.loc['California']['date'] > '2014-11-04') ]
calicf.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,137.0,137.0,137.0,137.0,0.0,137.0
mean,243.456715,16901.036496,189.319051,19704.693431,,1120.19708
std,0.998957,892.047804,0.828273,1350.024665,,65.682373
min,241.84,15752.0,187.85,17871.0,,1029.0
25%,242.93,16231.0,188.87,18662.0,,1069.0
50%,243.68,16646.0,189.21,19378.0,,1106.0
75%,244.15,17092.0,190.11,20015.0,,1138.0
max,244.94,18492.0,190.83,22027.0,,1232.0


In [331]:
calicf = df.loc['California'][(df.loc['California']['date'] < '2014-11-04') ]
calicf.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN
count,311.0,311.0,311.0,311.0,242.0,311.0
mean,246.222765,14083.790997,192.1291,15473.517685,190.795992,912.73955
std,1.234917,1085.216014,0.795382,1505.169027,1.586186,76.095396
min,244.94,12021.0,190.82,12724.0,187.83,770.0
25%,245.19,13189.0,191.53,14162.5,189.42,845.5
50%,245.71,14157.0,192.01,15561.0,191.075,929.0
75%,246.98,15143.0,192.9,16923.5,192.2,989.0
max,248.82,15720.0,193.63,17815.0,193.88,1026.0


<h5> Insights from the above analysis: </h5>
    
   * Price went up in Alaska after regularization. 
   * But reduced in other districts.
   * People buying increased in all districts.

<h2> Hypothesis Testing </h2>

In [332]:
sf = DataFrame.from_csv('Weed_Price.csv')
#sf = wp[wp['LowQ'] > 0]

# function to calculate the sum of weed data
def sumOfHighQualityWeedPrice(array):
    sum = 0.0
    for i in array:
        sum = sum + i
    return sum

# Alaska before november
alaska_data_below_november = sf[sf['date'] < '2014-11-02']
alaska_data = np.array(alaska_data_below_november.loc['Alaska']['HighQ'])
alaska_data_size = len(alaska_data)
alaska_high_quality_weed_price_mean_before_november = sumOfHighQualityWeedPrice(alaska_data)/alaska_data_size

# Alaska after november
alaska_data_after_november = sf[sf['date'] > '2014-11-02']
alaska_after_nov_data = np.array(alaska_data_after_november.loc['Alaska']['HighQ'])
alaska_after_nov_data_size = len(alaska_after_nov_data)
alaska_high_quality_weed_price_mean_after_november = sumOfHighQualityWeedPrice(alaska_after_nov_data)/alaska_after_nov_data_size

oregon_data_below_november = sf[sf['date'] < '2014-11-04']
oregon_data = np.array(oregon_data_below_november.loc['Oregon']['HighQ'])
oregon_data_size = len(oregon_data)
oregon_high_quality_weed_price_mean_before_november = sumOfHighQualityWeedPrice(oregon_data)/oregon_data_size

oregon_data_after_november = sf[sf['date'] > '2014-11-04']
oregon_after_data = np.array(oregon_data_after_november.loc['Oregon']['HighQ'])
oregon_after_data_size = len(oregon_after_data)
oregon_high_quality_weed_price_mean_after_november = sumOfHighQualityWeedPrice(oregon_after_data)/oregon_after_data_size

# District of Columbia
dc_data_below_november = sf[sf['date'] < '2014-11-04']
dc_data = np.array(dc_data_below_november.loc['District of Columbia']['HighQ'])
dc_data_size = len(dc_data)
dc_high_quality_weed_price_mean_before_november = sumOfHighQualityWeedPrice(dc_data)/dc_data_size

dc_data_after_november = sf[sf['date'] > '2014-11-04']
dc_after_data = np.array(dc_data_after_november.loc['District of Columbia']['HighQ'])
dc_after_data_size = len(dc_after_data)
dc_high_quality_weed_price_mean_after_november = sumOfHighQualityWeedPrice(dc_after_data)/dc_after_data_size

# means for before november data for alaska, oregon, District of Columbia
# make a table for these data
# TODO: Check if the data passes requirements for z-test
# increase in alaska weed price after regularization (1% increase).

# Since we want to check if the price of the weed went down, this is one-tailed hypothesis test
# The z-test
(oregon_high_quality_weed_price_mean_after_november - oregon_high_quality_weed_price_mean_before_november)/1.727567
oregon_stats = 1 - stats.norm.cdf(-2.7176120801038213)
oregon_stats #p-value > 0.1: There is little or no evidence against H0

# DC Z-TEST
(dc_high_quality_weed_price_mean_after_november - dc_high_quality_weed_price_mean_before_november)/1.481656
dc_stats = 1 - stats.norm.cdf(1.0)
dc_stats #p-value > 0.1: There is little or no evidence against H0

# Alaska Z-TEST
(alaska_high_quality_weed_price_mean_after_november - alaska_high_quality_weed_price_mean_before_november)/5.459218
alaska_stats = 1 - stats.norm.cdf(1.7253868802786354) # 0.01 < p-value < 0.05: There is strong evidence against H0
alaska_stats, dc_stats, oregon_stats

# For the first two null hypothesis holds true. 
# For Alaska, null hypothesis is rejected as the p-value is less than 0.05

#So, find the value where alaska will agree with the null hypothesis

# GET STANDARD DEVIATIONS
#oregon_data_after_november.loc['Oregon']['HighQ'].describe()
#dc_data_after_november.loc['District of Columbia']['HighQ'].describe()
#alaska_data_after_november.loc['Alaska']['HighQ'].describe()



(0.042228887044626462, 0.15865525393145707, 0.99671225596586954)

<h2> T-tests for hypothesis testing: </h2>

In [333]:
# t-test for indepdence btw two categories in dataset

from scipy.stats import ttest_ind

cat1 = alaska_data_below_november.loc['Alaska']
cat2 = alaska_data_after_november.loc['Alaska']
#(-26.401039029301675, 3.4667367833940785e-93)

cat1 = oregon_data_below_november.loc['Oregon']
cat2 = oregon_data_after_november.loc['Oregon']
#(26.914061651283518, 1.7639735830507177e-95)

cat1 = dc_data_below_november.loc['District of Columbia']
cat2 = dc_data_after_november.loc['District of Columbia']
ttest_ind(cat1['HighQ'], cat2['HighQ'])
#(2.1368196148483887, 0.03315614856932287)

# US AVERAGE WEED PRICE = 286.35$

(2.1368196148483887, 0.03315614856932287)

<h2> T-Test to test whether means of two populations are significantly different from each other (unpaired data)</h2>

Let us test for "Is there a difference in High Quality Weed prices after regularization ?"

In [334]:
# Mean for alaska before regularization
x1 = alaska_high_quality_weed_price_mean_before_november

# Mean for alaska after regularization
x2 = alaska_high_quality_weed_price_mean_after_november

# Let us compare the variance of each population
# Let s1 be the variance for alaska before regularization
variance1 = np.var(alaska_data)

# Let s2 be the variance for alaska after regualarization
variance2 = np.var(alaska_after_nov_data)

# Comparing tells us that the variance is un-equal
variance1,variance2

(4.3041192383825368, 29.58864552559384)

* T-statistic for comparing two means of paired data (independent) is given by:

$$t = (x1-x2) / \sqrt{(s1^2/n1) + (s2^2/n2)}$$

In [343]:
n1 = alaska_data_size
n2 = alaska_after_nov_data_size

def calculateTStatistics(mean1,mean2,var1,var2,samplesize1,samplesize2):
    t = (mean1-mean2)/np.sqrt(((var1)/samplesize1) + ((var2)/samplesize2))
    return t

df1 = alaska_data_size-1
df2 = alaska_after_nov_data_size- 1

t = calculateTStatistics(x1, x2, variance1, variance2, n1, n2)
print "FOR Alaska:"
print "----------"
print "degrees of freedom for alaska is:", df1 - df2
print "t-statistic for alaska is:", t
print "The P-Value is < 0.00001. The result is significant at p < 0.05"
print "The P-Value is < 0.00001. The result is significant at p < 0.01"
print " "


# Let us do this for DC data

xdf1 = dc_data_size-1
xdf2 = dc_after_data_size- 1
dcn1 = dc_data_size
dcn2 = dc_after_data_size

# Mean for alaska before regularization
dcx1 = dc_high_quality_weed_price_mean_before_november

# Mean for alaska after regularization
dcx2 = dc_high_quality_weed_price_mean_after_november

# Let us compare the variance of each population
# Let s1 be the variance for alaska before regularization
dvariance1 = np.var(dc_data)

# Let s2 be the variance for alaska after regualarization
dvariance2 = np.var(dc_after_data)

dc_t = calculateTStatistics(dcx1, dcx2, dvariance1, dvariance2, dcn1, dcn2)
print " "
print "FOR DC"
print "------"
print "degrees of freedom for DC is:", xdf1 - xdf2
print "t-statistic for DC is:", dc_t
print "The P-Value is 0.013525. The result is significant at p < 0.05"
print "The P-Value is 0.013525. The result is not significant at p < 0.01."
print " "
print " "
# Oregon data

odf1 = oregon_data_size-1
odf2 = oregon_after_data_size- 1
on1 = oregon_data_size
on2 = oregon_after_data_size

# Mean for alaska before regularization
ox1 = oregon_high_quality_weed_price_mean_before_november

# Mean for alaska after regularization
ox2 = oregon_high_quality_weed_price_mean_after_november

# Let us compare the variance of each population
# Let s1 be the variance for alaska before regularization
ovariance1 = np.var(oregon_data)

# Let s2 be the variance for alaska after regualarization
ovariance2 = np.var(oregon_after_data)

o_t = calculateTStatistics(ox1, ox2, ovariance1, ovariance2, on1, on2)

print "FOR Oregon:"
print "-----------"
print "degrees of freedom for Oregon is:", odf1 - odf2
print "t-statistic for Oregon is:", o_t
print " "

print "The P-Value is < 0.00001. The result is significant at p < 0.05"
print "The P-Value is < 0.00001. The result is significant at p < 0.01"
print " "
print " "
print "------"
print "The Null hypothesis H0 that there is no difference btw means of populations is rejected at level 1%"
print "But, there is some price change at significance level of 5% when price is regularized for DC data at 5%."
print "Here, we reject the null hypothesis H0 in favor of alternative hypothesis that there is statistical difference HA"

FOR Alaska:
----------
degrees of freedom for alaska is: 170
t-statistic for alaska is: -19.7787180634
The P-Value is < 0.00001. The result is significant at p < 0.05
The P-Value is < 0.00001. The result is significant at p < 0.01
 
 
FOR DC
------
degrees of freedom for DC is: 174
t-statistic for DC is: 2.47222088427
The P-Value is 0.013525. The result is significant at p < 0.05
The P-Value is 0.013525. The result is not significant at p < 0.01.
 
 
FOR Oregon:
-----------
degrees of freedom for Oregon is: 174
t-statistic for Oregon is: 26.762901661
 
The P-Value is < 0.00001. The result is significant at p < 0.05
The P-Value is < 0.00001. The result is significant at p < 0.01
 
 
------
The Null hypothesis H0 that there is no difference btw means of populations is rejected at level 1%
But, there is some price change at significance level of 5% when price is regularized for DC data at 5%.
Here, we reject the null hypothesis H0 in favor of alternative hypothesis that there is statistic

<h2> Something to think about: Which of these give smaller p-values ? </h2>
   
   * Smaller effect size
   * Smaller standard error
   * Smaller sample size
   * Higher variance
   
   Answer: A smaller standard-error

<h2> Chi-Square Testing </h2>

* Chi-Square testing is used to conduct hypothesis testing on categorial variables data.

In year 2014, so many number of ppl bought low quality, mid quality and highly quality weed. Some NGO took action to educate ppl to buy high quality weed. 

In 2015, we see the numbers (Sum up total number of ppl bought for each of those three categories for both the years). And figure out if response is valid or not.

* TODO: What is a chi-square distribution.

<h2> Chi-Square test for goodness fit </h2>

$$ X^2 = \sum (O - E)^2/E $$

* O is observed frequency
* E is expected frequency
* X is the chi-square statistic

In [400]:
sf = DataFrame.from_csv('Weed_Price.csv')
# number of people who brought high quality weed in Alaska in 2014
alaska_data_in_2014 = sf[sf['date'] < '2015-01-01']
alaska_data_filtered_2014 = alaska_data_in_2014[alaska_data_in_2014['date'] > '2013-12-31']
alaska_14_numbers = np.array(alaska_data_filtered_2014.loc['Alaska']['HighQN'])
alaska_data_filtered_2014.loc['Alaska']['HighQN'].describe()


count    364.000000
mean     310.491758
std       27.235782
min      252.000000
25%      286.000000
50%      316.500000
75%      331.000000
max      350.000000
Name: HighQN, dtype: float64

In [401]:
# number of people who brought low quality weed in Alaska in 2014
alaska_low_14_numbers = np.array(alaska_data_filtered_2014.loc['Alaska']['LowQN'])
alaska_data_filtered_2014.loc['Alaska']['LowQN'].describe()


count    364.000000
mean      31.074176
std        4.159690
min       26.000000
25%       26.000000
50%       32.000000
75%       34.000000
max       37.000000
Name: LowQN, dtype: float64

In [386]:
# number of people who brought high quality weed in Alaska in 2015
alaska_data_in_2015 = sf[sf['date'] > '2014-12-31']
alaska_high_15_numbers = np.array(alaska_data_in_2015.loc['Alaska']['HighQN'])
alaska_data_in_2015.loc['Alaska']['HighQN'].describe()


count     80.000000
mean     374.500000
std       22.629095
min      350.000000
25%      355.000000
50%      363.000000
75%      398.750000
max      406.000000
Name: HighQN, dtype: float64

In [402]:
# number of people who brought low quality weed in Alaska in 2015
alaska_data_in_2015 = sf[sf['date'] > '2014-12-31']
alaska_low_15_numbers = np.array(alaska_data_in_2015.loc['Alaska']['LowQN'])
alaska_data_in_2015.loc['Alaska']['LowQN'].describe()


count    80.000000
mean     38.462500
std       1.855159
min      37.000000
25%      37.000000
50%      37.000000
75%      41.000000
max      41.000000
Name: LowQN, dtype: float64

In [399]:
O = alaska_high_15_numbers

len(O)
np.mean(O)

l = [0] * len(O)
e = np.array(l)
e = e + 374.5

chi_test_2015 = scipy.stats.chisquare(O,e)

O = alaska_14_numbers
l = [0] * len(O)
e = np.array(l)
e = e + 374.5

chi_test_2014 = scipy.stats.chisquare(O,e)

chi_test_2015, chi_test_2014

((108.02136181575435, 0.01674509201460693), (4701.1935914552741, 0.0))

In [406]:
O = alaska_low_15_numbers

len(O)
np.mean(O)

l = [0] * len(O)
e = np.array(l)
e = e + 38.46

chi_test_low_2015 = scipy.stats.chisquare(O,e)

O = alaska_low_14_numbers
l = [0] * len(O)
e = np.array(l)
e = e + 38.46

chi_test_low_2014 = scipy.stats.chisquare(O,e)

chi_test_low_2015, chi_test_low_2014



((7.0693707748309933, 1.0), (679.598086323453, 1.6115618694114632e-21))

Like this we can find out at which value of price, there was significant effect for each State in 2014 and 2015
for different qualities of weed.