# Inferential Statistics:

Are there variables that are particularly significant in terms of explaining the answer to your project question?

Variables that are significant are crime subcategories, area and neighborhood, and report sums. 

Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?

We predict that there are strong correlations between sum of report numbers and crime throughout the years. 
We can dive into whether or not these variables are correlated. 

Null hypothesis: The sum of report numbers between the years are the same.

Alternative hypothesis: The sum of report numbers between the years are different. 

In [1]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF

import numpy as np
import pandas as pd
import scipy
from pandas import DataFrame

import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
import datetime as datetime 

In [2]:
# Importing the dataframe (same one used in Tableau)
df = pd.read_csv('C:/Users/jennywhojennyvu/Desktop/Capstone_2/Crime_Data.csv')
df.head()

Unnamed: 0,Report Number,Occurred Date,Occurred Time,Reported Date,Reported Time,Crime Subcategory,Primary Offense Description,Precinct,Sector,Beat,Neighborhood
0,1975000000000.0,12/16/1975,900.0,12/16/1975,1500.0,BURGLARY-RESIDENTIAL,BURGLARY-FORCE-RES,SOUTH,R,R3,LAKEWOOD/SEWARD PARK
1,1976000000000.0,1/1/1976,1.0,1/31/1976,2359.0,SEX OFFENSE-OTHER,SEXOFF-INDECENT LIBERTIES,UNKNOWN,,,UNKNOWN
2,1979000000000.0,1/28/1979,1600.0,2/9/1979,1430.0,CAR PROWL,THEFT-CARPROWL,EAST,G,G2,CENTRAL AREA/SQUIRE PARK
3,19810000000000.0,8/22/1981,2029.0,8/22/1981,2030.0,HOMICIDE,HOMICIDE-PREMEDITATED-WEAPON,SOUTH,S,S2,BRIGHTON/DUNLAP
4,1981000000000.0,2/14/1981,2000.0,2/15/1981,435.0,BURGLARY-RESIDENTIAL,BURGLARY-FORCE-RES,SOUTHWEST,W,W3,ROXHILL/WESTWOOD/ARBOR HEIGHTS


In [3]:
#convert 'Reported Date' to simplified 'Year', made a new column for 'Year'
df['Year'] = pd.to_datetime(df['Reported Date'], infer_datetime_format=True, errors='coerce').dt.year

In [4]:
df.head()

Unnamed: 0,Report Number,Occurred Date,Occurred Time,Reported Date,Reported Time,Crime Subcategory,Primary Offense Description,Precinct,Sector,Beat,Neighborhood,Year
0,1975000000000.0,12/16/1975,900.0,12/16/1975,1500.0,BURGLARY-RESIDENTIAL,BURGLARY-FORCE-RES,SOUTH,R,R3,LAKEWOOD/SEWARD PARK,1975
1,1976000000000.0,1/1/1976,1.0,1/31/1976,2359.0,SEX OFFENSE-OTHER,SEXOFF-INDECENT LIBERTIES,UNKNOWN,,,UNKNOWN,1976
2,1979000000000.0,1/28/1979,1600.0,2/9/1979,1430.0,CAR PROWL,THEFT-CARPROWL,EAST,G,G2,CENTRAL AREA/SQUIRE PARK,1979
3,19810000000000.0,8/22/1981,2029.0,8/22/1981,2030.0,HOMICIDE,HOMICIDE-PREMEDITATED-WEAPON,SOUTH,S,S2,BRIGHTON/DUNLAP,1981
4,1981000000000.0,2/14/1981,2000.0,2/15/1981,435.0,BURGLARY-RESIDENTIAL,BURGLARY-FORCE-RES,SOUTHWEST,W,W3,ROXHILL/WESTWOOD/ARBOR HEIGHTS,1981


In [5]:
df_group = df.groupby(['Year']).count()
df_group

Unnamed: 0_level_0,Report Number,Occurred Date,Occurred Time,Reported Date,Reported Time,Crime Subcategory,Primary Offense Description,Precinct,Sector,Beat,Neighborhood
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1975,1,1,1,1,1,1,1,1,1,1,1
1976,1,1,1,1,1,1,1,1,0,0,1
1979,1,1,1,1,1,1,1,1,1,1,1
1981,2,2,2,2,2,2,2,2,2,2,2
1988,1,1,1,1,1,1,1,1,1,1,1
1993,1,1,1,1,1,1,1,1,1,1,1
1994,1,1,1,1,1,1,1,1,1,1,1
1996,1,1,1,1,1,1,1,1,1,1,1
1999,1,0,0,1,0,1,1,1,0,0,1
2000,1,1,1,1,1,1,1,1,1,1,1


In [6]:
#selecting range of years https://chrisalbon.com/machine_learning/preprocessing_dates_and_times/select_date_and_time_ranges/ 
df = df_group.loc['2008':'2018']
df

Unnamed: 0_level_0,Report Number,Occurred Date,Occurred Time,Reported Date,Reported Time,Crime Subcategory,Primary Offense Description,Precinct,Sector,Beat,Neighborhood
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2008,42766,42766,42766,42766,42766,42581,42766,42764,42430,42436,42766
2009,45121,45121,45121,45121,45121,45078,45121,45120,44837,44846,45121
2010,43413,43413,43413,43413,43413,43383,43413,43413,43122,43128,43413
2011,41271,41271,41271,41271,41271,41267,41271,41271,40987,40997,41271
2012,40959,40959,40959,40959,40959,40959,40959,40959,40724,40733,40959
2013,45553,45553,45553,45553,45553,45553,45553,45553,45252,45251,45553
2014,49345,49345,49345,49345,49345,49345,49345,49345,49127,49130,49345
2015,47688,47688,47688,47688,47688,47688,47688,47688,47481,47481,47688
2016,49255,49255,49255,49255,49255,49255,49255,49254,48984,48986,49255
2017,50423,50423,50423,50423,50423,50423,50423,50422,50034,50037,50423


In [7]:
df.shape

(11, 11)

In [8]:
df = df.drop(['Occurred Time', 'Occurred Date', 'Reported Date', 'Reported Time', 'Primary Offense Description', 'Precinct', 'Sector', 'Beat', 'Neighborhood'], axis=1)

In [9]:
df.head()

Unnamed: 0_level_0,Report Number,Crime Subcategory
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,42766,42581
2009,45121,45078
2010,43413,43383
2011,41271,41267
2012,40959,40959


In [10]:
df.columns = ['Report_Number', 'Crime_Subcategory']
df.head()

Unnamed: 0_level_0,Report_Number,Crime_Subcategory
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2008,42766,42581
2009,45121,45078
2010,43413,43383
2011,41271,41267
2012,40959,40959


# One-Way ANOVA

We will be taking a look at Analysis of Variance Test or ANOVA. We will be conducting the One-Way ANOVA to compare if the crime reports in different years are equal. 
Whereas the Two-Way ANOVA compares two or more variables and how they are related to the variable in question.

Null hypothesis: The crime reports between different years are the same.

Alternative hypothesis: The crime reports between different years are different.


In [11]:
m = ols('Report_Number ~ Crime_Subcategory',df).fit()

In [12]:
print(m.summary())

                            OLS Regression Results                            
Dep. Variable:          Report_Number   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 4.776e+04
Date:                Tue, 01 Oct 2019   Prob (F-statistic):           4.48e-18
Time:                        22:11:15   Log-Likelihood:                -58.394
No. Observations:                  11   AIC:                             120.8
Df Residuals:                       9   BIC:                             121.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept           282.7396    210.34


kurtosistest only valid for n>=20 ... continuing anyway, n=11



The positive coefficient for Crime Subcategory suggests that report numbers increase throughout the years. This is congruent
with our speculation, so we reject the null hypothesis and accept the alternative that crime reports between different years
are unequal.