# statsmodels intro
The statsmodels library is the most popular library for doing a wide range of statistical testing and inference. One nice thing about statsmodels is all the access to datasets it provides in its datasets module.

In [43]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [38]:
crime = sm.datasets.statecrime.load_pandas().data
crime.head(10)

Unnamed: 0_level_0,violent,murder,hs_grad,poverty,single,white,urban
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
b'Alabama',459.9,7.1,82.1,17.5,29.0,70.0,48.65
b'Alaska',632.6,3.2,91.4,9.0,25.5,68.3,44.46
b'Arizona',423.2,5.5,84.2,16.5,25.7,80.0,80.07
b'Arkansas',530.3,6.3,82.4,18.8,26.3,78.4,39.54
b'California',473.4,5.4,80.6,14.2,27.8,62.7,89.73
b'Colorado',340.9,3.2,89.3,12.9,21.4,84.6,76.86
b'Connecticut',300.5,3.0,88.6,9.4,25.0,79.1,84.83
b'Delaware',645.1,4.6,87.4,10.8,27.6,71.9,68.71
b'District of Columbia',1348.9,24.2,87.1,18.4,48.0,38.7,100.0
b'Florida',612.6,5.5,85.3,14.9,26.6,76.9,87.44


## Formula vs NumPy inteface
Statsmodels has two interfaces, one that uses R-style string formulas and the other uses numpy arrays

In [57]:
# numpy
X = crime[['hs_grad', 'urban']]
X = sm.add_constant(X)
y = crime.murder.values

results = sm.OLS(y, X).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.260
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     8.419
Date:                Sat, 08 Jul 2017   Prob (F-statistic):           0.000734
Time:                        16:09:16   Log-Likelihood:                -130.17
No. Observations:                  51   AIC:                             266.3
Df Residuals:                      48   BIC:                             272.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         41.8326     12.125      3.450      0.0

In [58]:
# formula
results = smf.ols('murder ~ hs_grad + urban', data=crime).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 murder   R-squared:                       0.260
Model:                            OLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     8.419
Date:                Sat, 08 Jul 2017   Prob (F-statistic):           0.000734
Time:                        16:09:48   Log-Likelihood:                -130.17
No. Observations:                  51   AIC:                             266.3
Df Residuals:                      48   BIC:                             272.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     41.8326     12.125      3.450      0.0

## Resources
+ [statsmodels documentation](http://www.statsmodels.org/stable/index.html)