### Challenge 1:
The months of service provides the time interval in which a ship has chances to acquire damages. It can be thought of "exposure", and this column can be used as an offset.
Model the damage incident counts with a Poisson Regression.

In [57]:
from statsmodels.regression.linear_model import OLSResults 
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import chisqprob, chi2

import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
from pandas.io.stata import StataReader


In [None]:
reader = StataReader('ships.dta')
df = reader.data()

In [10]:
df.head()

Unnamed: 0,type,construction,operation,months,damage
0,A,1960-64,1960-74,127.0,0.0
1,A,1960-64,1975-79,63.0,0.0
2,A,1965-70,1960-74,1095.0,3.0
3,A,1965-70,1975-79,1095.0,4.0
4,A,1970-74,1960-74,1512.0,6.0


In [42]:
df['construction_mean'] = (df
                           .construction.str.extract(r"""([0-9]{2})-([0-9]{2})""", expand=True)
                           .astype(int).mean(axis=1)
                           )


df['operation_mean'] = df.operation.str.extract(r"([0-9]{2})-([0-9]{2})", expand=True).astype(int).mean(axis=1)



In [43]:
df.head()

Unnamed: 0,type,construction,operation,months,damage,construction_mean,operation_mean
0,A,1960-64,1960-74,127.0,0.0,62.0,67.0
1,A,1960-64,1975-79,63.0,0.0,62.0,77.0
2,A,1965-70,1960-74,1095.0,3.0,67.5,67.0
3,A,1965-70,1975-79,1095.0,4.0,67.5,77.0
4,A,1970-74,1960-74,1512.0,6.0,72.0,67.0


In [44]:
y, X = dmatrices('damage ~ type + construction_mean + operation_mean + months ', data=df, return_type='dataframe')
 

pois_m=sm.GLM(y,X, family=sm.families.Poisson(sm.families.links.log))
# Fitting our model using Maximum likelihood
pois_results=pois_m.fit()

print(pois_results.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                 damage   No. Observations:                   34
Model:                            GLM   Df Residuals:                       26
Model Family:                 Poisson   Df Model:                            7
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -108.37
Date:                Tue, 09 Aug 2016   Deviance:                       118.88
Time:                        16:26:30   Pearson chi2:                     112.
No. Iterations:                     9                                         
                        coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            -6.2302      1.426     -4.370      0.000        -9.025    -3.436
type[T.B]             0.9475   

## Answer
> Chsq is 112

## Challenge 2: 
The months of service provides the time interval in which a ship has chances to acquire damages. It can be thought of "exposure", and this column can be used as an offset.
Try your model with months of service as the offset. Does it perform better?

In [46]:
y,X = dmatrices('damage ~ type + construction_mean + operation_mean', data=df, return_type='dataframe')
logmonths = np.log(df.months)
p_glm_offset = sm.GLM(y, X, offset=logmonths, family=sm.families.Poisson()).fit()

print(p_glm_offset.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                 damage   No. Observations:                   34
Model:                            GLM   Df Residuals:                       27
Model Family:                 Poisson   Df Model:                            6
Link Function:                    log   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -78.076
Date:                Tue, 09 Aug 2016   Deviance:                       58.286
Time:                        16:27:04   Pearson chi2:                     64.6
No. Iterations:                     9                                         
                        coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept           -11.2451      1.031    -10.911      0.000       -13.265    -9.225
type[T.B]            -0.5397   

### Question: Did it perform better
>> Chi2 of 64.6 is way better than 112.

### Challenge 3
Now separate your data (even though it's only 14 rows) into a training and test set (your test will only be 4 or 5 rows), and check if you predict well (you can look at mean absolute error or mean squared error using sklearn.metrics).

In [56]:
train, test = train_test_split(df, test_size= .25, random_state=4444)

y_train, X_train = dmatrices("damage ~ type + construction_mean + operation_mean", 
                             data=train, return_type='dataframe')

y_test,  X_test =  dmatrices("damage ~ type + construction_mean + operation_mean", 
                             data=test, return_type='dataframe')


poisson_model3 = sm.GLM(y_train, X_train, 
                        offset = np.log(train['months']),
                        family = sm.families.Poisson()).fit()

len(train), len(test)


(25, 9)

In [59]:
y_pred = poisson_model3.predict(X_test)
mean_absolute_error(y_test, y_pred)


5.4412987707392837

In [60]:
mean_squared_error(y_test, y_pred)

62.280577572572106

### Challenge 4
Deviance. Compute the difference in Deviance statistics for your model and the null model. This is called the null deviance. You can do this in one of 2 ways:
We need the deviance for the null model (a model where none of the explanatory variables are used; it's just a model with a mean guess). To do that, fit a poisson regression with only a constant. Get the deviance for this null model. Take the difference of deviances between your model and this null model.
Use statsmodels.genmod.generalized_linear_model.GLMResults Check if this difference is extreme enough that we can reject the null hypothesis. If we can't reject the null hypothesis, we cannot say that this model tells us more than that trivial, null model. To calculate the p-value (prob. of getting a deviance difference at least as extreme as this under the null hypothesis), we need to do a hypothesis test.
Is your model better than the null model?

In [61]:
y_train, X_train = dmatrices("damage ~ 1", 
                             data=train, return_type='dataframe')

y_test,  X_test  = dmatrices("damage ~ 1", 
                             data=test, return_type='dataframe')

poisson_model_null = sm.GLM(y_train, X_train, 
                            #offset=np.log(train.months),
                            family=sm.families.Poisson())

In [73]:
poisson_model_null.fit().summary()

0,1,2,3
Dep. Variable:,damage,No. Observations:,25.0
Model:,GLM,Df Residuals:,24.0
Model Family:,Poisson,Df Model:,0.0
Link Function:,log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-300.47
Date:,"Tue, 09 Aug 2016",Deviance:,528.71
Time:,16:58:06,Pearson chi2:,616.0
No. Iterations:,8,,

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,2.5080,0.057,43.943,0.000,2.396 2.620


### Challenge 5
Now, instead of a poisson regression, do an ordinary least squares regression with log Y. Compare the models. Are the coefficients close? Do they perform similarly?

In [76]:
y_train, X_train = dmatrices('np.log(damage+0.1) ~ type + construction_mean + operation_mean',
                            data=train, return_type='dataframe')

y_test, X_test = dmatrices('np.log(damage + 0.1) ~ type + construction_mean + operation_mean',
                            data=test, return_type='dataframe')

linear_model = sm.OLS(y_train, X_train).fit()


In [79]:
linear_model.summary()

0,1,2,3
Dep. Variable:,np.log(damage + 0.1),R-squared:,0.715
Model:,OLS,Adj. R-squared:,0.62
Method:,Least Squares,F-statistic:,7.539
Date:,"Tue, 09 Aug 2016",Prob (F-statistic):,0.000374
Time:,17:00:58,Log-Likelihood:,-40.262
No. Observations:,25,AIC:,94.52
Df Residuals:,18,BIC:,103.1
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,-15.0671,5.175,-2.912,0.009,-25.939 -4.195
type[T.B],3.3320,0.830,4.016,0.001,1.589 5.075
type[T.C],-0.7603,0.923,-0.823,0.421,-2.700 1.179
type[T.D],-0.9786,0.830,-1.179,0.254,-2.723 0.766
type[T.E],-0.7275,1.014,-0.717,0.482,-2.859 1.404
construction_mean,0.2399,0.061,3.957,0.001,0.113 0.367
operation_mean,-0.0113,0.061,-0.187,0.854,-0.139 0.116

0,1,2,3
Omnibus:,0.538,Durbin-Watson:,1.833
Prob(Omnibus):,0.764,Jarque-Bera (JB):,0.602
Skew:,-0.046,Prob(JB):,0.74
Kurtosis:,2.245,Cond. No.,1810.0
