# CS-E-106: Data Modeling
## Fall 2019: Lab 07

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.graphics import gofplots
import pylab

## Section Problems

### (7.04) Reference to Grocery retailer Problem 6.9.
A large,national grocery retailer tracks productivity and costs of its facilities closely. Data were obtained from a single distribution center for a one-year period. Each data point for each variable represents one week of activity. The variables included are the number of cases shipped $(X_1)$, the indirect costs of the total labor hours as a percentage $(X_2)$, a qualitative predictor called holiday that is coded $1$ if the week has a holiday and $0$ otherwise $(X_3)$, and the total labor hours $(Y)$.

*Please use dataset titled **CH06PR09.txt** when applicable*

In [2]:
df_704 = pd.read_table("data/CH06PR09.txt", header=None, 
                       names = ["laborHours", "shippedCases", "indirectCosts", "holiday"])

a. Obtain the analysis of variance table that decomposes the regression sum of squares into extra sums of squares associated with $X_1$; with $X_3$, given $X_1$; and with $X_2$, given $X_1$ and $X_3$.

In [3]:
lmFit_704 = ols("laborHours ~ shippedCases+holiday+indirectCosts", data=df_704).fit()
lmFit_704.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.669
Dependent Variable:,laborHours,AIC:,667.7535
Date:,2019-11-11 23:17,BIC:,675.5585
No. Observations:,52,Log-Likelihood:,-329.88
Df Model:,3,F-statistic:,35.34
Df Residuals:,48,Prob (F-statistic):,3.32e-12
R-squared:,0.688,Scale:,20532.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
Intercept,4149.8872,195.5654,21.2199,0.0000,3756.6766,4543.0978
shippedCases,0.0008,0.0004,2.1590,0.0359,0.0001,0.0015
holiday,623.5545,62.6409,9.9544,0.0000,497.6064,749.5025
indirectCosts,-13.1660,23.0917,-0.5702,0.5712,-59.5951,33.2630

0,1,2,3
Omnibus:,1.532,Durbin-Watson:,2.298
Prob(Omnibus):,0.465,Jarque-Bera (JB):,1.504
Skew:,0.332,Prob(JB):,0.471
Kurtosis:,2.496,Condition No.:,3042750.0


In [4]:
anovaTable = anova_lm(lmFit_704)
anovaTable

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
shippedCases,1.0,136366.2,136366.2,6.641687,0.01309038
holiday,1.0,2033565.0,2033565.0,99.044333,2.96334e-13
indirectCosts,1.0,6674.588,6674.588,0.325084,0.5712274
Residual,48.0,985529.7,20531.87,,


b. Test whether $X_2$ can be dropped from the regression model given that $X_1$, and $X_3$ are retained. Use the F* test statistic and $\alpha = .05$. State the alternatives, decision rule, and conclusion. What is the P-value of the test?

In [5]:
ssr = anovaTable["sum_sq"]["indirectCosts"]
print(ssr)
sse = anovaTable["sum_sq"]["Residual"]
print(sse)

fStar = (ssr/1) / (sse/anovaTable["df"]["Residual"])
print(fStar)

alpha = 0.05

db = stats.f.ppf(1 -alpha, 1, lmFit_704.df_resid)
print(db)

6674.58808589122
985529.7463938453
0.32508428009918805
4.042652128566653


\underline{\textbf{ANALYSIS}}

\textbf{Hypotheses:}

$H_0: \beta_2 = 0$ 

$H_a: \beta_2 \neq 0$ 


\textbf{Decision Rules:}

If $F^* \leq$ `r db`, conclude $H_0$

If $F^* >$ `r db`, conclude $H_a$

\textbf{Conclusion:}

Since our test statistic, $F^* =$ `r fStar`, and `r fStar` $\leq$ `r db`, we conclude $H_0$.

**(c)** 

Does $SSR(X_1)$ + $SSR(X_2 | X_1)$ equal $SSR(X_2)$ + $SSR(X_1|X_2)$ here? Must this always be the case?
(Does our sum of squares associated with x1 plus sum of squares associated with x2 given x1 equal sum of squares associated with x2 plus sum of squares associated with x1 given x2?) 

In [6]:
ssr_x1 = anova_lm(ols("laborHours~shippedCases+indirectCosts", data=df_704).fit())["sum_sq"]["shippedCases"]
ssr_x2x1 = anova_lm(ols("laborHours~shippedCases+indirectCosts", data=df_704).fit())["sum_sq"]["indirectCosts"]
eq1_sum = np.round(ssr_x1+ssr_x2x1)
print(eq1_sum)

ssr_x2 = anova_lm(ols("laborHours~shippedCases+indirectCosts", data=df_704).fit())["sum_sq"]["indirectCosts"]
ssr_x1x2 = anova_lm(ols("laborHours~shippedCases+indirectCosts", data=df_704).fit())["sum_sq"]["shippedCases"]
eq2_sum = np.round(ssr_x2+ssr_x1x2)

print(eq2_sum)

142092.0
142092.0


\underline{\textbf{ANALYSIS}}

We can calculate this mathematically to see if $SSR(X_1)$ + $SSR(X_2 | X_1)$ = $SSR(X_2)$ + $SSR(X_1|X_2)$.

\textbf{Equation 1: } $SSR(X_1) + SSR(X_2 | X_1)$

$$SSR(X_1) + SSR(X_2 | X_1)$$

\textbf{Equation 2: } $SSR(X_2) + SSR(X_1|X_2)$

$$SSR(X_2) + SSR(X_1|X_2)$$

Combining equation 1 and equation 2:
        
$$SSR(X_1) + SSR(X_2 | X_1) = SSR(X_2) + SSR(X_1|X_2)$$


As a result, we see that $SSR(X_1) + SSR(X_2 | X_1) = SSR(X_2) + SSR(X_1|X_2)$. It will always be the case where the expressions are equivalent because of the inherent symmetry of the models.


### (7.38) Projects. Reference to SENIC data set in Appendix C.1.
The primary objective of the Study on the Efficacy of Nosocomial Infection Control (SENIC Project) was to determine whether infection surveillance and control programs have reduced the rates of nosocomial (hospital-acquired) infection in United States hospitals. This data set consists of a random sample of 113 hospitals selected from the original 338 hospitals surveyed. Each line of the dataset has an identification number and provides information on 11 variables for a single hospital. The data presented here are for the 1975-76 study period.

*Please use dataset titled **APPENC01.txt** when applicable*


For predicting the average length of stay of patients in a hospital $(Y)$, it has been decided to include age $(X_1)$ and infection risk $(X_2)$ as predictor variables. The question now is whether an additional predictor variable would be helpful in the model and, if so, which variable would be most helpful. Assume that a first-order multiple regression model is appropriate.


**(a)** 

For each of the following variables, calculate the coefficient of partial determination given that $X_1$ and $X_2$ are included in the model: routine culturing ratio $(X_3)$, average daily census $(X_4)$, number of nurses $(X_5)$, and available facilities and services $(X_6)$.

In [7]:
cols = ["Y","X1","X2","X3","X4","X5","X6"]
df_738 = pd.read_table("data/APPENC01.txt", header=None, delimiter=" "
#                        usecols=[1,2,3,4,9,10,11], 
#                        names = cols
                      )
df_738

AttributeError: module 'pandas' has no attribute 'read'

**Note:** Need to figure out how to read this text data. However, problem 7.38 parts (a), (b) and (c) are very similar to the questions covered in HW6. Please refer to the respective Jupyter notebook for HW6.

### (8.21) In a regression analysis of on-the-job head injuries of warehouse laborers caused by fulling objects, $Y$ is a measure of severity of the injury, $X_1$ is an index inflecting both the weight of the object and the distance it fell, and $X_2$ and $X_3$ are indicator variables for nature of head protection worn at the time of the accident, coded as follows:

| Type of Prediction | $X_2$ | $X_3$ |
|:------------------:|:-----:|:-----:|
|      Hard Hat      |   1   |   0   |
|      Bump Cap      |   0   |   1   |
|        None        |   0   |   0   |


The response function to be used in the study is $E\{Y\} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3$.

**(a)** 

Develop the response function for each type of protection category.


<!-- Input solution below -->

| Protection Category | Response Function |
|:-------------------:|:----------------:|
|      Hard Hat      |$E\{Y\} = (\beta_0+\beta_2) + \beta_1 X_1$|
|      Bump Cap      |$E\{Y\} = (\beta_0+\beta_3) + \beta_1 X_1$|
|        None        |$E\{Y\} = \beta_0 + \beta_1 X_1$|

<!-- End of solution -->

The response function used in the study implies that the regression of protection on head injuries is linear, with the same slope for all types of protections. The coefficients ($\beta_2, \beta_3$) indicate how much lower or higher the response functions for the protections models are than the no-protection category (e.g., 'None'). Thus, $\beta_2$ and $\beta_3$ measures the differential effects of the qualitative variable class. Differential effects of one qualitative variable on the intercept depend on the particular class of the other qualitative variable.

**(b)** 

For each of the following questions, specify the alternatives $H_0$ and $H_a$ for the appropriate test: (1) With $X_1$ fixed, does wearing a bump cap reduce the expected severity of injury as compared with wearing no protection? (2) With $X_1$ fixed, is the expected severity of injury the same when wearing a hard hat as when wearing a bump cap?

1. With $X_1$ fixed, does wearing a bump cap reduce the expected severity of injury as compared with wearing no protection? Null and alternative hypotheses as follows:
$$
H_0: \beta_3 \geq 0\\
H_a: \beta_3 < 0
$$

2. With $X_1$ fixed, is the expected severity of injury the same when wearing a hard hat as when wearing a bump cap? Null and alternative hypotheses as follows:

$$
H_0: \beta_2 = \beta_3\\
H_a: \beta_2 \neq \beta_3
$$

### (8.38) Projects. Reference to SENIC data set in Appendix C.1.
The primary objective of the Study on the Efficacy of Nosocomial Infection Control (SENIC Project) was to determine whether infection surveillance and control programs have reduced the rates of nosocomial (hospital-acquired) infection in United States hospitals. This data set consists of a random sample of 113 hospitals selected from the original 338 hospitals surveyed. Each line of the dataset has an identification number and provides information on 11 variables for a single hospital. The data presented here are for the 1975-76 studyperiod.

In [13]:
df_838  = pd.read_table("data/APPENC01.txt", header=False, sep="")
df_838

TypeError: Passing a bool to header is invalid. Use header=None for no header or header=int or list-like of ints to specify the row(s) making up the column names