# Assignment 4 Solution - Yashveer Beniwal

In [16]:
import numpy as np
from scipy import stats
from statsmodels.formula.api import ols
import pandas as pd
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go

## Question 1
(20 Points) Four common respiration treatment methods are being compared for effectiveness following bypass surgery (the results are in the attached CSV file - Assignment_4_ANOVA_data.csv). Does respiration method affect arterial oxygen level post-bypass? 

#### a. State the null and alternate hypotheses

$$H_{0}: \mu_{1} = \mu_{2} = \mu_{3} = \mu_{4}$$
$H_{A}: Means\ are\ not\ all\ equal$ or does any respiration treatment is more effective than others

#### b. Declare an appropriate alpha value

$$\alpha = 0.0001$$

In this case our test statistic is the F statistic, calculated as:

$$F = \frac{MSG}{MSE}$$

#### c. Calculate the test statistic

<b>Step 4:</b> <i>Calculate the test statistic</i>

Relevant formulas are:
$$MSG = \frac{\sum_{i=1}^{k} n_{i}(\overline{x}_{i} - x)^{2}}{df_{G}}$$
$$MSE = \frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2} + ... + (n_{k} - 1)s_{k}^{2}}{df_{E}}$$


In [17]:
total_groups = 4
# Read the data using Pandas
dataFrame = pd.read_csv("assignment_4_ANOVA_data.csv")
treatment_1 = [a for a in dataFrame.iloc[0:, 0].values if a > 0]
treatment_2 = [a for a in dataFrame.iloc[0:, 1].values if a > 0]
treatment_3 = [a for a in dataFrame.iloc[0:, 2].values if a > 0]
treatment_4 = [a for a in dataFrame.iloc[0:, 3].values if a > 0]
overall = np.concatenate((treatment_1,treatment_2,treatment_3,treatment_4), axis=0)

treatment_1_n = treatment_1.__len__()
treatment_2_n = treatment_2.__len__()
treatment_3_n = treatment_3.__len__()
treatment_4_n = treatment_4.__len__()
overall_n = overall.__len__()
print("n(Size of Groups) : ", treatment_1_n, treatment_2_n, treatment_3_n, treatment_4_n, overall_n)

treatment_1_mean = np.mean(treatment_1)
treatment_2_mean = np.mean(treatment_2)
treatment_3_mean = np.mean(treatment_3)
treatment_4_mean = np.mean(treatment_4)
overall_mean = np.mean(overall)
print("Mean : ", treatment_1_mean, treatment_2_mean, treatment_3_mean, treatment_4_mean, overall_mean)

treatment_1_sd = np.std(treatment_1, ddof=1)
treatment_2_sd = np.std(treatment_2, ddof=1)
treatment_3_sd = np.std(treatment_3, ddof=1)
treatment_4_sd = np.std(treatment_4, ddof=1)
overall_sd = np.std(overall, ddof=1)
print("Standard Deviation : ", treatment_1_sd, treatment_2_sd, treatment_3_sd, treatment_4_sd, overall_sd)

n(Size of Groups) :  10 12 9 8 39
Mean :  89.7 87.3333333333 79.2222222222 94.375 87.5128205128
Standard Deviation :  7.9449494789 6.24257134278 4.23608834238 2.9730936269 7.68755588725



| Summary  | n | mean  | sd
| ------------- |:-------------:| -----:|
| **treatment 1** | 10 | 89.70 | 7.94 |
| **treatment 2** | 12 | 89.33 | 6.24 |
| **treatment 3** | 9  | 79.22 | 4.23 |
| **treatment 4** | 8  | 94.37 | 2.97 |
| **overall** | 39 | 87.51 | 7.68 |


In [18]:
SST = np.sum(np.square(np.subtract(overall, overall_mean)))  # Sum of squares total
SSG = (treatment_1_n * ((treatment_1_mean - overall_mean)**2)) + (treatment_2_n * ((treatment_2_mean - overall_mean)**2)) + (treatment_3_n * ((treatment_3_mean - overall_mean)**2)) + (treatment_4_n * ((treatment_4_mean - overall_mean)**2))
SSE = SST - SSG
percentage_variablity_due_to_group = SSG/SST * 100
print(SST,SSG, SSE, percentage_variablity_due_to_group)

2245.74358974 1043.54636752 1202.19722222 46.4677433473


46% of the variability is introduced by different groups.

In [19]:
df_total = overall_n - 1
df_group = total_groups - 1
df_error = df_total - df_group
print(df_total, df_group, df_error)

38 3 35


In [20]:
MSG = SSG/df_group
MSE = SSE/df_error
F_statistics = MSG/MSE
p_value = 1 - stats.f.cdf(F_statistics, dfn=df_group, dfd=df_error)
print(MSG, MSE, F_statistics, p_value)

347.848789174 34.3484920635 10.1270468739 6.03131672903e-05


In [6]:
# function in stats library to verify result
stats.f_oneway(treatment_1, treatment_2, treatment_3, treatment_4)

F_onewayResult(statistic=10.127046873871553, pvalue=6.0313167290323707e-05)


| Summary |    | Degree of Freedom | Sum Square  | Mean Square | F value | p-value |
| ------------- |:-------------:| -----:| -----:| -----:| -----:|-----:| 
| Group |  Treatment  | 3  | 1043.54 | 347.84 | 10.12 | 0.00006 |
| Error |  Residuals | 35 | 1292.19 | 34.34  |       |         |
|       |  Total     | 38 | 2245.74 |        |       |         |


$$MSG = 347.84$$
$$MSE = 34.34$$

#### d. Determine the p-value corresponding to the test statistic
$$pvalue = 0.00006$$

#### e. Present a conclusion

Since the p-value is small (less than alpha), we can reject the null hypothesis. The data provide convincing evidence that at lease one pair of population means are different from each other (but we can't tell which one).

## Question 2 
(20 Points) OpenIntro Statistics 7.4 - Identify relationships, Part II. For each of the six plots on page 357 (image attached here - 7_4_Image.png), identify the strength of the relationship (e.g. weak, moderate, or strong) in the data and whether fitting a linear model would be reasonable.

#### a. Strong , linear.
#### b. Strong, non linear

#### c. Strong, linear

#### d. Weak, linear

#### e. weak, linear

#### f. moderate, linear

## Question 3
(20 Points) OpenIntro Statistics 7.8 - Match the correlation, Part II. Match the calculated correlations (below) to the corresponding scatterplot (image attached here - 7_8_Image.png) .

#### a) r = 0.49 = 2

#### b) r = -0.48 = 4

#### c) r = -0.03 = 3

#### d) r = -0.85 = 1

## Question 4
Husbands and Wives - Using the attached dataset (husbandsWives.csv), please answer the following questions:

In [14]:
husbandWives = pd.read_csv("husbandsWives.csv")
husbandAgeData = husbandWives[(husbandWives.Age_Husb_at_Marriage > 0) & (husbandWives.Age_Wife_At_Marriage > 0)].Age_Husb_at_Marriage
wifeAgeData = husbandWives[(husbandWives.Age_Husb_at_Marriage > 0) & (husbandWives.Age_Wife_At_Marriage > 0)].Age_Wife_At_Marriage

trace = go.Scatter(
    x = husbandAgeData,
    y = wifeAgeData,
    mode = 'markers'
)
trace2 = go.Scatter(
    x = np.arange(60),
    y = np.add(np.multiply(np.arange(60), 0.7148), 8.7954),
    mode = 'lines',
    name = 'lines'
)


layout= go.Layout(
    title= 'Scatterplot x vs y',
    xaxis= dict(
        title= 'Husband Age',
    ),
    yaxis=dict(
        title= 'Wife Age',
    ),
    showlegend= False
)

fig= go.Figure(data=[trace, trace2], layout=layout)
iplot(fig)


In [15]:
df=pd.DataFrame({"husb": husbandAgeData, "wife": wifeAgeData})
model = ols("wife ~ husb", data=df).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                   wife   R-squared:                       0.507
Model:                            OLS   Adj. R-squared:                  0.505
Method:                 Least Squares   F-statistic:                     172.1
Date:                Sat, 10 Mar 2018   Prob (F-statistic):           1.78e-27
Time:                        10:29:09   Log-Likelihood:                -463.78
No. Observations:                 169   AIC:                             931.6
Df Residuals:                     167   BIC:                             937.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.0713      1.396      3.634      0.0

#### a) Write the equation of the regression line for predicting wife’s age from husband’s age.
```
Slope = 0.7148
Intercept = 8.7954
wife_age = 0.7148 * husband_age + 8.7954
```
#### (b) What does the slope of the equation from (a) tell us?

The slope of the line tells us that the age at which women gets married increases with Husband's age.

#### (c) Given that R-squared = 0.88, what is the correlation of ages in this data set?

correlation ($r$) is square root of R-squared.
$$r = \sqrt{R^2}$$
$$R^2 = 0.88$$
$$r = 0.938$$

#### (d) You meet a married man from Britain who is 55 years old. What would you predict his wife’s age to be?
```
wife_age = 0.7148 * husband_age + 8.7954
wife_age = 0.7148 * 55 + 8.7954
wife_age = 48
```
#### (e) You meet another married man from Britain who is 85 years old. Would it be wise to use the same linear model to predict his wife’s age? Explain.

No, since the age of this man is outside the limit of what this model was based off on which is $17 <= husbandAge <= 52$