# Linear Regression Exercises

**Exercise 1:** The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Variables in this study are as follows:
- **response variable:** birth weight in ounces (`bwt`)
- length of pregnancy in days (`gestation`)
- mother's age in years (`age`)
- mother's height in inches (`height`)
- mother's pregnancy weight in pounds (`weight`)
- mother's `smoke` status: 1 if the mother is a smoker, 0 otherwise
- child's `parity` status: 1 if first child, 0 otherwise

Below are three observations from this data set.

| id | bwt | gestation  | parity |  age | height | weight | smoke |
|------|------|------|------|------|------|------|------|
|  1  | 120  | 284  | 0 | 27 | 62 | 100 | 0 |
|   2 | 113    | 282| 0|  33| 64 | 135 | 0 |
|   . | .    | .| . |  . | . | . | . |
|   . | .    | .| . |  . | . | . | . |
|   . | .    | .| . |  . | . | . | . |
|   1236  | 117   | 297| 0| 38 | 65 | 129 | 0 |
 

The summary table below shows the results of a regression model for predicting the birth weight of
babies (`bwt`) based on all of the variables included in the dataset.

| - | Estimate | Std. Error   | t value | Pr(>abs(t)) |
|------|------|------|------|------|
|  (Intercept)  | -80.41  | 14.35   | -5.60 | 0.0000 |
|   gestation  | 0.44    | 0.03| 15.26 |  0.0000 | 
|   parity | -3.33    | 1.13| -2.95 |  0.0033 | 
|   age  | -0.01    | 0.09| -0.10|  0.9170 | 
|   height  | 1.15    | 0.21| 5.63 |  0.0000 | 
|   weight  | 0.05    | 0.03| 1.99 |  0.0471 | 
|   smoke  | -8.40    | 0.95| -8.81 |  0.0000 | 

(A) Write the equation of the regression model that includes all of the variables.

(B) Interpret the slopes of `gestation` and `age` in this context.

(C) Calculate the residual for the first observation in the data set.

(D)  Is there a statistically significant relationship between `bwt` and `smoke`?

(E) The variance of the residuals is 249.28 and the variance of the birth weights of all babies in the dataset is 332.57. Calculate the R-squared and the adjusted R-squared values. Note that there are 1,236 observations in the dataset.

## Baseball Player Statistics (MLB11)

The movie [Moneyball](https://www.imdb.com/title/tt1210166/) focuses on the "quest for the secret of success in baseball". It follows a low-budget team, the Oakland Athletics, who believed that under-used statistics, such as a player's ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these under-used statistics turned out to be much more affordable for the team.

Data Source: www.mlb.com

The data set is available as a CSV file named `mlb11.csv` [here](https://github.com/vaksakalli/datasets).

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

# so that we can see all the columns
pd.set_option('display.max_columns', None) 

import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

df_url = 'https://raw.githubusercontent.com/vaksakalli/datasets/master/mlb11.csv'
url_content = requests.get(df_url).content
mlb11 = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

print(f'Data shape = {mlb11.shape}')
mlb11.head()

**Exercise 2:** Plot pairwise relationships among `runs`, `hits`, `bat_avg` and `wins`.

**Hint**: Use seaborn's [`pairplot()`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) function.

**Exercise 3:** Construct a multiple regression model for `runs` as the response (dependent) variable and `bat_avg`, `wins`, `strikeouts` as the independent variables. Compute R-squared and Adjusted R-squared values.

**Hint**: Use [`statsmodels.api`](https://www.statsmodels.org/stable/regression.html) to fit the model.

**Exercise 4:** Construct a multiple regression model for `runs` as dependent variable again, but this time include all the independent variables (except `team`) in the model. Compute R-squared and Adjusted R-squared values again.