## Model 11 Homework
### Please complete the following questions. 

#### 1. Complete the code below to import the four libraries we've used most commonly.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.formula.api as sm

#### 2. Import the "babies.csv" file and name it df. 

<b>Background Info</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>Variables</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

In [2]:
df = pd.read_csv("babies.csv")

In [3]:
df.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0


#### 3. Check the shape of the dataset. How many columns and rows are there?

In [4]:
df.shape # 1236 columns 8 rows

(1236, 8)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1236 entries, 0 to 1235
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   case       1236 non-null   int64  
 1   bwt        1236 non-null   int64  
 2   gestation  1223 non-null   float64
 3   parity     1236 non-null   int64  
 4   age        1234 non-null   float64
 5   height     1214 non-null   float64
 6   weight     1200 non-null   float64
 7   smoke      1226 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 77.4 KB


#### 4. Check the first 10 rows and the last 10 rows. Drop the column "case". 

In [6]:
(df.head(10)).drop(columns="case")

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
3,123,,0,36.0,69.0,190.0,0.0
4,108,282.0,0,23.0,67.0,125.0,1.0
5,136,286.0,0,25.0,62.0,93.0,0.0
6,138,244.0,0,33.0,62.0,178.0,0.0
7,132,245.0,0,23.0,65.0,140.0,0.0
8,120,289.0,0,25.0,62.0,125.0,0.0
9,143,299.0,0,30.0,66.0,136.0,1.0


In [7]:
(df.tail(10)).drop(columns="case")

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
1226,109,244.0,1,21.0,63.0,102.0,1.0
1227,103,278.0,0,30.0,60.0,87.0,1.0
1228,118,276.0,0,34.0,64.0,116.0,0.0
1229,127,290.0,0,27.0,65.0,121.0,0.0
1230,132,270.0,0,27.0,65.0,126.0,0.0
1231,113,275.0,1,27.0,60.0,100.0,0.0
1232,128,265.0,0,24.0,67.0,120.0,0.0
1233,130,291.0,0,30.0,65.0,150.0,1.0
1234,125,281.0,1,21.0,65.0,110.0,0.0
1235,117,297.0,0,38.0,65.0,129.0,0.0


#### 5. Is there any missing data? Check!

In [8]:
df.isnull().sum()

case          0
bwt           0
gestation    13
parity        0
age           2
height       22
weight       36
smoke        10
dtype: int64

In [9]:
df.isna().sum()

case          0
bwt           0
gestation    13
parity        0
age           2
height       22
weight       36
smoke        10
dtype: int64

In [10]:
df.duplicated().sum() # check for dupes

0

#### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data.

In [11]:
df.dropna(inplace = True)

In [12]:
df.shape # drop 62 rows with missing data

(1174, 8)

#### 7. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [13]:
# use z scores to remove outliers

# make copy and check shape
dfz = df.copy()
print(dfz.shape)

(1174, 8)


In [14]:
def zscore_outliers(column,df_z,df_):
    #calulate zscore
    df_z[f"zscore_{column}"] = np.abs(stats.zscore(df_[column]))
    #return outliers z_score > 3
    return df_z.loc[df_z[f"zscore_{column}"] > 3].index

In [15]:
#define columns of interest(numeric/non-catagerical) 
columns = ["bwt","gestation","age","height","weight"]
for col in columns:
    dfz.drop(zscore_outliers(column=col,df_z=dfz,df_=df),inplace=True)

In [16]:
dfz.shape #drop 40 rows containing outliers added 5 cols for zscore

(1134, 13)

In [17]:
#delete zscore columns
dfz = dfz.drop(columns = ["zscore_bwt","zscore_gestation","zscore_age","zscore_height","zscore_weight"])

In [18]:
dfz.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0


#### 8. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [19]:
dfz.describe().drop(columns = ["parity","smoke"])

Unnamed: 0,case,bwt,gestation,age,height,weight
count,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0
mean,624.557319,119.661376,279.42328,27.206349,64.040564,127.218695
std,356.104467,17.860299,13.930125,5.802403,2.468179,18.385539
min,1.0,65.0,232.0,15.0,57.0,87.0
25%,319.25,109.0,272.0,23.0,62.0,114.0
50%,624.5,120.0,280.0,26.0,64.0,125.0
75%,935.75,131.0,288.0,31.0,66.0,137.0
max,1236.0,174.0,324.0,44.0,71.0,190.0


#### 9. Let's model birthweight based on the characteristics of the mother. But first... 

We want to easily distinguish between the numeric and categorical variables. Replace the values 0/1 in the "parity" and "smoke" column with meaningful labels (i.e. smokes, doesn't smoke).

In [20]:
dfz["parity"].replace([0,1],["no","yes"],inplace=True)
dfz["smoke"].replace([0,1],["nonsmoker","smoker"],inplace=True)

In [21]:
dfz.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,no,27.0,62.0,100.0,nonsmoker
1,2,113,282.0,no,33.0,64.0,135.0,nonsmoker
2,3,128,279.0,no,28.0,64.0,115.0,smoker
4,5,108,282.0,no,23.0,67.0,125.0,smoker
5,6,136,286.0,no,25.0,62.0,93.0,nonsmoker


#### 10. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? 

Describe the strength of the correlation between all the numeric variables and birthweight. 

In [22]:
(dfz.drop(columns = "case")).corr(numeric_only = True)

Unnamed: 0,bwt,gestation,age,height,weight
bwt,1.0,0.411186,0.026259,0.212844,0.166821
gestation,0.411186,1.0,-0.05402,0.072687,0.045045
age,0.026259,-0.05402,1.0,-0.001016,0.16076
height,0.212844,0.072687,-0.001016,1.0,0.46382
weight,0.166821,0.045045,0.16076,0.46382,1.0


In [23]:
# bwt and gestation .411 weak to moderate correlation
# bwt and height/weight have a very weak to no correlation
# bwt and age have no correlation

#### 11. Determine the relationship between birthweight and the categorical variables: parity and smoke. 

Use the groupby function to determine if there are any differences between birthweight and the different groups.  Does it seem like there is a relationship between these variables and birthweight?

In [24]:
dfz["bwt"].groupby(df["parity"]).mean()

parity
0    120.174489
1    118.254125
Name: bwt, dtype: float64

In [25]:
dfz["bwt"].groupby(df["smoke"]).mean()

smoke
0.0    123.337192
1.0    113.927765
Name: bwt, dtype: float64

#### 12. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? 

In the space below, write your justification for why you are including each variable. 

In [26]:
# bwt - dependent var

# gest - strongest correlation to bwt
# smoke - difference between mean bwt for smokers and nonsmokers is 10 ounces ~ 60% of std(17 ounces)
# height - low correlation

# parity - minimal difference between bwt of first pregnancy and not first pregnancy

# weight - very low correlation
# age - no correlation between age and bwt

#### 13. Construct your regression model and print the summary. 

Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

In [27]:
## create the regression model
result = sm.ols('bwt ~ gestation + C(smoke) +  height', data = dfz).fit()
## print the regression model summary
result.summary()

0,1,2,3
Dep. Variable:,bwt,R-squared:,0.254
Model:,OLS,Adj. R-squared:,0.252
Method:,Least Squares,F-statistic:,127.9
Date:,"Sun, 06 Aug 2023",Prob (F-statistic):,2.43e-71
Time:,17:56:53,Log-Likelihood:,-4711.6
No. Observations:,1134,AIC:,9431.0
Df Residuals:,1130,BIC:,9451.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-99.7909,14.592,-6.839,0.000,-128.422,-71.160
C(smoke)[T.smoker],-8.2823,0.944,-8.769,0.000,-10.135,-6.429
gestation,0.4837,0.033,14.578,0.000,0.419,0.549
height,1.3670,0.187,7.329,0.000,1.001,1.733

0,1,2,3
Omnibus:,3.626,Durbin-Watson:,2.07
Prob(Omnibus):,0.163,Jarque-Bera (JB):,3.935
Skew:,0.054,Prob(JB):,0.14
Kurtosis:,3.268,Cond. No.,9130.0


In [28]:
# r = .25 not a good fit

#### 14. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors. Use the information in the model summary to make these predictions. 

In [29]:
# 
result.predict({
    'gestation': 280, 
    'smoke': 'nonsmoker', 
    'parity': 'yes', 
    'height': 70})

0    131.322416
dtype: float64

In [30]:
# 
result.predict({
    'gestation': 280, 
    'smoke': 'smoker', 
    'parity': 'yes', 
    'height': 70})

0    123.040099
dtype: float64

In [31]:
# 
result.predict({
    'gestation': 255, 
    'smoke': 'nonsmoker', 
    'parity': 'yes', 
    'height': 65})

0    112.396057
dtype: float64

In [32]:
# 
result.predict({
    'gestation': 255, 
    'smoke': 'smoker', 
    'parity': 'yes', 
    'height': 65})

0    104.11374
dtype: float64