## Module 11: Intro to Correlation and Linear Regression 

***

You've learned to explore your dataset features, clean up messy data, and use descriptive statistics to summarize the characteristics of your dataset. 

We are now moving forward with exploring the relationship <b>between</b> variables. Determining how your variables interact is one of the most useful skills you will learn while working with data. 

The first step is understanding the dependency of your variables. 

***

In [97]:
import pandas as pd
import numpy as np

## <font color=DODGERBLUE>Independent and Dependent Variables</font>

When you start to look at the relationship between variables, there are two classifications of variables that need to be considered. 

***

### Independent Variables (IV)
An independent variable, also known as a predictor variable, is a variable that is independent of the other variables in your dataset. Consider this variable the "cause". Independent variables are commonly represented with "x".  

### Dependent Variables (DV)
A dependent variable, also known as an outcome variable, is a variable whose value is dependent on the values of the other variables in your dataset. Consider this variable the "effect". We assume that changes in the values of the independent variable(s) will result in changes to the dependent variable. Dependent variables are commonly represented with "y". 

***

### Examples 
The distinction between IV and DV is an important concept to master because several analyses require identification of which variables are IV and DV. IV's cause a change in the DV, and it isn't possible for the DV to cause a change in the IV.

#### Do flowers grow fastest under fluorescent or natural light?

    - IV: Type of light flowers are grown under
    - DV: Rate of flower growth

#### What is the effect of diet and regular soda on blood sugar levels?

    - IV: Type of soda
    - DV: Blood sugar levels

#### How does cellphone use before bedtime influence sleep?

    - IV: Amount of phone use before bedtime
    - DV: Hours of sleep, quality of sleep, restfulness, etc. 

In [98]:
df = pd.read_csv("EduGradeData.csv")

df.head()

## which variables can be considered independent and dependent?

# Independent Variables - fname, lname, gender, and age

# Dependent Variables - exercise, level_of_fit, hours, level_of_study, grade, and honme_state

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH


## <font color=GOLDENROD>Your Turn</font>

    1. Import the "insurance.csv" file; name the dataset 'ins'. Preview the first 5 rows. 
    2. In the space below, determine what type of variable is in each column (qualitative/quantitative). 
    3. What variable(s) could be considered dependent?

In [99]:
ins = pd.read_csv("insurance.csv")

ins.head()

## which variables can be considered independent and dependent? 

# Independent Variables - age, sex

# Dependent Variables -bmi, children, smoke, drinker, prescription, provider, and charges

Unnamed: 0,age,sex,bmi,children,smoker,drinker,comorbidities,prescription,provider_type,charges
0,54,female,47.41,0,yes,yes,5,yes,out-of-network,63770.42801
1,45,male,30.36,0,yes,yes,5,yes,out-of-network,62592.87309
2,52,male,34.485,3,yes,yes,2,yes,out-of-network,60021.39897
3,31,female,38.095,1,yes,yes,4,yes,out-of-network,58571.07448
4,33,female,35.53,0,yes,yes,5,yes,out-of-network,55135.40209


# <font color=DODGERBLUE>Introduction to Correlation & Association</font>

***

### Correlation
***
Correlation is a tool that can be used to determine th strength of a relationship between two numeric variables. Correlation describes the direction (+/-) and magnitude (how large) of a <b>linear relationship</b> exists between two numeric variables. This is the initial check if your numeric variables have any kind of meaningful relationship. 

Correlation values range from "-1" to "+1". Variables that are positively correlated are closer to "+1" and variables that are negatively correlated are closer to "-1". 

#### Positively correlated variables move in the same direction - as one variable increases, the other variable increases in the same direction.

        - Increase in study time >> increases in test score
        - Decrease in sugar consumption >> decrease in blood sugar levels
        
#### Negatively correlated variables move in the opposite direction - as one variable increases, the other variable decreases (or vice versa). 

        - Decrease in daily spending >> increase in total savings
        - Increase in weight >> decrease in mobility

In [100]:
# Reviewing first five rows of the dataset

df.head()

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH


In [101]:
# Reviewing last five rows of the dataset

df.tail()

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
1995,Cody,Shepherd,male,19,1,high,8,moderate,80.1,VA
1996,Geraldine,Peterson,female,16,4,low,18,high,100.0,NY
1997,Mercedes,Leon,female,18,3,low,14,high,84.9,UT
1998,Lucius,Rowland,male,16,1,high,7,moderate,69.1,MT
1999,Linus,Morris,male,19,4,low,10,moderate,79.6,NJ


In [102]:
# Create a correlation matrix

df.corr()

Unnamed: 0,age,exercise,hours,grade
age,1.0,-0.003643,-0.017467,-0.00758
exercise,-0.003643,1.0,0.021105,0.161286
hours,-0.017467,0.021105,1.0,0.801955
grade,-0.00758,0.161286,0.801955,1.0


### Strength of Relationship between Variables
***
The strength of the relationship between variables is determined by the value of the correlation coefficient. To interpret its value, see which of the following values your correlation coefficient is closest to:

- <b>Exactly –1</b>: A perfect downhill (negative) linear relationship
- <b>–0.70</b>: A strong downhill (negative) linear relationship
- <b>–0.50</b>: A moderate downhill (negative) relationship
- <b>–0.30</b>: A weak downhill (negative) linear relationship
- <b>0</b>: No linear relationship
- <b>+0.30</b>: A weak uphill (positive) linear relationship
- <b>+0.50</b>: A moderate uphill (positive) relationship
- <b>+0.70</b>: A strong uphill (positive) linear relationship
- <b>Exactly +1</b>: A perfect uphill (positive) linear relationship

#### It is important to understand that <u>correlation is not the same as causation</u>. Identifying a correlation between two factors does not automatically mean one factor causes another factor to occur.

***

## <font color=GOLDENROD>Your Turn</font>

    1. Create a correlation matrix with the insurance dataset. 
    2. What can you say about the relationship between the independent variables and the dependent variable?

In [103]:
ins.corr()

Unnamed: 0,age,bmi,children,comorbidities,charges
age,1.0,0.109272,0.042469,0.611903,0.299008
bmi,0.109272,1.0,0.012759,0.06546,0.198341
children,0.042469,0.012759,1.0,-0.282273,0.067998
comorbidities,0.611903,0.06546,-0.282273,1.0,0.505246
charges,0.299008,0.198341,0.067998,0.505246,1.0


In [104]:
# There is no linear relationship between age and bmi
# There is no linear relationship between age and children 
# There is a moderate uphill (positive) relationship between age and comorbidities
# There is aweak uphill (positive) linear relationship between age and charges
# There is no linear relationship between bmi and children
# There is no linear relationship between bmi and comorbidities
# There is no linear relationship between bmi and charges
# There is no linear relationship between children and bmi
# There is no linear relationship children and comorbidities
# There is no linear relationship between children and charges
# There is a moderate uphill (positive) relationship comorbidities and charges

### Qualitative Variables and Association
***
The correlation matrix is great for examining the relationship between two numeric variables, however, this method does not work when you want to assess the relationship between qualitative variables, or one qualitative and one numeric variable. 

For these situations, we are interested in determining the differences between groups. These differences include variations in counts or means. If we identify notable differences between groups (the average test score for male students is 89, and the average test score for female students is 92) - this is evidence that there is some underlying relationship present that can be further explored. 

In [105]:
# Two qualitative variables

pd.crosstab(df["gender"], df["level_of_fit"], margins = True)

level_of_fit,high,low,All
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,387,613,1000
male,410,590,1000
All,797,1203,2000


In [106]:
## One qualitative variable, one numeric

df["grade"].groupby(df["gender"]).mean()

gender
female    82.7173
male      82.3948
Name: grade, dtype: float64

In [107]:
pd.crosstab(df["gender"], df["grade"], margins = True)

grade,32.0,43.0,55.9,56.1,56.3,57.9,58.9,59.0,59.2,59.3,...,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100.0,All
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
female,1,0,0,1,1,1,0,1,1,0,...,3,1,2,2,1,0,1,1,49,1000
male,0,1,1,0,0,0,2,0,0,1,...,1,1,0,0,2,2,0,3,38,1000
All,1,1,1,1,1,1,2,1,1,1,...,4,2,2,2,3,2,1,4,87,2000


## <font color=GOLDENROD>Your Turn</font>

    1. Using the insurance dataset, determine if there is any association between the categorical independent variables and the dependent variable. 

In [108]:
ins.corr()

Unnamed: 0,age,bmi,children,comorbidities,charges
age,1.0,0.109272,0.042469,0.611903,0.299008
bmi,0.109272,1.0,0.012759,0.06546,0.198341
children,0.042469,0.012759,1.0,-0.282273,0.067998
comorbidities,0.611903,0.06546,-0.282273,1.0,0.505246
charges,0.299008,0.198341,0.067998,0.505246,1.0


In [109]:
# Below are the two relationships 

# That have some evidence that there is some underlying relationship present that can be further explored.

# There is a moderate uphill (positive) relationship between age and comorbidities

# There is a moderate uphill (positive) relationship comorbidities and charges

### Variance and your Dependent Variable
***
Understanding which variables have a relationship with your dependent variable is the first step. The next step is to better understand what is influencing that relationship, or the variance seen in your dependent variable. <B>Variance</B> describes the variation in values for a specific variable. 

We know that some people are taller than others - not everyone is exactly the same height. If we were to ask 100 people their specific height - we would not receive 100 identical answers; instead, we would likely receive different values (with some overlap). This variation in values is variance. When variance is observed, the next step is to determine what the factors are between the variance - in other words, what factors are responsible for the difference in height between the 100 people? Gender is likely an important consideration - as males tend to be taller on average. But is that the only reason some people are taller or shorter than others? Maybe ethnicity is influencing this variance - some ethnic groups are naturally taller than others. Genetics might also contribute; as taller parents typically have taller children.

Gender, ethnicity, genetics, etc - are all factors that may contribute to the variation in height. In this example, these factors are independent variables and the height of the individual is the dependent variable. The key question is: <b>how do changes in the independent variables explain the variance in the dependent variable?</b> An additional question is: <b>how much of the variation in the dependent variable can be attributed to each independent variable?</b>

Once we understand the factors that are influencing the dependent variable, we can purposefully manipulate the values of the independent variables to predict how that will change the dependent variable. This is where regression comes in handy! 

In [110]:
df.head(10)

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH
5,Neil,Whitley,male,16,5,low,16,high,88.7,NJ
6,Nelle,Golden,female,17,1,high,9,moderate,80.2,PA
7,Armando,Hoffman,male,17,5,low,18,high,95.1,MI
8,Illiana,Rojas,female,15,5,low,9,moderate,76.5,LA
9,Neil,Wooten,male,15,3,low,15,high,89.7,TN


# <font color=DODGERBLUE>Introduction to Linear Regression</font>
***

<b>Linear Regression</b> is a powerful statistical tool that allows for a closer examination of the linear relationship between <b>a continuous dependent variable</b> and various independent variables. 

Linear regression allows you to :
1. Determine if the relationship between the dependent and independent variables is statistically significant (or accurate).
2. Identify how much of the variation in the dependent variable is explained by the selection of independent variables. 
3. Determine the direction and magnitude of the relationships between variables, and 
4. Predict what the value of the dependent variable would be given specific input from the independent variables. 

You can use linear regression to predict the salary of a lawyer (DV) based on the number of years they practiced law (IV). You could also determine just how much of the variation in salary is attributed to the number of years they have practiced law.

***

A simple linear regression models the relationships between a single dependent and independent variable, <b>where the independent variable is predicting the value of the dependent variable</b>. A linear regression model is mathematically represented by the formula of a line:
### \begin{align}  y = mx + b \end{align}
Where “y” is the the value of the dependent variable, “m” is the slope (also known as the coefficient), “x” represents the value of the independent variable, and “b” is the y-intercept (also known as the constant) which is the value of “y” when the coefficient is equal to 0. 

<b>Linear regression models will determine the line-of-best-fit, also known as the regression line, which is the best fitting straight line through your data points.</b> Most commonly, the best fitting line is the line that minimizes errors. The equation for the regression line is what is used to make predictions for your dependent variable. 

<center><img src='https://s3.amazonaws.com/stackabuse/media/linear-regression-python-scikit-learn-1.png'></center>

***
## Things to Consider when using Linear Regression

***

### Feature Selection

Once we get a sense of the relationship between our variables, we need to make some decisions on which variables to include in our regression analyses. <b>We only want to include the variables that have some kind of relationship with your dependent variable(s) of interest.</b> This allows us to cut back on the empty "noise" in your dataset and focus on the variables that are meaningful. 

### Redundancy
Be careful with including multiple variables that show similar information. For example, if you have a variable "Hours of Study" which represents the total hours that a student studied, and you've also binned that data and created a new variable "Study Level" which groups students into "high, med, low" - you now have two variables in your dataset that show similar information, although one variable is a lot more specific. When you have this situation, you should elect to keep the variable that has the more detailed information (i.e. the specific hours of study). 

### Multiple Groups
Variables with a lot of different levels (i.e. State of Residence) don't do well in regression models. If you have a variable that has a large number of different groups, unless this variable is vital to your analyses, you should work to reduce the number of overall groups (i.e State can become Region), or leave the variable out of the analyses. It is possible to include these complicated variables, but it isn't always the easiest interpretation.

### Confounding Variables
When we preform regression analyses, we are interested in further understanding the relationship between the dependent and independent variable. Simply put, we are interested in investigating how the independent variable effects the dependent variable. However, it's rarely this simple and there are other factors that need to be understood and controlled. A confounding variable is a third variable that influences both the independent and dependent variable to some degree. It is important to acknowledge this third variable to ensure that the results of your analyses are valid. 

For example, you collect data on sunburns and ice cream consumption. You find that higher ice cream consumption is associated with a higher rate of sunburns. Does this mean that eating ice cream causes sunburns? Absolutely not, there are several other factors that can be attributed to this trend -- but most likely, the confounding variable is temperature. Hot temperatures cause people to eat more ice cream and result in people spending more time outdoors, which can result in more sunburns. Without accounting for the confounding variable(s), you may find relationships between variables that might not actually exist. To control for the potential effects of confounding variables, you simply have to include them in your regression model as another independent variable. 

When you include all of your assumed confounding variables in your regression model, you are controlling for the effects of all of them, and if you find there is still a relationship between a specific independent variable and your dependent variable, you will know that relationship isn't being influenced by any of these other factors.

In [111]:
## new library alert! ##

## Import the StatsModels library for our regression analyses

import statsmodels.formula.api as sm

### Creating a Linear Regression Model

* <b>result</b> is the name that we are assigning the regression formula
* <b>sm</b> is the shorthand for the linear regression model library
* <b>ols</b> is Ordinary Least Squares, the most common method of calculating the regression line
* the regression equation starts with the dependent variable on the left, followed by the independent variables
* independent variables are separated by "+"
* categorical variables must be in parentheses and annotated with a "C"
* data = is where you specify your dataset that you're pulling variables from
* <b>.fit()</b> function uses the predictive values to calculate the best linear regression line
* <b>.summary()</b> function will show the calculated values (slopes and y-intercept) for the linear regression formula

In [112]:
# create the regression model

result = sm.ols('grade ~ hours + age + exercise + C(gender)', data = df).fit()

# print the regression model summary

result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.665
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,988.1
Date:,"Thu, 04 May 2023",Prob (F-statistic):,0.0
Time:,15:28:17,Log-Likelihood:,-6299.1
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1995,BIC:,12640.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.0874,1.326,43.804,0.000,55.487,60.688
C(gender)[T.male],-0.4485,0.253,-1.773,0.076,-0.944,0.047
hours,1.9173,0.031,61.617,0.000,1.856,1.978
age,0.0405,0.075,0.543,0.587,-0.106,0.187
exercise,0.9841,0.089,11.073,0.000,0.810,1.158

0,1,2,3
Omnibus:,325.522,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2284.723
Skew:,-0.569,Prob(JB):,0.0
Kurtosis:,8.111,Cond. No.,214.0


# <font color=DODGERBLUE>Interpreting Linear Regression Result</font>
***
### Determining Model Fit
***
Linear regression calculates the regression line (equation) that minimizes the distance between the regression line and all the data points.  If our data points fall closely to the generated regression line, we consider the model to be a good fit. 

<img src='https://blog.minitab.com/hubfs/Imported_Blog_Media/residual_illustration-1.gif'>

But what does that mean non-graphically? Linear regression may not always be the right technique to use for the specific set of data. The fit of the model describes how well your variables explain the variance in the dependent variable. 

To assess the model fit, we look at the adjusted R-squared (Adj. R-squared). The Adj. R-squared is a statistical measure of how closely the data are to the fitted regression line. <B>The Adj. R-squared is the percentage of variation in the dependent variable that can be explained by all the independent variables included in the model.</B> For example, how much of the variation in student grades (i.e. grades ranging from 32-100) can be explained by hours of study, age, hours of exercise, gender, etc? Values range from 0 to 1; the higher the value, the better the fit. 

***
### Intercept
***
The y-intercept, or constant, is the value given to the dependent variable if all independent variables are equal to 0. For example, when all independent variables are equal to 0, the expected student grade is approximately 58. Don't worry too much about this interpretation, oftentimes this won't make sense - for example, if age equals 0, but it's important to know how to review this. 

***
### Coefficients (coef)
***
These values show how changes in the independent variable influence the dependent variable, and in what direction. 

#### <b> Interpreting Numeric Variable Coefficients </b>
When you are looking at numeric variables (i.e. age), each coefficient represents the numeric change in the dependent variable given a one-unit change in the independent variable. For example, for every one hour increase in study time (hours), grade increases by 1.9 points.

* <b>INTERCEPT</b>: when all other IV's are zero, expected grade is around 58

* <b>AGE</b>: for every one year increase in age, grade increases by 0.04 points (when controlling for hours of study, exercise, and gender)

* <b>EXERCISE</b>: for every one hour increase in exercise, grade increases by approximately 1 point (when controlling for age, hours of study, and gender)

* <b>HOURS</b>: for every one hour increase in study time, grade increases by 1.9 points (when controlling for age, hours of exercise, and gender)

If you have a variable that gives you a negative coefficient, the relationship moves in opposite directions. For example, say the coefficient for age was <b>- 0.0405</b>, you would interpret it in the following way: 

* <b>AGE</b>: for every one year increase in age, grade <u>decreases</u> by 0.04 points. 

#### <b> Interpreting Categorical Variable Coefficients </b>

When working with categorical variables, the model will automatically take one of the categories and use it as a reference (i.e. comparison category). This reference category will not show up in the listed coefficients, and that is how you will be able to identify which category is serving as the reference. In our model, gender is the only categorical variable, and can be interpreted: 

<b>GENDER</b>:
* Female is the reference category
* On average, Male students have a grade 0.44 points lower than Female students (when controlling for age, hours of study, and exercise) 

The grade is lower because the coefficient is negative in this model. If we have more than 2 categories for our categorical variable, we would still compare each level to the reference. For example, if we had a group of students who choose not to disclose their gender, the coefficient for the 'undisclosed' group would show up in our model with it's own coefficient and we would compare those results to the Female category. 

#### Standard Error (std err)

The standard error reflects the level of accuracy of the coefficients. The lower the value, the higher the level of accuracy.

***
### Statistical Significance 
***
#### p-value (P>|t|)
When trying to determine if the results we received are statistically important - we need to consider the p-value. The p-value is the probability that you will receive the same results solely by chance (aka there is no meaning behind the results). 

Because the p-value is representing the probability of random findings, we always want to minimize this value. If the p-value was .5 (50%), that would mean that 50% of the time the results we see are just by chance. The p-value reflects how confident we are in our results and requires strict interpretation. A commonly used cut-off is 0.05 (5% chance the results observed are by chance) or below. 
* If the p-value is less than or equal to 0.05, we can deem our results to be statistically significant. 
* If the p-value is greater than 0.05, our results are not statistically significant. 

***
### What do I do with non-significant variables?
***
Regression analyses are rarely a "one and done" situation. After you run your analyses, you should tweak your model (equation) and see if you can improve the model fit. A good place to start is by removing non-significant variables from our model. In this example, gender and age are not significant - let's try a model without them!

In [113]:
# Removing non-significant variables 

result2 = sm.ols('grade ~ hours + exercise', data = df).fit()

result2.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1973.0
Date:,"Thu, 04 May 2023",Prob (F-statistic):,0.0
Time:,15:28:18,Log-Likelihood:,-6300.8
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1997,BIC:,12620.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.5316,0.447,130.828,0.000,57.654,59.409
hours,1.9162,0.031,61.575,0.000,1.855,1.977
exercise,0.9892,0.089,11.131,0.000,0.815,1.163

0,1,2,3
Omnibus:,318.721,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2158.0
Skew:,-0.564,Prob(JB):,0.0
Kurtosis:,7.962,Cond. No.,43.2


# <font color=DODGERBLUE>Comparing Models and Making Predictions</font>

When comparting two models, the first thing you want to check is if there are any changes in the adjusted r-squared. In this example, the adj r-squared as not changed at all between the two models. The next item you want to check is the p-value of the remaining variables in the model. Did removing variables increase/decrease the p-values for the remaining variables?

You do not need to continue running your model until all your variables are significant. You should focus on maximizing your adj r-square, regardless of the significance of your variables. 

### What can I do with this information?

Now that you have a better understanding of how your variables interact, you are in a better place to describe your data. You now know that an increase of just one hour of studying will, on average, show a significant increase in a students final grade.  You can also make predictions about future grades...

### Making Predictions based on the Regression Results

Recall that a simple linear regression model is mathematically represented by the formula of a line: <b> y = mx + b</b>. When you are creating a multiple linear regression equation, the formula is similar - with some added features: <b> y = b + (m1 x X1) + (m2 x X2) + (m3 x X3) + ....etc. </b> where each "m x X" is representative of one of the coefficients in your model, and "b" represents the intercept. Once you have your regression output, you can plug these values in to make predictions on your dependent variable given specific values for your independent variables.

    grade(y) = intercept(b) + [hours of study coef(m1) x hours of study(x1)] + [hours of exercise coef(m2) x hours exercise(x2)]

    grade(y) = 58.5316 + [1.9162 x hours of study(x1)] + [0.9892 x hours exercise(x2)]

### What grade can we expect from a student that studied 8 hours and exercised 4.3 hours?

    grade(y) = 58.5316 + [1.9162 x hours of study(8)] + [0.9892 x hours exercise(4.3)]

    grade(y) = 58.5316 + [1.9162 x (8)] + [0.9892 x (4.3)]

    grade(y) = 58.5316 + [‭15.3296‬] + ‭[4.25356‬]

    grade = ‭78.11476

## <font color=SALMON>Ta Da! You can now predict the future!</font>

<LEFT><img src='https://i0.wp.com/www.learning-mind.com/wp-content/uploads/2019/10/psychic-spiritual-energy.jpg?resize=768%2C512&ssl=1'></LEFT>

## Making Predictions: the simple way!

What grade can we expect from a 16 year old female student that studied 8 hours and exercised 5.7 hours? 

In [114]:
# We can use the predict function (from statsmodel library) to predict the outcome given specific input

# I will reference the model with my function to reference the appropriate coef's!

# model_name.predict({'variable1_name':value1, 'variable2_name':value2, ...})

result.predict({
    'hours': 15, 
    'age': 16, 
    'exercise': 4, 
    'gender': "female"})

0    91.431847
dtype: float64

In [115]:
## Another scenario

result.predict({
    'hours': 14, 
    'age': 18, 
    'exercise': 12, 
    'gender': "female"})

0    97.468586
dtype: float64

In [116]:
# Based on the results and the model females performed better than males in most scenario.

# <font color=DODGERBLUE>Model 11 Exercises</font>

#### 1. Complete the code below to import the four libraries we've used most commonly.

In [117]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
import scipy.stats as stats 

#### 2. Import the "babies.csv" file and name it df. 

<b>Background Info</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>Variables</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

In [118]:
df = pd.read_csv("babies.csv")

df.head()


Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0


In [119]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1236 entries, 0 to 1235
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   case       1236 non-null   int64  
 1   bwt        1236 non-null   int64  
 2   gestation  1223 non-null   float64
 3   parity     1236 non-null   int64  
 4   age        1234 non-null   float64
 5   height     1214 non-null   float64
 6   weight     1200 non-null   float64
 7   smoke      1226 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 77.4 KB


#### 3. Check the shape of the dataset. How many columns and rows are there?

In [120]:
# There are 1236 rows and 8 colums

df.shape

(1236, 8)

#### 4. Check the first 10 rows and the last 10 rows. Drop the column "case". 

In [121]:
# Check the first 10 rows and the last 10 rows. 

df.head(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
6,7,138,244.0,0,33.0,62.0,178.0,0.0
7,8,132,245.0,0,23.0,65.0,140.0,0.0
8,9,120,289.0,0,25.0,62.0,125.0,0.0
9,10,143,299.0,0,30.0,66.0,136.0,1.0


In [122]:
# Check the last 10 rows. 

df.tail(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
1226,1227,109,244.0,1,21.0,63.0,102.0,1.0
1227,1228,103,278.0,0,30.0,60.0,87.0,1.0
1228,1229,118,276.0,0,34.0,64.0,116.0,0.0
1229,1230,127,290.0,0,27.0,65.0,121.0,0.0
1230,1231,132,270.0,0,27.0,65.0,126.0,0.0
1231,1232,113,275.0,1,27.0,60.0,100.0,0.0
1232,1233,128,265.0,0,24.0,67.0,120.0,0.0
1233,1234,130,291.0,0,30.0,65.0,150.0,1.0
1234,1235,125,281.0,1,21.0,65.0,110.0,0.0
1235,1236,117,297.0,0,38.0,65.0,129.0,0.0


In [123]:
# Dropping "case" colum 

df.drop(columns="case", inplace = True)

df

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
3,123,,0,36.0,69.0,190.0,0.0
4,108,282.0,0,23.0,67.0,125.0,1.0
...,...,...,...,...,...,...,...
1231,113,275.0,1,27.0,60.0,100.0,0.0
1232,128,265.0,0,24.0,67.0,120.0,0.0
1233,130,291.0,0,30.0,65.0,150.0,1.0
1234,125,281.0,1,21.0,65.0,110.0,0.0


#### 5. Is there any missing data? Check!

In [124]:
# Yes there is missing data

df.isnull()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
1231,False,False,False,False,False,False,False
1232,False,False,False,False,False,False,False
1233,False,False,False,False,False,False,False
1234,False,False,False,False,False,False,False


#### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data.

In [125]:
# dropping all rows with at least one missing data

df.dropna(inplace = True)


In [126]:
# check the changes

df.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
4,108,282.0,0,23.0,67.0,125.0,1.0
5,136,286.0,0,25.0,62.0,93.0,0.0


In [127]:
# check the changes

df.tail()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
1231,113,275.0,1,27.0,60.0,100.0,0.0
1232,128,265.0,0,24.0,67.0,120.0,0.0
1233,130,291.0,0,30.0,65.0,150.0,1.0
1234,125,281.0,1,21.0,65.0,110.0,0.0
1235,117,297.0,0,38.0,65.0,129.0,0.0


#### 7. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [128]:
## Create a copy of your dataset to filter outliers
dfz = df.copy()

## Check original shape of the dataset
print(dfz.shape)

(1174, 7)


In [129]:
# BWT 

In [130]:
## Calculate quartiles
bwt_q1 = dfz["bwt"].quantile(.25)
bwt_q3 = dfz["bwt"].quantile(.75)

print("Q1:", bwt_q1)
print("Q3:", bwt_q3)

Q1: 108.0
Q3: 131.0


In [131]:
## Calculate the IQR
bwt_iqr = bwt_q3 - bwt_q1

print("bwt IQR:", bwt_iqr)

bwt IQR: 23.0


In [132]:
## Determine outlier fences 

bwt_top = gestation_q3 + (bwt_iqr * 1.5)
bwt_bottom = bwt_q1 - (bwt_iqr * 1.5)


print("BWT Upper Limit:", bwt_top)
print("BWT Lower Limit:", bwt_bottom)

BWT Upper Limit: 322.5
BWT Lower Limit: 73.5


In [133]:
# Determine the index locations for rows that fall outside of outlier fences

bwt_iqr_outliers = dfz.loc[(dfz['bwt'] > bwt_top) | (dfz['bwt'] < bwt_bottom)].index

print("BWT INDEX VALUES:", bwt_iqr_outliers)

BWT INDEX VALUES: Int64Index([309, 361, 462, 500, 529, 829, 904, 912, 978, 1035, 1063, 1065,
            1139, 1148, 1169],
           dtype='int64')


In [134]:
# Gestation 

In [135]:
## Calculate quartiles

gestation_q1 = dfz["gestation"].quantile(.25)
gestation_q3 = dfz["gestation"].quantile(.75)

print("gestation Q1:", gestation_q1)
print("gestation Q3:", gestation_q3)

gestation Q1: 272.0
gestation Q3: 288.0


In [136]:
## Calculate the IQR
gestation_iqr = gestation_q3 - gestation_q1

print("gestation IQR:", gestation_iqr)

gestation IQR: 16.0


In [137]:
## Determine outlier fences 

gestation_top = gestation_q3 + (gestation_iqr * 1.5)
gestation_bottom = gestation_q1 - (gestation_iqr * 1.5)


print("gestation Upper Limit:", gestation_top)
print("gestation Lower Limit:", gestation_bottom)

gestation Upper Limit: 312.0
gestation Lower Limit: 248.0


In [138]:
## Determine the index locations for rows that fall outside of outlier fences

gestation_iqr_outliers = dfz.loc[(dfz['gestation'] > gestation_top) | (dfz['gestation'] < gestation_bottom)].index

print("GESTATION INDEX VALUES:", gestation_iqr_outliers)

GESTATION INDEX VALUES: Int64Index([   6,    7,   10,   59,   63,   66,  119,  129,  155,  188,  192,
             198,  210,  215,  217,  234,  240,  253,  260,  279,  345,  361,
             373,  394,  440,  460,  462,  484,  500,  511,  523,  630,  685,
             699,  710,  726,  746,  761,  769,  778,  784,  828,  829,  830,
             833,  869,  912,  927,  952,  969,  978, 1002, 1020, 1026, 1035,
            1065, 1074, 1134, 1139, 1140, 1142, 1146, 1152, 1172, 1178, 1199,
            1206, 1207, 1217, 1219, 1226],
           dtype='int64')


In [139]:
# Age

In [140]:
## Calculate quartiles
age_q1 = dfz["age"].quantile(.25)
age_q3 = dfz["age"].quantile(.75)

print("Age Q1:", age_q1)
print("Age Q3:", age_q3)

Age Q1: 23.0
Age Q3: 31.0


In [141]:
## Calculate the IQR
age_iqr = age_q3 - age_q1

print("Age IQR:", age_iqr)

Age IQR: 8.0


In [142]:
## Determine outlier fences 

age_top = age_q3 + (age_iqr * 1.5)
age_bottom = age_q1 - (age_iqr * 1.5)


print("Age Upper Limit:", age_top)
print("Age Lower Limit:", age_bottom)

Age Upper Limit: 43.0
Age Lower Limit: 11.0


In [143]:
## Determine the index locations for rows that fall outside of outlier fences

age_iqr_outliers = dfz.loc[(dfz['age'] > age_top) | (dfz['age'] < age_bottom)].index

print("AGE INDEX VALUES:", age_iqr_outliers)

AGE INDEX VALUES: Int64Index([633, 1070], dtype='int64')


In [144]:
# Weight 

In [145]:
## Calculate quartiles
weight_q1 = dfz["weight"].quantile(.25)
weight_q3 = dfz["weight"].quantile(.75)

print("Weight Q1:", weight_q1)
print("Weight Q3:", weight_q3)

Weight Q1: 114.25
Weight Q3: 139.0


In [146]:
## Calculate the IQR
weight_iqr = weight_q3 - weight_q1

print("Weight IQR:", weight_iqr)

Weight IQR: 24.75


In [147]:
## Determine outlier fences 

weight_top = weight_q3 + (weight_iqr * 1.5)
weight_bottom = weight_q1 - (weight_iqr * 1.5)


print("Weight Upper Limit:", weight_top)
print("Weight Lower Limit:", weight_bottom)

Weight Upper Limit: 176.125
Weight Lower Limit: 77.125


In [148]:
## Determine the index locations for rows that fall outside of outlier fences

weight_iqr_outliers = dfz.loc[(dfz['weight'] > weight_top) | (dfz['weight'] < weight_bottom)].index

print("WEIGHT INDEX VALUES:", weight_iqr_outliers)

WEIGHT INDEX VALUES: Int64Index([   6,   23,   41,   88,  117,  149,  162,  181,  183,  222,  240,
             287,  411,  426,  512,  522,  528,  563,  608,  622,  632,  723,
             733,  747,  849,  858,  859,  865,  888,  924,  935, 1008, 1021,
            1148, 1154, 1167, 1219],
           dtype='int64')


In [149]:
# Height 

In [150]:
## Calculate quartiles
height_q1 = dfz["height"].quantile(.25)
height_q3 = dfz["height"].quantile(.75)

print("Height Q1:", height_q1)
print("Height Q3:", height_q3)

Height Q1: 62.0
Height Q3: 66.0


In [151]:
## Calculate the IQR
height_iqr = height_q3 - height_q1

print("Height IQR:", height_iqr)

Height IQR: 4.0


In [152]:
## Determine outlier fences 

height_top = height_q3 + (height_iqr * 1.5)
height_bottom = height_q1 - (height_iqr * 1.5)


print("Height Upper Limit:", height_top)
print("Height Lower Limit:", height_bottom)

Height Upper Limit: 72.0
Height Lower Limit: 56.0


In [153]:
## Determine the index locations for rows that fall outside of outlier fences

height_iqr_outliers = dfz.loc[(dfz['height'] > height_top) | (dfz['height'] < height_bottom)].index

print("HEIGHT INDEX VALUES:", height_iqr_outliers)

HEIGHT INDEX VALUES: Int64Index([434, 1208], dtype='int64')


In [154]:
## Using the z-score method for outlier identification and removal 

## starting shape
print ("starting shape:", df.shape)

## calculate zscores
df["zscore_bwt"] = np.abs(stats.zscore(df["bwt"]))
df["zscore_gestation"] = np.abs(stats.zscore(df["gestation"]))
df["zscore_age"] = np.abs(stats.zscore(df["age"]))
df["zscore_height"] = np.abs(stats.zscore(df["height"]))
df["zscore_weight"] = np.abs(stats.zscore(df["weight"]))

###

## determine rows with outliers for birthweight and drop
bwt_outliers = df[df["zscore_bwt"] > 3].index
df = df.drop(bwt_outliers)

###

## determine rows with outliers for gestation and drop
gest_outliers = df[df["zscore_gestation"] > 3].index
df = df.drop(gest_outliers)

###

## determine rows with outliers for age and drop
age_outliers = df[df["zscore_age"] > 3].index
df = df.drop(age_outliers)

###

## determine rows with outliers for height and drop
ht_outliers = df[df["zscore_height"] > 3].index
df = df.drop(ht_outliers)

###

## determine rows with outliers for weight and drop
wt_outliers = df[df["zscore_weight"] > 3].index
df = df.drop(wt_outliers)

###

## ending shape; dropping zscore columns
df.drop(columns=["zscore_bwt", "zscore_gestation", "zscore_age", "zscore_height", "zscore_weight"], inplace=True)


print ("shape after outliers removed:", df.shape)

starting shape: (1174, 7)
shape after outliers removed: (1134, 7)


#### 8. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [155]:
df.describe()

## average age: 27
## average gestation: 279 days

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
count,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0
mean,119.661376,279.42328,0.267196,27.206349,64.040564,127.218695,0.390653
std,17.860299,13.930125,0.442691,5.802403,2.468179,18.385539,0.488112
min,65.0,232.0,0.0,15.0,57.0,87.0,0.0
25%,109.0,272.0,0.0,23.0,62.0,114.0,0.0
50%,120.0,280.0,0.0,26.0,64.0,125.0,0.0
75%,131.0,288.0,1.0,31.0,66.0,137.0,1.0
max,174.0,324.0,1.0,44.0,71.0,190.0,1.0


#### 9. Let's model birthweight based on the characteristics of the mother. But first... 

We want to easily distinguish between the numeric and categorical variables. Replace the values 0/1 in the "parity" and "smoke" column with meaningful labels (i.e. smokes, doesn't smoke).

In [156]:
df["parity"].replace([0,1], ["First Preg", "Not First Preg"], inplace = True)

df["smoke"].replace([0,1], ["Non-Smoker", "Smoker"], inplace = True)

#check

df.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,First Preg,27.0,62.0,100.0,Non-Smoker
1,113,282.0,First Preg,33.0,64.0,135.0,Non-Smoker
2,128,279.0,First Preg,28.0,64.0,115.0,Smoker
4,108,282.0,First Preg,23.0,67.0,125.0,Smoker
5,136,286.0,First Preg,25.0,62.0,93.0,Non-Smoker


#### 10. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? 

Describe the strength of the correlation between all the numeric variables and birthweight. 

In [157]:
df.corr()

## birthweight is correlated to: A weak uphill (positive) linear relationship with gestation
## gestation (0.4) moderate positive correlation
## age (0.03) very weak positive correlation
## height (0.2) weak positive correlation
## weight (0.2) weak positive correlation

Unnamed: 0,bwt,gestation,age,height,weight
bwt,1.0,0.411186,0.026259,0.212844,0.166821
gestation,0.411186,1.0,-0.05402,0.072687,0.045045
age,0.026259,-0.05402,1.0,-0.001016,0.16076
height,0.212844,0.072687,-0.001016,1.0,0.46382
weight,0.166821,0.045045,0.16076,0.46382,1.0


#### 11. Determine the relationship between birthweight and the categorical variables: parity and smoke. 

Use the groupby function to determine if there are any differences between birthweight and the different groups.  Does it seem like there is a relationship between these variables and birthweight?

In [158]:
df["bwt"].groupby(df["smoke"]).mean()

## non-smokers have higher bwt's on average

smoke
Non-Smoker    123.337192
Smoker        113.927765
Name: bwt, dtype: float64

In [159]:
df["bwt"].groupby(df["parity"]).mean()

# mothers who are having their first child have higher bwt's on average

parity
First Preg        120.174489
Not First Preg    118.254125
Name: bwt, dtype: float64

#### 12. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? 

In the space below, write your justification for why you are including each variable. 

In the process of conrtusting my regression model I plan keep all the variable:
BWT: this is my dependent variable; i want to determine the variables that can predict bwt/explain the variation in bwt.
GESTATION: there is a moderate relationship between bwt and gestation; logically, this makes sense, but i'm interested in further understanding how much of the variation in birthweight is attributed to how long the pregnancy lasted; how does one additional day influence birthweight?.
HEIGHT: there is a weak relationship between height and birthweight; keeping it in as a potential demographic confounding variable; I don't think height influences birthweight much, but I want to control for the demographics of the mother.
WEIGHT: I am keeping weight in for the same reason as height.
AGE: I am keeping age in, despite its VERY weak correlation, for the same reason as height - i don't expect there to be statistically significant results, but I want to control for differences in age between the mothers.
PARITY: there were slight differences in the average birthweight between first and not-first pregnancies, i want to explore this relationship further.
SMOKE: similar to above, i want to further explore the relationship between birthweight and smoking, given there were slight differences in the average birthweight for the two groups.

#### 13. Construct your regression model and print the summary. 

Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

In [160]:
result = sm.ols('bwt ~ gestation + age + height + weight + C(parity) + C(smoke)', data=df).fit()

result.summary()

0,1,2,3
Dep. Variable:,bwt,R-squared:,0.264
Model:,OLS,Adj. R-squared:,0.26
Method:,Least Squares,F-statistic:,67.37
Date:,"Thu, 04 May 2023",Prob (F-statistic):,1.1e-71
Time:,15:28:46,Log-Likelihood:,-4703.6
No. Observations:,1134,AIC:,9421.0
Df Residuals:,1127,BIC:,9457.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-97.8230,15.101,-6.478,0.000,-127.452,-68.194
C(parity)[T.Not First Preg],-3.6400,1.113,-3.272,0.001,-5.823,-1.457
C(smoke)[T.Smoker],-8.1849,0.944,-8.669,0.000,-10.037,-6.332
gestation,0.4926,0.033,14.867,0.000,0.428,0.558
age,-0.0249,0.085,-0.291,0.771,-0.193,0.143
height,1.2259,0.211,5.819,0.000,0.813,1.639
weight,0.0486,0.029,1.698,0.090,-0.008,0.105

0,1,2,3
Omnibus:,4.377,Durbin-Watson:,2.08
Prob(Omnibus):,0.112,Jarque-Bera (JB):,5.187
Skew:,0.007,Prob(JB):,0.0748
Kurtosis:,3.331,Cond. No.,10400.0


## INTERPRETATION OF RESULTS

* My adj r-squared is 0.260 (26% of the variation in bwt can be attributed to my independent variable), this is a fairly low value and I would typically fine tume the model to determine if this value can be maximized.

## Interpretation of Model Coef's

* <b>PARITY (ref grp is first pregnancy)<b>: on average, women who have had multiple pregnancies have babies with a birthweight 3.6 oz lower than women who are experiencing their first pregnancy, when controlling for smoking status, gestation length, age of the mother, height and weight of the mother.

* <b>SMOKE (ref grp is non-smoker)<b>: on average, women who smoke cigarettes have babies with a birthwt 8.2 oz lower than women who do not smoke cigarettes, when controlling for number of prior pregnancies, gestation length, age of the mother, height and weight of the mother.

* <b>GESTATION<b>: for every 1-day increase in gestation, birthwt increases by 0.49, when controlling for number of prior pregnancies, smoking status, age of the mother, height and weight of the mother.

* <b>AGE<b>: for every 1-year increase in age, birthwt decreases by 0.02, when controlling for number of prior pregnancies, smoking status, gestation length, height and weight of the mother. <font color=green>This coef is not statistically significant.</font> 

* <b>HEIGHT<b>: for every one inch increase in height, birthwt increases by 1.23, when controlling for number of prior pregnancies, smoking status, gestation length, age of the mother, and the weight of the mother.

* <b>WEIGHT:<b>: for every one pound increase in weight, birthwt increases by 0.05, when controlling for number of prior pregnancies, smoking status, gestation length, age of the mother, and the height of the mother. <font color=green>This coef is not statistically significant.</font> 

#### 14. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors. Use the information in the model summary to make these predictions. 

In [161]:
result.predict({
    'gestation': 282, 
    'parity': "First Preg", 
    'age': 40, 
    'height': 62,
    'weight':110,
    'smoke':"Smoker"})

0    113.258685
dtype: float64

In [162]:
result.predict({
    'gestation': 282, 
    'parity': "First Preg", 
    'age': 40, 
    'height': 62,
    'weight':110,
    'smoke':"Non-Smoker"})

0    121.443552
dtype: float64

In [163]:
result.predict({
    'gestation': 245, 
    'parity': "Not First Preg", 
    'age': 23, 
    'height': 66,
    'weight':145,
    'smoke':"Smoker"})

0    98.422221
dtype: float64

In [164]:
result.predict({
    'gestation': 245, 
    'parity': "Not First Preg", 
    'age': 23, 
    'height': 66,
    'weight':145,
    'smoke':"Non-Smoker"})

0    106.607089
dtype: float64

In [165]:
result.predict({
    'gestation': 297, 
    'parity': "First Preg", 
    'age': 33, 
    'height': 67,
    'weight':201,
    'smoke':"Smoker"})

0    131.376842
dtype: float64

In [166]:
result.predict({
    'gestation': 297, 
    'parity': "First Preg", 
    'age': 33, 
    'height': 67,
    'weight':201,
    'smoke':"Non-Smoker"})

0    139.561709
dtype: float64

### <font color = blue> Based on the results Non-Smokers tend to be overweight than smokers comparing same varible values.</font> 