## Module 11: Intro to Correlation and Linear Regression 

***

You've learned to explore your dataset features, clean up messy data, and use descriptive statistics to summarize the characteristics of your dataset. 

We are now moving forward with exploring the relationship <b>between</b> variables. Determining how your variables interact is one of the most useful skills you will learn while working with data. 

The first step is understanding the dependency of your variables. 

***

In [23]:
import pandas as pd
import numpy as np

## <font color=DODGERBLUE>Independent and Dependent Variables</font>

When you start to look at the relationship between variables, there are two classifications of variables that need to be considered. 

***

### Independent Variables (IV)
An independent variable, also known as a predictor variable, is a variable that is independent of the other variables in your dataset. Consider this variable the "cause". Independent variables are commonly represented with "x".  

### Dependent Variables (DV)
A dependent variable, also known as an outcome variable, is a variable whose value is dependent on the values of the other variables in your dataset. Consider this variable the "effect". We assume that changes in the values of the independent variable(s) will result in changes to the dependent variable. Dependent variables are commonly represented with "y". 

***

### Examples 
The distinction between IV and DV is an important concept to master because several analyses require identification of which variables are IV and DV. IV's cause a change in the DV, and it isn't possible for the DV to cause a change in the IV.

#### Do flowers grow fastest under fluorescent or natural light?

    - IV: Type of light flowers are grown under
    - DV: Rate of flower growth

#### What is the effect of diet and regular soda on blood sugar levels?

    - IV: Type of soda
    - DV: Blood sugar levels

#### How does cellphone use before bedtime influence sleep?

    - IV: Amount of phone use before bedtime
    - DV: Hours of sleep, quality of sleep, restfulness, etc. 

In [24]:
df = pd.read_csv("EduGradeData.csv")

df.head()

## which variables can be considered independent and dependent?

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH


## <font color=GOLDENROD>Your Turn</font>

    1. Import the "insurance.csv" file; name the dataset 'ins'. Preview the first 5 rows. 
    2. In the space below, determine what type of variable is in each column (qualitative/quantitative). 
    3. What variable(s) could be considered dependent?

# <font color=DODGERBLUE>Introduction to Correlation & Association</font>

***

### Correlation
***
Correlation is a tool that can be used to determine th strength of a relationship between two numeric variables. Correlation describes the direction (+/-) and magnitude (how large) of a <b>linear relationship</b> exists between two numeric variables. This is the initial check if your numeric variables have any kind of meaningful relationship. 

Correlation values range from "-1" to "+1". Variables that are positively correlated are closer to "+1" and variables that are negatively correlated are closer to "-1". 

#### Positively correlated variables move in the same direction - as one variable increases, the other variable increases in the same direction.

        - Increase in study time >> increases in test score
        - Decrease in sugar consumption >> decrease in blood sugar levels
        
#### Negatively correlated variables move in the opposite direction - as one variable increases, the other variable decreases (or vice versa). 

        - Decrease in daily spending >> increase in total savings
        - Increase in weight >> decrease in mobility

In [25]:
df.head()

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH


In [26]:
## Create a correlation matrix

df.corr()

Unnamed: 0,age,exercise,hours,grade
age,1.0,-0.003643,-0.017467,-0.00758
exercise,-0.003643,1.0,0.021105,0.161286
hours,-0.017467,0.021105,1.0,0.801955
grade,-0.00758,0.161286,0.801955,1.0


### Strength of Relationship between Variables
***
The strength of the relationship between variables is determined by the value of the correlation coefficient. To interpret its value, see which of the following values your correlation coefficient is closest to:

- <b>Exactly –1</b>: A perfect downhill (negative) linear relationship
- <b>–0.70</b>: A strong downhill (negative) linear relationship
- <b>–0.50</b>: A moderate downhill (negative) relationship
- <b>–0.30</b>: A weak downhill (negative) linear relationship
- <b>0</b>: No linear relationship
- <b>+0.30</b>: A weak uphill (positive) linear relationship
- <b>+0.50</b>: A moderate uphill (positive) relationship
- <b>+0.70</b>: A strong uphill (positive) linear relationship
- <b>Exactly +1</b>: A perfect uphill (positive) linear relationship

#### It is important to understand that <u>correlation is not the same as causation</u>. Identifying a correlation between two factors does not automatically mean one factor causes another factor to occur.

***

## <font color=GOLDENROD>Your Turn</font>

    1. Create a correlation matrix with the insurance dataset. 
    2. What can you say about the relationship between the independent variables and the dependent variable?

### Qualitative Variables and Association
***
The correlation matrix is great for examining the relationship between two numeric variables, however, this method does not work when you want to assess the relationship between qualitative variables, or one qualitative and one numeric variable. 

For these situations, we are interested in determining the differences between groups. These differences include variations in counts or means. If we identify notable differences between groups (the average test score for male students is 89, and the average test score for female students is 92) - this is evidence that there is some underlying relationship present that can be further explored. 

In [5]:
## Two qualitative variables

pd.crosstab(df["gender"], df["level_of_fit"],margins= True)

level_of_fit,high,low,All
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,387,613,1000
male,410,590,1000
All,797,1203,2000


In [6]:
## One qualitative variable, one numeric
# 原始列.groupby(group的根据列).mean()   
# 根据gender来分类grade 并且显示分类后各组的mean平均值

df["grade"].groupby(df["gender"]).mean()

gender
female    82.7173
male      82.3948
Name: grade, dtype: float64

## <font color=GOLDENROD>Your Turn</font>

    1. Using the insurance dataset, determine if there is any association between the categorical independent variables and the dependent variable. 

### Variance and your Dependent Variable
***
Understanding which variables have a relationship with your dependent variable is the first step. The next step is to better understand what is influencing that relationship, or the variance seen in your dependent variable. <B>Variance</B> describes the variation in values for a specific variable. 

We know that some people are taller than others - not everyone is exactly the same height. If we were to ask 100 people their specific height - we would not receive 100 identical answers; instead, we would likely receive different values (with some overlap). This variation in values is variance. When variance is observed, the next step is to determine what the factors are between the variance - in other words, what factors are responsible for the difference in height between the 100 people? Gender is likely an important consideration - as males tend to be taller on average. But is that the only reason some people are taller or shorter than others? Maybe ethnicity is influencing this variance - some ethnic groups are naturally taller than others. Genetics might also contribute; as taller parents typically have taller children.

Gender, ethnicity, genetics, etc - are all factors that may contribute to the variation in height. In this example, these factors are independent variables and the height of the individual is the dependent variable. The key question is: <b>how do changes in the independent variables explain the variance in the dependent variable?</b> An additional question is: <b>how much of the variation in the dependent variable can be attributed to each independent variable?</b>

Once we understand the factors that are influencing the dependent variable, we can purposefully manipulate the values of the independent variables to predict how that will change the dependent variable. This is where regression comes in handy! 

了解哪些变量与因变量有关系是第一步。下一步是更好地了解影响这种关系的因素，或者从变量中看到的方差。方差描述特定变量的值的变化。

我们知道有些人比其他人高 - 不是每个人都是完全相同的身高。如果我们要问100个人他们的具体身高 - 我们不会得到100个相同的答案;相反，我们可能会收到不同的值（有一些重叠）。值的这种变化就是方差。当观察到方差时，下一步是确定方差之间的因素是什么 - 换句话说，哪些因素是造成100人之间身高差异的原因？性别可能是一个重要的考虑因素 - 因为男性平均而言往往更高。但这是有些人比其他人高或矮的唯一原因吗？也许种族正在影响这种差异 - 一些种族群体天生比其他种族更高。遗传学也可能有所帮助;因为身高较高的父母通常有较高的孩子。
性别，种族，遗传等 - 都是可能导致身高变化的因素。在此示例中，这些因子是自变量，个体的高度是因变量。关键问题是：自变量的变化如何解释因变量的方差？另一个问题是：因变量中有多少变异可以归因于每个自变量？

一旦我们了解了影响因变量的因素，我们就可以有目的地操纵自变量的值，以预测这将如何改变因变量。这就是回归派上用场的地方！

In [7]:
df.head(10)

Unnamed: 0,fname,lname,gender,age,exercise,level_of_fit,hours,level_of_study,grade,home_state
0,Marcia,Pugh,female,17,3,low,10,moderate,82.4,NJ
1,Kadeem,Morrison,male,18,4,low,4,low,78.2,MA
2,Nash,Powell,male,18,5,low,9,moderate,79.3,OH
3,Noelani,Wagner,female,14,2,high,7,moderate,83.2,FL
4,Noelani,Cherry,female,18,4,low,15,high,87.4,OH
5,Neil,Whitley,male,16,5,low,16,high,88.7,NJ
6,Nelle,Golden,female,17,1,high,9,moderate,80.2,PA
7,Armando,Hoffman,male,17,5,low,18,high,95.1,MI
8,Illiana,Rojas,female,15,5,low,9,moderate,76.5,LA
9,Neil,Wooten,male,15,3,low,15,high,89.7,TN


# <font color=DODGERBLUE>Introduction to Linear Regression</font>
***

<b>Linear Regression</b> is a powerful statistical tool that allows for a closer examination of the linear relationship between <b>a continuous dependent variable</b> and various independent variables. 

Linear regression allows you to :
1. Determine if the relationship between the dependent and independent variables is statistically significant (or accurate).
2. Identify how much of the variation in the dependent variable is explained by the selection of independent variables. 
3. Determine the direction and magnitude of the relationships between variables, and 
4. Predict what the value of the dependent variable would be given specific input from the independent variables. 

You can use linear regression to predict the salary of a lawyer (DV) based on the number of years they practiced law (IV). You could also determine just how much of the variation in salary is attributed to the number of years they have practiced law.


线性回归是一个强大的统计工具，可以更仔细地检查连续因变量和各种自变量之间的线性关系。

线性回归允许您 ：
1. 确定因变量和自变量之间的关系在统计意义上是否显著（或准确）。
2. 确定因变量中有多少变异是由自变量的选择来解释的。
3. 确定变量之间关系的方向和大小，以及
4. 预测从自变量的特定输入中将给出因变量的值。

您可以使用线性回归根据律师执业年限 （IV） 来预测律师的薪水 （DV）。您还可以确定工资变化中有多少归因于他们从事法律工作的年限。
***

A simple linear regression models the relationships between a single dependent and independent variable, <b>where the independent variable is predicting the value of the dependent variable</b>. A linear regression model is mathematically represented by the formula of a line:

简单的线性回归对单个因变量和自变量之间的关系进行建模，自变量预测因变量的值，线性回归模型在数学上由直线的公式表示：
### \begin{align}  y = mx + b \end{align}
Where “y” is the the value of the dependent variable, “m” is the slope (also known as the coefficient), “x” represents the value of the independent variable, and “b” is the y-intercept (also known as the constant) which is the value of “y” when the coefficient is equal to 0. 

其中"y"是因变量的值，"m"是斜率（也称为系数），"x"表示自变量的值，"b"是y截距（也称为常量），当系数等于0时，它是"y"的值。

<b>Linear regression models will determine the line-of-best-fit, also known as the regression line, which is the best fitting straight line through your data points.</b> Most commonly, the best fitting line is the line that minimizes errors. The equation for the regression line is what is used to make predictions for your dependent variable. 

线性回归模型将确定最佳拟合线（也称为回归线），这是通过数据点的最佳拟合直线 &lt;/b&gt;。最常见的是，最佳拟合线是将误差降至最低的线。回归线的等式是用于对因变量进行预测的方程。
<center><img src='https://s3.amazonaws.com/stackabuse/media/linear-regression-python-scikit-learn-1.png'></center>

***
## Things to Consider when using Linear Regression

***

### Feature Selection

Once we get a sense of the relationship between our variables, we need to make some decisions on which variables to include in our regression analyses.
一旦我们了解了变量之间的关系，我们就需要做出一些决定，决定在回归分析中包括哪些变量 
<b>We only want to include the variables that have some kind of relationship with your dependent variable(s) of interest.
我们只想包含与感兴趣的因变量有某种关系的变量。
</b> This allows us to cut back on the empty "noise" in your dataset and focus on the variables that are meaningful. 
这使我们能够减少数据集中的空白"噪音"，并专注于有意义的变量。


### Redundancy冗余
Be careful with including multiple variables that show similar information. For example, if you have a variable "Hours of Study" which represents the total hours that a student studied, and you've also binned that data and created a new variable "Study Level" which groups students into "high, med, low" - you now have two variables in your dataset that show similar information, although one variable is a lot more specific. When you have this situation, you should elect to keep the variable that has the more detailed information (i.e. the specific hours of study). 
在包含显示类似信息的多个变量时要小心。例如，如果您有一个变量"学习小时数"，它表示学生学习的总小时数，并且您还对该数据进行了装箱并创建了一个新变量"学习水平"，该变量将学生分组为"高,中，低" - 您现在在数据集中有两个变量显示类似的信息，尽管一个变量更具体。当您遇到这种情况时，您应该选择保留具有更详细信息（即特定学习时间）的变量。


### Multiple Groups
Variables with a lot of different levels (i.e. State of Residence) don't do well in regression models. If you have a variable that has a large number of different groups, unless this variable is vital to your analyses, you should work to reduce the number of overall groups (i.e State can become Region), or leave the variable out of the analyses. It is possible to include these complicated variables, but it isn't always the easiest interpretation.
具有许多不同级别（即居住地）的变量在回归模型中表现不佳。如果变量包含大量不同的组，除非此变量对您的分析至关重要，否则应努力减少总体组的数量（即状态可以成为区域），或者将变量排除在分析之外。可以包含这些复杂的变量，但它并不总是最简单的解释。


### Confounding Variables混杂变量
When we preform regression analyses, we are interested in further understanding the relationship between the dependent and independent variable. Simply put, we are interested in investigating how the independent variable effects the dependent variable. However, it's rarely this simple and there are other factors that need to be understood and controlled. A confounding variable is a third variable that influences both the independent and dependent variable to some degree. It is important to acknowledge this third variable to ensure that the results of your analyses are valid. 当我们进行预制回归分析时，我们感兴趣的是进一步了解因变量和自变量之间的关系。简而言之，我们感兴趣的是研究自变量如何影响因变量。然而，它很少这么简单，还有其他因素需要理解和控制。混杂变量是在一定程度上影响自变量和因变量的第三个变量。确认这第三个变量以确保分析结果有效非常重要。

For example, you collect data on sunburns and ice cream consumption. You find that higher ice cream consumption is associated with a higher rate of sunburns. Does this mean that eating ice cream causes sunburns? Absolutely not, there are several other factors that can be attributed to this trend -- but most likely, the confounding variable is temperature. Hot temperatures cause people to eat more ice cream and result in people spending more time outdoors, which can result in more sunburns. Without accounting for the confounding variable(s), you may find relationships between variables that might not actually exist. To control for the potential effects of confounding variables, you simply have to include them in your regression model as another independent variable. 例如，您收集有关晒伤和冰淇淋消费的数据。您发现较高的冰淇淋消费量与较高的晒伤率有关。这是否意味着吃冰淇淋会导致晒伤？绝对不是，还有其他几个因素可以归因于这种趋势 - 但最有可能的是，混淆变量是温度。炎热的温度导致人们吃更多的冰淇淋，导致人们在户外花费更多的时间，这可能导致更多的晒伤。如果不考虑混杂变量，您可能会发现变量之间的关系实际上可能并不存在。要控制混杂变量的潜在影响，只需将它们作为另一个自变量包含在回归模型中即可。



When you include all of your assumed confounding variables in your regression model, you are controlling for the effects of all of them, and if you find there is still a relationship between a specific independent variable and your dependent variable, you will know that relationship isn't being influenced by any of these other factors.当您在回归模型中包含所有假定的混杂变量时，您正在控制所有这些变量的效果，如果您发现特定自变量与因变量之间仍然存在关系，您将知道该关系不受任何其他因素的影响。

#### Libarary: statsmodels.formula.api <br/>


In [33]:
## new library alert! ##

## Import the StatsModels library for our regression analyses

import statsmodels.formula.api as sm

### Creating a Linear Regression Model

* <b>result</b> is the name that we are assigning the regression formula
* <b>sm</b> is the shorthand for the linear regression model library
* <b>ols</b> is Ordinary Least Squares, the most common method of calculating the regression line
* the regression equation starts with the dependent variable on the left, followed by the independent variables
* independent variables are separated by "+"
* categorical variables must be in parentheses and annotated with a "C"
* data = is where you specify your dataset that you're pulling variables from
* <b>.fit()</b> function uses the predictive values to calculate the best linear regression line
* <b>.summary()</b> function will show the calculated values (slopes and y-intercept) for the linear regression formula

In [35]:
## create the regression model
#   .ols(dependent variable名~ independent variables名+ C(categorical variable名), data= dataset名). fit()
result = sm.ols("grade ~ hours + age + exercise + C(gender)", data = df).fit()

## print the regression model summary
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.665
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,988.1
Date:,"Wed, 08 Dec 2021",Prob (F-statistic):,0.0
Time:,14:52:06,Log-Likelihood:,-6299.1
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1995,BIC:,12640.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.0874,1.326,43.804,0.000,55.487,60.688
C(gender)[T.male],-0.4485,0.253,-1.773,0.076,-0.944,0.047
hours,1.9173,0.031,61.617,0.000,1.856,1.978
age,0.0405,0.075,0.543,0.587,-0.106,0.187
exercise,0.9841,0.089,11.073,0.000,0.810,1.158

0,1,2,3
Omnibus:,325.522,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2284.723
Skew:,-0.569,Prob(JB):,0.0
Kurtosis:,8.111,Cond. No.,214.0


# <font color=DODGERBLUE>Interpreting Linear Regression Result</font>
***
### Determining Model Fit
***
Linear regression calculates the regression line (equation) that minimizes the distance between the regression line and all the data points.  If our data points fall closely to the generated regression line, we consider the model to be a good fit. 线性回归计算回归线（方程），该回归线（方程）可最小化回归线与所有数据点之间的距离。如果我们的数据点与生成的回归线非常接近，则我们认为模型是一个很好的拟合。

<img src='https://blog.minitab.com/hubfs/Imported_Blog_Media/residual_illustration-1.gif'>

But what does that mean non-graphically? Linear regression may not always be the right technique to use for the specific set of data. The fit of the model describes how well your variables explain the variance in the dependent variable. 
但这在非图形上意味着什么呢？线性回归可能并不总是用于特定数据集的正确技术。模型的拟合描述了变量对因变量方差的解释程度。

To assess the model fit, we look at the adjusted R-squared (Adj. R-squared). The Adj. R-squared is a statistical measure of how closely the data are to the fitted regression line. <B>The Adj. R-squared is the percentage of variation in the dependent variable that can be explained by all the independent variables included in the model.</B> For example, how much of the variation in student grades (i.e. grades ranging from 32-100) can be explained by hours of study, age, hours of exercise, gender, etc? Values range from 0 to 1; the higher the value, the better the fit. 
为了评估模型拟合，我们查看调整后的 R 平方（调整 R 平方）。调整 R 平方是数据与拟合回归线的接近程度的统计度量。&lt;B&gt;调整 R 平方是因变量中的变化百分比，可以用模型中包含的所有自变量来解释 &lt;/B&gt;。例如，学生成绩（即32-100）的差异有多少可以通过学习时间，年龄，运动时间，性别等来解释？值范围从 0 到 1;值越高，拟合度越好。


***
### Intercept
***
The y-intercept, or constant, is the value given to the dependent variable if all independent variables are equal to 0. For example, when all independent variables are equal to 0, the expected student grade is approximately 58. Don't worry too much about this interpretation, oftentimes this won't make sense - for example, if age equals 0, but it's important to know how to review this. 

***
### Coefficients (coef)
***
These values show how changes in the independent variable influence the dependent variable, and in what direction. 

#### <b> Interpreting Numeric Variable Coefficients </b>
When you are looking at numeric variables (i.e. age), each coefficient represents the numeric change in the dependent variable given a one-unit change in the independent variable. For example, for every one hour increase in study time (hours), grade increases by 1.9 points.

* <b>INTERCEPT</b>: when all other IV's are zero, expected grade is around 58

* <b>AGE</b>: for every one year increase in age, grade increases by 0.04 points (when controlling for hours of study, exercise, and gender)

* <b>EXERCISE</b>: for every one hour increase in exercise, grade increases by approximately 1 point (when controlling for age, hours of study, and gender)

* <b>HOURS</b>: for every one hour increase in study time, grade increases by 1.9 points (when controlling for age, hours of exercise, and gender)

If you have a variable that gives you a negative coefficient, the relationship moves in opposite directions. For example, say the coefficient for age was <b>- 0.0405</b>, you would interpret it in the following way: 

* <b>AGE</b>: for every one year increase in age, grade <u>decreases</u> by 0.04 points. 

#### <b> Interpreting Categorical Variable Coefficients </b>

When working with categorical variables, the model will automatically take one of the categories and use it as a reference (i.e. comparison category). This reference category will not show up in the listed coefficients, and that is how you will be able to identify which category is serving as the reference. In our model, gender is the only categorical variable, and can be interpreted: 

<b>GENDER</b>:
* Female is the reference category
* On average, Male students have a grade 0.44 points lower than Female students (when controlling for age, hours of study, and exercise) 

The grade is lower because the coefficient is negative in this model. If we have more than 2 categories for our categorical variable, we would still compare each level to the reference. For example, if we had a group of students who choose not to disclose their gender, the coefficient for the 'undisclosed' group would show up in our model with it's own coefficient and we would compare those results to the Female category. 

#### Standard Error (std err)

The standard error reflects the level of accuracy of the coefficients. The lower the value, the higher the level of accuracy.

***
### Statistical Significance 
***
#### p-value (P>|t|)
When trying to determine if the results we received are statistically important - we need to consider the p-value. The p-value is the probability that you will receive the same results solely by chance (aka there is no meaning behind the results). 

Because the p-value is representing the probability of random findings, we always want to minimize this value. If the p-value was .5 (50%), that would mean that 50% of the time the results we see are just by chance. The p-value reflects how confident we are in our results and requires strict interpretation. A commonly used cut-off is 0.05 (5% chance the results observed are by chance) or below. 
* If the p-value is less than or equal to 0.05, we can deem our results to be statistically significant. 
* If the p-value is greater than 0.05, our results are not statistically significant. 

***
### What do I do with non-significant variables?
***
Regression analyses are rarely a "one and done" situation. After you run your analyses, you should tweak your model (equation) and see if you can improve the model fit. A good place to start is by removing non-significant variables from our model. In this example, gender and age are not significant - let's try a model without them!

In [10]:
## Removing non-significant variables 

result2 = sm.ols('grade ~ hours + exercise', data = df).fit()

result2.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.664
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,1973.0
Date:,"Tue, 07 Dec 2021",Prob (F-statistic):,0.0
Time:,18:25:45,Log-Likelihood:,-6300.8
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1997,BIC:,12620.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.5316,0.447,130.828,0.000,57.654,59.409
hours,1.9162,0.031,61.575,0.000,1.855,1.977
exercise,0.9892,0.089,11.131,0.000,0.815,1.163

0,1,2,3
Omnibus:,318.721,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2158.0
Skew:,-0.564,Prob(JB):,0.0
Kurtosis:,7.962,Cond. No.,43.2


# <font color=DODGERBLUE>Comparing Models and Making Predictions</font>

When comparting two models, the first thing you want to check is if there are any changes in the adjusted r-squared. In this example, the adj r-squared as not changed at all between the two models. The next item you want to check is the p-value of the remaining variables in the model. Did removing variables increase/decrease the p-values for the remaining variables?

You do not need to continue running your model until all your variables are significant. You should focus on maximizing your adj r-square, regardless of the significance of your variables. 

### What can I do with this information?

Now that you have a better understanding of how your variables interact, you are in a better place to describe your data. You now know that an increase of just one hour of studying will, on average, show a significant increase in a students final grade.  You can also make predictions about future grades...

### Making Predictions based on the Regression Results

Recall that a simple linear regression model is mathematically represented by the formula of a line: <b> y = mx + b</b>. When you are creating a multiple linear regression equation, the formula is similar - with some added features: <b> y = b + (m1 x X1) + (m2 x X2) + (m3 x X3) + ....etc. </b> where each "m x X" is representative of one of the coefficients in your model, and "b" represents the intercept. Once you have your regression output, you can plug these values in to make predictions on your dependent variable given specific values for your independent variables.

    grade(y) = intercept(b) + [hours of study coef(m1) x hours of study(x1)] + [hours of exercise coef(m2) x hours exercise(x2)]

    grade(y) = 58.5316 + [1.9162 x hours of study(x1)] + [0.9892 x hours exercise(x2)]

### What grade can we expect from a student that studied 8 hours and exercised 4.3 hours?

    grade(y) = 58.5316 + [1.9162 x hours of study(8)] + [0.9892 x hours exercise(4.3)]

    grade(y) = 58.5316 + [1.9162 x (8)] + [0.9892 x (4.3)]

    grade(y) = 58.5316 + [‭15.3296‬] + ‭[4.25356‬]

    grade = ‭78.11476

## <font color=SALMON>Ta Da! You can now predict the future!</font>

<LEFT><img src='https://i0.wp.com/www.learning-mind.com/wp-content/uploads/2019/10/psychic-spiritual-energy.jpg?resize=768%2C512&ssl=1'></LEFT>

## Making Predictions: the simple way!

What grade can we expect from a 16 year old female student that studied 8 hours and exercised 5.7 hours? 

In [11]:
## We can use the predict function (from statsmodel library) to predict the outcome given specific input
## reference the model with your function to reference the appropriate coef's!

# model_name.predict({'variable1_name':value1, 'variable2_name':value2, ...})

result.predict({
    'hours': 8, 
    'age': 16, 
    'exercise': 5.7, 
    'gender': "female"})

0    79.683606
dtype: float64

In [12]:
## What about another scenario?

result.predict({
    'hours': 14, 
    'age': 18, 
    'exercise': 12, 
    'gender': "male"})

0    97.020102
dtype: float64

# <font color=DODGERBLUE>Model 11 Exercises</font>

#### 1. Complete the code below to import the four libraries we've used most commonly.

In [36]:
import pandas as pd
import numpy as np
import scipy.stats as stats

In [37]:
import statsmodels.formula.api as sm

#### 2. Import the "babies.csv" file and name it df. 

<b>Background Info</b>

    The Child Health and Development Studies considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. The goal is to model the weight of the infants (bwt, in ounces) using variables including length of pregnancy in days (gestation), mother's age in years (age), mother's height in inches (height), whether the child was the first born (parity), mother's pregnancy weight in pounds (weight), and whether the mother was a smoker (smoke).

<b>Variables</b>

    case - id number
    bwt - birthweight, in ounces
    gestation - length of gestation, in days
    parity - binary indicator for a first pregnancy (0=first pregnancy)
    age - mother's age in years
    height - mother's height in inches
    weight - mother's weight in pounds
    smoke - binary indicator for whether the mother smokes

In [38]:
df= pd.read_csv("babies.csv")

#### 3. Check the shape of the dataset. How many columns and rows are there?

In [39]:
df.shape

(1236, 8)

#### 4. Check the first 10 rows and the last 10 rows. Drop the column "case". 

In [40]:
df.head(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0
5,6,136,286.0,0,25.0,62.0,93.0,0.0
6,7,138,244.0,0,33.0,62.0,178.0,0.0
7,8,132,245.0,0,23.0,65.0,140.0,0.0
8,9,120,289.0,0,25.0,62.0,125.0,0.0
9,10,143,299.0,0,30.0,66.0,136.0,1.0


In [41]:
df.tail(10)

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
1226,1227,109,244.0,1,21.0,63.0,102.0,1.0
1227,1228,103,278.0,0,30.0,60.0,87.0,1.0
1228,1229,118,276.0,0,34.0,64.0,116.0,0.0
1229,1230,127,290.0,0,27.0,65.0,121.0,0.0
1230,1231,132,270.0,0,27.0,65.0,126.0,0.0
1231,1232,113,275.0,1,27.0,60.0,100.0,0.0
1232,1233,128,265.0,0,24.0,67.0,120.0,0.0
1233,1234,130,291.0,0,30.0,65.0,150.0,1.0
1234,1235,125,281.0,1,21.0,65.0,110.0,0.0
1235,1236,117,297.0,0,38.0,65.0,129.0,0.0


#### 5. Is there any missing data? Check!

In [42]:
df.isnull().sum()

case          0
bwt           0
gestation    13
parity        0
age           2
height       22
weight       36
smoke        10
dtype: int64

#### 6. The amount of missing data is small considering the size of our dataset. Drop all rows in the dataset that have missing data.

In [55]:
a= df.copy()
a.dropna(inplace = True)

In [56]:
a.isnull().sum()

case         0
bwt          0
gestation    0
parity       0
age          0
height       0
weight       0
smoke        0
dtype: int64

#### 7. Check each of your numeric columns for outliers - pick one method and use it for all the columns. 

In [57]:
print("starting shape: ", a.shape)

a["z_bwt"]= np.abs(stats.zscore (a["bwt"]))
a["z_gestation"]= np.abs(stats.zscore(a["gestation"]))
a["z_parity"]= np.abs(stats.zscore (a["parity"]))
a["z_age"]= np.abs(stats.zscore (a["age"]))
a["z_height"]= np.abs(stats.zscore(a["height"]))
a["z_weight"]= np.abs(stats.zscore(a["weight"]))

starting shape:  (1174, 8)


In [58]:
bwt_out= a[a["z_bwt"] >3].index
a= a.drop(bwt_out)

gest_out= a[a["z_gestation"]>3].index
a= a.drop(gest_out)

parity_out= a[a["z_parity"]>3].index
a= a.drop(parity_out)

age_out= a[a["z_age"]>3].index
a= a.drop(age_out)

height_out= a[a["z_height"]>3].index
a= a.drop(height_out)

weight_out= a[a["z_weight"]>3].index
a= a.drop(weight_out)

a.drop(columns=["z_bwt", "z_gestation", "z_parity", "z_age", "z_height", "z_weight"], inplace=True)



In [59]:
print("shape after removed outliers: ", a.shape)

shape after removed outliers:  (1134, 8)


#### 8. Print the descriptive statistics for each numeric column. What is the average age of the mothers? What is the average gestation period?

In [60]:
a.describe()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
count,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0
mean,624.557319,119.661376,279.42328,0.267196,27.206349,64.040564,127.218695,0.390653
std,356.104467,17.860299,13.930125,0.442691,5.802403,2.468179,18.385539,0.488112
min,1.0,65.0,232.0,0.0,15.0,57.0,87.0,0.0
25%,319.25,109.0,272.0,0.0,23.0,62.0,114.0,0.0
50%,624.5,120.0,280.0,0.0,26.0,64.0,125.0,0.0
75%,935.75,131.0,288.0,1.0,31.0,66.0,137.0,1.0
max,1236.0,174.0,324.0,1.0,44.0,71.0,190.0,1.0


#### 9. Let's model birthweight based on the characteristics of the mother. But first... 

We want to easily distinguish between the numeric and categorical variables. Replace the values 0/1 in the "parity" and "smoke" column with meaningful labels (i.e. smokes, doesn't smoke).

In [61]:
a["parity"].replace([0,1], ["First Preg", "Not First Preg"],inplace= True)

In [62]:
a["smoke"].replace([0,1], ["NonSmoker", "Smoker"], inplace = True)

a.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,First Preg,27.0,62.0,100.0,NonSmoker
1,2,113,282.0,First Preg,33.0,64.0,135.0,NonSmoker
2,3,128,279.0,First Preg,28.0,64.0,115.0,Smoker
4,5,108,282.0,First Preg,23.0,67.0,125.0,Smoker
5,6,136,286.0,First Preg,25.0,62.0,93.0,NonSmoker


#### 10. Run a correlation matrix with your dataset. Which variables are correlated with birthweight? 

Describe the strength of the correlation between all the numeric variables and birthweight. 

In [63]:
a.corr()


Unnamed: 0,case,bwt,gestation,age,height,weight
case,1.0,-0.049006,0.031611,0.013661,-0.019494,-0.055292
bwt,-0.049006,1.0,0.411186,0.026259,0.212844,0.166821
gestation,0.031611,0.411186,1.0,-0.05402,0.072687,0.045045
age,0.013661,0.026259,-0.05402,1.0,-0.001016,0.16076
height,-0.019494,0.212844,0.072687,-0.001016,1.0,0.46382
weight,-0.055292,0.166821,0.045045,0.16076,0.46382,1.0


#### 11. Determine the relationship between birthweight and the categorical variables: parity and smoke. 

Use the groupby function to determine if there are any differences between birthweight and the different groups.  Does it seem like there is a relationship between these variables and birthweight?

In [64]:
a["bwt"].groupby(a["smoke"]).mean()

smoke
NonSmoker    123.337192
Smoker       113.927765
Name: bwt, dtype: float64

In [65]:
a["bwt"].groupby(a["parity"]).mean()

parity
First Preg        120.174489
Not First Preg    118.254125
Name: bwt, dtype: float64

#### 12. Let's construct your regression model. Firstly, which variables do you plan to include in your model, and why? 

In the space below, write your justification for why you are including each variable. 

The coef of gestation, height, and weight is 0.41, 0.21, and 0.17, which is moderate, weak, and weak relationship respectively with birthweight, So I will keep them (gestation, height, and weight)as independent variables.

The birthweight between smoker and non-smoker has no significant difference, as well as first and non-first pregnancies, so I will leave them.

#### 13. Construct your regression model and print the summary. 

Write out your full interpretation of the regression results. If you are not happy with the results, tweak your model and run it again. 

In [67]:
result= sm.ols ("bwt~ gestation + height + weight", data= a).fit()
result.summary()

0,1,2,3
Dep. Variable:,bwt,R-squared:,0.208
Model:,OLS,Adj. R-squared:,0.206
Method:,Least Squares,F-statistic:,98.85
Date:,"Wed, 08 Dec 2021",Prob (F-statistic):,8.07e-57
Time:,18:07:46,Log-Likelihood:,-4745.3
No. Observations:,1134,AIC:,9499.0
Df Residuals:,1130,BIC:,9519.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-100.3913,15.267,-6.576,0.000,-130.346,-70.437
gestation,0.5089,0.034,14.950,0.000,0.442,0.576
height,1.0594,0.217,4.891,0.000,0.634,1.484
weight,0.0787,0.029,2.711,0.007,0.022,0.136

0,1,2,3
Omnibus:,0.937,Durbin-Watson:,2.062
Prob(Omnibus):,0.626,Jarque-Bera (JB):,0.816
Skew:,0.029,Prob(JB):,0.665
Kurtosis:,3.118,Cond. No.,10100.0


gestation: for every 1 day increase in gestation, birthweight increases by 0.51oz.
height: for every 1 inch increase in height, bwt inceases by 1.06oz.
weight: for every 1 pound increase in weight, bwt increases by 0.08 oz.

#### 14. Create three scenarios (i.e. make up specific values) and predict the birthweight given these factors. Use the information in the model summary to make these predictions. 

In [68]:
result.predict({ "gestation": 290, "height": 67, "weight": 145})

0    129.578683
dtype: float64

In [69]:
result.predict({ "gestation": 250, "height": 67, "weight": 145})

0    109.22385
dtype: float64

In [70]:
result.predict({ "gestation": 290, "height": 63, "weight": 145})

0    125.340896
dtype: float64