# **Regression Analysis**


-   Regression analysis in place of the t-test
-   Regression analysis in place of ANOVA
-   Regression analysis in place of correlation


In [None]:
# !pip install pandas
# !pip install numpy
# !pip install statsmodels

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [2]:
ratings_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'
ratings_df = pd.read_csv(ratings_url)


### Regression with T-test: Using the teachers rating data set, does gender affect teaching evaluation rates?


Initially, we had used the t-test to test if there was a statistical difference in evaluations for males and females, we are now going to use regression. We will state the null hypothesis:

-   $H_0: β1$ = 0 (Gender has no effect on teaching evaluation scores)
-   $H_1: β1$ is not equal to 0 (Gender has an effect on teaching evaluation scores)


We will use the female variable. female = 1 and male = 0


In [3]:
## X is the input variables (or independent variables)
X = ratings_df['female']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.022
Model:,OLS,Adj. R-squared:,0.02
Method:,Least Squares,F-statistic:,10.56
Date:,"Tue, 12 Jan 2021",Prob (F-statistic):,0.00124
Time:,18:37:42,Log-Likelihood:,-378.5
No. Observations:,463,AIC:,761.0
Df Residuals:,461,BIC:,769.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.0690,0.034,121.288,0.000,4.003,4.135
female,-0.1680,0.052,-3.250,0.001,-0.270,-0.066

0,1,2,3
Omnibus:,17.625,Durbin-Watson:,1.209
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.97
Skew:,-0.496,Prob(JB):,7.6e-05
Kurtosis:,2.981,Cond. No.,2.47


**Conclusion:** Like the t-test, the p-value is less than the alpha (α) level = 0.05, so we reject the null hypothesis as there is evidence that there is a difference in mean evaluation scores based on gender. The coefficient -0.1680 means that females get 0.168 scores less than men.


### Regression with ANOVA: Using the teachers' rating data set, does beauty  score for instructors  differ by age?


State the Hypothesis:

-   $H_0: µ1 = µ2 = µ3$ (the three population means are equal)
-   $H_1:$ At least one of the means differ


Then we group the data like we did with ANOVA


In [5]:
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

Use OLS function from the statsmodel library


In [6]:
from statsmodels.formula.api import ols
lm = ols('beauty ~ age_group', data = ratings_df).fit()
table= sm.stats.anova_lm(lm)
print(table)

              df      sum_sq    mean_sq          F        PR(>F)
age_group    2.0   20.422744  10.211372  17.597559  4.322549e-08
Residual   460.0  266.925153   0.580272        NaN           NaN


**Conclusion:** We can also see the same values for ANOVA like before and we will reject the null hypothesis since the p-value is less than 0.05 there is significant evidence that at least one of the means differ.


### Regression with ANOVA option 2


Create dummy variables - A dummy variable is a numeric variable that represents categorical data, such as gender, race, etc. Dummy variables are dichotomous, i.e they can take on only two quantitative values.


In [10]:
X = pd.get_dummies(ratings_df[['age_group']])

In [8]:
y = ratings_df['beauty']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.067
Method:,Least Squares,F-statistic:,17.6
Date:,"Tue, 12 Jan 2021",Prob (F-statistic):,4.32e-08
Time:,18:49:41,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1065.0
Df Residuals:,460,BIC:,1077.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.496,0.620,-0.041,0.069
age_group_40 years and younger,0.3224,0.058,5.574,0.000,0.209,0.436
age_group_57 years and older,-0.2596,0.056,-4.621,0.000,-0.370,-0.149
age_group_between 40 and 57 years,-0.0489,0.045,-1.081,0.280,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,5980000000000000.0


You will get the same results and conclusion


### Correlation: Using the teachers' rating dataset, Is teaching evaluation score correlated with beauty score?


In [9]:
## X is the input variables (or independent variables)
X = ratings_df['beauty']
## y is the target/dependent variable
y = ratings_df['eval']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,17.08
Date:,"Tue, 12 Jan 2021",Prob (F-statistic):,4.25e-05
Time:,18:49:51,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.948,4.048
beauty,0.1330,0.032,4.133,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.238
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


**Conclusion:** p < 0.05 there is evidence of correlation between beauty and evaluation scores


### Using the teachers' rating data set, does tenure affect beauty scores?

-   Use α = 0.05


In [12]:
## put beauty scores in a list
y = ratings_df['beauty']
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.067
Method:,Least Squares,F-statistic:,17.6
Date:,"Tue, 12 Jan 2021",Prob (F-statistic):,4.32e-08
Time:,18:52:32,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1065.0
Df Residuals:,460,BIC:,1077.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.496,0.620,-0.041,0.069
age_group_40 years and younger,0.3224,0.058,5.574,0.000,0.209,0.436
age_group_57 years and older,-0.2596,0.056,-4.621,0.000,-0.370,-0.149
age_group_between 40 and 57 years,-0.0489,0.045,-1.081,0.280,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,5980000000000000.0


### Using the teachers' rating data set, does being an English speaker affect the number of students assigned to professors?

-   Use "allstudents"
-   Use α = 0.05 and α = 0.1 


In [13]:
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()


0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.067
Method:,Least Squares,F-statistic:,17.6
Date:,"Tue, 12 Jan 2021",Prob (F-statistic):,4.32e-08
Time:,18:53:03,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1065.0
Df Residuals:,460,BIC:,1077.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.496,0.620,-0.041,0.069
age_group_40 years and younger,0.3224,0.058,5.574,0.000,0.209,0.436
age_group_57 years and older,-0.2596,0.056,-4.621,0.000,-0.370,-0.149
age_group_between 40 and 57 years,-0.0489,0.045,-1.081,0.280,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,5980000000000000.0


### Using the teachers' rating data set, what is the correlation between the number of students who participated in the evaluation survey and evaluation scores?

-   Use "students" variable


In [14]:
## add an intercept (beta_0) to our model
X = sm.add_constant(X) 

model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,beauty,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.067
Method:,Least Squares,F-statistic:,17.6
Date:,"Tue, 12 Jan 2021",Prob (F-statistic):,4.32e-08
Time:,18:53:31,Log-Likelihood:,-529.47
No. Observations:,463,AIC:,1065.0
Df Residuals:,460,BIC:,1077.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0138,0.028,0.496,0.620,-0.041,0.069
age_group_40 years and younger,0.3224,0.058,5.574,0.000,0.209,0.436
age_group_57 years and older,-0.2596,0.056,-4.621,0.000,-0.370,-0.149
age_group_between 40 and 57 years,-0.0489,0.045,-1.081,0.280,-0.138,0.040

0,1,2,3
Omnibus:,11.586,Durbin-Watson:,0.434
Prob(Omnibus):,0.003,Jarque-Bera (JB):,12.114
Skew:,0.394,Prob(JB):,0.00234
Kurtosis:,2.913,Cond. No.,5980000000000000.0


This project was done as part of IBM Data Science professional certificate graded coursework
## Authors


[Aije Egwaikhide](https://www.linkedin.com/in/aije-egwaikhide?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork-20531532&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ) is a Data Scientist at IBM who holds a degree in Economics and Statistics from the University of Manitoba and a Post-grad in Business Analytics from St. Lawrence College, Kingston. She is a current employee of IBM where she started as a Junior Data Scientist at the Global Business Services (GBS) in 2018. Her main role was making meaning out of data for their Oil and Gas clients through basic statistics and advanced Machine Learning algorithms. The highlight of her time in GBS was creating a customized end-to-end Machine learning and Statistics solution on optimizing operations in the Oil and Gas wells. She moved to the Cognitive Systems Group as a Senior Data Scientist where she will be providing the team with actionable insights using Data Science techniques and further improve processes through building machine learning solutions. She recently joined the IBM Developer Skills Network group where she brings her real-world experience to the courses she creates.


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                     |
| ----------------- | ------- | --------------- | -------------------------------------- |
| 2020-08-14        | 0.1     | Aije Egwaikhide | Created the initial version of the lab |


 Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork-20531532&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork-20531532&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork-20531532&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork-20531532&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).
