# Ordinary Least Squares Linear Fitting

In this lab we will make scatterplots of data which include the OLS regression line.

We begin by importing libraries.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

<h2> 1. Example</h2>
<h3> OLS Regression of ACT Scores and College Eligibility</h3>
Datafile: ACTCollegeEligible.csv

1) Import Libraries, dropping rows with missing data.

In [2]:
df=pd.read_csv("ACTCollegeEligible.csv")
df=df.dropna()
df.head(2)

2) Specify the x and y coordinates of the points to be plotted.

In [3]:
X=df[["Average Score ACT 2012"]]
Y=df[["College Eligibility 2012 - Percent"]]

3)  Create the OLS regression model.

In [4]:
OLSmodel = LinearRegression()
OLSmodel.fit(X, Y)
print("OLS regression line y=mx+b")
print("slope m=",OLSmodel.coef_)
print("intercept b=",OLSmodel.intercept_)

4) Make the scatterplot with regression line.

In [5]:
from sklearn.linear_model import LinearRegression #sklearn is a machine learning library
X=df[["Average Score ACT 2012"]]
Y=df[["College Eligibility 2012 - Percent"]]
reg=LinearRegression()
reg.fit(X,Y)
print("Intercept is ", reg.intercept_)
print("Slope is ", reg.coef_)
print("R^2 for OLS is ", reg.score(X,Y))
# x values on the regression line will be between 13.5 and 30 
x = np.linspace(13.5, 30 ,100) 
# define the regression line y = mx+b here
[[m]]=reg.coef_
[b]=reg.intercept_
y =  m*x  + b   
#plot the data points 
fig=df.plot(x="Average Score ACT 2012", y="College Eligibility 2012 - Percent", style='o')  
plt.xlabel("Average Score ACT 2012")  
plt.ylabel("College Eligibility 2012 - Percent")  
# plot the regression line 
plt.plot(x,y, 'k') #add the color for red
plt.legend([],[], frameon=True)
plt.grid()
plt.show()

<h2> 2. Exercise </h2>
<h3> Linear Regression of Chicago Public School Data </h3>

Data File:  Imported directly from the Chicago data portal in Step 1.

1) Let's begin by executing the following cell to retrieve a Chicago Public School (CPS) dataset.

In [6]:
import pandas as pd
import numpy as np
raw_CPS_data=  pd.read_json('https://data.cityofchicago.org/resource/kh4r-387c.json?$limit=100000')
raw_CPS_data.head() 

2) Let's get the column names

In [7]:
raw_CPS_data.columns

3) Let's find the number of rows in each column which have data using a command of the form df.count().

In [8]:
raw_CPS_data.count()

4) Let's check what entries there are in the 'grades_offered' column.

In [9]:
raw_CPS_data['grades_offered'].value_counts()

5) Let's create a dataframe called mid with just the data for PK,K-8 

In [10]:
mid=raw_CPS_data[raw_CPS_data['grades_offered']=='PK,K-8']
mid.head(2)

6) Let's streamline the data to a dataframe df which includes just the columns ['address','student_count_total','student_count_black','student_count_hispanic','student_count_white','zip']

In [11]:
df=raw_CPS_data[['address','student_count_total','student_count_black','student_count_hispanic','student_count_white','zip']]
df.head(2)

7) Let's get all the schools with zip 60623

In [12]:
df23=df[df['zip']==60623]
df23.head(2)

7) Let's reset the index.

In [13]:
df23=df23.reset_index(drop=True)
df23.head(5)

8) What is the size of the biggest CPS PK,K-8 in 60623?

In [14]:
max=df23["student_count_total"].max()
max

9) What is the size of the smallest CPS PK,K-8 in 60623?

In [15]:
min=df23["student_count_total"].min()
min

10) Let's simplify the column names to ["address","total","black","hispanic","white"]

In [16]:
df23.columns= ["address","total","black","hispanic","white","zip"]
df23.head(1)


10) Let's create 3 new columns '%black', '%hispanic', '%white'

In [17]:
for i in df23.index:
    df23.loc[i,'%black']=round(100*df23.loc[i,'black']/df23.loc[i,'total'],1)
    df23.loc[i,'%hispanic']=round(100*df23.loc[i,'hispanic']/df23.loc[i,'total'],1)
    df23.loc[i,'%white']=round(df23.loc[i,'white']/df23.loc[i,'total'],1)
df23.head(5)

<h4>ASSIGNMENT</h4>
Make a scatterplot which shows the %black (x-axis) vs. %hispanic (y-axis) and include the OLS regression line on the plot. What does the graph tell us about grade pre K - 8 schools in Chicago zip code 60623?

<h4>Solution</h4>

In [18]:
from sklearn.linear_model import LinearRegression #sklearn is a machine learning library
X=df23[["%black"]]
Y=df23[["%hispanic"]]
reg=LinearRegression()
reg.fit(X,Y)
print("Intercept is ", reg.intercept_)
print("Slope is ", reg.coef_)
print("R^2 for OLS is ", reg.score(X,Y))
# x values on the regression line will be between 0 and 100 with a spacing of .0
x = np.arange(0, 100 ,.01) 
# define the regression line y = mx+b here
[[m]]=reg.coef_
[b]=reg.intercept_
y =  m*x  + b   

fig=df23.plot(x='%black', y='%hispanic', style='o')  
plt.title('% Black vs % Hispanic in 60623 pre-K - 8 Schools')  
plt.xlabel('% Black')  
plt.ylabel('% Hispanic')  
# plot the regression line 
plt.plot(x,y, 'r') #add the color for red
plt.legend([],[], frameon=True)
plt.grid()
plt.show()

The graph show that schools are predominantly hispanic or preominantly black.