<h1>Linear Regression using Spark</h1>
<p>We'll be going over a short case study of using Spark to perform a linear regression.  I'll be using data from the <a href="https://rptsvr1.tea.texas.gov/perfreport/tapr/2013/download/DownloadData.html">Texas Education Agency</a>, particularly focusing on campus level data from STAAR Phase-in 1 Level II (Grades 3 to 8, End of Course).  As with all data tasks, we'll need to clean and view data before we can make meaningful progress.</p>

In [1]:
from pyspark.sql import SparkSession
path = "/Users/joesphgartner/Desktop/data/texaseduagency/Starr_phase1_Grades3to8.txt"
spk = SparkSession.builder.master("local").getOrCreate()
df = spk.read.csv(path, header=True)
cols = df.columns
print(len(cols))
#df.show(3)

2059


<h2>Gross</h2>
<p>The above columns bum me out.  I have no clue what this means.  We do have access to the <a href="https://rptsvr1.tea.texas.gov/perfreport/tapr/2013/download/campstaar1a.html">data dictionary</a>. Have a look at the provided link. You can see that there is data for many grades, ethnicities, and schools.  For obvious reasons, the schools have been anonomysed; this is unfortunate as fusing this data with meadian household income would be an interesting way of enriching this dataset.<br/><br/>

Lets see if we can't use our web scraping skills to get descriptions programatically.</p>

In [2]:
from bs4 import BeautifulSoup
from urllib import request
import ssl

# This performs SSL certificate verification
context = ssl._create_unverified_context()

# Fake connection coming from a browser
hdr = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'none',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive'
    }

# URL for the data dict
url = "https://rptsvr1.tea.texas.gov/perfreport/tapr/2013/download/campstaar1a.html"

# Build a request
req = request.Request(url, headers=hdr)
gcontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1)  # Only for gangstars -> Stole this line from S.O.
res = request.urlopen(req, context=gcontext)

# Create a BS object
soup = BeautifulSoup(res.read(), "html.parser")

<h2>Limit the data</h2>
<p>Let's start simple (generally a good idea for large messy data).  Looking at the table it's self, we can seen there are many columns specific to a particular grade.  Let's start by simply selecting those columns that pertain to the 3rd grade.</p>

In [3]:
# Get the table, and then the table row objects
table = soup.find("tbody")
rows = table.find_all("tr")
print(len(rows)) # <- number of rows in the html table

# Find only colums with "Grade 3" in the description, get the column name
g3_cols = []
for row in rows:
    if str(row).find("Grade 3") != -1:
        g3_cols.append(row.find_all("td")[0].text.strip())
        
len(g3_cols) # <- number of column names we extract

673


168

In [4]:
df_small = df.select(g3_cols)
df_small.head()

Row(CB03AMA1512N='.', CB03AMA1012D='.', CB03AMA1512R='.', CA03AMA1512N='.', CA03AMA1012D='.', CA03AMA1512R='.', CI03AMA1512N='.', CI03AMA1012D='.', CI03AMA1512R='.', C303AMA1512N='.', C303AMA1012D='.', C303AMA1512R='.', CR03AMA1512N='.', CR03AMA1012D='.', CR03AMA1512R='.', CL03AMA1512N='.', CL03AMA1012D='.', CL03AMA1512R='.', CE03AMA1512N='.', CE03AMA1012D='.', CE03AMA1512R='.', CF03AMA1512N='.', CF03AMA1012D='.', CF03AMA1512R='.', CH03AMA1512N='.', CH03AMA1012D='.', CH03AMA1512R='.', CM03AMA1512N='.', CM03AMA1012D='.', CM03AMA1512R='.', C403AMA1512N='.', C403AMA1012D='.', C403AMA1512R='.', CS03AMA1512N='.', CS03AMA1012D='.', CS03AMA1512R='.', C203AMA1512N='.', C203AMA1012D='.', C203AMA1512R='.', CW03AMA1512N='.', CW03AMA1012D='.', CW03AMA1512R='.', CB03ARE1512N='.', CB03ARE1012D='.', CB03ARE1512R='.', CA03ARE1512N='.', CA03ARE1012D='.', CA03ARE1512R='.', CI03ARE1512N='.', CI03ARE1012D='.', CI03ARE1512R='.', C303ARE1512N='.', C303ARE1012D='.', C303ARE1512R='.', CR03ARE1512N='.', CR03AR

<h2>Progress</h2>
<p>OK, now let's try to answer some questions.  Let's start by getting those rows that have a percent for both white, black, and hispanic students.</p>

In [5]:
print(df_small.count())
df_filt = df_small.filter((df_small['CB03AMA1512R']!=".") & (df_small['CB03AMA1512R']!="-1") & \
                          (df_small['CW03AMA1512R']!=".") & (df_small['CW03AMA1512R']!="-1") & \
                          (df_small['CH03AMA1512R']!=".") & (df_small['CH03AMA1512R']!="-1"))
print(df_filt.count())

8555
912


<h2>More Sophisticated Transformations</h2>
<p>The data we have is on a per facility basis, which means the tagets (i.e. the passage rate for the various students) are in the same rows.  What we need to do is make separate rows for each category, and create a variable to distinguis between the three.  Let's have hispanic students be our base class.</p>

In [6]:
from pyspark.sql.functions import lit
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import VectorAssembler

#Create limited DF with only hispanic students as our targets
df_hs = df_filt.withColumn("Passage", df_filt["CH03AMA1512R"].cast(DoubleType()))\
                 .withColumn('is_AA', lit(0))\
                 .withColumn('is_CA', lit(0))\
                 .select(["Passage", "is_AA", "is_CA"])
            
print(df_hs.head())

#Create limited DF with only black students as our targets
df_aa = df_filt.withColumn("Passage", df_filt["CB03AMA1512R"].cast(DoubleType()))\
                 .withColumn('is_AA', lit(1))\
                 .withColumn('is_CA', lit(0))\
                 .select(["Passage", "is_AA", "is_CA"])
            
print(df_aa.head())

#Create limited DF with only white students as our targets
df_ca = df_filt.withColumn("Passage", df_filt["CW03AMA1512R"].cast(DoubleType()))\
                 .withColumn('is_AA', lit(0))\
                 .withColumn('is_CA', lit(1))\
                 .select(["Passage", "is_AA", "is_CA"])
            
print(df_ca.head())

#Combined into single DF and format for ML
from pyspark.ml.linalg import Vectors
df_all = df_hs.union(df_aa).union(df_ca)
df_all.count()

vecAssembler = VectorAssembler(inputCols=["is_AA", "is_CA"], outputCol="features")
data = vecAssembler.transform(df_all).selectExpr("features as features", "Passage as label")

data.head()

Row(Passage=68.0, is_AA=0, is_CA=0)
Row(Passage=46.0, is_AA=1, is_CA=0)
Row(Passage=66.0, is_AA=0, is_CA=1)


Row(features=SparseVector(2, {}), label=68.0)

In [7]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

lr = LinearRegression(maxIter=10)
train, test = data.randomSplit([0.8, 0.2])
model = lr.fit(train)

model.transform(test)\
    .select("features", "label", "prediction")\
    .show(2)

+---------+-----+-----------------+
| features|label|       prediction|
+---------+-----+-----------------+
|(2,[],[])| 45.0|71.54918032786885|
|(2,[],[])| 46.0|71.54918032786885|
+---------+-----+-----------------+
only showing top 2 rows



In [8]:
print(model.intercept)
print(model.coefficients)

71.54918032786885
[-7.49385529329,9.05301445951]
