# NLSchools dataset

We begin by downloading the dataset and the documentation from a remote source, using a shell command.

In [None]:
!wget https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/nlschools.csv

In [None]:
!wget https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/doc/MASS/nlschools.html

In [None]:
from IPython.core.display import HTML
HTML(filename='nlschools.html')

## Step 1: Descriptive data analysis - univariate

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
nlDf = pd.read_csv('nlschools.csv')

In [None]:
nlDf.describe()

In [None]:
nlDf['IQ'].describe()

This high-level image shows us that lang tends to have more higher values than lower ones and that
the IQ is pretty evenly distributed (the mean and median aren’t very different). We also see that 629
of the students are in combined classes while almost 3 times more are in non-combined classes.

## Histograms of 'interesting' variables

We continue by computing the histograms of the each variable. The IQ (as seen in figure 1) seems to
fit a normal distribution, whilst the lang and SES variables are skewed to the right, respectively left.


In [None]:
nlDf['IQ'].plot.hist(alpha=0.7, bins=25, figsize=(12,6))
plt.show()

Now let's see the histogram for lang

In [None]:
nlDf['lang'].plot.hist(alpha=0.7, bins=50, figsize=(12,6))
plt.show()

In [None]:
nlDf['SES'].plot.hist(alpha=0.7, bins=20, figsize=(12,6))
plt.show()

The SES (Socio-economic status) seems to follow the pattern: few wealthy, many poor. It reflects the
reality that we usually expect. Next, the box plot shows us a decent amount of low and high IQ's.

In [None]:
nlDf['IQ'].plot.box(figsize=(12,6), title="Boxplot of IQ")
plt.show()


## Step 2: Descriptive data analysis - multivariate

In [None]:
from pandas.plotting import scatter_matrix


In [None]:
scatter_matrix(nlDf[['lang', 'IQ', 'GS', 'SES']], alpha=0.4, figsize=(12, 12), diagonal='kde')
plt.show()

Now let's see whether being in a combined class or not had any effect on the students in the classes.

In [None]:
nlDf.boxplot(column='IQ',by='COMB', figsize=(8,8))
plt.show()

## Step 3: Modeling the data

In [None]:
from sklearn import linear_model
from sklearn.metrics import r2_score

In [None]:
reg = linear_model.LinearRegression()
reg.fit(nlDf[['IQ']], nlDf['lang'])

In [None]:
predictions = reg.predict(nlDf[['IQ']])
r2_score(nlDf['lang'].as_matrix(), predictions)

In [None]:
reg.coef_

## Further fiddling

Can we make it better?