# Data Science Course Week 2 - Lesson 2 
# Introduction to Linear Regression


Regression is the process of learning a mapping from a vector of input data to a quantitative output given a set of observations. For example, coordinates in a room to Wi-Fi signal strength, someone's Body Mass Index to their life expectancy or a stock's performance over the last 5 days to it's value tomorrow.

There are numerous methods for addressing this problem, each with their own set of assumptions and behaviour. Some are ideal for tackling large volumes of data while others provide more informative probabilistic outputs.

In this lab, we'll focus on Linear Regression; a good point of reference for many other regression techniques and still used widely around the world today due to their simplicity and favourable scaling characteristics.

We will be using [statsmodels](http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html) for learning about linear regression. It covers the models better than in scikit when we are learning and want more insights into the model parameters. But we will mainly be using scikit learn for the rest of the course. 

# Class Workshop

In [None]:
# Import the libraries required
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

# this allows plots to appear directly in the notebook
%matplotlib inline

Here we are using a dataset of chicago house prices. The dataset is made up of houses that sold and several variables that describe the data. 

In [None]:
# Read in data
house_data = pd.read_csv("chicagohouseprices2.csv", index_col=0)

In [None]:
house_data.head()

In [None]:
# do we have any missing data?
house_data.isnull().sum() 

In [None]:
# Summarise the data
house_data.describe()
#house_data.describe(include=['object'])
#house_data.describe(include='all')

In [None]:
# Look for any linear correlations in the data
house_data.corr()

In [None]:
# Plot each variable against each other 
# scroll down past the subplot information
pd.scatter_matrix(house_data, figsize=(15,15))

### Questions?

- Can you describe the data set - give a summary of what's happening?
- What looks to be affecting house prices from our initial inspection?
- What is the type of relationship in those variables affecting price?

In [None]:
# create a fitted model in one line
lm = smf.ols(formula='Price ~ Bath + HouseSizeSqft', data=house_data).fit()

# print the coefficients!
lm.params

The summary function provided by the statsmodel library presents lots of useful information about the resulting model.

Key items to pay attention to are:
- R-squared
- Adj. R-Squared
- coef (for each variable)

In [None]:
lm.summary()

In [None]:
# What would you expect a house price to be for a house with 3 bathrooms and 350 sqft?
# Calculate it.
508310 -28995.66*3 + 133*350

## Student Excercise

Now try creating a regression from just the EstimatedPrice

In [None]:
# check the distribution of Price vs. EstimatedPrice using a scatterplot
import seaborn as sns
sns.set_style("darkgrid")

sns.lmplot(y='Price', x='EstimatedPrice', data=house_data)

In [None]:
# Let's try just modelling just using the estimated price
# create a fitted model in one line
lm = smf.ols(formula='Price ~ EstimatedPrice ', data=house_data).fit()

# print the coefficients
lm.params

In [None]:
lm.summary()

### Question:
Did the model build on just Estimated Price have a better R-Squared value than the prior model built using Bath and HouseSizeSqft?

### Predicting Price using the built model


Now we will find out what the model estimates the price will be for the the minimum and maximium values of EstimatedPrice

In [None]:
# create a DataFrame with the minimum and maximum values of EstimatedPrice
# these values will be be used in the built model to predict the Price
X_new = pd.DataFrame({'EstimatedPrice': [house_data.EstimatedPrice.min(), house_data.EstimatedPrice.max()]})
X_new.head()

In [None]:
# predict price given two data points and the built model

preds = lm.predict(X_new)
preds

Now lets view a line representing the model build off just Estimated Price, over a scatter plot of the data of Price vs. EstimatedPrice.

To produce the line overlay, we will simply plot a straight line between the two predicted points over the scatter plot. 

In [None]:
# first, plot the observed data
house_data.plot(kind='scatter', x='EstimatedPrice', y='Price')

In [None]:
# Now, plot a line over the points that uses just the two points
house_data.plot(kind='scatter', x='EstimatedPrice', y='Price')
# this code overlays a straight line between the the coordinates created by X_new and preds
plt.plot(X_new, preds, c='red', linewidth=2)

### Experiment
Now try creating new models by selecting different variables or combinations of variables. Can you get a better fit?

## Themepark 

The file **themepark.csv** has some data from a theme park. Children were asked to rate their experience on a scale
from 0-5. This data was compared against the number of hours they spent in the park.

In [None]:
!ls 

In [None]:
# Create a pandas DataFrame from this data by reading it in using pandas. 
# It is located in the same directory as this notebook, so you will not need to provide a filepath.

In [None]:
# Have a look at the data (e.g. display the head or tail, use describe, etc.)

In [None]:
# Draw a scatterplot of this data.

In [None]:
# Try doing a linear regression on this data

In [None]:
# Plot this regression over the top of the data. Could the regression be better?