# Overview

In this lab you’ll use the Scikit-Learn API to perform supervised learning on sample data about Boston house prices. This dataset has been widely used in the machine learning community for many years!

You’ll load the Boston sample data from Scikit-Learn, familiarize yourself with the shape of the data, and then use linear regression to find a line of best fit for Boston house prices.

# Roadmap
There are 6 exercises in this lab, of which the last 2 exercises are "if time permits". Here is a brief summary of the tasks you will perform in each exercise; more detailed instructions follow later:
1.	Loading the Boston dataset from Scikit-Learn
2.	Converting the feature matrix into a Pandas DataFrame
3.	Splitting the dataset into “training” data and “testing” data
4.	Creating a linear regression model and predicting labels for new data
5.	(If time permits) Plotting predicted vs. actual house prices 
6.	(If time permits) Determining the root mean squared error 


# Exercise 1:  Loading the Boston dataset from Scikit-Learn

In this notebook, add code to do the following:
-	From the sklearn.datasets module, import the load_boston function.
-	Call the load_boston() function, to load the Boston house price dataset. Store the dataset in a variable named boston. This dataset is specially designed for use in linear regression, to predict house prices based on features such as the age of a house, the proportion of land in the town allocated for housing use or industrial use, etc.

Python objects have a keys() function, which tells you the names of the properties defined in the object. Add the following code, to print the keys for the boston dataset object:

     print(boston.keys())  

You should find that the boston dataset object has the following properties:
-	data          – Features matrix, i.e. all the data about each house
-	target        – Target array, i.e. the price of each house
-	feature_names – Array of feature names
-	DESCR         – General descriptive blurb about the dataset 
-	filename      – Name of CSV file from which the data was loaded

Add code as follows, to investigate the shape of the feature matrix and the target array: 

    print("\nDetails about the feature matrix")
    print(boston.data.shape)
    print(boston.feature_names)
    print(boston.data)
    print("\nDetails about the target array")
    print(boston.target.shape)
    print(boston.target)

You should find that the boston dataset had 506 rows. This is a very small sample set, but it’s handy to get started with machine learning concepts. In a real-world scenario, a dataset might contain thousands or millions of samples.



In [None]:
# PLACE YOUR SOLUTION HERE


# Exercise 2:  Converting the feature matrix into a Pandas DataFrame

As you saw in the previous exercise, the data in the Boston dataset (i.e. boston.data) is a 2D array. There are 506 rows (representing 506 houses) and 13 columns (representing the 13 features for each house). Generally it’s beneficial to convert the data into a Pandas DataFrame, which is easier to work with than a 2D array. You can do this as follows:

    import pandas as pd
    X = pd.DataFrame(boston.data)
   
Add the following statement, to print the first 5 rows in the DataFrame:

    print(X.head())

Note the data columns are just called 0, 1, 2, 3, …. 12. 

If you want more meaningful column names (and who wouldn’t 😊), you can assign the feature names as column names as follows. The print() statement confirms the column names are now meaningful: 

    X.columns = boston.feature_names
    print(X.head())
    
One last step in this exercise… assign the Boston target array (i.e. the house prices) to a variable named y for readability and according to convention:

    y = boston.target


In [None]:
# PLACE YOUR SOLUTION HERE


# Exercise 3:  Splitting the dataset into “training” data and “testing” data

A common approach in machine learning is to split the samples in a dataset into two portions:
-	A relatively large portion of samples (e.g. 80%) that can be used to train the model.
-	The remainder of the samples (e.g. 20%) that can be used to test the quality of the values predicted by the model.

This is such a common task in machine learning that Scikit-Learn has an off-the-shelf function called train_test_split() to do it. Add the following code to your script:

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(
        X, 
        y, 
        train_size = 0.80)

Here’s a quick explanation of the train_test_split() function:
-	The function takes any number of arrays of the same size.
-	The train_size parameter indicates the portion of the data you want to treat as “training” data. Here, 80% will be treated as “training” data, and 20% as “testing” data.
-	The function returns a bunch of arrays, according to the specified split. In our example, X_train and X_test will hold the “training” and “testing” portions of X, and y_train and y_test will hold the “training” and “test” portions of y.

Add the following code, to show the shape of the resultant arrays:

    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)
   

In [None]:
# PLACE YOUR SOLUTION HERE


# Exercise 4:  Creating a linear regression model and predicting labels for new data

In this exercise you’ll use Scikit-Learn to create a linear regression model, fit it to the training data about Boston house process, and then use the model to predict the prices of other houses in Boston.

Follow these steps (refer to the chapter notes for more info, if necessary):
-	Import the LinearRegression model class.
-	Create a LinearRegression model object, and fit it to the "training" dataset.
-	Use the model to predict labels (i.e. house prices) for the "testing" dataset. To do this, pass the X_test data as a parameter to the predict() function. The function returns predicted house prices for this test data. 
-	Print the predicted house prices for the test data, alongside the actual house prices for the test data. This will give you an inkling over the quality of the model – the better the model, the closer the predicted and actual prices will be. What do you find, and why…?



In [None]:
# PLACE YOUR SOLUTION HERE


# Exercise 5:  Plotting predicted vs. actual house prices

Using MatPlotLib, draw a scatterplot graph that shows predicted vs. actual house prices. Here’s the graph we obtained when we ran the code in the solution script. In a perfect model, the dots would form a completely straight line because the predicted and actual values would always match exactly…


In [None]:
# PLACE YOUR SOLUTION HERE



# Exercise 6:  Determining the root mean squared error

In statistics, the root mean squared error is a measurement of the quality of predicted vs. actual results: 
-	You pump in a series of predicted and actual results, and it calculates the square of the difference (i.e. error) between predicted and actual values. It uses squares to always get positive deltas. 
-	It then calculates the mean (average) of the squared errors. 
-	You can then take the square root, to obtain the average error in the same units as the data itself (e.g. $ for Boston house prices).

This is such a common technique in machine learning that Scikit-Learn has a standard function called to mean_squared_error() to calculate the mean squared error. The following code shows how to use it:

    from sklearn.metrics import mean_squared_error
    import math
    mse = mean_squared_error(y_test, Y_pred)
    rmse = math.sqrt(mse)
    print("Root mean squared error %f" % rmse)


In [None]:
# PLACE YOUR SOLUTION HERE

