# Prediction and Performance Evaluation

Previously, we learned about the three-step process in scikit-learn to do machine learning. Once our model is trained, it's time to put it to use. We will learn how to make predictions with it and then to evaluate its performance by calculating $R^2$.

## Repeating the three step machine learning process

Let's repeat the three step machine learning process to build a linear regression model that uses above ground living area to predict sale price. Let's read in our sample housing dataset and select the single feature `GrLivArea` as our input.

In [None]:
import pandas as pd
import numpy as np
housing = pd.read_csv('../data/housing_sample.csv')
X = housing[['GrLivArea']]
y = housing['SalePrice']

### Import, Instantiate, Fit

Let's complete the three-step process in a single cell by importing our estimator, instantiating, and training it with the `fit` method.

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

### Make predictions with the `predict` method
Now that our model is trained, we can use the `predict` method to make predictions. We must pass it a 2-dimensional numpy array (or single-column DataFrame) with the same number of columns as the one it was trained on. We used one feature during training, so we need a one-column array. Let's make predictions for houses with square footage of 500, 1,000, 3,000, and 5,000.

In [None]:
new_data = np.array([[500], [1000], [3000], [5000]])
new_data

Pass this two-dimensional array to the `predict` method to get the predicted sale price.

In [None]:
lr.predict(new_data)

All those decimal places are excessive and make it difficult to read. Let's round our result to the nearest thousand.

In [None]:
lr.predict(new_data).round(-3)

### Could have used a DataFrame or list of lists

Instead of using a numpy array to hold the new data, we could have used a single-column DataFrame or a list of lists. We begin by creating a DataFrame.

In [None]:
df_new_data = pd.DataFrame({'GrLivArea': [500, 1000, 3000, 5000]})
df_new_data

Passing it to the `predict` method returns the same exact values as above.

In [None]:
lr.predict(df_new_data).round(-3)

Using a list of lists where the inner list contains the value for the single feature also works and again returns the same predicted values.

In [None]:
new_data_list = [[500], [1000], [3000], [5000]]
lr.predict(new_data_list).round(-3)

### Write our own function to calculate predictions

We know the coefficients of our trained model and can build our own `predict` function with them.

In [None]:
def predict(x):
    return lr.intercept_ + lr.coef_ * x

Let's verify that the values are the same.

In [None]:
predict(new_data).round(-3)

## Evaluating the performance of our predictions

It's only possible to evaluate the performance of our predictions if they are labeled with the ground truth. The only labeled observations we have are the ones we built the model from and are currently stored in variables `X` and `y`. 

### Evaluating on the training data is bad, but we will do it anyways

The data that is used to train the model is called the **training data**. Typically, we would not use this same data to evaluate our model performance and instead choose labeled data that the model has not seen. We will learn formal processes for model evaluation later. For now, we will evaluate our model on the training data with the `score` method which returns $R^2$.

In [None]:
lr.score(X, y)

### Explanation of the `score` method

The `score` method uses the model built during the call to the `fit` method. It makes a prediction for each observation in `X`. It then calculates $R^2$ between this predicted value and the actual sale price $y$. Our linear regression model with a single feature obtained a score of .5 meaning that the sum of squared error from the model was 50% less than the sum of squared error produced by guessing the mean. In other words, our model explains 50% of the inherent variance in the data.

## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Use the garage area to build a simple linear regression model. What is the value of $R^2$ when scoring with the training data? What does the model predict for garage areas of size 100, 500, and 1000 square feet? Repeat this process for the other numeric features. Create your own arrays with new data to make predictions with.</span>