## Week 1 - Assignment - Simple Linear Regression

#### GOALS
1. Use functions to compute important summary statistics
2. Write a function to compute the Simple Linear Regression weights using the closed form solution
3. Write a function to make predictions of the output given the input feature
4. Turn the regression around to predict the input/feature given the output
5. Compare two different models for predicting house prices

**Note** - In the course, [TuriCreate](https://apple.github.io/turicreate/docs/userguide/supervised-learning/regression.html) is used to estimate the parameters. You can refer the [link](https://nbviewer.org/github/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/assignment/week1/week-1-simple-regression-assignment-exercise.ipynb). Here , I have used Python.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

In [6]:
sys.version

'3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20) \n[GCC 7.3.0]'

In [7]:
df=pd.read_csv("./kc_house_data.csv",)
df_train=pd.read_csv("./kc_house_train_data.csv",)
df_test=pd.read_csv("./kc_house_test_data.csv",)

df.shape,df_train.shape,df_test.shape

((21613, 21), (17384, 21), (4229, 21))

### Goal 1 - Summary Stats

In [42]:
df[['sqft_living', 'price']].describe()

Unnamed: 0,sqft_living,price
count,21613.0,21613.0
mean,2079.899736,540088.1
std,918.440897,367127.2
min,290.0,75000.0
25%,1427.0,321950.0
50%,1910.0,450000.0
75%,2550.0,645000.0
max,13540.0,7700000.0


In [43]:
df_train[['sqft_living', 'price']].describe()

Unnamed: 0,sqft_living,price
count,17384.0,17384.0
mean,2080.02951,539366.6
std,921.630888,369691.2
min,290.0,75000.0
25%,1420.0,320000.0
50%,1910.0,450000.0
75%,2550.0,640000.0
max,13540.0,7700000.0


In [44]:
df_test[['sqft_living', 'price']].describe()

Unnamed: 0,sqft_living,price
count,4229.0,4229.0
mean,2079.36628,543054.0
std,905.317454,356421.2
min,370.0,85000.0
25%,1430.0,325000.0
50%,1920.0,453000.0
75%,2550.0,650000.0
max,9890.0,6885000.0


### Goal 2 - Closed Form Solution

Write a generic function that accepts a column of data (e.g, an SArray) ‘input_feature’ and another column ‘output’ and returns the Simple Linear Regression parameters ‘intercept’ and ‘slope’. Use the closed form solution from lecture to calculate the slope and intercept. e.g. in python:

In [11]:
def simple_linear_regression(input_feature, output):
    x=input_feature
    y=output
    N=len(x)
    
    # Means
    x_mean= x.mean()
    y_mean= y.mean()
    
    # Compute the numerator for the slope
    xy= (x*y).sum()
    xy_avg= (x.sum() * y.sum())/N
    slope_nmr= xy-xy_avg
    
    # Compute the denominator for the slope
    x_sq= (x*x).sum()
    x_mul= (x.sum()*x.sum())/N
    slope_dnr= x_sq-x_mul
    
    # slope
    slope= slope_nmr/slope_dnr
    
    #intercept
    intercept= y_mean-slope * x_mean
    
    return (intercept, slope)

### Estimate Slope and Intercept

Use your function to calculate the estimated slope and intercept on the training data to predict ‘price’ given ‘sqft_living’.
save the value of the slope and intercept for later (you might want to call them e.g. squarfeet_slope, and squarefeet_intercept)

In [24]:
input_feature = df_train['sqft_living']
output = df_train['price']

squarefeet_intercept,squarfeet_slope = simple_linear_regression(input_feature,output)
squarfeet_slope,squarefeet_intercept

(281.9588396303426, -47116.07907289418)

In [25]:
# Estimate using Sklearn 

from sklearn.linear_model import LinearRegression

model= LinearRegression(n_jobs=-1)
model.fit(input_feature.to_frame(), output)
model.intercept_, model.coef_

(-47116.07907289383, array([281.95883963]))

### Goal 3 - Prediction

Write a function that accepts a column of data ‘input_feature’, the ‘slope’, and the ‘intercept’ you learned, and returns an a column of predictions ‘predicted_output’ for each entry in the input column. e.g. in python:

Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?

In [26]:
def get_regression_predictions(input_feature, intercept, slope) :
    predicted_output= intercept+ np.array(slope) * input_feature
    return(predicted_output)

In [27]:
get_regression_predictions([2650], squarefeet_intercept,squarfeet_slope)

array([700074.84594751])

### RSS COST FUNCTION

Write a function that accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’  and outputs the Residual Sum of Squares (RSS). e.g. in python.

According to this function and the slope and intercept from (4) What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?

In [28]:
def get_residual_sum_of_squares(input_feature, output, intercept,slope):
    pred=get_regression_predictions(input_feature, intercept,slope)
    residuals=output-pred
    RSS= (residuals*residuals).sum()
    return(RSS)

In [29]:
get_residual_sum_of_squares(input_feature, output, squarefeet_intercept,squarfeet_slope)

1201918354177283.0

### Goal 4 - Estimate sq. feet from Price

Note that although we estimated the regression slope and intercept in order to predict the output from the input, since this is a simple linear relationship with only two variables we can invert the linear function to estimate the input given the output!

Write a function that accept a column of data:‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the column of data: ‘estimated_input’. Do this by solving the linear function output = intercept + slope*input for the ‘input’ variable (i.e. ‘input’ should be on one side of the equals sign by itself). e.g. in python:

According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?

In [30]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_input= (output-intercept)/slope
    return(estimated_input)

In [31]:
inverse_regression_predictions([800000], squarefeet_intercept,squarfeet_slope)

array([3004.39624515])

### Goal 5 - Compare 2 different models 

Instead of using ‘sqft_living’ to estimate prices we could use ‘bedrooms’ (a count of the number of bedrooms in the house) to estimate prices. Using your function from (3) calculate the Simple Linear Regression slope and intercept for estimating price based on bedrooms. Save this slope and intercept for later (you might want to call them e.g. bedroom_slope, bedroom_intercept).

In [37]:
input_feature = df_train['bedrooms']
output = df_train['price']

bedroom_intercept,bedroom_slope = simple_linear_regression(input_feature,output)
bedroom_slope,bedroom_intercept

(127588.95293398784, 109473.1776229596)

Now that we have 2 different models compute the RSS from BOTH models on TEST data.

13. Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.

In [38]:
rss_model2_bedroom= get_residual_sum_of_squares(df_test['bedrooms'], df_test['price'], bedroom_intercept,bedroom_slope)
rss_model1_sqfeet= get_residual_sum_of_squares(df_test['sqft_living'], df_test['price'], squarefeet_intercept,squarfeet_slope)
rss_model2_bedroom < rss_model1_sqfeet

False

In [39]:
rss_model2_bedroom,  rss_model1_sqfeet
# Model with sq. feet is better compared to model with bedroom

(493364585960300.9, 275402933617812.12)