In [3]:
import pandas as pd
import math

In [4]:
# DATA PREPREATION
df = pd.read_csv(r"D:\#10\AI\CA4\house_data.csv")
df.info()
df['sqft_living'] = df['sqft_living'].fillna(df['sqft_living'].mode()[0])
df['floors'] = df['floors'].fillna(df['floors'].mode()[0])
df['sqft_basement'] = df['sqft_basement'].fillna(df['sqft_basement'].mode()[0])
df['yr_built'] = df['yr_built'].fillna(df['yr_built'].mode()[0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 27 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0.2   21613 non-null  int64  
 1   Unnamed: 0.1   21613 non-null  int64  
 2   Unnamed: 0     21613 non-null  int64  
 3   id             21613 non-null  int64  
 4   date           21613 non-null  object 
 5   price          21613 non-null  float64
 6   bedrooms       21613 non-null  int64  
 7   bathrooms      21613 non-null  float64
 8   sqft_living    18528 non-null  float64
 9   sqft_lot       21613 non-null  int64  
 10  floors         18530 non-null  float64
 11  waterfront     21613 non-null  int64  
 12  view           21613 non-null  int64  
 13  condition      21613 non-null  int64  
 14  grade          21613 non-null  int64  
 15  sqft_above     21613 non-null  int64  
 16  sqft_basement  21184 non-null  float64
 17  yr_built       18531 non-null  float64
 18  yr_ren

# Linear Regression

Main form of simple linear regression function: 
$$f(x) = \alpha x + \beta$$

here we want to find the intercept($\alpha$) and slope($\beta$) by minimizing the derivation of the RSS function:

- step 1: Compute RSS of the training data  

$$ RSS = \Sigma (y_i - (\hat{\beta} + \hat{\alpha} * x_i) )^2 $$

- step 2: Compute the derivatives of the RSS function in term of $\underline{\alpha}$ and $\underline{\beta}$, and set them equal to 0 to find the desired parameters

$$ \frac{\partial RSS}{\partial \beta} = \Sigma (-f(x_i) + \hat{\beta} + \hat{\alpha} * x_i) = 0$$
$$ \to \hat{\beta} = \hat{y} - \hat{\alpha} \hat{x} \to (1)$$


$$ \frac{\partial RSS}{\partial \alpha} = \Sigma (-2 x_i y_i + 2 \hat{\beta} x_i + 2\hat{\alpha} x_i ^ 2) = 0 \to (2)$$

$$ (1) , (2) \to \hat{\alpha} = \frac{\Sigma{(x_i - \hat{x})(y_i - \hat{y})}}{\Sigma{(x_i - \hat{x})^2}}
$$ 
$$ \hat{\beta} = \hat{y} - \hat{\alpha} \hat{x}$$



Based on the formula above, complete this function to compute the parameters of a simple linear regression

In [5]:
def simple_linear_regression(input_feature, output):
    # TO DO:
    # compute the sum of input_feature and output
    input_sum = math.fsum(input_feature)
    norm_input = input_sum / len(input_feature)
    output_sum = math.fsum(output)
    norm_output = output_sum / len(output)
    # compute the product of the output and the input_feature and its sum
    product = (input_feature - norm_input)  * (output - norm_output)
    product_sum = math.fsum(product)
    # compute the squared value of the input_feature and its sum
    sq_input =  (input_feature - norm_input)**2
    sq_sum = math.fsum(sq_input)
    # use the formula for the slope
    intercept = product_sum / sq_sum
    # use the formula for the intercept
    slope = norm_output - (intercept * norm_input)
    return (intercept, slope)

Now complete this function to predict the value of given data based on the calculated intercept and slope

In [6]:
def get_regression_predictions(input_feature, intercept, slope):
    # TO DO:

    # calculate the predicted values:
    predicted_values = intercept * input_feature + slope

    return (predicted_values)

Now that we have a model and can make predictions let's evaluate our model using Root Mean Square Error (RSME). RMSE is the square root of the mean of the squared differences between the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

Complete the following function to compute the RSME of a simple linear regression model given the input_feature, output, intercept and slope:

In [7]:
def get_root_mean_square_error(predicted_values , output):
    # TO DO:

    # Compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals  = predicted_values - output
    # square the residuals and add them up
    sum_sq = math.fsum(residuals ** 2)
    # find the mean of the above phrase
    mean_sq = sum_sq / len(output)
    # calculate the root
    RMSE = math.sqrt(mean_sq)
    return(RMSE)

AS you might guessed, the RMSE has no bound and it is not easy to find out the percentage of fitting the model into data with it. instead, we use R2 score. The R2 score is calculated by comparing the sum of the squared differences between the actual and predicted values of the dependent variable to the total sum of squared differences between the actual and mean values of the dependent variable. Matematically, the R2 score formula is shown as follows:

$$R^2 = 1 - \frac{SSres}{SStot} = 1 - \frac{\sum_{i=1}^{n} (y_{i,true} - y_{i,pred})^2}{\sum_{i=1}^{n} (y_{i,true} - \bar{y}_{true})^2} $$

In this step, complete the following function to calculate the R2 score of a given input_feature, output, intercept, and slope:

In [8]:
def get_r2_score(predicted_values, output):
    # TO DO:

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residuals  = predicted_values - output
    # square the residuals and add them up -> SSres
    sum_sq = math.fsum(residuals ** 2)
    # compute the SStot
    output_sum = math.fsum(output)
    norm_output = output_sum / len(output)
    SStot = math.fsum((output - norm_output)**2)
    # compute the R2 score value
    SSres = math.fsum((output - predicted_values)**2)
    R2_score = 1 - (SSres / SStot)
    return(R2_score)

Now calculate the fitness of the model and explain the outputs

In [12]:
# TO DO:

designated_feature_list = ['sqft_living' , 'yr_built' , 'grade' , 'zipcode']
output = df['price']
intercept = 0
slope = 0
prediction = 0
for feature in designated_feature_list:
    # TO DO: calculate R2 score and RMSE for each given feature
    (intercept , slope) = simple_linear_regression(df[feature] , output)
    prediction = get_regression_predictions(df[feature] , intercept , slope)
    print("RMSE:" , get_root_mean_square_error(prediction , output))
    print("R2:" , get_r2_score(prediction , output))

RMSE: 298822.8059063252
R2: 0.33830391674057914
RMSE: 366999.2530635096
R2: 0.0019289793749904804
RMSE: 338384.9116288185
R2: 0.15149761982248722
RMSE: 366834.1372163335
R2: 0.0028268573348371184
