#### For this Assignment you have been given a data which is a subset of a bigger dataset which was collected by Buffalo Tax department. It contains information regarding the various properties in Buffalo.

Number of Instances: 92508

Number of Attributes: 16 (including the target variable)

Attribute Information:

| Column Name                | Description                                                                                                                                      | Type        |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| TOTAL VALUE                | The combined assessed value of the land and improvements on the parcel                                                                           | Number      |
| FRONT                      | The width of the front of property (in feet)                                                                                                     | Number      |
| DEPTH                      | The depth of the property (in feet)                                                                                                              | Number      |
| PROPERTY CLASS             | Property Type Classification Codes describe the primary use of each parcel of real property on the assessment roll                               | Number      |
| LAND VALUE                 | The assessed value of the land                                                                                                                   | Number      |
| SALE PRICE                 | The price that the parcel of real property was last sold for                                                                                     | Number      |
| YEAR BUILT                 | The year the primary building on the parcel was built                                                                                            | Number      |
| TOTAL LIVING AREA          | The amount of living space (in square feet)                                                                                                      | Number      |
| OVERALL CONDITION          | A grade of the condition of the property                                                                                                         | Number      |
| BUILDING STYLE             | A code for the style of building                                                                                                                 | Number      |
| HEAT TYPE                  | The type of heating system in the building (only applicable to residential properties)                                                           | Number      |
| BASEMENT TYPE              | The type of basement on the property (only applicable to residential properties)                                                                 | Number      |
| # OF STORIES               | The number of floors/Stories in the property                                                                                                     | Number      |
| # OF FIREPLACES            | The number of fireplaces in a dwelling (only applicable to residential properties)                                                               | Number      |
| # OF BEDS                  | The number of beds in a dwelling (only applicable to residential properties)                                                                     | Number      |
| # OF BATHS                 | The number of baths in a dwelling (only applicable to residential properties)                                                                    | Number      |
| # OF KITCHENS              | The number of kitchens in a dwelling (only applicable to residential properties)                                                                 | Number      |



There are no missing Attribute Values.

Your task is to implement a Linear Regression Model to predict the TOTAL VALUE of a property

In [66]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

%matplotlib inline

#### STEP 1 - Load Data (Already Done)

In [67]:
df = pd.read_csv('data.csv', dtype=np.float64)

In [68]:
y = np.asarray(df['TOTAL VALUE'] )
y = y.reshape(y.shape[0],1)
feature_cols = df.columns.to_list()
feature_cols.remove('TOTAL VALUE')
x = np.asarray(df[feature_cols])

In [69]:
print(len(x))
print(int(len(x)*0.7))
print(int(len(x)*0.2))
print(int(len(x)*0.1))

92508
64755
18501
9250


In [70]:
#df.info

Variable **y** contains the total values of the property

Variable **x** contains the features

#### STEP 2 - Split the Data into training and testing and validation split ( 70% Training, 20% Testing and 10% validation) ( Hint: you can use the sklearn library for this step only) ( 5 Points)

In [71]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state = 42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=1/3, random_state = 42)
X_train.shape[0], X_test.shape[0], X_val.shape[0],y_train.shape[0], y_test.shape[0], y_val.shape[0]

(64755, 18502, 9251, 64755, 18502, 9251)

#### STEP 3 - Scale Data Using Min Max Scaler (10 Points)
For each feature scaled value can be calculated using $  x_{scaled} = \frac{x - min(x)}{max(x) - min(x)}$


In [72]:
#STEP 3
X_scaled     = (X_train - np.min(X_train))/ (np.max(X_train) - np.min(X_train))
X_Test_scaled = (X_test - np.min(X_test))/ (np.max(X_test) - np.min(X_test))
X_Val_scaled  = (X_val - np.min(X_val))/ (np.max(X_val) - np.min(X_val))
y_scaled     = (y_train - np.min(y_train))/ (np.max(y_train) - np.min(y_train))
y_Test_scaled = (y_test - np.min(y_test))/ (np.max(y_test) - np.min(y_test))
y_Val_scaled  = (y_val - np.min(y_val))/ (np.max(y_val) - np.min(y_val))

#### STEP 4 - Initialize values for the weights, No. of Epochs and Learning Rate (5 Points)

In [73]:
#STEP 4
theta = np.random.rand(15,1)
learning_rate = 0.001   
epochs = 100
       

#### STEP 5 - Train Linear Regression Model (40 Points)
 5.1 Start a Loop For each Epoch
 
 5.2 Find the predicted value using $ y(x,w) = w_0 + w_1x $ for the training and validation splits (10 Points)
 
 5.3 Find the Loss using Mean Squared Error for the training and validation splits and store in a list (10 Points)
 
 5.4 Calculate the Gradients (15 Points)
 
 5.5 Update the weights using the gradients (5 Points)

In [89]:
from sklearn.linear_model import LinearRegression,SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler



# def gradient_descent(X, y, theta, epochs):
#     b0 = 1
#     b1 = np.ones((15,1))
#     mse_list   = []
#     pred_list  = []
   
#     for i in range(epochs):
#         error = (b0 + X.dot(b1)) - y
#         b0 = theta -theta * (error.sum() / len(y) )
#         b1 = theta -theta * (X.T.dot(error) / len(y))
#         mse_list.append(1/(2*len(y))*np.dot(error.T,error))
#         pred_list.append(X*theta)
        
#     return mse_list, pred_list


# mse_list_train, pred_list_train = gradient_descent(X_scaled, y_scaled, theta=0.001, epochs=15)
# mse_list_test, pred_list_test = gradient_descent(X_test, y_test, theta=0.001, epochs=15)
# mse_list_val, pred_list_val = gradient_descent(X_val, y_val, theta=0.001, epochs=15)
# mse_list_train
# mse_list_test
# mse_list_val

reg = LinearRegression()
reg.fit(X_scaled, y_scaled)
train_score = regressor.score(X_scaled, y_scaled)
test_score = regressor.score(X_Test_scaled, y_Test_scaled)
valid_score = regressor.score(X_Val_scaled, y_Val_scaled)

train_score, test_score, valid_score

(0.09538785235732072, -1.1627539881414717, -7.024141322473927)

In [90]:
reg = make_pipeline(SGDRegressor(max_iter=1000, alpha=0.001))
reg.fit(X_train, y_train.ravel())
y_pred_train = reg.predict(X_train)

reg_test = make_pipeline(StandardScaler(),SGDRegressor(max_iter=1000, alpha=0.001))
reg_test.fit(X_test, y_test.ravel())
y_pred_test = reg.predict(X_test)

reg_val = make_pipeline(StandardScaler(),SGDRegressor(max_iter=1000, alpha=0.001))
reg_val.fit(X_val, y_val.ravel())
y_pred_val = reg.predict(X_val)





#### STEP 6 - Evaluate the Model ( 25 Points)
6.1 Plot a graph of the Training and Validation Loss wrt epochs (10 Points)

6.2 Find the R2 Score of the trained model for the Train, Test and Validation splits (15 Points)

In [88]:
reg.score(y_train,y_pred_train), reg_test.score(y_test,y_pred_test), reg_val.score(y_val,y_val_pred) 

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 15 is different from 1)