### Linear Regression Model
This notebook uses the train and validation model data to train a linear regression model on the training data. I will first use the default parameters in the model and then use optimal parameters generated by a random rearch.

In [1]:
#importing the data
import pandas as pd
import numpy as np
from pathlib import Path

In [2]:
#importing the data
PARENT = "Predicing_House_Prices"
path = Path(PARENT).parent / "../Data/X_train_model2.csv"
X_train_model2 = pd.read_csv(path)

path2 = Path(PARENT).parent / "../Data/X_valid_model2.csv"
X_valid_model2 = pd.read_csv(path2)

path3 = Path(PARENT).parent / "../Data/y_train_model2.csv"
y_train_model2 = pd.read_csv(path3)

path4 = Path(PARENT).parent / "../Data/y_valid_model2.csv"
y_valid_model2 = pd.read_csv(path4)

In [3]:
#dropping unnamed column
X_train_model2 = X_train_model2.drop(columns="Unnamed: 0")
X_valid_model2 = X_valid_model2.drop(columns="Unnamed: 0")
y_train_model2 = y_train_model2.drop(columns="Unnamed: 0")
y_valid_model2 = y_valid_model2.drop(columns="Unnamed: 0")

In [4]:
#training the default linear regression model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg1 = LinearRegression().fit(X_train_model2, y_train_model2)

In [5]:
#getting the mean squared error and R^2

#y predictions on the validation data
preds= lin_reg1.predict(X_valid_model2)

print("Mean Squared Error:", format(mean_squared_error(y_valid_model2, preds)))
print("R Squared:", format(lin_reg1.score(X_valid_model2, y_valid_model2)))

Mean Squared Error: 4.152894233082195e+30
R Squared: -1.226869361830285e+18


In [6]:
# random search for parameters

from sklearn.model_selection import RandomizedSearchCV
import random

random.seed(3467)
# run randomized search
lin_reg2 = LinearRegression(
    normalize= False,
    n_jobs=None
).fit(X_train_model2, y_train_model2)

#intializing possible parameters
params = {
    "fit_intercept": [True, False],
    "copy_X": [True, False],
    "positive": [True, False]
}
#number of iterations
n_iter_search = 400
random_search = RandomizedSearchCV(
    lin_reg2, param_distributions=params, 
    n_iter=n_iter_search, 
    cv= 3, n_jobs = -1
)
#fitting the random search
random_search.fit(X_train_model2, y_train_model2)
#printing the results
rand_opt = random_search.best_params_
print(rand_opt)



{'positive': True, 'fit_intercept': False, 'copy_X': True}


In [7]:
#fitting the model with random search parameters
lin_reg3 = LinearRegression(positive= True, fit_intercept=False, copy_X= True,
    n_jobs=None).fit(X_train_model2, y_train_model2)

In [8]:
#getting the mean squared error and R^2

#y predictions on the validation data
preds2= lin_reg3.predict(X_valid_model2)

print("Mean Squared Error:", format(mean_squared_error(y_valid_model2, preds2)))
print("R Squared:", format(lin_reg3.score(X_valid_model2, y_valid_model2)))

Mean Squared Error: 1709983475856.7566
R Squared: 0.49482789158160845


I will now fit a model with less features. I picked the features that I think would have a strong effect on house price and limited the data to only those features.

In [9]:
#looking at all of the columns
for t in X_train_model2.columns:
    print(t)

yearBuilt
livingArea
bathrooms
bedrooms
parking
garageSpaces
hasGarage
pool
spa
isNewConstruction
hasPetsAllowed
state_CA
state_GA
city_"oneals"
city_abbeville
city_acampo
city_acton
city_acworth
city_adairsville
city_adel
city_adelanto
city_adin
city_adrian
city_agoura
city_agoura hills
city_agua dulce
city_ahwahnee
city_ailey
city_alameda
city_alamo
city_alapaha
city_albany
city_albion
city_alderpoint
city_alhambra
city_aliso viejo
city_allenhurst
city_alma
city_alpaugh
city_alpharetta
city_alpine
city_alta
city_alta loma
city_altadena
city_alto
city_alturas
city_alviso
city_amador city
city_american canyon
city_americus
city_anaheim
city_anaheim hills
city_anderson
city_angels camp
city_angelus oaks
city_angwin
city_annapolis
city_antelope
city_antioch
city_apple valley
city_applegate
city_appling
city_aptos
city_arabi
city_aragon
city_arcadia
city_arcata
city_armuchee
city_arnold
city_arnoldsville
city_arrowbear lake
city_arrowhead
city_arroyo grande
city_artesia
city_arvin
city_as

In [10]:
# subsetting for only necessary vairables
X_train_model3 = X_train_model2.filter(["livingArea", "bathrooms", "bedrooms", "garageSpaces", 
"pool", "isNewConstruction", "state_CA", "state_GA"])
X_valid_model3 = X_valid_model2.filter(["livingArea", "bathrooms", "bedrooms", "garageSpaces", 
"pool", "isNewConstruction", "state_CA", "state_GA"])


In [11]:
#fitting the model with limited features
lin_reg4 = LinearRegression(positive= True, fit_intercept=False, copy_X= True,
    n_jobs=None).fit(X_train_model3, y_train_model2)

In [12]:
#getting the mean squared error and R^2

#y predictions on the validation data
preds4= lin_reg4.predict(X_valid_model3)

print("Mean Squared Error:", format(mean_squared_error(y_valid_model2, preds4)))
print("R Squared:", format(lin_reg4.score(X_valid_model3, y_valid_model2)))

Mean Squared Error: 2305960452157.2104
R Squared: 0.3187613096892473
