## Prediction of house prices

The goal is to create a model to predict house prices by given house characteristics. The assignment would be evaluated based on models performance on the test data (which is not included in the attached dataset) and full training flow implementation. Here are a few hints to avoid mistakes and get a better model and a better implementation.

1. Avoid overfitting - check your models on validation/test datasets except training ones.
2. Use cross validation with different algorithms (Linear Regression, Ridge Regression, KNNR etc.) and different params (e.g., KNN with different number of neighbors, Ridge Regression with different alpha/lambda values). The model with best score/metric on validation data is selected as your final model
3. Retrain your selected model with best parameters on your data. Use this as a final predictor for the final function.
4. The best practice would be to save the trained model and submit it with the notebook file. You can read about saving sklearn models here - https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/;
5. However, this is optional for now, and for the final test we can rerun your training process and use trained model for the prediction. If your training time is long, please try to go on with solution described in 4.
6. Keep in mind, that even the initial columns and structure of test dataset would be the same as the train dataset, after your processing the column set can be different (which can appear because of One Hot Encoding) and you should handle this somehow.
7. Please leave some kind of instruction for the TA to run your final function and, if necessary, all your codes.

In [83]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
df = pd.read_csv('houses_train.csv')
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from numpy import arange
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
import pickle
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsRegressor

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5001 entries, 0 to 5000
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      5001 non-null   int64  
 1   price           5001 non-null   float64
 2   condition       5001 non-null   object 
 3   district        5001 non-null   object 
 4   max_floor       5001 non-null   int64  
 5   street          5001 non-null   object 
 6   num_rooms       5001 non-null   int64  
 7   region          5001 non-null   object 
 8   area            5001 non-null   float64
 9   url             5001 non-null   object 
 10  num_bathrooms   5001 non-null   int64  
 11  building_type   5001 non-null   object 
 12  floor           5001 non-null   int64  
 13  ceiling_height  5001 non-null   float64
dtypes: float64(3), int64(5), object(6)
memory usage: 547.1+ KB


In [3]:
df

Unnamed: 0.1,Unnamed: 0,price,condition,district,max_floor,street,num_rooms,region,area,url,num_bathrooms,building_type,floor,ceiling_height
0,4598,100000.0,newly repaired,Arabkir,6,Kievyan St,3,Yerevan,96.0,http://www.myrealty.am/en/item/26229/3-senyaka...,1,stone,4,3.0
1,5940,52000.0,good,Arabkir,14,Mamikoniants St,3,Yerevan,78.0,http://www.myrealty.am/en/item/32897/3-senyaka...,1,panel,10,2.8
2,2302,52000.0,newly repaired,Qanaqer-Zeytun,9,M. Melikyan St,3,Yerevan,97.0,http://www.myrealty.am/en/item/1459/apartment-...,1,panel,1,2.8
3,5628,130000.0,good,Center,4,Spendiaryan St,3,Yerevan,80.0,http://www.myrealty.am/en/item/2099/3-senyakan...,1,stone,2,3.2
4,760,81600.0,zero condition,Center,9,Ler. Kamsar St,3,Yerevan,107.0,http://www.myrealty.am/en/item/22722/3-senyaka...,1,monolit,9,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,3585,70000.0,newly repaired,Arabkir,5,Griboedov St,3,Yerevan,97.0,http://www.myrealty.am/en/item/36852/3-senyaka...,1,stone,4,2.8
4997,3291,77000.0,newly repaired,Arabkir,4,Orbeli Yeghbayrner St,3,Yerevan,71.0,http://www.myrealty.am/en/item/13933/Apartment...,1,stone,4,2.8
4998,5959,46000.0,zero condition,Center,5,Mashtots Ave,1,Yerevan,40.0,http://www.myrealty.am/en/item/31190/1-senyaka...,1,stone,2,3.0
4999,542,99000.0,newly repaired,Center,14,Argishti St,4,Yerevan,118.0,http://www.myrealty.am/en/item/25905/4-senyaka...,2,monolit,14,3.0


In [4]:
df=df.drop(['url', 'street','region'], axis=1)
df=df.drop(['Unnamed: 0'], axis=1)
df=df.drop(['max_floor'], axis=1)
#df.floor.unique()
df

Unnamed: 0,price,condition,district,num_rooms,area,num_bathrooms,building_type,floor,ceiling_height
0,100000.0,newly repaired,Arabkir,3,96.0,1,stone,4,3.0
1,52000.0,good,Arabkir,3,78.0,1,panel,10,2.8
2,52000.0,newly repaired,Qanaqer-Zeytun,3,97.0,1,panel,1,2.8
3,130000.0,good,Center,3,80.0,1,stone,2,3.2
4,81600.0,zero condition,Center,3,107.0,1,monolit,9,3.0
...,...,...,...,...,...,...,...,...,...
4996,70000.0,newly repaired,Arabkir,3,97.0,1,stone,4,2.8
4997,77000.0,newly repaired,Arabkir,3,71.0,1,stone,4,2.8
4998,46000.0,zero condition,Center,1,40.0,1,stone,2,3.0
4999,99000.0,newly repaired,Center,4,118.0,2,monolit,14,3.0


In [21]:
new= pd.get_dummies(df, columns=['condition','district','num_rooms','num_bathrooms','building_type'], drop_first=True)

target = new['price']
features = new[new.columns.difference(['price'])]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

In [34]:
X_train_area = X_train.loc[:, "area"].to_numpy()
X_train_remaining = X_train.loc[:, X_train.columns != "area"].to_numpy()
X_test_area = X_test.loc[:, "area"].to_numpy()
X_test_remaining = X_test.loc[:, X_test.columns != "area"].to_numpy()

# Apply scaling separately to the training set
scaler1 = StandardScaler()
X_train_area_scaled = scaler1.fit_transform(X_train_area.reshape(-1,1))

scaler2 = MinMaxScaler()
X_train_remaining_scaled = scaler2.fit_transform(X_train_remaining)

# Apply the same scaling transformations to the testing set
X_test_area_scaled = scaler1.transform(X_test_area.reshape(-1,1))
X_test_remaining_scaled = scaler2.transform(X_test_remaining)

# Merge the scaled feature and the scaled remaining features back into the training and testing sets
X_train_scaled = np.concatenate((X_train_area_scaled, X_train_remaining_scaled), axis=1)
X_test_scaled = np.concatenate((X_test_area_scaled, X_test_remaining_scaled), axis=1)


In [62]:
model = linear_model.LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

print("MSE for linear regression: %.3f" %mean_squared_error(y_test, y_pred))
print("RMSE for linear regression: %.3f" %np.sqrt(mean_squared_error(y_test, y_pred)))

MSE for linear regression: 916705971.724
RMSE for linear regression: 30277.153


In [66]:
lasso = Lasso()
alphas = np.logspace(-5, -0.5, 30)
scores = []
for alpha in alphas:
    lasso.alpha = alpha
    cv_score = cross_val_score(lasso, X_train_scaled, y_train, cv=5)
    scores.append(np.mean(cv_score))

# Find the alpha that gives the highest cross-validation score
best_alpha = alphas[np.argmax(scores)]
print("Best alpha:", best_alpha)

# Train a final Lasso regression model with the best alpha
lasso.alpha = best_alpha
lasso.fit(X_train_scaled, y_train)
y_predd = lasso.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_predd)
print(f"MSE for lasso: {mse:.3f}")
print("RMSE for lasso: %.3f" %np.sqrt(mean_squared_error(y_test, y_predd)))

Best alpha: 0.31622776601683794
MSE for lasso: 916680616.624
RMSE for lasso: 30276.734


In [71]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
ridge = RidgeCV(alphas=arange(0, 1, 0.1), cv=kf, scoring='neg_mean_absolute_error')
ridge.fit(X_train_scaled, y_train)

y_preddd = ridge.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_preddd)

print("Best alpha:",ridge.alpha_)
print("MSE:", mse)
print("RMSE for ridge: %.3f" %np.sqrt(mean_squared_error(y_test, y_preddd)))

Best alpha: 0.9
MSE: 916891672.222476
RMSE for ridge: 30280.219


In [80]:
# Choose a value for k
k_range = range(1, 31)

# List to store the cross-validation scores
cv_scores = []

# Perform 10-fold cross-validation for each k value
for k in k_range:
    knn = KNeighborsRegressor(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=10, scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())

# Find the optimal k value with the minimum MSE score
optimal_k = k_range[cv_scores.index(min(cv_scores))]
print("Optimal k-value:", optimal_k)

# Fit the model on the training set
knn = KNeighborsRegressor(n_neighbors=k)
knn.fit(X_train_scaled, y_train)

# Make predictions on the testing set
y_pred4 = knn.predict(X_test_scaled)

# Evaluate the performance of the model
print("MSE for knn:", mse)
print("RMSE for knn: %.3f" %np.sqrt(mean_squared_error(y_test, y_pred4)))

Optimal k value: 13
MSE for knn: 916891672.222476
RMSE for knn: 29007.217


### The model that resulted in the smallest RMSE was KNN. So below you can find the loaded model and run the test on it.

In [82]:
filename = 'finalized_model.sav'
pickle.dump(knn, open(filename, 'wb'))
 
# some time later...
 
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
# result = loaded_model.score(X_test, Y_test)
# print(result)

In [None]:
def final_predict(final_test_df):
    1. preprocessing of final_test_df (scaling, one hot encoding ...)
    2. make sure that columns and their order in train and test are the same
    2. return predictions

In [None]:
df = pd.read_csv('houses_test.csv')
final_predict(df)