## <center>Predictive modelling</center>

### Content

 1. Data description
 2. Data Preprocessing
 3. Data Modelling
 4. Conclusion

## 1. Data description

The dataset taken from https://krisha.kz/ by parsing first 120 appartments (6 pages) for sale in Atyrau city. Below is data that scraped and used for analysis:
+ "Название" - title of the appartment
+ "Цена" - price of the appartment
+ "Город" - city of the appartment
+ "Автор" - author of advertisement of appartment
+ "Телефон" - mobile phone of author of advertisement of appartment
+ "Тип дома" - building method of the appartment
+ "Год постройки" - building year of the appartment
+ "Количество комнат" - number of rooms of appartment 

In [1]:
# import libarires
import pandas as pd
import numpy as np
from numpy import isnan
from numpy import nan
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn import linear_model
from sklearn.model_selection import train_test_split

In [2]:
# import dataset
df = pd.read_excel('KrishaAtyrauApps.xlsx')

df.head()

Unnamed: 0.1,Unnamed: 0,Название,Цена,Город,Автор,Телефон,Тип дома,Год постройки,Количество комнат,Площадь
0,0,"1-комнатная квартира, 43.46 м², Абылхаир хана ...",15646500,"Атырау, Атырауская обл.",[],[],монолитный,2021,1,43
1,1,"4-комнатная квартира, 156.2 м², 2-ая Береговая...",49984000,"Атырау, Атырауская обл.",[],[],монолитный,2020,4,156
2,2,"1-комнатная квартира, 56.9 м², Авангард 2 микр...",15932000,"Атырау, Атырауская обл.",[],[],монолитный,2020,1,56
3,3,"3-комнатная квартира, 113.9 м², 8/9 этаж, улиц...",38000000,"Атырау, Атырауская обл.",Хозяин недвижимости,+7 701 588 5651,монолитный,2006,3,113
4,4,"2-комнатная квартира, 82.6 м², 1/8 этаж, мкр Н...",27000000,"Атырау, мкр Нурсая",Хозяин недвижимости,+7 702 311 5951,,2014,2,82


In [3]:
df.dtypes

Unnamed: 0            int64
Название             object
Цена                  int64
Город                object
Автор                object
Телефон              object
Тип дома             object
Год постройки         int64
Количество комнат     int64
Площадь               int64
dtype: object

## 2. Data Preprocessing
At this step, I have moreless ready dataset, however I need to do the following:
+ Handling with missing value
+ Encode categorical variables
+ Data transformation: standardization

In [3]:
# drop further unused columns
df1 = df.drop(['Unnamed: 0','Название', 'Город', 'Автор', 'Телефон'], axis=1)

### 2.1. Handling with missing value

Below you can see the amount of missingness numerically

In [4]:
# total missing values
print(df1.isnull().sum())

Цена                  0
Тип дома             11
Год постройки         0
Количество комнат     0
Площадь               0
dtype: int64


From previous HW we recognized that Listwise Deletion is the best option with handling missing values in this dataset. 

In [5]:
# Listwise Deletion or Complete Case
df1.dropna(subset=["Тип дома"], how='any', inplace=True)
df1.isnull().sum()

Цена                 0
Тип дома             0
Год постройки        0
Количество комнат    0
Площадь              0
dtype: int64

### 2.2 Encoding categorical variables: One-hot encoding

One of the columns in the dataset gives building type for the appartments listed. Because 'Тип дома' column is a categorical variable with more than two categories, I need to use ordinal encoding to transform this column numerically. I am going to use sklearn OrdinalEncoder to do so.

In [6]:
# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

# Create Ordinal Encoder
apps_ord_enc = OrdinalEncoder()
# Select non-null values in ambience
apps = df1['Тип дома']
apps_not_null = apps[apps.notnull()]
reshaped_vals = apps_not_null.values.reshape(-1, 1)
# Encode the non-null values of ambience
encoded_vals = apps_ord_enc.fit_transform(reshaped_vals)
# Replace the ambience column with ordinal values
df1.loc[apps.notnull(), 'Тип дома'] = np.squeeze(encoded_vals)

df1.head()

Unnamed: 0,Цена,Тип дома,Год постройки,Количество комнат,Площадь
0,15646500,2.0,2021,1,43
1,49984000,2.0,2020,4,156
2,15932000,2.0,2020,1,56
3,38000000,2.0,2006,3,113
5,8000000,1.0,2021,1,53


### 2.3. Data transformation: standardization

In [8]:
# descriptive statistics
df1.describe()

Unnamed: 0,Цена,Тип дома,Год постройки,Количество комнат,Площадь
count,109.0,109.0,109.0,109.0,109.0
mean,25212260.0,1.761468,2003.669725,2.752294,86.302752
std,13961210.0,0.848736,16.086027,0.914451,43.237395
min,8000000.0,0.0,1966.0,1.0,40.0
25%,14000000.0,1.0,1989.0,2.0,59.0
50%,23000000.0,2.0,2006.0,3.0,79.0
75%,30000000.0,2.0,2019.0,3.0,100.0
max,75000000.0,3.0,2021.0,5.0,330.0


Since the "Год постройки","Количество комнат", "Площадь" columns in the dataset are all on different scales, I am going to standardize them in a way that allows for use in a linear model.

In [9]:
# standardization
df1[["Год постройки","Количество комнат", "Площадь","Тип дома"]] = StandardScaler().fit_transform(df1[["Год постройки","Количество комнат","Площадь","Тип дома"]])
df_f[["Год постройки","Количество комнат","Площадь","Тип дома"]] = StandardScaler().fit_transform(df_f[["Год постройки","Количество комнат","Площадь","Тип дома"]])

df1.head()

Unnamed: 0,Цена,Тип дома,Год постройки,Количество комнат,Площадь
0,15646500,0.282342,1.082326,-1.925075,-1.006138
1,49984000,0.282342,1.019873,1.370734,1.619412
2,15932000,0.282342,1.019873,-1.925075,-0.704083
3,38000000,0.282342,0.145532,0.272131,0.620309
5,8000000,-0.901322,1.082326,-1.925075,-0.773788


## 3. Data Modelling
I am going to compare resullts of the linear regression model for predicting the price of appartments for sale.

### 3.1. Linear Regression Model

In [8]:
df_f = df1.copy()

X = df_f[["Год постройки","Количество комнат","Площадь", "Тип дома"]]
y = df_f['Цена'] 

Below I applied principal component analysis (PCA) for feature extraction

In [10]:
from sklearn.decomposition import PCA

# calculating the principal components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

In [11]:
# Split the data into training/testing sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state = 0)

In [12]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)

# Make predictions using the testing set
y_pred = regr.predict(X_test)

print(regr.intercept_)

# The coefficients
print(regr.coef_)

25591680.634215143
[ 9011170.41625777   283347.52835997 -2404470.78414975]


In [13]:
# Evaluation the model against test data using RMSE and R2
print('Root Mean Squared Error:', round(np.sqrt(metrics.mean_squared_error(y_test, y_pred))))
print('R2:', round(metrics.r2_score(y_test, y_pred), 3))


Root Mean Squared Error: 6662634.0
R2: 0.716


In [14]:
df_s = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df_s

Unnamed: 0,Actual,Predicted
91,32000000,25421390.0
11,18000000,34113300.0
82,23500000,31711300.0
2,15932000,16383180.0
25,10608000,24674050.0
110,20000000,18571250.0
119,30000000,34684150.0
8,23828000,34113300.0
17,22500000,20015070.0
93,14000000,14125740.0


## 4. Conclusion

Overall, I can make a conclusion that I build a model with an accuracy of 72% straight.