### Курсовой проект

Задание
Используя данные из train.csv, построить
модель для предсказания цен на недвижимость (квартиры).
С помощью полученной модели предсказать
цены для квартир из файла test.csv.

Целевая переменная:
Price

Метрика
R2 - коэффициент детерминации (sklearn.metrics.r2_score)

Сдача проекта:
1. Прислать в раздел Задания Урока 10 ("Вебинар. Консультация по итоговому проекту")
ссылку на программу в github (программа должна содержаться в файле Jupyter Notebook 
с расширением ipynb). (Pull request не нужен, только ссылка ведущая на сам скрипт).
2. Приложить файл с названием по образцу SShirkin_predictions.csv
с предсказанными ценами для квартир из test.csv (файл должен содержать два поля: Id, Price).
В файле с предсказаниями должна быть 5001 строка (шапка + 5000 предсказаний).

Сроки и условия сдачи
Дедлайн: сдать проект нужно в течение 72 часов после начала Урока 10 ("Вебинар. Консультация по итоговому проекту").
Для успешной сдачи должны быть все предсказания (для 5000 квартир) и R2 должен быть больше 0.6.
При сдаче до дедлайна результат проекта может попасть в топ лучших результатов.
Повторная сдача и проверка результатов возможны только при условии предыдущей неуспешной сдачи.
Успешный проект нельзя пересдать в целях повышения результата.
Проекты, сданные после дедлайна или сданные повторно, не попадают в топ лучших результатов, но можно узнать результат.
В качестве итогового результата берется первый успешный результат, последующие успешные результаты не учитываются.

Примечание
Все файлы csv должны содержать названия полей (header - то есть "шапку"),
разделитель - запятая. В файлах не должны содержаться индексы из датафрейма.

Рекомендации для файла с кодом (ipynb)
1. Файл должен содержать заголовки и комментарии
2. Повторяющиеся операции лучше оформлять в виде функций
3. Не делать вывод большого количества строк таблиц (5-10 достаточно)
4. По возможности добавлять графики, описывающие данные (около 3-5)
5. Добавлять только лучшую модель, то есть не включать в код все варианты решения проекта
6. Скрипт проекта должен отрабатывать от начала и до конца (от загрузки данных до выгрузки предсказаний)
7. Весь проект должен быть в одном скрипте (файл ipynb).
8. При использовании статистик (среднее, медиана и т.д.) в качестве признаков,
лучше считать их на трейне, и потом на валидационных и тестовых данных не считать 
статистики заново, а брать их с трейна. Если хватает знаний, можно использовать кросс-валидацию,
но для сдачи этого проекта достаточно разбить данные из train.csv на train и valid.
9. Проект должен полностью отрабатывать за разумное время (не больше 10 минут),
поэтому в финальный вариант лучше не включать GridSearch с перебором 
большого количества сочетаний параметров.
10. Допускается применение библиотек Python и моделей машинного обучения

In [1]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor 
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.metrics import r2_score, mean_squared_error

plt.style.use("fivethirtyeight")
warnings.filterwarnings("ignore")

%matplotlib inline
%config InlineBackend.figure_format = "svg"

In [2]:
# define path to local files

PATH_TRAIN_DATASET = 'train.csv'
PATH_TEST_DATASET = 'test.csv'

In [3]:
# import data from csv file

data = pd.read_csv(PATH_TRAIN_DATASET, sep=',', index_col=0, encoding='utf-8')

In [4]:
# check for first few lines in our data just to see

data.head()

Unnamed: 0_level_0,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2,Price
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
14038,35,2.0,47.981561,29.442751,6.0,7,9.0,1969,0.08904,B,B,33,7976,5,,0,11,B,184966.93073
15053,41,3.0,65.68364,40.049543,8.0,7,9.0,1978,7e-05,B,B,46,10309,1,240.0,1,16,B,300009.450063
4765,53,2.0,44.947953,29.197612,0.0,8,12.0,1968,0.049637,B,B,34,7759,0,229.0,1,3,B,220925.908524
5809,58,2.0,53.352981,52.731512,9.0,8,17.0,1977,0.437885,B,B,23,5735,3,1084.0,0,5,B,175616.227217
10783,99,1.0,39.649192,23.776169,7.0,11,12.0,1976,0.012339,B,B,35,5776,1,2078.0,2,4,B,150226.531644


In [5]:
# read info about our data

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 14038 to 6306
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DistrictId     10000 non-null  int64  
 1   Rooms          10000 non-null  float64
 2   Square         10000 non-null  float64
 3   LifeSquare     7887 non-null   float64
 4   KitchenSquare  10000 non-null  float64
 5   Floor          10000 non-null  int64  
 6   HouseFloor     10000 non-null  float64
 7   HouseYear      10000 non-null  int64  
 8   Ecology_1      10000 non-null  float64
 9   Ecology_2      10000 non-null  object 
 10  Ecology_3      10000 non-null  object 
 11  Social_1       10000 non-null  int64  
 12  Social_2       10000 non-null  int64  
 13  Social_3       10000 non-null  int64  
 14  Healthcare_1   5202 non-null   float64
 15  Helthcare_2    10000 non-null  int64  
 16  Shops_1        10000 non-null  int64  
 17  Shops_2        10000 non-null  object 
 18  Pri

from this info we can see that we have some problems with a pair of parameters (missing values):

##### remember this:   

LifeSquare: 7887
Healthcare_1: 5202

In [6]:
# look for our data from another point of view

data.describe()

Unnamed: 0,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,50.4008,1.8905,56.315775,37.199645,6.2733,8.5267,12.6094,3990.166,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,43.587592,0.839512,21.058732,86.241209,28.560917,5.241148,6.775974,200500.3,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,1.136859,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,20.0,1.0,41.774881,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,36.0,2.0,52.51331,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,75.0,2.0,65.900625,45.128803,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,209.0,19.0,641.065193,7480.592129,2014.0,42.0,117.0,20052010.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


from this info we can see some wrong with max HouseYear

In [7]:
data.HouseYear.unique()

array([    1969,     1978,     1968,     1977,     1976,     2011,
           1960,     2014,     1973,     1959,     1999,     1980,
           1979,     1983,     2001,     2012,     2002,     1996,
           1964,     2018,     1972,     1965,     1984,     1961,
           1971,     1963,     2017,     1970,     1981,     2003,
           2016,     1991,     1975,     2006,     2009,     1985,
           1974,     1994,     2000,     1987,     1998,     2005,
           1990,     1982,     1997,     2015,     2008,     2010,
           2004,     2007,     1967,     1957,     1962,     1993,
           1966,     1955,     1937,     1992,     1954,     1995,
           2019,     1948,     1986,     2013,     1989,     1958,
           1938,     1956,     1988,     2020,     1951,     1952,
           1935,     1914,     1932,     1950,     1917,     1918,
           1940, 20052011,     1942,     1939,     1934,     1931,
           1919,     1912,     1953,     1936,     1947,     1

In [8]:
# replace bad numbers with more real values

data.loc[data.HouseYear==20052011, 'HouseYear']=2005
data.loc[data.HouseYear==4968, 'HouseYear']=1968

data.HouseYear.unique()

array([1969, 1978, 1968, 1977, 1976, 2011, 1960, 2014, 1973, 1959, 1999,
       1980, 1979, 1983, 2001, 2012, 2002, 1996, 1964, 2018, 1972, 1965,
       1984, 1961, 1971, 1963, 2017, 1970, 1981, 2003, 2016, 1991, 1975,
       2006, 2009, 1985, 1974, 1994, 2000, 1987, 1998, 2005, 1990, 1982,
       1997, 2015, 2008, 2010, 2004, 2007, 1967, 1957, 1962, 1993, 1966,
       1955, 1937, 1992, 1954, 1995, 2019, 1948, 1986, 2013, 1989, 1958,
       1938, 1956, 1988, 2020, 1951, 1952, 1935, 1914, 1932, 1950, 1917,
       1918, 1940, 1942, 1939, 1934, 1931, 1919, 1912, 1953, 1936, 1947,
       1929, 1930, 1933, 1941, 1916, 1910, 1928], dtype=int64)

In [9]:
data.describe()

Unnamed: 0,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Price
count,10000.0,10000.0,10000.0,7887.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,5202.0,10000.0,10000.0,10000.0
mean,50.4008,1.8905,56.315775,37.199645,6.2733,8.5267,12.6094,1984.8657,0.118858,24.687,5352.1574,8.0392,1142.90446,1.3195,4.2313,214138.857399
std,43.587592,0.839512,21.058732,86.241209,28.560917,5.241148,6.775974,18.411517,0.119025,17.532614,4006.799803,23.831875,1021.517264,1.493601,4.806341,92872.293865
min,0.0,0.0,1.136859,0.370619,0.0,1.0,0.0,1910.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0,59174.778028
25%,20.0,1.0,41.774881,22.769832,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,350.0,0.0,1.0,153872.633942
50%,36.0,2.0,52.51331,32.78126,6.0,7.0,13.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0,192269.644879
75%,75.0,2.0,65.900625,45.128803,9.0,12.0,17.0,2001.0,0.195781,36.0,7227.0,5.0,1548.0,2.0,6.0,249135.462171
max,209.0,19.0,641.065193,7480.592129,2014.0,42.0,117.0,2020.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0,633233.46657


looks better now

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 14038 to 6306
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DistrictId     10000 non-null  int64  
 1   Rooms          10000 non-null  float64
 2   Square         10000 non-null  float64
 3   LifeSquare     7887 non-null   float64
 4   KitchenSquare  10000 non-null  float64
 5   Floor          10000 non-null  int64  
 6   HouseFloor     10000 non-null  float64
 7   HouseYear      10000 non-null  int64  
 8   Ecology_1      10000 non-null  float64
 9   Ecology_2      10000 non-null  object 
 10  Ecology_3      10000 non-null  object 
 11  Social_1       10000 non-null  int64  
 12  Social_2       10000 non-null  int64  
 13  Social_3       10000 non-null  int64  
 14  Healthcare_1   5202 non-null   float64
 15  Helthcare_2    10000 non-null  int64  
 16  Shops_1        10000 non-null  int64  
 17  Shops_2        10000 non-null  object 
 18  Pri

In [11]:
# define wich columns will use to build and train our model


features = data[['DistrictId', 'Rooms', 'Square', 'KitchenSquare', 'Floor', 'HouseFloor', 'HouseYear',
               'Ecology_1', 'Social_1', 'Social_2', 'Social_3', 
               'Healthcare_1', 'Helthcare_2', 'Shops_1']]

target = data['Price']

In [12]:
# split data to train and test

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [13]:
X_train.describe()

Unnamed: 0,DistrictId,Rooms,Square,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1
count,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0,3642.0,7000.0,7000.0
mean,50.560857,1.889286,56.307485,6.490714,8.521286,12.654571,1984.918857,0.11784,24.767571,5364.425714,8.044571,1133.876167,1.313143,4.260571
std,43.744178,0.843116,20.507466,33.999713,5.222043,6.851357,18.309534,0.118599,17.583889,4011.768297,23.869975,1018.354716,1.488326,4.825024
min,0.0,0.0,2.377248,0.0,1.0,0.0,1912.0,0.0,0.0,168.0,0.0,30.0,0.0,0.0
25%,19.0,1.0,41.74471,1.0,4.0,9.0,1974.0,0.017647,6.0,1564.0,0.0,325.0,0.0,1.0
50%,37.0,2.0,52.633656,6.0,7.0,14.0,1977.0,0.075424,25.0,5285.0,2.0,900.0,1.0,3.0
75%,75.0,2.0,65.981105,9.0,12.0,17.0,2001.0,0.194489,36.0,7227.0,5.0,1548.0,2.0,6.0
max,209.0,19.0,604.705972,2014.0,42.0,117.0,2020.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0


In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7000 entries, 14604 to 2135
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DistrictId     7000 non-null   int64  
 1   Rooms          7000 non-null   float64
 2   Square         7000 non-null   float64
 3   KitchenSquare  7000 non-null   float64
 4   Floor          7000 non-null   int64  
 5   HouseFloor     7000 non-null   float64
 6   HouseYear      7000 non-null   int64  
 7   Ecology_1      7000 non-null   float64
 8   Social_1       7000 non-null   int64  
 9   Social_2       7000 non-null   int64  
 10  Social_3       7000 non-null   int64  
 11  Healthcare_1   3642 non-null   float64
 12  Helthcare_2    7000 non-null   int64  
 13  Shops_1        7000 non-null   int64  
dtypes: float64(6), int64(8)
memory usage: 820.3 KB


In [15]:
# Column Healthcare has missing values
# add some values with median values

#### define Python function to fill missing values


In [16]:
def fill_Healthcare1(ds_name, col_name):
    ds_name[col_name].fillna(ds_name[col_name].median(), inplace=True)
    return;

In [17]:
fill_Healthcare1(X_train, 'Healthcare_1')
fill_Healthcare1(X_test, 'Healthcare_1')


X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7000 entries, 14604 to 2135
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DistrictId     7000 non-null   int64  
 1   Rooms          7000 non-null   float64
 2   Square         7000 non-null   float64
 3   KitchenSquare  7000 non-null   float64
 4   Floor          7000 non-null   int64  
 5   HouseFloor     7000 non-null   float64
 6   HouseYear      7000 non-null   int64  
 7   Ecology_1      7000 non-null   float64
 8   Social_1       7000 non-null   int64  
 9   Social_2       7000 non-null   int64  
 10  Social_3       7000 non-null   int64  
 11  Healthcare_1   7000 non-null   float64
 12  Helthcare_2    7000 non-null   int64  
 13  Shops_1        7000 non-null   int64  
dtypes: float64(6), int64(8)
memory usage: 820.3 KB


#### Commented block below was used to find best parameters

In [18]:
# regressor = RandomForestRegressor(random_state=42, n_jobs=1)

# find good parameters for model using RandomizedSearchCV

# grid = {
#     'n_estimators': np.arange(200, 501, 20),
#     'max_depth': np.arange(2, 51, 2),
#     'max_features': [0.5, 0.6, 0.7, 0.8, 0.9],
#     'min_samples_leaf': [1, 2, 4],
#     'min_samples_split': [2, 5, 10]
# }

# search = RandomizedSearchCV(
#     estimator = regressor,
#     param_distributions = grid,
#     n_iter = 50,
#     scoring = 'r2',
#     cv = 10,
#     verbose = 2,
#     random_state = 42,
#     n_jobs = -1
# )

# search.fit(X_train, y_train)

# print(search.best_score_)
# print(search.best_params_)


# best_score_: 0.7393137762152002
# best_params: {'n_estimators': 340, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 0.5, 'max_depth': 42}

# y_pred_train = search.best_estimator_.predict(X_train)
# y_pred_test = search.best_estimator_.predict(X_test)

In [19]:
# arrange the data with StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# prepare model
regressor = RandomForestRegressor(
    n_estimators = 340,
    max_depth = 42,
    max_features = 0.5,
    min_samples_leaf = 1,
    min_samples_split = 5,
    random_state = 42,
    n_jobs = -1
)

regressor.fit(X_train, y_train)
# predictions
y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)

#### Find R2 score metrics

In [20]:
# find quality of predictions using r2_score

r2_train = r2_score(y_train, y_pred_train)
r2_test  = r2_score(y_test, y_pred_test)

print("R2 train: ", r2_train)
print("R2 test: ", r2_test)

R2 train:  0.9413813094606348
R2 test:  0.7310905294155272


#### We find model and now is time to teach model on all train data

In [21]:
# prepare data
fill_Healthcare1(features, 'Healthcare_1')

scaler = MinMaxScaler()

features_scaled = scaler.fit_transform(features)

# fit our model
regressor.fit(features_scaled, target)


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=42, max_features=0.5, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=5, min_weight_fraction_leaf=0.0,
                      n_estimators=340, n_jobs=-1, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [22]:
# Load test data for prediction

test_data = pd.read_csv(PATH_TEST_DATASET, sep=',', index_col=0, encoding='utf-8')

test_data.head()

Unnamed: 0_level_0,DistrictId,Rooms,Square,LifeSquare,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Ecology_2,Ecology_3,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1,Shops_2
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
725,58,2.0,49.882643,33.432782,6.0,6,14.0,1972,0.310199,B,B,11,2748,1,,0,0,B
15856,74,2.0,69.263183,,1.0,6,1.0,1977,0.075779,B,B,6,1437,3,,0,2,B
5480,190,1.0,13.597819,15.948246,12.0,2,5.0,1909,0.0,B,B,30,7538,87,4702.0,5,5,B
15664,47,2.0,73.046609,51.940842,9.0,22,22.0,2007,0.101872,B,B,23,4583,3,,3,3,B
14275,27,1.0,47.527111,43.387569,1.0,17,17.0,2017,0.072158,B,B,2,629,1,,0,0,A


In [23]:
# check HouseYear values

test_data.HouseYear.unique()

array([1972, 1977, 1909, 2007, 2017, 1997, 2014, 1981, 1971, 1968, 1974,
       1959, 1976, 2015, 2004, 2000, 1970, 1964, 1975, 1988, 1963, 1987,
       1933, 1962, 1969, 1984, 1980, 1929, 1990, 1960, 2016, 1954, 1996,
       2019, 1993, 1911, 1985, 1982, 1966, 1978, 2003, 1983, 1973, 2018,
       2013, 2010, 1957, 1958, 1965, 2008, 1986, 1979, 2012, 1995, 1999,
       1989, 1992, 2009, 1956, 2005, 1998, 1940, 2002, 1991, 1967, 1994,
       2020, 1955, 1961, 2006, 2011, 1926, 2001, 1934, 1917, 1931, 1953,
       1943, 1941, 1930, 1912, 1935, 1927, 1937, 1918, 1950, 1952, 1910,
       1939, 1914, 1908, 1938, 1928, 1932, 1948, 1949, 1920], dtype=int64)

In [24]:
# load selected columns
test_features = test_data[['DistrictId', 'Rooms', 'Square', 'KitchenSquare', 'Floor', 'HouseFloor', 'HouseYear',
               'Ecology_1', 'Social_1', 'Social_2', 'Social_3', 
               'Healthcare_1', 'Helthcare_2', 'Shops_1']]

target = data['Price']

In [25]:
test_features.describe()

Unnamed: 0,DistrictId,Rooms,Square,KitchenSquare,Floor,HouseFloor,HouseYear,Ecology_1,Social_1,Social_2,Social_3,Healthcare_1,Helthcare_2,Shops_1
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,2623.0,5000.0,5000.0
mean,51.2792,1.91,56.4495,5.9768,8.632,12.601,1984.3926,0.119874,24.9338,5406.9,8.2626,1146.657263,1.3194,4.2428
std,44.179466,0.838594,19.092787,9.950018,5.483228,6.789213,18.573149,0.12007,17.532202,4026.614773,23.863762,1044.744231,1.47994,4.777365
min,0.0,0.0,1.378543,0.0,1.0,0.0,1908.0,0.0,0.0,168.0,0.0,0.0,0.0,0.0
25%,21.0,1.0,41.906231,1.0,4.0,9.0,1973.0,0.019509,6.0,1564.0,0.0,325.0,0.0,1.0
50%,37.0,2.0,52.92134,6.0,7.0,12.0,1977.0,0.072158,25.0,5285.0,2.0,900.0,1.0,3.0
75%,77.0,2.0,66.285129,9.0,12.0,17.0,2000.0,0.195781,36.0,7287.0,5.0,1548.0,2.0,6.0
max,212.0,17.0,223.453689,620.0,78.0,99.0,2020.0,0.521867,74.0,19083.0,141.0,4849.0,6.0,23.0


In [26]:
test_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 725 to 12504
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DistrictId     5000 non-null   int64  
 1   Rooms          5000 non-null   float64
 2   Square         5000 non-null   float64
 3   KitchenSquare  5000 non-null   float64
 4   Floor          5000 non-null   int64  
 5   HouseFloor     5000 non-null   float64
 6   HouseYear      5000 non-null   int64  
 7   Ecology_1      5000 non-null   float64
 8   Social_1       5000 non-null   int64  
 9   Social_2       5000 non-null   int64  
 10  Social_3       5000 non-null   int64  
 11  Healthcare_1   2623 non-null   float64
 12  Helthcare_2    5000 non-null   int64  
 13  Shops_1        5000 non-null   int64  
dtypes: float64(6), int64(8)
memory usage: 585.9 KB


In [27]:
# add missing values for column Healthcare_1

fill_Healthcare1(test_features, 'Healthcare_1')


In [28]:
# arrange data with scaler (StandardScaler)

test_featured_scalled = scaler.transform(test_features)

In [29]:
test_features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 725 to 12504
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   DistrictId     5000 non-null   int64  
 1   Rooms          5000 non-null   float64
 2   Square         5000 non-null   float64
 3   KitchenSquare  5000 non-null   float64
 4   Floor          5000 non-null   int64  
 5   HouseFloor     5000 non-null   float64
 6   HouseYear      5000 non-null   int64  
 7   Ecology_1      5000 non-null   float64
 8   Social_1       5000 non-null   int64  
 9   Social_2       5000 non-null   int64  
 10  Social_3       5000 non-null   int64  
 11  Healthcare_1   5000 non-null   float64
 12  Helthcare_2    5000 non-null   int64  
 13  Shops_1        5000 non-null   int64  
dtypes: float64(6), int64(8)
memory usage: 585.9 KB


#### Make final Price prediction

In [30]:
y_pred_test_prices = regressor.predict(test_featured_scalled)

In [31]:
# predicted price and Id write to dataframe

price_predictions = pd.DataFrame({
    'Id': test_features.index,
    'Price': y_pred_test_prices
})

In [32]:
price_predictions.head()

Unnamed: 0,Id,Price
0,725,163286.304483
1,15856,226749.719677
2,5480,235864.365739
3,15664,336627.165769
4,14275,143637.439089


In [33]:
price_predictions.shape

(5000, 2)

#### Write final result to csv file

In [34]:
price_predictions.to_csv('ATokmakov_predictions.csv', sep=',', index=False, encoding='utf-8')