# Курсовой проект.

### Задание
Создать и обучить модель для предсказания стоимости недвижимости 

Метрика:<br>
- <b>R2</b> - коэффициент детерминации (sklearn.metrics.r2_score)

**План работы**
* [Загрузка данных](#load)
* [1. EDA](#eda)
* [2. Обработка выбросов](#outlier)
* [3. Обработка пропусков](#nan)
* [4. Построение новых признаков](#feature)
* [5. Отбор признаков](#feature_selection)
* [6. Разбиение на train и test](#split)
* [7. Построение моделей](#modeling)
* [7.1 RandomForestRegressor](#forest)
* [7.2 LinearRegression](#linear)
* [7.3 CatBoostRegressor](#cbt)
* [8. Прогнозирование на тестовом датасете](#prediction)

**Подключение библиотек и скриптов**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Импортируем необходимые библиотеки, классы и функции**

In [None]:
import numpy as np
import pandas as pd
import random

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score as r2
from sklearn.model_selection import KFold, GridSearchCV

from sklearn.ensemble import StackingRegressor, VotingRegressor, BaggingRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime
from decimal import Decimal

import warnings

<b>Конфигурация библиотек</b>

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
warnings.filterwarnings('ignore')
matplotlib.rcParams.update({'font.size': 14})
pd.pandas.set_option('display.max_columns', None)
sns.set_style('darkgrid')
matplotlib.rcParams.update({'font.size': 14})

<b>Функция для оценки качества модели</b>

In [None]:
def evaluate_preds(train_true_values, train_pred_values, test_true_values, test_pred_values):
    print("Train R2:\t" + str(round(r2(train_true_values, train_pred_values), 3)))
    print("Test R2:\t" + str(round(r2(test_true_values, test_pred_values), 3)))
    
    plt.figure(figsize=(18,10))
    
    plt.subplot(121)
    sns.scatterplot(x=train_pred_values, y=train_true_values)
    plt.xlabel('Predicted values')
    plt.ylabel('True values')
    plt.title('Train sample prediction')
    
    plt.subplot(122)
    sns.scatterplot(x=test_pred_values, y=test_true_values)
    plt.xlabel('Predicted values')
    plt.ylabel('True values')
    plt.title('Test sample prediction')

    plt.show()

**Пути к директориям и файлам с данными**

In [None]:
TRAIN_DATASET_PATH = '../input/real-estate-price-prediction-moscow/train.csv'
TEST_DATASET_PATH = '../input/real-estate-price-prediction-moscow/test.csv'

### Загрузка данных <a class='anchor' id='load'>

**Описание датасета**

* **Id** - идентификационный номер квартиры
* **DistrictId** - идентификационный номер района
* **Rooms** - количество комнат
* **Square** - площадь
* **LifeSquare** - жилая площадь
* **KitchenSquare** - площадь кухни
* **Floor** - этаж
* **HouseFloor** - количество этажей в доме
* **HouseYear** - год постройки дома
* **Ecology_1, Ecology_2, Ecology_3** - экологические показатели местности
* **Social_1, Social_2, Social_3** - социальные показатели местности
* **Healthcare_1, Helthcare_2** - показатели местности, связанные с охраной здоровья
* **Shops_1, Shops_2** - показатели, связанные с наличием магазинов, торговых центров
* **Price** - цена квартиры

In [None]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_train.tail()

In [None]:
df_train.dtypes

In [None]:
df_test = pd.read_csv(TEST_DATASET_PATH)
df_test.tail()

In [None]:
print('Строк в трейне:', df_train.shape[0])
print('Строк в тесте', df_test.shape[0])

In [None]:
df_train.shape[1] - 1 == df_test.shape[1]

### Приведение типов

In [None]:
df_train.dtypes

In [None]:
df_train['Id'] = df_train['Id'].astype(str)
df_train['DistrictId'] = df_train['DistrictId'].astype(str)

## 1. EDA  <a class='anchor' id='eda'>
Делаем EDA для:
- Исправления выбросов
- Заполнения NaN
- Идей для генерации новых фич

**Целевая переменная**

In [None]:
target_name = ['Price']
def target_plt(df, targ_name):
    plt.figure(figsize = (10, 8))

    df[targ_name].hist(bins=30)
    plt.ylabel('Count')
    plt.xlabel(targ_name)

    plt.title('Target distribution')
    plt.show()
    
target_plt(df_train, 'Price')

Визуализируем числовые признаки

In [None]:
num_feat = list(df_train.select_dtypes(exclude='object').columns)
df_train[num_feat].hist(
    figsize=(14,14)
)
plt.show()

**Количественные переменные**

In [None]:
df_train.describe()

**Номинативные переменные**

In [None]:
df_train.select_dtypes(include='object').columns.tolist()

In [None]:
df_train['DistrictId'].value_counts()

In [None]:
df_train['Ecology_2'].value_counts()

In [None]:
df_train['Ecology_3'].value_counts()

In [None]:
df_train['Shops_2'].value_counts()

### 2. Обработка выбросов  <a class='anchor' id='outlier'>
Что можно делать с ними?
1. Выкинуть эти данные (только на трейне, на тесте ничего не выкидываем)
2. Заменять выбросы разными методами (медианы, средние значения, np.clip и т.д.)
3. Делать/не делать дополнительную фичу
4. Ничего не делать

**Rooms**

In [None]:
df_train['Rooms'].value_counts()

In [None]:
df_train['Rooms_outlier'] = 0
df_train.loc[(df_train['Rooms'] == 0) | (df_train['Rooms'] >= 6), 'Rooms_outlier'] = 1
df_train.head()

In [None]:
df_train.loc[df_train['Rooms'] == 0, 'Rooms'] = 1
df_train.loc[df_train['Rooms'] >= 6, 'Rooms'] = df_train['Rooms'].median()

In [None]:
df_train['Rooms'].value_counts()

**KitchenSquare** 

In [None]:
df_train['KitchenSquare'].value_counts()

In [None]:
df_train['KitchenSquare'].quantile(.975), df_train['KitchenSquare'].quantile(.025)

In [None]:
condition = (df_train['KitchenSquare'].isna()) \
             | (df_train['KitchenSquare'] > df_train['KitchenSquare'].quantile(.975))
        
df_train.loc[condition, 'KitchenSquare'] = df_train['KitchenSquare'].median()

df_train.loc[df_train['KitchenSquare'] < 3, 'KitchenSquare'] = 3

In [None]:
df_train['KitchenSquare'].value_counts()

**HouseFloor, Floor**

In [None]:
df_train['HouseFloor'].sort_values().unique()

In [None]:
df_train['Floor'].sort_values().unique()

In [None]:
(df_train['Floor'] > df_train['HouseFloor']).sum()

In [None]:
df_train['HouseFloor_outlier'] = 0
df_train.loc[df_train['HouseFloor'] == 0, 'HouseFloor_outlier'] = 1
df_train.loc[df_train['Floor'] > df_train['HouseFloor'], 'HouseFloor_outlier'] = 1

In [None]:
df_train.loc[df_train['HouseFloor'] == 0, 'HouseFloor'] = df_train['HouseFloor'].median()

In [None]:
floor_outliers = df_train.loc[df_train['Floor'] > df_train['HouseFloor']].index
floor_outliers

In [None]:
df_train.loc[floor_outliers, 'Floor'] = df_train.loc[floor_outliers, 'HouseFloor']\
                                                .apply(lambda x: random.randint(1, x))

In [None]:
(df_train['Floor'] > df_train['HouseFloor']).sum()

**HouseYear**

In [None]:
df_train['HouseYear'].sort_values(ascending=False)

In [None]:
df_train.loc[df_train['HouseYear'] > 2020, 'HouseYear'] = 2020

In [None]:
df_train['HouseYear'].sort_values(ascending=False)

### 3. Обработка пропусков  <a class='anchor' id='nan'>

In [None]:
df_train.isna().sum()

In [None]:
df_train[['LifeSquare', 'Healthcare_1']].head(10)

**LifeSquare**

In [None]:
df_train['LifeSquare_nan'] = df_train['LifeSquare'].isna() * 1

In [None]:
df_train.head(10)

In [None]:
condition = (df_train['LifeSquare'].isna()) \
             & (~df_train['Square'].isna()) \
             & (~df_train['KitchenSquare'].isna())

condition

In [None]:
# Жилую площадь расчитываем как 80% всей площади
df_train.loc[condition, 'LifeSquare'] = df_train['Square'].div(100).mul(80)

In [None]:
df_train.head()

**Healthcare_1**

In [None]:
#df_train.drop('Healthcare_1', axis=1, inplace=True)

In [None]:
health_per_district = df_train.groupby('DistrictId', as_index=False)\
            .agg({'Healthcare_1': 'mean'})\
            .rename(columns={'Healthcare_1': 'AverageHealthcare_1'})


In [None]:
health_per_district

In [None]:
df_train = df_train.merge(health_per_district, on=["DistrictId"], how='left')

In [None]:
df_train['AverageHealthcare_1'].fillna(df_train['AverageHealthcare_1'].median(), inplace=True)

In [None]:
df_train[df_train['AverageHealthcare_1'].notna()]

In [None]:
df_train.head(40)

In [None]:
df_train[(df_train['DistrictId'] == 8) & (df_train['Healthcare_1'].isna())]

In [None]:
ids = df_train['DistrictId'].sort_values().unique()

In [None]:
pd.DataFrame(ids, columns=['id'])

In [None]:
#df_train.loc(df_train )

In [None]:
df_train[(df_train['Healthcare_1'] == 2857.0) ]

In [None]:
#df_train['Healthcare_1'].sort_values().unique()
df_train['DistrictId'].sort_values().unique()
df_train[df_train['DistrictId'] == 0 ]
df_train[(df_train['DistrictId'] == 0) & (df_train['DistrictId'] == 0) ]

# temp_id = []
# for i in df_train['DistrictId'].sort_values().unique():
#     iid = df_train[(df_train['DistrictId'] == i) & (df_train['Healthcare_1'].notna())]['Healthcare_1'].head(1).values
#     if iid.size > 0:
#         with_nan = df_train[(df_train['DistrictId'] == i) & (df_train['Healthcare_1'].isna())]['Healthcare_1'].head(1).values
#     print(f'{i} : {iid} {with_nan}')
    

In [None]:
#df_train['Social_2'].sort_values().unique()

In [None]:
#df_train.info()

In [None]:
df_train = df_train.drop(columns='Healthcare_1')

In [None]:
class DataPreprocessing:
    """Подготовка исходных данных"""

    def __init__(self):
        """Параметры класса"""
        self.medians = None
        self.kitchen_square_quantile = None
        self.kitchen_square_default = 3
        
    def fit(self, X):
        """Сохранение статистик"""       
        # Расчет медиан
        self.medians = X.median()
        self.kitchen_square_quantile = X['KitchenSquare'].quantile(.975)
    
    def transform(self, X):
        """Трансформация данных"""
        
        # HouseFloor, Floor
        X['HouseFloor_outlier'] = 0
        X.loc[X['HouseFloor'] == 0, 'HouseFloor_outlier'] = 1
        X.loc[X['Floor'] > X['HouseFloor'], 'HouseFloor_outlier'] = 1
        
        X.loc[X['HouseFloor'] == 0, 'HouseFloor'] = self.medians['HouseFloor']
        
        floor_outliers = X.loc[X['Floor'] > X['HouseFloor']].index
        X.loc[floor_outliers, 'Floor'] = X.loc[floor_outliers, 'HouseFloor']\
                                            .apply(lambda x: random.randint(1, x))
        
        # Rooms
        X['Rooms_outlier'] = 0
        X.loc[(X['Rooms'] == 0) | (X['Rooms'] >= 6), 'Rooms_outlier'] = 1
        
        X.loc[X['Rooms'] == 0, 'Rooms'] = 1
        X.loc[X['Rooms'] >= 6, 'Rooms'] = self.medians['Rooms']
        
        # KitchenSquare
        condition = (X['KitchenSquare'].isna()) \
                    | (X['KitchenSquare'] > self.kitchen_square_quantile)
        
        X.loc[condition, 'KitchenSquare'] = self.medians['KitchenSquare']

        X.loc[X['KitchenSquare'] < self.kitchen_square_default, 'KitchenSquare'] = self.kitchen_square_default
        

        
        # HouseYear
        current_year = datetime.now().year
        
        X['HouseYear_outlier'] = 0
        X.loc[X['HouseYear'] > current_year, 'HouseYear_outlier'] = 1
        
        X.loc[X['HouseYear'] > current_year, 'HouseYear'] = current_year
        
        # Healthcare_1
        if 'Healthcare_1' in X.columns:
            health_per_district = X.groupby('DistrictId', as_index=False)\
            .agg({'Healthcare_1': 'mean'})\
            .rename(columns={'Healthcare_1': 'AverageHealthcare_1'})
            X = X.merge(health_per_district, on=["DistrictId"], how='left')
            X['AverageHealthcare_1'].fillna(X['AverageHealthcare_1'].median(), inplace=True)
            
            
            
        #    X.drop('Healthcare_1', axis=1, inplace=True)
            
        # LifeSquare
        X['LifeSquare_nan'] = X['LifeSquare'].isna() * 1
        condition = (X['LifeSquare'].isna()) & \
                      (~X['Square'].isna()) & \
                      (~X['KitchenSquare'].isna())
        
        X.loc[condition, 'LifeSquare'] = X['Square'].div(100).mul(80)
        
        
        X.fillna(self.medians, inplace=True)
        
        return X

### 4. Построение новых признаков  <a class='anchor' id='feature'>

**Dummies**

In [None]:
binary_to_numbers = {'A': 0, 'B': 1}

df_train['Ecology_2'] = df_train['Ecology_2'].replace(binary_to_numbers)
df_train['Ecology_3'] = df_train['Ecology_3'].replace(binary_to_numbers)
df_train['Shops_2'] = df_train['Shops_2'].replace(binary_to_numbers)

**DistrictSize, IsDistrictLarge**

In [None]:
district_size = df_train['DistrictId'].value_counts().reset_index()\
                    .rename(columns={'index':'DistrictId', 'DistrictId':'DistrictSize'})

district_size.head()

In [None]:
df_train = df_train.merge(district_size, on='DistrictId', how='left')
df_train.head()

In [None]:
(df_train['DistrictSize'] > 100).value_counts()

In [None]:
df_train['IsDistrictLarge'] = (df_train['DistrictSize'] > 100).astype(int)

**MedPriceByDistrict**

In [None]:
med_price_by_district = df_train.groupby(['DistrictId', 'Rooms'], as_index=False).agg({'Price':'median'})\
                            .rename(columns={'Price':'MedPriceByDistrict'})

med_price_by_district.head()

In [None]:
med_price_by_district.shape

In [None]:
df_train = df_train.merge(med_price_by_district, on=['DistrictId', 'Rooms'], how='left')
df_train.head()

**MedPriceByFloorYear**

In [None]:
def floor_to_cat(X):

    X['floor_cat'] = 0

    X.loc[X['Floor'] <= 3, 'floor_cat'] = 1  
    X.loc[(X['Floor'] > 3) & (X['Floor'] <= 5), 'floor_cat'] = 2
    X.loc[(X['Floor'] > 5) & (X['Floor'] <= 9), 'floor_cat'] = 3
    X.loc[(X['Floor'] > 9) & (X['Floor'] <= 15), 'floor_cat'] = 4
    X.loc[X['Floor'] > 15, 'floor_cat'] = 5

    return X


def floor_to_cat_pandas(X):
    bins = [X['Floor'].min(), 3, 5, 9, 15, X['Floor'].max()]
    X['floor_cat'] = pd.cut(X['Floor'], bins=bins, labels=False)
    
    X['floor_cat'].fillna(-1, inplace=True)
    return X


def year_to_cat(X):

    X['year_cat'] = 0

    X.loc[X['HouseYear'] <= 1941, 'year_cat'] = 1
    X.loc[(X['HouseYear'] > 1941) & (X['HouseYear'] <= 1945), 'year_cat'] = 2
    X.loc[(X['HouseYear'] > 1945) & (X['HouseYear'] <= 1980), 'year_cat'] = 3
    X.loc[(X['HouseYear'] > 1980) & (X['HouseYear'] <= 2000), 'year_cat'] = 4
    X.loc[(X['HouseYear'] > 2000) & (X['HouseYear'] <= 2010), 'year_cat'] = 5
    X.loc[(X['HouseYear'] > 2010), 'year_cat'] = 6

    return X


def year_to_cat_pandas(X):
    bins = [X['HouseYear'].min(), 1941, 1945, 1980, 2000, 2010, X['HouseYear'].max()]
    X['year_cat'] = pd.cut(X['HouseYear'], bins=bins, labels=False)
    
    X['year_cat'].fillna(-1, inplace=True)
    return X

In [None]:
bins = [df_train['Floor'].min(), 3, 5, 9, 15, df_train['Floor'].max()]
pd.cut(df_train['Floor'], bins=bins, labels=False)

In [None]:
bins = [df_train['Floor'].min(), 3, 5, 9, 15, df_train['Floor'].max()]
pd.cut(df_train['Floor'], bins=bins)

In [None]:
df_train = year_to_cat(df_train)
df_train = floor_to_cat(df_train)
df_train.head()

In [None]:
med_price_by_floor_year = df_train.groupby(['year_cat', 'floor_cat'], as_index=False).agg({'Price':'median'}).\
                                            rename(columns={'Price':'MedPriceByFloorYear'})
med_price_by_floor_year.head()

In [None]:
df_train = df_train.merge(med_price_by_floor_year, on=['year_cat', 'floor_cat'], how='left')
df_train.head()

In [None]:
class FeatureGenetator():
    """Генерация новых фич"""
    
    def __init__(self):
        self.DistrictId_counts = None
        self.binary_to_numbers = None
        self.med_price_by_district = None
        self.med_price_by_floor_year = None
        self.house_year_max = None
        self.floor_max = None
        self.house_year_min = None
        self.floor_min = None
        self.district_size = None
        
    def fit(self, X, y=None):
        
        X = X.copy()
        
        # Binary features
        self.binary_to_numbers = {'A': 0, 'B': 1}
        
        # DistrictID
        self.district_size = X['DistrictId'].value_counts().reset_index() \
                               .rename(columns={'index':'DistrictId', 'DistrictId':'DistrictSize'})
                
        # Target encoding
        ## District, Rooms
        df = X.copy()
        
        if y is not None:
            df['Price'] = y.values
            
            self.med_price_by_district = df.groupby(['DistrictId', 'Rooms'], as_index=False).agg({'Price':'median'})\
                                            .rename(columns={'Price':'MedPriceByDistrict'})
            
            self.med_price_by_district_median = self.med_price_by_district['MedPriceByDistrict'].median()
            
        ## floor, year
        if y is not None:
            self.floor_max = df['Floor'].max()
            self.floor_min = df['Floor'].min()
            self.house_year_max = df['HouseYear'].max()
            self.house_year_min = df['HouseYear'].min()
            df['Price'] = y.values
            df = self.floor_to_cat(df)
            df = self.year_to_cat(df)
            self.med_price_by_floor_year = df.groupby(['year_cat', 'floor_cat'], as_index=False).agg({'Price':'median'}).\
                                            rename(columns={'Price':'MedPriceByFloorYear'})
            self.med_price_by_floor_year_median = self.med_price_by_floor_year['MedPriceByFloorYear'].median()
        

        
    def transform(self, X):
        
        # Binary features
        X['Ecology_2'] = X['Ecology_2'].map(self.binary_to_numbers)  # self.binary_to_numbers = {'A': 0, 'B': 1}
        X['Ecology_3'] = X['Ecology_3'].map(self.binary_to_numbers)
        X['Shops_2'] = X['Shops_2'].map(self.binary_to_numbers)
        
        # DistrictId, IsDistrictLarge
        X = X.merge(self.district_size, on='DistrictId', how='left')
        
        X['new_district'] = 0
        X.loc[X['DistrictSize'].isna(), 'new_district'] = 1
        
        X['DistrictSize'].fillna(5, inplace=True)
        
        X['IsDistrictLarge'] = (X['DistrictSize'] > 100).astype(int)
        
        # More categorical features
        X = self.floor_to_cat(X)  # + столбец floor_cat
        X = self.year_to_cat(X)   # + столбец year_cat
        
        # Target encoding
        if self.med_price_by_district is not None:
            X = X.merge(self.med_price_by_district, on=['DistrictId', 'Rooms'], how='left')
            X['MedPriceByDistrict'].fillna(self.med_price_by_district_median, inplace=True)
            
        if self.med_price_by_floor_year is not None:
            X = X.merge(self.med_price_by_floor_year, on=['year_cat', 'floor_cat'], how='left')
            X['MedPriceByFloorYear'].fillna(self.med_price_by_floor_year_median, inplace=True)
        
        return X
    
    def floor_to_cat(self, X):
        bins = [self.floor_min, 3, 5, 9, 15, self.floor_max]
        X['floor_cat'] = pd.cut(X['Floor'], bins=bins, labels=False)

        X['floor_cat'].fillna(-1, inplace=True)
        return X
     
    def year_to_cat(self, X):
        bins = [self.house_year_min, 1941, 1945, 1980, 2000, 2010, self.house_year_max]
        X['year_cat'] = pd.cut(X['HouseYear'], bins=bins, labels=False)

        X['year_cat'].fillna(-1, inplace=True)
        return X

In [None]:
num_feat = list(df_train.select_dtypes(exclude='object').columns)
df_train[num_feat].hist(
    figsize=(14,14)
)
plt.show()

In [None]:
df_train.describe()

In [None]:
# создадим функцию для изучения и визуализации вещественных признаков 
def learn_real_features(df, feature_name):
    if df[feature_name].dtype.name in ['float64', 'float32', 'float16', 'int8', 'int16', 'int32']:
        plt.figure(figsize = (16, 8))
        sns.distplot(df[feature_name])
        y = np.linspace(0, 0.000005, 10)
        feature_mean = round(df[feature_name].mean(), 2)
        feature_median = df[feature_name].median()
        feature_mode = df[feature_name].mode()[0]
        feature_min = round(df[feature_name].min(), 2)
        feature_max = round(df[feature_name].max(), 2)
        feature_NA_number = df[feature_name].isnull().sum()
        plt.plot([feature_mean] * 10, y, label='mean',  linewidth=4)
        plt.plot([feature_median] * 10, y, label='median',  linewidth=4)
        plt.plot([feature_mode] * 10, y, label='mode', linewidth=4)
        plt.title('Distribution of {} '.format(feature_name))
        plt.legend()
        print(f'feature_name - {feature_name}\nmedian - {feature_median}\nmean - {feature_mean}\nmode - {feature_mode}\nMin - {feature_min}\nMax - {feature_max}\nNA number - {feature_NA_number}')
        plt.show()
    else:
        print("Признак не является вещественным")

learn_real_features(df_train, 'Price')

### 5. Отбор признаков  <a class='anchor' id='feature_selection'>

In [None]:
df_train.columns.tolist()

In [None]:
feature_names = [
    'Rooms', 
    'Square', 
    'LifeSquare', 
    'KitchenSquare', 
    'Floor', 
    'HouseFloor', 
    'HouseYear',
                 
    'Ecology_1', 
    #'Ecology_2', 
    'Ecology_3', 
    'Social_1', 
    'Social_2', 
    'Social_3',
                 
    #'Helthcare_2', 
    'Shops_1', 
    'Shops_2'
]

new_feature_names = [
    #'Rooms_outlier', 
    #'HouseFloor_outlier', 
    #'HouseYear_outlier', 
    #'LifeSquare_nan', 
    'DistrictSize',
    #'IsDistrictLarge',  
    #'MedPriceByDistrict', 
    #'MedPriceByFloorYear', 
    #'AverageHealthcare_1',
    #'new_district'
]

target_name = 'Price'

In [None]:

# df_train[feature_names].hist(
#     figsize=(16,16)
# )
# plt.show()

### 6. Разбиение на train и test  <a class='anchor' id='split'>

In [None]:
df_train = pd.read_csv(TRAIN_DATASET_PATH)
df_test = pd.read_csv(TEST_DATASET_PATH)

X = df_train.drop(columns=target_name)
y = df_train[target_name]

In [None]:
#df_train

In [None]:
#df_test.info()

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33, shuffle=True, random_state=21)

In [None]:
preprocessor = DataPreprocessing()
preprocessor.fit(X_train)

X_train = preprocessor.transform(X_train)
X_valid = preprocessor.transform(X_valid)
df_test = preprocessor.transform(df_test)

X_train.shape, X_valid.shape, df_test.shape

In [None]:
X_train.describe()

In [None]:
features_gen = FeatureGenetator()
features_gen.fit(X_train, y_train)

X_train = features_gen.transform(X_train)
X_valid = features_gen.transform(X_valid)
df_test = features_gen.transform(df_test)

X_train.shape, X_valid.shape, df_test.shape

In [None]:
X_train = X_train[feature_names + new_feature_names]
X_valid = X_valid[feature_names + new_feature_names]
df_test = df_test[feature_names + new_feature_names]

In [None]:
X_train

In [None]:
X_train.isna().sum().sum(), X_valid.isna().sum().sum(), df_test.isna().sum().sum()

### 7. Построение моделей  <a class='anchor' id='modeling'>

### 7.1 Обучение модели на RandomForestRegressor <a class='anchor' id='forest'>

In [None]:
rf_model = RandomForestRegressor(
            random_state=21, 
            criterion='mse'
        )
rf_model.fit(X_train, y_train)

In [None]:
y_train_preds = rf_model.predict(X_train)
y_test_preds = rf_model.predict(X_valid)

evaluate_preds(y_train, y_train_preds, y_valid, y_test_preds)

**Кросс-валидация**

In [None]:
cv_score = cross_val_score(rf_model, X_train, y_train, scoring='r2', cv=KFold(n_splits=3, shuffle=True, random_state=21))
cv_score

In [None]:
cv_score.mean()

**Важность признаков**

In [None]:
def feature_imp():

    feature_importances = pd.DataFrame(zip(X_train.columns, rf_model.feature_importances_), 
                                       columns=['feature_name', 'importance']).sort_values(by='importance', ascending=False)

    #feature_importances = feature_importances.sort_values(by='importance', ascending=False)
    

    plt.figure(figsize=(10,7))
    plt.barh(feature_importances['feature_name'], feature_importances['importance'])

    plt.title('Оценка важности признака')
    plt.xlabel('Вес признака')
    plt.ylabel('Название признака')
    
    return feature_importances
    
feature_imp()


**Кореляция признаков**

In [None]:
X_corr = X_train.corr()

plt.figure(figsize = (10,10))

# Размер шрифта
sns.set(font_scale=0.9)

corr_matrix = X_corr
corr_matrix = np.round(corr_matrix, 2)
corr_matrix[np.abs(corr_matrix) < 0.3] = 0

sns.heatmap(corr_matrix, annot=True, linewidths=.5, cmap='coolwarm', label='small')

plt.title('Матрица кореляции признаков')

### Обучение модели на LinearRegression <a class='anchor' id='linear'>

In [None]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [None]:
y_train_preds = linear_model.predict(X_train)
y_test_preds = linear_model.predict(X_valid)

evaluate_preds(y_train, y_train_preds, y_valid, y_test_preds)

### Обучение модели на CatBoostRegressor <a class='anchor' id='cbt'>

In [None]:
from catboost import CatBoostRegressor

In [None]:
cbt_model = CatBoostRegressor(iterations=1000, verbose=False, learning_rate=0.04, depth=7, eval_metric='R2', random_seed=42)
cbt_model.fit(X_train, y_train)

In [None]:
y_train_preds = cbt_model.predict(X_train)
y_test_preds = cbt_model.predict(X_valid)

evaluate_preds(y_train, y_train_preds, y_valid, y_test_preds)

In [None]:
cv_score = cross_val_score(cbt_model, X_train, y_train, scoring='r2', cv=KFold(n_splits=3, shuffle=True, random_state=21))
cv_score

In [None]:
feature_importances = pd.DataFrame(zip(X_train.columns, cbt_model.feature_importances_), columns=['feature_name', 'importance'])

feature_importances.sort_values(by='importance', ascending=False)

### 8. Прогнозирование на тестовом датасете и сохранение результата <a class='anchor' id='prediction'>

1. Выполнить для тестового датасета те же этапы обработки и постронияния признаков
2. Не потерять и не перемешать индексы от примеров при построении прогнозов
3. Прогнозы должны быть для всех примеров из тестового датасета (для всех строк)

In [None]:
df_test.shape

In [None]:
df_test

In [None]:
submit = pd.read_csv('/kaggle/input/real-estate-price-prediction-moscow/sample_submission.csv')
submit.head()

In [None]:
predictions = cbt_model.predict(df_test)
predictions

In [None]:
submit['Price'] = predictions
submit.head()

In [None]:
submit.to_csv('rf_submit77.csv', index=False)