<a href="https://colab.research.google.com/github/yandexdataschool/MLatImperial2021/blob/master/02_lab/kaggle_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up variables and download data

Register on [kaggle](https://www.kaggle.com) and accept the [competition](https://www.kaggle.com/c/mlimperial2022-predict-the-house-price/overview) rules.

Go to My Account and under API section click **create new API Token**.
Download created kaggle.json

Upload this file to your google drive root folder.

Now execute the following magic. - It installs kaggle, mounts google drive and downloads data from competition to you drive.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
!mkdir /root/.kaggle
!cp /content/gdrive/My\ Drive/kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!ls -l /root/.kaggle

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
#!kaggle config set -n path -v /content
!kaggle competitions download -c mlimperial2022-predict-the-house-price -p '/content/gdrive/My Drive/mlimperial2022-predict-the-house-price'

In [None]:
!unzip -q /content/gdrive/My\ Drive/mlimperial2022-predict-the-house-price/mlimperial2022-predict-the-house-price.zip -d /content/gdrive/My\ Drive/mlimperial2022-predict-the-house-price/

In [None]:
DATA_PATH = "/content/gdrive/My Drive/mlimperial2022-predict-the-house-price"
!ls /content/gdrive/My\ Drive/mlimperial2022-predict-the-house-price

# https://www.kaggle.com/c/mlimperial2022-predict-the-house-price/overview

### Metric

For regression task we can use the most common Mean Squared Error(MSE). However, sometimes its better to use logarithmic error. In this challenge, we will use RMSLE - root mean square logarithmic error:

$$
RMSLE = \sqrt{\frac{1}{N} \sum_{i=1}^{N} [\log(y_i + 1) - \log(p_i + 1)]^2},
$$

where $y_i$ is true value and $p_i$ is a predicted value.

# Grading

Your task is to try as many techniques that you have learned this week as possible.


The outcome of your work should be a properly documented jupyter notebook, that contains all the experiments you did + your explanations/comments on them.


The archive with jupyter notebook should be sent to mlicl-2022-seminars@mail.ru
 with the topic: Surname_name_kaggle_1

### The total amount of points is 10. You will get additional points based on your final ranking

Start with baseline solution for convinience.

- 1 Point. Work with missed values. Which features are better to remove? Are there any features worth fixing (filling values)?
- 1 Point. Work with categorial features. Try to encode them (one-hot encoding, frequency encoding). Does this improve your score?
- 1 Point. Work with the timestamps. What information can you extract from them? <i>Example</i>: convert them to separate year, month, day features.
- 1 Point. Find highly correlated features in the train.csv and macro.csv (determine your own threshold). How does removing one of them affects the model prediction capability? Analyse the correlation between features and target. Decide what to do with features that have negative correlation with target (throw them away or process them)?
- 1 Point. Compare various linear regression methods (Ridge, Lasso, SVR, SGD-based) and decision trees based algorithms (random forest, boosting). Grid search parameters.
- 1 Point. Try to find badly defined features and outliers in the dataset. Remove them. Did it help?
- 1 Point. Try generating your own features that produce improvement (ratio of life_sq to full_sq, age of a building, ratio of floor to max floor, ...)
- 1 Point. Try using PCA on some subset of features (determine this subset). Is it usefull? Why? Visualise features with the highest explained variance.
- 1 Point. Estimate feature importances w.r.t your best model. Try to remove the least important features. Which difference did you notice in comparison when you removed correlated features? Visualize dependency between target and two most important features.
- 1 Point. Make stacking of the models trained above? Does it improve your score?

## Bonus

Beat medium baseline and we will give you +3 points :)

# Baseline

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error, mean_squared_error
import os

In [None]:
X_train = pd.read_csv(os.path.join(DATA_PATH, 'X_train.csv'), index_col=0, parse_dates=['timestamp'])
X_test = pd.read_csv(os.path.join(DATA_PATH, 'X_test.csv'), index_col=0, parse_dates=['timestamp'])
y_train = pd.read_csv(os.path.join(DATA_PATH, 'y_train.csv'), index_col=0)

macro = pd.read_csv(os.path.join(DATA_PATH, "macro.csv"), index_col=0, parse_dates=['timestamp'])

### Explore train data

In [None]:
X_train.head()

In [None]:
X_train.info()

In [None]:
X_train.describe()

In [None]:
y_train.head()

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16,8))

sns.histplot(y_train['price_doc'], ax=ax[0]);
ax[0].grid();ax[0].set_title('Price doc distr')
sns.histplot(np.log1p(y_train['price_doc']), ax=ax[1]);
ax[1].grid();ax[1].set_title('log Price doc distr');

### Let's see what 'macro' data offers

In [None]:
macro.head()

As you can see, timestamps are important here, because it will define the various variables, that change in time. For example, we could merge two data tables (X_train and macro) by timestamps.

In [None]:
X_train_aug = pd.merge(X_train, macro, on='timestamp', how='left')

In [None]:
X_train_aug.info()

Before that we proceed to model training, we must get rid of NaNs and non-numerical data!

### Working with missing values

In [None]:
# calculate ratio of missing values to total size of a column
nan_ratio = X_train_aug.isna().sum(axis=0)/len(X_train_aug)

In [None]:
plt.figure(figsize=(12, 8))
plt.bar(np.arange(len(nan_ratio)), nan_ratio);plt.grid();plt.show()

In [None]:
def process_nans(X):
    # select numerical columns
    numeric_cols = X.columns[(X.dtypes == 'int64') | (X.dtypes == 'float64')]
    
    # replace all NaN with mean value of corresponding column
    X[numeric_cols] = X[numeric_cols].fillna(X[numeric_cols].mean()).copy()
    return X

In [None]:
X_train_aug = process_nans(X_train_aug)

In [None]:
X_train_aug.isna().sum(axis=0)

Have all NaNs gone?

### Working with object data

In [None]:
# select all 'string' columns (called Object type)
obj_cols = X_train_aug.columns[X_train_aug.dtypes == 'O']
X_train_aug[obj_cols].head()

For now we will just drop columns of 'Object' type within the baseline solution. In YOUR solution, of course, you should find a way to extract vital information out of it.

In [None]:
def process_obj(X):
    # just drop them for now
    obj_cols = X.columns[X.dtypes == 'O']
    return X.drop(columns = obj_cols)

In [None]:
X_train_aug = process_obj(X_train_aug)

### Working with timestamp data

Same goes with 'datetime' data. We just drop it.

In [None]:
def process_ts(X):
    ts_cols = X.columns[X.dtypes == "datetime64[ns]"]
    return X.drop(columns = ts_cols)

In [None]:
X_train_aug = process_ts(X_train_aug)

In [None]:
X_train_aug.info()

In [None]:
X_train_aug.head()

### Feature filtering

Feature filtering is dedicated to feature postprocessing: dropping unnecessary features, transforming them or generating new ones (by hand, PCA, etc.)

In [None]:
def filter_feats(X):
    return X.drop(columns = ['id'])

In [None]:
X_train_aug = filter_feats(X_train_aug)

### Scaling data

Make sure our features are normalised:

Scale data before passing into algorithms. NOTE that outliers affect some of scaling techniques.

In [None]:
from sklearn.preprocessing import StandardScaler

def scale_data(X):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    return X_scaled, scaler

In [None]:
col_names = X_train_aug.columns

In [None]:
X_train_aug, scaler = scale_data(X_train_aug)

In [None]:
X_train_aug.mean()

In [None]:
X_train_aug.std(axis=0)

### Forming preprocess pipeline

Let's create our pipeline procedure that incorporates all operations/transformations with data we did before. We would need that in the future in order to apply our transformations to X_test (the SAME way as for X_train!).

In [None]:
def my_preprocessing(X, scaler=None):
    """
    Input:
    - pandas table X
    - sklearn Scaler (None for train, not None for test)
    
    Applies the following transformations to input X (within baseline):
    - Merge X with macro data by timestamps
    - Replace all NaNs for numerical features with their column mean value
    - Drop all Object-type data (strings)
    - Drop all timestamps
    - Drop id column
    - Scale features (converts pandas table to numpy array)
    
    Output:
    - Transformed X_preproc
    - sklearn Scaler that we use later for X_test scaling
    """
    # augment data with macro data
    X_preproc = pd.merge(X, macro, on='timestamp', how='left')
    
    # remove or fix NaNs
    X_preproc = process_nans(X_preproc)
    
    # Encode object values (of type 'string') into integer values
    X_preproc = process_obj(X_preproc)
    
    # Encode timestamps (into) into integer values
    X_preproc = process_ts(X_preproc)
    
    # Do feature filtering ! MOST IMPORTANT part
    X_preproc = filter_feats(X_preproc)
    
    col_names = X_preproc.columns
    
    # Apply scaler if avaiable (test) or create and apply one (train)
    if (scaler is None):
        X_preproc, scaler = scale_data(X_preproc)
    else:
        X_preproc = scaler.transform(X_preproc)
        
    return X_preproc, scaler

### Training

In [None]:
from sklearn.metrics import make_scorer

# Create our metric (Loss)
def RMSLE(log_y_pred, log_y_true):
    return mean_squared_error(log_y_pred, log_y_true, squared=False)

scorer = make_scorer(RMSLE, greater_is_better=False)

Let's train basic linear Ridge regression on whole train dataset. Is that a good idea? Why?

In [None]:
from sklearn.linear_model import Ridge

# Train on whole training data
predictor = Ridge(alpha=1.0)
predictor.fit(X_train_aug, np.log1p(y_train['price_doc']))

RMSLE(predictor.predict(X_train_aug), np.log1p(y_train['price_doc']))

What was our model trained to?

Now let's run cross-validation procedure. Before that we need to specify the number of folds (parameter cv) and grid of parameters (alpha).

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10]}

gscv = GridSearchCV(predictor, param_grid, scoring=scorer, cv=5, n_jobs=-1, verbose=1)
gscv.fit(X_train_aug, np.log1p(y_train['price_doc']))

In [None]:
for ind, alpha in enumerate(param_grid['alpha']):
    print(f'alpha = {alpha} || Mean cv-loss = {-gscv.cv_results_["mean_test_score"][ind]:.2f}')

Why do we multiply out mean test score by '-1' ? We can see that with alpha=1.0 cv-loss differs from training loss, why?

Find best 'alpha':

In [None]:
best_score_ind = np.argmax(gscv.cv_results_['mean_test_score'])
gscv.cv_results_['params'][best_score_ind], -gscv.cv_results_['mean_test_score'][best_score_ind]

In [None]:
predictor = Ridge(alpha=10.0)
predictor.fit(X_train_aug, np.log1p(y_train['price_doc']))

In [None]:
col_names[np.argsort(predictor.coef_)[-20:]]

# Make predictions on the test set



In [None]:
X_test_preproc, _ = my_preprocessing(X_test, scaler)

In [None]:
# inverse of y_log = log(1 + y) -> y = exp(y_log) - 1
prediction = np.expm1(predictor.predict(X_test_preproc))
prediction = pd.DataFrame(np.stack((X_test['id'], prediction), axis=1), columns=['id', "price_doc"])
prediction['id'] = prediction['id'].astype('int')

In [None]:
prediction.to_csv(os.path.join(DATA_PATH, "prediction.csv"), index=False)

In [None]:
!head -n 5 '/content/gdrive/My Drive/mlimperial2022-predict-the-house-price/prediction.csv'

# Lets use kaggle API again to submit results


In [None]:
!kaggle competitions submit -c mlimperial2022-predict-the-house-price -f "{DATA_PATH}/prediction.csv" -m "Test"