## 1. Import libraries<a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Work with Data - the main Python libraries
import numpy as np
import pandas as pd
import pandas_profiling as pp

# Visualization
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split, KFold, ShuffleSplit, GridSearchCV

# Modeling
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from xgboost.sklearn import XGBRegressor

# Metrics
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings("ignore")

## 2. Download data<a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Download training data
train = pd.read_csv('/kaggle/input/ammonium-prediction-in-river-water/train.csv')

In [None]:
# Display the first 5 rows of the training dataframe.
train.head()

In [None]:
# Information for training data
train.info()

In [None]:
# Download test data
test = pd.read_csv('../input/ammonium-prediction-in-river-water/test.csv')

In [None]:
# Display the 7 last rows of the training dataframe
test.tail()

In [None]:
test.info()

## 3. EDA & FE & Preprocessing data<a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

### 3.1. Statistics & FE<a class="anchor" id="3.1"></a>

[Back to Table of Contents](#0.1)

The analysis showed that many values are only available in stations 1 and 2, while others have much less data. I propose select only these two stations.

In [None]:
# Select the stations with the most data in training dataset
train = train.drop(['Id','3','4','5','6','7'], axis = 1)
train = train.dropna().reset_index(drop=True)
train.info()

In [None]:
# Display the statistics for training data
train.describe()

In [None]:
# EDA with Pandas Profiling
pp.ProfileReport(train)

In [None]:
# Selecting a target featute and removing it from training dataset
target = train.pop('target')

In [None]:
# Select the stations with the most data in test dataset
test = test.drop(['Id','3','4','5','6','7'], axis = 1)
test = test.dropna().reset_index(drop=True)

**TASK:** Make EDA for the test dataset by Pandas Profiling

In [None]:
# EDA with Pandas Profiling
pp.ProfileReport(test)

In [None]:
# Display basic information about the test data
test.info()

### 3.2. Data standartization<a class="anchor" id="3.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Standartization data
scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(train), columns = train.columns)

# Display training data
train

In [None]:
# Display the statistics for training data
train.describe()

**TASK:** Standardize the test dataset with the same scaler and display it

In [None]:
# Standartization data
test = pd.DataFrame(scaler.transform(test), columns = test.columns)
# Display test
display(test)

**TASK:** Display the statistics for test data

In [None]:
# Display the statistics for training data
test.describe()

**It is important to make sure** that all features in the training and test datasets:
* do not have missing values (number of non-null values = number of entries of index) 
* all features have a numeric data type (int8, int16, int32, int64 or float16, float32, float64).

**ADDITIONAL TASK:** Try use RobustScaler or MinMaxScaler instead of StandardScaler and to analyze what is the difference for accuracy of models will be below.

In [None]:
trainAdd = pd.read_csv('../input/ammonium-prediction-in-river-water/test.csv')

In [None]:
# Standartization data
rScaler = RobustScaler()

trainAdd = pd.DataFrame(rScaler.fit_transform(trainAdd), columns = trainAdd.columns)
# Display training data
trainAdd

### 3.3. Training data splitting<a class="anchor" id="3.3"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Training data splitting to new training (part of the all training) and validation data
train_all = train.copy()
target_all = target.copy()
train, valid, target_train, target_valid = train_test_split(train_all, target_all, test_size=0.2, random_state=0)

In [None]:
# Display information about new training data
train.info()

**TASK:** Display information about validation data

In [None]:
# Display information about validation data
valid.info()

**ADDITIONAL TASK:** Try use other values in the parameter test_size above: 0.1, 0.15, 0.3, 0.5 and to analyze what is the difference for accuracy of models will be below.

### 3.4. Cross-validation of training data<a class="anchor" id="3.4"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Cross-validation of training data with shuffle
cv_train =  KFold(n_splits=5, shuffle=False, random_state=0)

**ADDITIONAL TASKS:** 
1. Set number of splitting = 5, 7, 10 and to compare of results.
2. Try use another method for cross-validation of training data (without shuffle):

        KFold(n_splits=5, shuffle=False, random_state=0)

## 4. Modeling<a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Creation the dataframe with the resulting score of all models
result = pd.DataFrame({'model' : ['Decision Tree Regressor', 'Random Forest Regressor', 'XGBoost Regressor'], 
                       'train_score': 0, 'valid_score': 0, 'y_train':  [[], [], []], 'y_val':  [[], [], []], 'y_test':  [[], [], []]})
result

In [None]:
#Для подальшого використання та оптимізації коду створено функцію, що універсалізує запуск моделей
#Функція на виході дає результати розрахунку за певною моделлю для трейнової, валідаційної та тестової вибірок. 

def get_model(train, valid, target_train, target_valid, model_name, name, param_grid, cv_train, result):
    model = model_name
    grid = GridSearchCV(model,
                        param_grid,
                        cv = cv_train,
                        verbose=False)
    grid.fit(train, target_train)
    
    # Prediction for training data
    y_train = grid.predict(train)
    print(grid.best_params_)
    
    # Accuracy of model
    r2_score_acc = round(r2_score(target_train, y_train)*100,1)
    print(f'Accuracy of {name} model training is {r2_score_acc}')
    
    result.loc[result['model'] == name, 'train_score'] = r2_score_acc
    result.at[result.loc[result['model'] == name].index[0], 'y_train'] = y_train
    
    # Print rounded r2_score_acc to 2 decimal values after the text
    y_val = grid.predict(valid)
    r2_score_acc_valid = round(r2_score(target_valid, y_val)*100,1)
    result.loc[result['model'] == name, 'valid_score'] = r2_score_acc_valid
    result.at[result.loc[result['model'] == name].index[0], 'y_val'] = y_val
    
    print(f'Accuracy of {name} model prediction for valid dataset is {r2_score_acc_valid}')
    
    result.at[result.loc[result['model'] == name].index[0], 'y_test'] = grid.predict(test)
    return result

### 4.1. Decision Tree Regressor<a class="anchor" id="4.1"></a>

[Back to Table of Contents](#0.1)

In [None]:
decision_tree = DecisionTreeRegressor()
param_grid = {'min_samples_leaf': [i for i in range(5,10)], 'max_depth': [i for i in range(3,12)]}
result = get_model(train, valid, target_train, target_valid, decision_tree ,'Decision Tree Regressor', param_grid, cv_train, result)

### 4.2. Random Forest Regressor<a class="anchor" id="4.2"></a>

[Back to Table of Contents](#0.1)

In [None]:
rf = RandomForestRegressor()
param_grid = {'n_estimators': [10, 100, 500], 'min_samples_leaf': [i for i in range(5,10)], 
              'max_features': ['auto'], 'max_depth': [i for i in range(4,6)], 
              'criterion': ['mse'], 'bootstrap': [False]}
result = get_model(train, valid, target_train, target_valid, rf ,'Random Forest Regressor', param_grid, cv_train, result)

### 4.3. XGBoost Regressor<a class="anchor" id="4.3"></a>

[Back to Table of Contents](#0.1)

**ADDITIONAL TASK:** Add the XGBRegressor model (the same commands as in 4.1 and 4.2 adapted to the library xgb). Please see example in the notebooks: 
* [BOD prediction in river - 15 regression models](https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models)
* [XGBRegressor with GridSearchCV](https://www.kaggle.com/jayatou/xgbregressor-with-gridsearchcv)

In [None]:
# XGBoost Regressor
xgb = XGBRegressor(verbosity=0)
parameters = {'nthread':[4], #when use hyperthread, xgboost may become slower
              'objective':['reg:linear'],
              'learning_rate': [.03, 0.05, .07], #so called `eta` value
              'max_depth': [5, 6, 7],
              'min_child_weight': [4],
              'silent': [1],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [500]}
result = get_model(train, valid, target_train, target_valid, xgb ,'XGBoost Regressor', param_grid, cv_train, result)

**ADDITIONAL TASKS:** 
1. Add to dataframe result also calculated array: y_train, y_val.
2. Creation the function with all commands and output information (in each section of this chapter 4) for all models:

        result = get_model(train, valid, target_train, target_valid, model_name, param_grid, cv_train, result)

## 5. Visualization<a class="anchor" id="6"></a>

[Back to Table of Contents](#0.1)

**TASK:** Building plot for prediction for the valid data.

In [None]:
# Building plot for prediction for the valid data 


**TASK:** Building plot for prediction for the test data.

In [None]:
# Building plot for prediction for the test data 


In [None]:
def plot_prediction(result, type_plot, target_train=[]):
    if (type_plot == 'training'):
        result_type = 'y_train'
        result_title = 'Prediction for the training data'
    if (type_plot == 'validation'):
        result_type = 'y_val'
        result_title = 'Prediction for the validation data'
    if (type_plot == 'testing'):
        result_type = 'y_test'
        result_title = 'Prediction for the testing data'
    x = np.arange(len(result.at[0, result_type]))
    plt.figure(figsize=(16,10))
    if (type_plot != 'testing'):
        plt.scatter(x, target_train, label = "Target data", color = 'g')
    plt.scatter(x, result.at[0, result_type], label = "Decision Tree prediction", color = 'b')
    plt.scatter(x, result.at[1, result_type], label = "Random Forest prediction", color = 'y')
    plt.scatter(x, result.at[2, result_type], label = "XGB prediction", color = '#17becf')
    plt.plot(x, np.full(len(result.at[0, result_type]), 0.5), label = "Maximum allowable value", color = 'r')
    plt.title(result_title)
    plt.legend(loc='best')
    plt.grid(True)
    

In [None]:
plot_prediction(result, 'training', target_train)

In [None]:
plot_prediction(result, 'validation', target_valid)

In [None]:
plot_prediction(result, 'testing')

**ADDITIONAL TASKS:** 
1. Add to dataframe result also calculated array: y_test.
2. Add the line with XGBRegressor model prediction (train, valid, test take from the dataframe result).
3. Creation the function with all commands and output information for all models (for type_plot = 'training', 'valid' or 'test'):

        plot_prediction(result, type_plot='training')

## 6. Select the best model <a class="anchor" id="7"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Display results of modeling
result.sort_values(by=['valid_score', 'train_score'], ascending=False)

In [None]:
# Select models with minimal overfitting
result_best = result[(result['train_score'] - result['valid_score']).abs() < 5]
result_best.sort_values(by=['valid_score', 'train_score'], ascending=False)

In [None]:
# Select the best model
result_best.nlargest(1, 'valid_score')

In [None]:
# Find a name of the best model (with maximal valid score)
best_model_name = result_best.loc[result_best['valid_score'].idxmax(result_best['valid_score'].max()), 'model']

In [None]:
print(f'The best model is "{best_model_name}"')