<a class="anchor" id="0"></a>

# BOD prediction in the river water by 3 models

## **Acknowledgements**
#### This kernel uses such good kernels:
   - https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
   - https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
   - https://www.kaggle.com/spice4ever/bod-prediction-in-the-river-water

<a class="anchor" id="0.1"></a>
## **Table of Contents**
1. [Import libraries](#1)
2. [Download datasets](#2)
3. [EDA](#3)
4. [Preparing to modeling](#4)
5. [Tuning models and test for all features](#5)
    - [Stochastic Gradient Descent](#5.1)
    - [Decision Tree Regressor](#5.2)
    - [GradientBoostingRegressor with HyperOpt](#5.3)
6. [Models comparison](#6)
7. [Prediction](#7)

<a class="anchor" id="1"></a>
## 1. Import libraries 
##### [Back to Table of Contents](#0.1)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
import pandas_profiling as pp

# models
from sklearn.linear_model import LinearRegression, SGDRegressor, RidgeCV
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor 
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, VotingRegressor 
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
import sklearn.model_selection
from sklearn.model_selection import cross_val_predict as cvp
from sklearn import metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import xgboost as xgb
import lightgbm as lgb

# model tuning
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe, space_eval

import warnings
warnings.filterwarnings("ignore")

<a class="anchor" id="2"></a>
## 2. Download datasets 
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
valid_part = 0.3

In [None]:
train0 = pd.read_csv('/kaggle/input/prediction-bod-in-river-water/train.csv')

In [None]:
train0.head(10)

In [None]:
train0.info()

<a class="anchor" id="3"></a>
## 3. EDA
##### [Back to Table of Contents](#0.1)



In [None]:
train0.describe()

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()

In [None]:
plotPerColumnDistribution(train0 , 10, 5)

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
plotCorrelationMatrix(train0, 8)

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
plotScatterMatrix(train0 , 20, 10)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
nRowsRead = 1000 # specify 'None' if want to read whole file
# train.csv may have more rows in reality, but we are only loading/previewing the first 1000 rows
df2 = pd.read_csv('/kaggle/input/prediction-bod-in-river-water/train.csv', delimiter=',', nrows = nRowsRead)
df2.dataframeName = 'train.csv'
nRow, nCol = df2.shape
print(f'There are {nRow} rows and {nCol} columns')

In [None]:
df2.head(5)

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
plotPerColumnDistribution(df2, 10, 5)

In [None]:
# Thanks to: https://www.kaggle.com/anastasiyamarch05/ammonium-prediction-in-river-water-starter-code
plotCorrelationMatrix(df2, 8)

In [None]:
pp.ProfileReport(train0)

The analysis showed that many values ​​are only available in stations 1 and 2, while others have much less data. We propose that at the start code, the BOD5 prediction should be carried out only for data from the first two stations

In [None]:
train0 = train0.drop(['Id', '4', '5', '6','7'], axis = 1)
train0 = train0.dropna()
train0.info()

In [None]:
train0.head(3)

<a class="anchor" id="4"></a>
## 4. Preparing to modeling
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
target_name = 'target'

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# For boosting model
train0b = train0
train_target0b = train0b[target_name]
train0b = train0b.drop([target_name], axis=1)
# Synthesis valid as test for selection models
trainb, testb, targetb, target_testb = train_test_split(train0b, train_target0b, test_size=valid_part, random_state=0)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
train_target0 = train0[target_name]
train0 = train0.drop([target_name], axis=1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
#For models from Sklearn
scaler = StandardScaler()
train0 = pd.DataFrame(scaler.fit_transform(train0), columns = train0.columns)

In [None]:
train0.head(3)

In [None]:
len(train0)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Synthesis valid as test for selection models
train, test, target, target_test = train_test_split(train0, train_target0, test_size=valid_part, random_state=0)

In [None]:
train.head(3)

In [None]:
test.head(3)

In [None]:
train.info()

In [None]:
test.info()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
acc_train_r2 = []
acc_test_r2 = []
acc_train_d = []
acc_test_d = []
acc_train_rmse = []
acc_test_rmse = []

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
def acc_d(y_meas, y_pred):
    # Relative error between predicted y_pred and measured y_meas values
    return mean_absolute_error(y_meas, y_pred)*len(y_meas)/sum(abs(y_meas))

def acc_rmse(y_meas, y_pred):
    # RMSE between predicted y_pred and measured y_meas values
    return (mean_squared_error(y_meas, y_pred))**0.5

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
def acc_boosting_model(num,model,train,test,num_iteration=0):
    # Calculation of accuracy of boosting model by different metrics
    
    global acc_train_r2, acc_test_r2, acc_train_d, acc_test_d, acc_train_rmse, acc_test_rmse
    
    if num_iteration > 0:
        ytrain = model.predict(train, num_iteration = num_iteration)  
        ytest = model.predict(test, num_iteration = num_iteration)
    else:
        ytrain = model.predict(train)  
        ytest = model.predict(test)

    print('target = ', targetb[:5].values)
    print('ytrain = ', ytrain[:5])

    acc_train_r2_num = round(r2_score(targetb, ytrain) * 100, 2)
    print('acc(r2_score) for train =', acc_train_r2_num)   
    acc_train_r2.insert(num, acc_train_r2_num)

    acc_train_d_num = round(acc_d(targetb, ytrain) * 100, 2)
    print('acc(relative error) for train =', acc_train_d_num)   
    acc_train_d.insert(num, acc_train_d_num)

    acc_train_rmse_num = round(acc_rmse(targetb, ytrain) * 100, 2)
    print('acc(rmse) for train =', acc_train_rmse_num)   
    acc_train_rmse.insert(num, acc_train_rmse_num)

    print('target_test =', target_testb[:5].values)
    print('ytest =', ytest[:5])
    
    acc_test_r2_num = round(r2_score(target_testb, ytest) * 100, 2)
    print('acc(r2_score) for test =', acc_test_r2_num)
    acc_test_r2.insert(num, acc_test_r2_num)
    
    acc_test_d_num = round(acc_d(target_testb, ytest) * 100, 2)
    print('acc(relative error) for test =', acc_test_d_num)
    acc_test_d.insert(num, acc_test_d_num)
    
    acc_test_rmse_num = round(acc_rmse(target_testb, ytest) * 100, 2)
    print('acc(rmse) for test =', acc_test_rmse_num)
    acc_test_rmse.insert(num, acc_test_rmse_num)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
def acc_model(num,model,train,test):
    # Calculation of accuracy of model акщь Sklearn by different metrics   
  
    global acc_train_r2, acc_test_r2, acc_train_d, acc_test_d, acc_train_rmse, acc_test_rmse
    
    ytrain = model.predict(train)  
    ytest = model.predict(test)

    print('target = ', target[:5].values)
    print('ytrain = ', ytrain[:5])

    acc_train_r2_num = round(r2_score(target, ytrain) * 100, 2)
    print('acc(r2_score) for train =', acc_train_r2_num)   
    acc_train_r2.insert(num, acc_train_r2_num)

    acc_train_d_num = round(acc_d(target, ytrain) * 100, 2)
    print('acc(relative error) for train =', acc_train_d_num)   
    acc_train_d.insert(num, acc_train_d_num)

    acc_train_rmse_num = round(acc_rmse(target, ytrain) * 100, 2)
    print('acc(rmse) for train =', acc_train_rmse_num)   
    acc_train_rmse.insert(num, acc_train_rmse_num)

    print('target_test =', target_test[:5].values)
    print('ytest =', ytest[:5])
    
    acc_test_r2_num = round(r2_score(target_test, ytest) * 100, 2)
    print('acc(r2_score) for test =', acc_test_r2_num)
    acc_test_r2.insert(num, acc_test_r2_num)
    
    acc_test_d_num = round(acc_d(target_test, ytest) * 100, 2)
    print('acc(relative error) for test =', acc_test_d_num)
    acc_test_d.insert(num, acc_test_d_num)
    
    acc_test_rmse_num = round(acc_rmse(target_test, ytest) * 100, 2)
    print('acc(rmse) for test =', acc_test_rmse_num)
    acc_test_rmse.insert(num, acc_test_rmse_num)

<a class="anchor" id="1"></a>
## 5. Tuning models and test for all features
##### [Back to Table of Contents](#0.1)

<a class="anchor" id="5.1"></a>
### 5.1 Stochastic Gradient Descen
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Stochastic Gradient Descent

sgd = SGDRegressor()
sgd.fit(train, target)
acc_model(1,sgd,train,test)

<a class="anchor" id="5.2"></a>
### 5.2 Decision Tree Regressor
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Decision Tree Regression

decision_tree = DecisionTreeRegressor()
decision_tree.fit(train, target)
acc_model(2,decision_tree,train,test)

<a class="anchor" id="5.3"></a>
### 5.3 GradientBoostingRegressor with HyperOpt
##### [Back to Table of Contents](#0.1)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
def hyperopt_gb_score(params):
    clf = GradientBoostingRegressor(**params)
    current_score = cross_val_score(clf, train, target, cv=10).mean()
    print(current_score, params)
    return current_score 
 
space_gb = {
            'n_estimators': hp.choice('n_estimators', range(100, 1000)),
            'max_depth': hp.choice('max_depth', np.arange(2, 10, dtype=int))            
        }
 
best = fmin(fn=hyperopt_gb_score, space=space_gb, algo=tpe.suggest, max_evals=10)
print('best:')
print(best)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
params = space_eval(space_gb, best)
params

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Gradient Boosting Regression

gradient_boosting = GradientBoostingRegressor(**params)
gradient_boosting.fit(train, target)
acc_model(3,gradient_boosting,train,test)

<a class="anchor" id="6"></a>
## 6. Models comparison 
##### [Back to Table of Contents](#0.1)

We can now compare our models and to choose the best one for our problem.

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
models = pd.DataFrame({
    'Model': ['Stochastic Gradient Decent', 
              'Decision Tree Regressor', 
              'GradientBoostingRegressor'],
    
    'r2_train': acc_train_r2,
    'r2_test': acc_test_r2,
    'd_train': acc_train_d,
    'd_test': acc_test_d,
    'rmse_train': acc_train_rmse,
    'rmse_test': acc_test_rmse
                     })

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
pd.options.display.float_format = '{:,.2f}'.format

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
print('Prediction accuracy for models by R2 criterion - r2_test')
models.sort_values(by=['r2_test', 'r2_train'], ascending=False)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
print('Prediction accuracy for models by relative error - d_test')
models.sort_values(by=['d_test', 'd_train'], ascending=True)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
print('Prediction accuracy for models by RMSE - rmse_test')
models.sort_values(by=['rmse_test', 'rmse_train'], ascending=True)

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Plot
plt.figure(figsize=[25,6])
xx = models['Model']
plt.tick_params(labelsize=14)
plt.plot(xx, models['r2_train'], label = 'r2_train')
plt.plot(xx, models['r2_test'], label = 'r2_test')
plt.legend()
plt.title('R2-criterion for 15 popular models for train and test datasets')
plt.xlabel('Models')
plt.ylabel('R2-criterion, %')
plt.xticks(xx, rotation='vertical')
plt.savefig('graph.png')
plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Plot
plt.figure(figsize=[25,6])
xx = models['Model']
plt.tick_params(labelsize=14)
plt.plot(xx, models['d_train'], label = 'd_train')
plt.plot(xx, models['d_test'], label = 'd_test')
plt.legend()
plt.title('Relative errors for 15 popular models for train and test datasets')
plt.xlabel('Models')
plt.ylabel('Relative error, %')
plt.xticks(xx, rotation='vertical')
plt.savefig('graph.png')
plt.show()

In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
# Plot
plt.figure(figsize=[25,6])
xx = models['Model']
plt.tick_params(labelsize=14)
plt.plot(xx, models['rmse_train'], label = 'rmse_train')
plt.plot(xx, models['rmse_test'], label = 'rmse_test')
plt.legend()
plt.title('RMSE for 15 popular models for train and test datasets')
plt.xlabel('Models')
plt.ylabel('RMSE, %')
plt.xticks(xx, rotation='vertical')
plt.savefig('graph.png')
plt.show()

Thus, the best models by the RMSE are Linear Regression and Ridge Regressor.

<a class="anchor" id="7"></a>
## 7. Prediction
##### [Back to Table of Contents](#0.1)


In [None]:
# Thanks to: https://www.kaggle.com/vbmokin/bod-prediction-in-river-15-regression-models
testn = pd.read_csv('/kaggle/input/prediction-bod-in-river-water/test.csv')
testn.info()

In [None]:
testn = testn.drop(['Id','4', '5','6','7'], axis = 1)
testn.head(3)

In [None]:
#For models from Sklearn
testn = pd.DataFrame(scaler.transform(testn), columns = testn.columns)

In [None]:
# Stochastic Gradient Descent
sgd.fit(train0, train_target0)
sgd.predict(train)[:3]

In [None]:
# Decision Tree Regression
decision_tree.fit(train0, train_target0)
decision_tree.predict(train)[:3]

In [None]:
# Gradient Boosting Regression
gradient_boosting.fit(train0, train_target0)
gradient_boosting.predict(train)[:3]