<h1><center>Water quality</h1></center>

## Catboost. Cross-validation

**Importing libraries**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, cv, Pool
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
SEED = 42

**Reading dataset**

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')
df.head()

**Correlation matrix**

In [None]:
sns.heatmap(df.corr())

**Correlation of potability versus other features**

In [None]:
df.corr()['Potability']

As we can see, there are low coefficients of correlation therefore we can make a conclusion that we should collect more data.

**Histogram of potability**

In [None]:
df['Potability'].hist(bins=2)

In the histogram we can see that we work with imbalanced dataset. Therefore we should calculate `class_weights`. I follow the tip in the [catboost_docs](https://catboost.ai/docs/concepts/python-reference_parameters-list.html) and calculate the value by this formula:  
<center>$\large\begin{array}{l}
{w_0} = 1;\\
{w_1} = \frac{{sum\_negative}}{{sum\_positive}}.
\end{array}$

In [None]:
pots = df['Potability']
w0 = 1
w1 = pots[pots == 0].count() / pots[pots == 1].count()
class_weights = [w0, w1]
class_weights

We have imbalanced dataset therefore we should use _cross-validation_ for training model.

**Cross-validation**

In [None]:
cv_pool = Pool(df.drop('Potability', axis=1), df['Potability'])

params = {'iterations': 500,
          'loss_function': 'Logloss',
          'eval_metric': 'F1',
          'class_weights': class_weights,
          'roc_file': 'roc_file',
          'verbose': False}

**Grid search**

Let's search the best params for training.

In [None]:
model = CatBoostClassifier()
model.set_params(**params)

grid = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]}

grid_search_result = model.grid_search(grid, 
                                       cv_pool,
                                       verbose=False,
                                       plot=False)

Add the best values of parameters to `params`.

In [None]:
params.update(grid_search_result['params'])

And call cross-validation.

In [None]:
scores = cv(cv_pool,
            params,
            fold_count=5, 
            verbose=False)

**Roc-curve**

In [None]:
roc_curve = pd.read_csv('catboost_info/roc_file', sep='\t')

plt.plot(roc_curve['FPR'], roc_curve['TPR'])

plt.title('ROC-curve')
plt.xlabel('FPR')
plt.ylabel('TPR')

plt.show()

Roc-curve shows us that we have low quality classificator. Let's calculate _AUC_ (Area Under Curve) to prove it.

In [None]:
print('AUC:', roc_curve['TPR'].sum() / roc_curve['TPR'].count())

AUC is above 0.5 but not so much for telling that it is a good classificator.

**Training**

In [None]:
model = CatBoostClassifier()
model.set_params(**params)
model.fit(cv_pool, silent=True)

In [None]:
print('Logloss:', scores['test-Logloss-mean'].iloc[-1])
print('F1-score:', scores['test-F1-mean'].iloc[-1])

**Conclusions**  
Features are uncorrelated with target therefore there was created a low quality classificator. I think it is necessary to collect more data for using machine learning algorithms.