# Table of Contents



 1. [Load and Check Data](#1)
 1. [Feature Description](#2)
     * [Count of Data Types and Information](#3)
     * [Data Visualization](#4)
 1. [Missing Values](#5)
 1. [Filling Values](#6)
 1. [Detect Outlier](#7)
 1. [Correlation Matrix](#8)
 1. [Binning](#9)
 1. [Feature Analysis](#10)
     * [(x) - Potability](#11)
     * [ph-Sulfate-Potability](#12)
     * [ph-Hardness-Potability](#13)
     * [ph-Solids-Potability](#14)
     * [ph-Conductivity-Potability](#15)
     * [ph-Organic_carbon-Potability](#16)
     * [ph-Chloramines-Potability](#17)
     * [ph-Trihalomethanes-Potability](#18)
     * [ph-Turbidity-Potability](#19)
 1. [Feature Engineering](#20)
 1. [Dummies Section](#21)
 1. [Data Transforming](#22)
 1. [Modeling](#23)
     * [Import Library](#24)
     * [Train-Test Split](#25)
     * [Machine Learning Algorithms](#26)
         * [Random Forest](#27)
         * [Artificial Neural Network](#28)
         * [Gradient Boosting](#29)
         * [Cat Boost](#30)
     * [Classifiers and Parameters](#31)
     * [Optimization](#32)
         * [Grid Search and Cross Validation](#33)
         * [Ensemble Modeling](#34)
     * [Prediction](#35)
 1. [Submission](#36)
<br></br>

## Insight List:
* [Insight-1](#101)
* [Insight-2](#102)
* [Insight-3](#103)
* [Insight-4](#104)
* [Insight-5](#105)
* [Insight-6](#106)
* [Insight-7](#107)
* [Insight-8](#108)
* [Insight-9-10-11-12-13-14](#109)
* [Insight-15-16-17-18](#115)
* [Insight-19-20-21-22](#119)
* [Insight-23-24-25-26](#123)
* [Insight-27-28-29](#127)
* [Insight-30-31-32-33](#130)
* [Insight-34-35-36-37](#134)
* [Insight-38-39](#138)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id='1'></a>

# Load and Check Data

In [None]:
data = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')

In [None]:
data.info()

<a id='101'></a>
#### Insight-1: 

There are missing values on **'ph', 'Sulfate', and 'Trihalomethanes'** columns.

In [None]:
data.head()

In [None]:
data.describe()

<a id='102'></a>
### Insight-2:
Data standardization, scaling or binning is required, as there's highly difference between the maximum values.

<a id='2'></a>
# Feature Description

This chapter includes data information and count of data types.

<a id='3'></a>
### Count Data Types and Information

* float: (9)
    * **ph              :** unit of the water meaning power of hydrogen that can be an acid or a base
    * **Hardness        :** ratio of dissolved calcium and magnesium in water
    * **Solids          :** total dissolved solids (fresh, brackish, saline water)
    * **Chloramines     :** disinfectants used to treat drinking water
    * **Sulfate         :** a salt that forms when sulfuric acid reacts with another chemical
    * **Conductivity    :** the water's ability to conduct electricity
    * **Organic_carbon  :** a measure of the carbon contained within soil organic matter
    * **Trihalomethanes :** chemical compounds in which three of the four hydrogen atoms of methane (CH4) are replaced by halogen atoms
    * **Turbidity       :** the measure of relative clarity of water
* int: (1)
    * **Potability      :** suitable for drinking

<a id='103'></a>
### Insight-3
Dataset consists of continuous and decimal variable. Therefore, we may not be able to draw on graph of the data without grouping it.

In [None]:
data.columns

<a id='4'></a>
### Data Visualization

In [None]:
def feature_plot(df,feature):
    #set plot size
    plt.figure(figsize=(7,2))

    #draw plot
    plt.hist(df[feature])

    #set plot labels
    y_label = 'Frequency'
    plt.xlabel(feature)
    plt.ylabel(y_label)

    #show plot
    plt.show()
    
feature_list = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity','Organic_carbon', 'Trihalomethanes', 'Turbidity', 'Potability']

for i in feature_list:
    feature_plot(data,i)

<a id='104'></a>
### Insight-4

It has unbalanced dataset and there's density on the median values. Additionally, amount of potable water is more than impotable water.

<a id='5'></a>
# Missing Values

In this section, it is checked whether missing values exist or not.

In [None]:
data.isnull().sum()

<a id='6'></a>
# Filling Values

That's filled in with the average (the mean value).

In [None]:
data.ph = data.ph.fillna(data.groupby(['Potability'])['ph'].transform('mean'))
data.Sulfate=data.Sulfate.fillna(data.groupby(['Potability'])['Sulfate'].transform('mean'))
data.Trihalomethanes =data.Trihalomethanes.fillna(data.groupby(['Potability'])['Trihalomethanes'].transform('mean'))

In [None]:
data.head()

In [None]:
#Check again

data.isnull().sum()

<a id='7'></a>
# Detect Outlier

In [None]:
from collections import Counter

#drop the label 
outlier_list = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity','Organic_carbon', 'Trihalomethanes', 'Turbidity'] 

def detect_outlier(df,feature):
    
    outlier_indices = []
    
    for f in feature:
        
        #lower quartile
        q1 = np.percentile(df[f],25)
        
        #upper quartile
        q3 = np.percentile(df[f],75)
        
        #interquartile range
        iqr = q3-q1
        
        #with coefficient
        with_coef = 1.5*iqr
        
        #detect outlier(s)
        lower_rule = q1-with_coef
        upper_rule = q3+with_coef
        
        outlier_variable = df[(df[f]<lower_rule) | (df[f]>upper_rule)].index
        outlier_indices.extend(outlier_variable)
    
    #converting to amount
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i,v in outlier_indices.items() if v>2)
    
    return multiple_outliers

 
#drop outliers
data = data.drop(detect_outlier(data,outlier_list),axis=0).reset_index(drop=True)

<a id='8'></a>
# Correlation Matrix

In [None]:
plt.figure(figsize=(9,3))
sns.heatmap(data[feature_list].corr(),annot=True,fmt='.2f')
plt.show()

In [None]:
data.head()

<a id='105'></a>
### Insight-5
The variables are independent each other. There is not correlation.

In [None]:
data.info()

<a id='9'></a>
# Bining

In [None]:
data.tail()

It's splitted from ordinal to binary form.

In [None]:
ftr_list = ['ph','Hardness','Solids','Chloramines', 'Sulfate', 'Conductivity','Organic_carbon', 'Trihalomethanes', 'Turbidity'] 
      
    
    
for f in ftr_list:
    
    ftr_min = data[f].min()
    ftr_max = data[f].max()
    
    
    #lower quartile
    q1 = np.percentile(data[f],25)
    
    q2 = data[f].median()
    
    #upper quartile
    q3 = np.percentile(data[f],75)
    
    data[f] = [4 if i>= q3 and i < ftr_max else 3 if i>= q2 and i<q3 else 2 if i>=q1 and i<q2 else 1 if i>=ftr_min and i<q1 else 3 for i in data[f]]

In [None]:
data.head()

<a id='10'></a>
# Feature Analysis

In this chapter, it's included data visualization with cat and histogram plots.

<a id='11'></a>
### (x) - Potability

In [None]:
def feature_analysis(df,x,y):
    g = sns.catplot(x=x,y=y,data=df,kind='bar',height=4)
    plt.show()
    
feature_list = ['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity','Organic_carbon', 'Trihalomethanes', 'Turbidity'] 
label = 'Potability'

for i in feature_list:
    feature_analysis(data,i,label)

<a id='106'></a>
### Insight-6

Water with **'pH'** value in second group are more potable than other groups. It can be grouped according to 2nd one.

<a id='107'></a>
### Insight-7

Water with **'Hardness'** value in first and fourth group is more potable than other groups. It can be grouped according to 1st and 4th ones.

<a id='108'></a>
### Insight-8

Water with **'Sulfate'** value in first, second, and fourth groups are more potable than other groups. It can be grouped according to 1st, 2nd and 4th ones.

<a id='12'></a>
### ph-Sulfate-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Sulfate',bins=25)
g.add_legend()
plt.show()

<a id='109'></a>
### Insight-9-10-11-12-13-14:
* **(9)** If group of **'pH'** value is 1, and **'Sulfate'** is 1, percentage of water potability is %25. Another one is %75.
* **(10)** If group of **'pH'** value is 1, and **'Sulfate'** is 3, percentage of water potability is %16. Another one is %84.
* **(11)** If group of **'pH'** value is 2, and **'Sulfate'** is 2, percentage of water potability is %25. Another one is %75.
* **(12)** If group of **'pH'** value is 3, and **'Sulfate'** is 3, percentage of water potability is %16. Another one is %84.
* **(13)** If group of **'pH'** value is 4, and **'Sulfate'** is 3, percentage of water potability is %20. Another one is %80.
* **(14)** If group of **'pH'** value is 4, and **'Sulfate'** is 4, percentage of water potability is %25. Another one is %75.

<a id='13'></a>
### ph-Hardness-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Hardness',bins=25)
g.add_legend()
plt.show()

<a id='115'></a>
### Insight-15-16-17-18:

* **(15)** If group of **'pH'** value is 3, and **'Hardness'** is 2, percentage of water potability is %22. Another one is %78.
* **(16)** If group of **'pH'** value is 3, and **'Hardness'** is 4, percentage of water potability is %20. Another one is %80.
* **(17)** If group of **'pH'** value is 4, and **'Hardness'** is 3, percentage of water potability is %27. Another one is %73.
* **(18)** If group of **'pH'** value is 4, and **'Hardness'** is 4, percentage of water potability is %33. Another one is %67.

<a id='14'></a>
### ph-Solids-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Solids',bins=25)
g.add_legend()
plt.show()

<a id='119'></a>
### Insight-19-20-21-22:
* **(19)** If group of **'pH'** value is 1, and **'Solids'** is 4, percentage of water potability is %27. Another one is %73.
* **(20)** If group of **'pH'** value is 3, and **'Solids'** is 1, percentage of water potability is %21. Another one is %78.
* **(21)** If group of **'pH'** value is 3, and **'Solids'** is 2, percentage of water potability is %25. Another one is %75.
* **(22)** If group of **'pH'** value is 3, and **'Solids'** is 3, percentage of water potability is %24. Another one is %76.

<a id='15'></a>
### ph-Conductivity-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Conductivity',bins=25)
g.add_legend()
plt.show()

<a id='23'></a>
### Insight-23-24-25-26:
* **(23)** If group of **'pH'** value is 3, and **'Conductivity'** is 1, percentage of water potability is %27. Another one is %73.
* **(24)** If group of **'pH'** value is 3, and **'Conductivity'** is 2, percentage of water potability is %25. Another one is %75.
* **(25)** If group of **'pH'** value is 3, and **'Conductivity'** is 3, percentage of water potability is %29. Another one is %71.
* **(26)** If group of **'pH'** value is 3, and **'Conductivity'** is 4, percentage of water potability is %28. Another one is %72.

<a id='16'></a>
### ph-Organic_carbon-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Organic_carbon',bins=25)
g.add_legend()
plt.show()

<a id='127'></a>
### Insight-27-28-29:
* **(27)** If group of **'pH'** value is 3, and **'Organic_carbon'** is 1, percentage of water potability is %27. Another one is %73.
* **(28)** If group of **'pH'** value is 3, and **'Organic_carbon'** is 2, percentage of water potability is %30. Another one is %70.
* **(29)** If group of **'pH'** value is 3, and **'Organic_carbon'** is 3, percentage of water potability is %33. Another one is %67.

<a id='17'></a>
### ph-Chloramines-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Chloramines',bins=25)
g.add_legend()
plt.show()

<a id='130'></a>
### Insight-30-31-32-33:
* **(30)** If group of **'pH'** value is 1, and **'Chloramines'** is 4, percentage of water potability is %30. Another one is %70.
* **(31)** If group of **'pH'** value is 3, and **'Chloramines'** is 1, percentage of water potability is %27. Another one is %73.
* **(32)** If group of **'pH'** value is 3, and **'Chloramines'** is 2, percentage of water potability is %25. Another one is %75.
* **(33)** If group of **'pH'** value is 4, and **'Chloramines'** is 1, percentage of water potability is %27. Another one is %73.

<a id='18'></a>
### ph-Trihalomethanes-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Trihalomethanes',bins=25)
g.add_legend()
plt.show()

<a id='134'></a>
### Insight-34-35-36-37:
* **(34)** If group of **'pH'** value is 3, and **'Trihalomethanes'** is 1, percentage of water potability is %25. Another one is %75.
* **(35)** If group of **'pH'** value is 3, and **'Trihalomethanes'** is 2, percentage of water potability is %29. Another one is %71.
* **(36)** If group of **'pH'** value is 3, and **'Trihalomethanes'** is 3, percentage of water potability is %30. Another one is %70.
* **(37)** If group of **'pH'** value is 3, and **'Trihalomethanes'** is 4, percentage of water potability is %33. Another one is %67.

<a id='19'></a>
### ph-Turbidity-Potability

In [None]:
g = sns.FacetGrid(data,col='Potability',row='ph',height=3)
g.map(plt.hist,'Turbidity',bins=25)
g.add_legend()
plt.show()

<a id='138'></a>
### Insight-38-39:
* **(38)** If group of **'pH'** value is 3, and **'Turbidity'** is 1, percentage of water potability is %25. Another one is %75.
* **(49)** If group of **'pH'** value is 3, and **'Turbidity'** is 2, percentage of water potability is %29. Another one is %71.

<a id='20'></a>
# Feature Engineering

#### [According to Insight-6:](#106)


In [None]:
data['ph_in']=[1 if i == 2 else 0 for i in data.ph]

#### [According to Insight-7:](#107)


In [None]:
data['Hardness_in']=[1 if i == 1 or i == 4 else 0 for i in data.Hardness]

#### [According to Insight-8:](#108)


In [None]:
data['Sulfate_in']=[0 if i == 3 else 1 for i in data.Sulfate]

#### [According to Insight-9-10-11-12-13-14:](#109)


In [None]:
data['ph_Sulfate_11']=[1 if data['ph'][i]==1 and data['Sulfate'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Sulfate_13']=[1 if data['ph'][i]==1 and data['Sulfate'][i]==3 else 0 for i in range(len(data['ph']))]
data['ph_Sulfate_22']=[1 if data['ph'][i]==2 and data['Sulfate'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Sulfate_33']=[1 if data['ph'][i]==3 and data['Sulfate'][i]==3 else 0 for i in range(len(data['ph']))]
data['ph_Sulfate_43']=[1 if data['ph'][i]==4 and data['Sulfate'][i]==3 else 0 for i in range(len(data['ph']))]
data['ph_Sulfate_44']=[1 if data['ph'][i]==4 and data['Sulfate'][i]==4 else 0 for i in range(len(data['ph']))]

#### [According to Insight-15-16-17-18:](#115)


In [None]:
data['ph_Hardness_32']=[1 if data['ph'][i]==3 and data['Hardness'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Hardness_34']=[1 if data['ph'][i]==3 and data['Hardness'][i]==4 else 0 for i in range(len(data['ph']))]
data['ph_Hardness_43']=[1 if data['ph'][i]==4 and data['Hardness'][i]==3 else 0 for i in range(len(data['ph']))]
data['ph_Hardness_44']=[1 if data['ph'][i]==4 and data['Hardness'][i]==4 else 0 for i in range(len(data['ph']))]

#### [According to Insight-19-20-21-22:](#119)

In [None]:
data['ph_Solids_14']=[1 if data['ph'][i]==1 and data['Solids'][i]==4 else 0 for i in range(len(data['ph']))]
data['ph_Solids_31']=[1 if data['ph'][i]==3 and data['Solids'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Solids_32']=[1 if data['ph'][i]==3 and data['Solids'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Solids_33']=[1 if data['ph'][i]==3 and data['Solids'][i]==3 else 0 for i in range(len(data['ph']))]

#### [According to Insight-23-24-25-26:](#123)

In [None]:
data['ph_Conductivity_31']=[1 if data['ph'][i]==3 and data['Conductivity'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Conductivity_32']=[1 if data['ph'][i]==3 and data['Conductivity'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Conductivity_33']=[1 if data['ph'][i]==3 and data['Conductivity'][i]==3 else 0 for i in range(len(data['ph']))]
data['ph_Conductivity_34']=[1 if data['ph'][i]==3 and data['Conductivity'][i]==4 else 0 for i in range(len(data['ph']))]

#### [According to Insight-27-28-29:](#127)

In [None]:
data['ph_Organic_c_31']=[1 if data['ph'][i]==3 and data['Organic_carbon'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Organic_c_32']=[1 if data['ph'][i]==3 and data['Organic_carbon'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Organic_c_33']=[1 if data['ph'][i]==3 and data['Organic_carbon'][i]==3 else 0 for i in range(len(data['ph']))]

#### [According to Insight-30-31-32-33:](#130)

In [None]:
data['ph_Chloramines_14']=[1 if data['ph'][i]==1 and data['Chloramines'][i]==4 else 0 for i in range(len(data['ph']))]
data['ph_Chloramines_31']=[1 if data['ph'][i]==3 and data['Chloramines'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Chloramines_32']=[1 if data['ph'][i]==3 and data['Chloramines'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Chloramines_41']=[1 if data['ph'][i]==4 and data['Chloramines'][i]==1 else 0 for i in range(len(data['ph']))]


#### [According to Insight-34-35-36-37:](#134)

In [None]:
data['ph_Trihalomethanes_31']=[1 if data['ph'][i]==3 and data['Trihalomethanes'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Trihalomethanes_32']=[1 if data['ph'][i]==3 and data['Trihalomethanes'][i]==2 else 0 for i in range(len(data['ph']))]
data['ph_Trihalomethanes_33']=[1 if data['ph'][i]==3 and data['Trihalomethanes'][i]==3 else 0 for i in range(len(data['ph']))]
data['ph_Trihalomethanes_34']=[1 if data['ph'][i]==3 and data['Trihalomethanes'][i]==4 else 0 for i in range(len(data['ph']))]

#### [According to Insight-38-39:](#138)

In [None]:
data['ph_Turbidity_31']=[1 if data['ph'][i]==3 and data['Turbidity'][i]==1 else 0 for i in range(len(data['ph']))]
data['ph_Turbidity_32']=[1 if data['ph'][i]==3 and data['Turbidity'][i]==2 else 0 for i in range(len(data['ph']))]

<a id='21'></a>
# Dummies Section

In [None]:
 """
    from ph to ph_1, ph_2, ph_3, and ph_4
    ...
    ...
    from Turbidity to Turbidity_1, Turbidity_2, Turbidity_3, Turbidity_4
"""

data = pd.get_dummies(data,columns=['ph']) 
data = pd.get_dummies(data,columns=['Hardness'])    
data = pd.get_dummies(data,columns=['Solids'])   
data = pd.get_dummies(data,columns=['Chloramines'])  
data = pd.get_dummies(data,columns=['Sulfate'])  
data = pd.get_dummies(data,columns=['Conductivity'])  
data = pd.get_dummies(data,columns=['Organic_carbon'])
data = pd.get_dummies(data,columns=['Trihalomethanes'])
data = pd.get_dummies(data,columns=['Turbidity']) 

<a id='22'></a>
# Data Transforming

In [None]:
original_data = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')

In [None]:
#filling missing values
original_data.ph = original_data.ph.fillna(original_data.groupby(['Potability'])['ph'].transform('mean'))
original_data.Sulfate=original_data.Sulfate.fillna(original_data.groupby(['Potability'])['Sulfate'].transform('mean'))
original_data.Trihalomethanes =original_data.Trihalomethanes.fillna(original_data.groupby(['Potability'])['Trihalomethanes'].transform('mean'))

In [None]:
#drop outliers
original_data = original_data.drop(detect_outlier(original_data,outlier_list),axis=0).reset_index(drop=True)

In [None]:
#adding the new feature from binning
original_data['ph_in']=data['ph_in']
original_data['Sulfate_in']=data['Sulfate_in']
original_data['Hardness_in']=data['Hardness_in']



#from sulfate and ph
original_data['ph_Sulfate_11']=data['ph_Sulfate_11']
original_data['ph_Sulfate_13']=data['ph_Sulfate_13']
original_data['ph_Sulfate_22']=data['ph_Sulfate_22']
original_data['ph_Sulfate_33']=data['ph_Sulfate_33']
original_data['ph_Sulfate_43']=data['ph_Sulfate_43']
original_data['ph_Sulfate_44']=data['ph_Sulfate_44']

#from hardness and ph
original_data['ph_Hardness_32']=data['ph_Hardness_32']
original_data['ph_Hardness_34']=data['ph_Hardness_34']
original_data['ph_Hardness_43']=data['ph_Hardness_43']
original_data['ph_Hardness_44']=data['ph_Hardness_44']

#from solids and ph
original_data['ph_Solids_14']=data['ph_Solids_14']
original_data['ph_Solids_31']=data['ph_Solids_31']
original_data['ph_Solids_32']=data['ph_Solids_32']
original_data['ph_Solids_33']=data['ph_Solids_33']

#from organic carbon and ph
original_data['ph_Organic_c_31']=data['ph_Organic_c_31']
original_data['ph_Organic_c_32']=data['ph_Organic_c_32']
original_data['ph_Organic_c_33']=data['ph_Organic_c_33']

#from chloramines and ph
original_data['ph_Chloramines_14']=data['ph_Chloramines_14']
original_data['ph_Chloramines_31']=data['ph_Chloramines_31']
original_data['ph_Chloramines_32']=data['ph_Chloramines_32']
original_data['ph_Chloramines_41']=data['ph_Chloramines_41']

#from trihalomethanes and ph
original_data['ph_Trihalomethanes_31']=data['ph_Trihalomethanes_31']
original_data['ph_Trihalomethanes_32']=data['ph_Trihalomethanes_32']
original_data['ph_Trihalomethanes_33']=data['ph_Trihalomethanes_33']
original_data['ph_Trihalomethanes_34']=data['ph_Trihalomethanes_34']

#from turbidity and ph
original_data['ph_Turbidity_31']=data['ph_Turbidity_31']
original_data['ph_Turbidity_32']=data['ph_Turbidity_32']


<a id='23'></a>
# Modeling

* That's utilized machine learning algorithms called the 
    * **Random Forest,** 
    * **Artificial Neural Network,** 
    * **Gradient Boost**
    * and **Cat Boost**
    classifiers in literature. 
   <br></br>
* In the meantime, it's executed the hyperparameter tuning method that it's named the **Grid Search Cross Validation**.  
<br></br>
* As a consequence, that's chosen best parameter and algorithm through *ensemble modeling.*

<a id='24'></a>
## Import Library

First of all, we import require library for accomplished output.

In [None]:
from sklearn.model_selection import train_test_split,StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
import math

<a id='25'></a>
## Train-Test Split

The data set is splitted in order to training, testing and cross validation.

In [None]:
X = original_data.drop(['Potability'],axis=1)
y = original_data['Potability']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state =42)

<a id='26'></a>
## Machine Learning Algorithms

Parameters in literature is selected.

<a id='27'></a>
### Random Forest

In [None]:
rf_params = {'max_features':[1,3,10],
           'min_samples_split':[2,3,10],
           'min_samples_leaf':[1,3,10],
           'bootstrap':[False],
           'n_estimators':[100,300],
           'criterion':['gini']}

<a id='28'></a>
### Artificial Neural Network

In [None]:
from sklearn.neural_network import MLPClassifier

ann_params = {'alpha':[0.1,0.01,0.02,0.005,0.0001,0.00001],
             'hidden_layer_sizes':[(10,10,10),
                                 (100,100,100),
                                 (100,100),
                                 (3,5),
                                 (5,3)],
             'solver':['lbfgs','adam','sgd'],
             'activation':['relu','logistic']
        
}

<a id='29'></a>
### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb_params = {'learning_rate':[0.001,0.01,0.1,0.05],
            'n_estimators':[100,500,100],
            'max_depth':[3,5,10],
            'min_samples_split':[2,5,10]
    
}

<a id='30'></a>
### Cat Boost

In [None]:
from catboost import CatBoostClassifier

cb_params = {
    'iterations':[200,500],
    'learning_rate':[0.01,0.05,0.1],
    'depth':[3,5,8]
}

<a id='31'></a>
### Classifiers and Parameters

In [None]:
classifier = [
    RandomForestClassifier(),
    CatBoostClassifier(),
    MLPClassifier(),
    GradientBoostingClassifier(),
   
     ]

classifier_param = [
    rf_params,
    cb_params,
    ann_params,
    gb_params,
    
    
]

<a id='32'></a>
## Optimization

At the continuation of the this study, **Bayesian Optimization** and **Randomized Search Cross Validation** methods will be add. Furthermore, the device is running longer, the score can be improved.


<a id='33'></a>
### Grid Search and Cross Validation

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

#cross validation results
cv_results = []

#to use in ensemble modeling
best_estimators = []

for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i],
                       param_grid = classifier_param[i],
                       cv = StratifiedKFold(n_splits=2),
                       scoring = 'roc_auc',
                       n_jobs = -1,
                       verbose=1)
    
    clf.fit(X_train_scaled,y_train)
    cv_results.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print('Method: {}  Score: {}'.format(classifier[i],cv_results[i]))
    


results = pd.DataFrame({'Cross Validation Means':cv_results,
                       'ML Models':[
                           'Random Forest',
                           'Cat Boosting',
                           'Artificial Neural Network',
                           'Gradient Boosting',
                           
                       ]})


#visualization of results
g = sns.barplot('Cross Validation Means','ML Models',data=results)
g.set_xlabel('Mean Accuracy')
g.set_title('ROC-AUC Score')

<a id='34'></a>
## Ensemble Modeling

In [None]:
#so as to optimization
voting_c = VotingClassifier(estimators=[('cb',best_estimators[1]),
                                        ('rf',best_estimators[0]),
                                        ('gb',best_estimators[3]),
                                       ('ann',best_estimators[2])],
                            voting='hard',
                            n_jobs= -1
                            
)
 

<a id='35'></a>
## Prediction

In [None]:
voting_c = voting_c.fit(X_train_scaled,y_train)
my_score = accuracy_score(voting_c.predict(X_test_scaled),y_test)
print(my_score)

<a id='36'></a>
## Submission

In [None]:
# road to KAGGLE!
# water_quality=pd.Series(original_data,name='Potability').astype(int)
original_data.to_csv('water_potability.csv',index=False)
