## Introduction To The Dataset

This dataset is about a flotation plant which is a process used to concentrate the iron ore. This process is very common in a mining plant.

Froth flotation is a process for selectively separating hydrophobic materials from hydrophilic. This is used in mineral processing, paper recycling and waste-water treatment industries. Historically this was first used in the mining industry, where it was one of the great enabling technologies of the 20th century. 

The development of froth flotation has improved the recovery of valuable minerals, such as copper- and lead-bearing minerals. Along with mechanized mining, it has allowed the economic recovery of valuable metals from much lower grade ore than previously.

### Data Dictionary

- The first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.

- The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. 

- Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. 

- From column 9 until column 22, we can see process data (level and air flow inside the flotation columns, which also impact in ore quality. 

- The last two columns are the final iron ore pulp quality measurement from the lab.

- Target is to predict the last column, which is the % of silica in the iron ore concentrate.

### Goal Of This Project

The main goal is to use this data to predict how much impurity is in the ore concentrate. 
As this impurity is measured every hour, if I can predict how much silica (impurity) is in the ore concentrate, this can help the engineers, giving them early information to take actions (empowering!). 
Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).


- In this kernel, I'll:

- Prepare the data for machine learning
- Train a model using a Regressor Model
- Measure & optimize the accuracy of your model


In [1]:
import pandas as pd
Mining = pd.read_csv("MiningProcess_Flotation_Plant_Database.csv")
Mining.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 737453 entries, 0 to 737452
Data columns (total 24 columns):
date                            737453 non-null object
% Iron Feed                     737453 non-null object
% Silica Feed                   737453 non-null object
Starch Flow                     737453 non-null object
Amina Flow                      737453 non-null object
Ore Pulp Flow                   737453 non-null object
Ore Pulp pH                     737453 non-null object
Ore Pulp Density                737453 non-null object
Flotation Column 01 Air Flow    737453 non-null object
Flotation Column 02 Air Flow    737453 non-null object
Flotation Column 03 Air Flow    737453 non-null object
Flotation Column 04 Air Flow    737453 non-null object
Flotation Column 05 Air Flow    737453 non-null object
Flotation Column 06 Air Flow    737453 non-null object
Flotation Column 07 Air Flow    737453 non-null object
Flotation Column 01 Level       737453 non-null object
Flotation

In [2]:
Mining.head(10)

Unnamed: 0,date,% Iron Feed,% Silica Feed,Starch Flow,Amina Flow,Ore Pulp Flow,Ore Pulp pH,Ore Pulp Density,Flotation Column 01 Air Flow,Flotation Column 02 Air Flow,...,Flotation Column 07 Air Flow,Flotation Column 01 Level,Flotation Column 02 Level,Flotation Column 03 Level,Flotation Column 04 Level,Flotation Column 05 Level,Flotation Column 06 Level,Flotation Column 07 Level,% Iron Concentrate,% Silica Concentrate
0,2017-03-10 01:00:00,552,1698,301953,557434,395713,100664,174,249214,253235,...,250884,457396,432962,424954,443558,502255,44637,523344,6691,131
1,2017-03-10 01:00:00,552,1698,302441,563965,397383,100672,174,249719,250532,...,248994,451891,42956,432939,448086,496363,445922,498075,6691,131
2,2017-03-10 01:00:00,552,1698,304346,568054,399668,10068,174,249741,247874,...,248071,45124,468927,43461,449688,484411,447826,458567,6691,131
3,2017-03-10 01:00:00,552,1698,304736,568665,397939,100689,174,249917,254487,...,251147,452441,458165,442865,44621,471411,43769,427669,6691,131
4,2017-03-10 01:00:00,552,1698,303369,558167,400254,100697,174,250203,252136,...,248928,452441,4529,450523,45367,462598,443682,425679,6691,131
5,2017-03-10 01:00:00,552,1698,30791,564697,396533,100705,174,25073,248906,...,251873,444384,443269,460449,43992,451588,433539,425458,6691,131
6,2017-03-10 01:00:00,552,1698,312779,566467,3929,100713,174,250313,252202,...,253477,446185,444571,452306,431328,443548,444575,431251,6691,131
7,2017-03-10 01:00:00,552,1698,315293,558777,397002,100722,174,249895,25363,...,253345,445985,461341,46164,442067,44173,46177,449679,6691,131
8,2017-03-10 01:00:00,552,1698,314727,55603,394307,10073,174,250137,251104,...,250884,446686,478385,459103,455074,439798,457738,455915,6691,131
9,2017-03-10 01:00:00,552,1698,314258,565857,393105,100738,174,249653,252202,...,248137,445685,478779,460665,457225,453236,449898,45575,6691,131


## Finding Missing Values


In [3]:
Mining.isnull().sum()

date                            0
% Iron Feed                     0
% Silica Feed                   0
Starch Flow                     0
Amina Flow                      0
Ore Pulp Flow                   0
Ore Pulp pH                     0
Ore Pulp Density                0
Flotation Column 01 Air Flow    0
Flotation Column 02 Air Flow    0
Flotation Column 03 Air Flow    0
Flotation Column 04 Air Flow    0
Flotation Column 05 Air Flow    0
Flotation Column 06 Air Flow    0
Flotation Column 07 Air Flow    0
Flotation Column 01 Level       0
Flotation Column 02 Level       0
Flotation Column 03 Level       0
Flotation Column 04 Level       0
Flotation Column 05 Level       0
Flotation Column 06 Level       0
Flotation Column 07 Level       0
% Iron Concentrate              0
% Silica Concentrate            0
dtype: int64

## Feature Selection

- The 'date' feature doesn't relate any relevant infomation for the ML model - it constitutes noise, I'll drop it

In [4]:
Mining = Mining.drop(["date"], axis=1)

## Data Cleaning and Feature Engineering

- All observations (data points) are of object data type instead of a float type because of the comma (',') between the numbers

In [5]:
for col in Mining.columns:
    Mining[col] = Mining[col].str.replace(',', '').astype(float)

In [6]:
display(Mining.head(9))

Unnamed: 0,% Iron Feed,% Silica Feed,Starch Flow,Amina Flow,Ore Pulp Flow,Ore Pulp pH,Ore Pulp Density,Flotation Column 01 Air Flow,Flotation Column 02 Air Flow,Flotation Column 03 Air Flow,...,Flotation Column 07 Air Flow,Flotation Column 01 Level,Flotation Column 02 Level,Flotation Column 03 Level,Flotation Column 04 Level,Flotation Column 05 Level,Flotation Column 06 Level,Flotation Column 07 Level,% Iron Concentrate,% Silica Concentrate
0,552.0,1698.0,301953.0,557434.0,395713.0,100664.0,174.0,249214.0,253235.0,250576.0,...,250884.0,457396.0,432962.0,424954.0,443558.0,502255.0,44637.0,523344.0,6691.0,131.0
1,552.0,1698.0,302441.0,563965.0,397383.0,100672.0,174.0,249719.0,250532.0,250862.0,...,248994.0,451891.0,42956.0,432939.0,448086.0,496363.0,445922.0,498075.0,6691.0,131.0
2,552.0,1698.0,304346.0,568054.0,399668.0,10068.0,174.0,249741.0,247874.0,250313.0,...,248071.0,45124.0,468927.0,43461.0,449688.0,484411.0,447826.0,458567.0,6691.0,131.0
3,552.0,1698.0,304736.0,568665.0,397939.0,100689.0,174.0,249917.0,254487.0,250049.0,...,251147.0,452441.0,458165.0,442865.0,44621.0,471411.0,43769.0,427669.0,6691.0,131.0
4,552.0,1698.0,303369.0,558167.0,400254.0,100697.0,174.0,250203.0,252136.0,249895.0,...,248928.0,452441.0,4529.0,450523.0,45367.0,462598.0,443682.0,425679.0,6691.0,131.0
5,552.0,1698.0,30791.0,564697.0,396533.0,100705.0,174.0,25073.0,248906.0,249521.0,...,251873.0,444384.0,443269.0,460449.0,43992.0,451588.0,433539.0,425458.0,6691.0,131.0
6,552.0,1698.0,312779.0,566467.0,3929.0,100713.0,174.0,250313.0,252202.0,249082.0,...,253477.0,446185.0,444571.0,452306.0,431328.0,443548.0,444575.0,431251.0,6691.0,131.0
7,552.0,1698.0,315293.0,558777.0,397002.0,100722.0,174.0,249895.0,25363.0,249258.0,...,253345.0,445985.0,461341.0,46164.0,442067.0,44173.0,46177.0,449679.0,6691.0,131.0
8,552.0,1698.0,314727.0,55603.0,394307.0,10073.0,174.0,250137.0,251104.0,248774.0,...,250884.0,446686.0,478385.0,459103.0,455074.0,439798.0,457738.0,455915.0,6691.0,131.0


In [7]:
Mining.corr()["% Silica Concentrate"]

% Iron Feed                    -0.002580
% Silica Feed                   0.034893
Starch Flow                    -0.024802
Amina Flow                     -0.003580
Ore Pulp Flow                  -0.006495
Ore Pulp pH                     0.023905
Ore Pulp Density               -0.002378
Flotation Column 01 Air Flow    0.041565
Flotation Column 02 Air Flow    0.037116
Flotation Column 03 Air Flow    0.069685
Flotation Column 04 Air Flow    0.088561
Flotation Column 05 Air Flow    0.091065
Flotation Column 06 Air Flow   -0.023285
Flotation Column 07 Air Flow   -0.007616
Flotation Column 01 Level       0.015068
Flotation Column 02 Level       0.005887
Flotation Column 03 Level       0.048918
Flotation Column 04 Level      -0.005961
Flotation Column 05 Level       0.002197
Flotation Column 06 Level       0.003561
Flotation Column 07 Level       0.008262
% Iron Concentrate              0.696886
% Silica Concentrate            1.000000
Name: % Silica Concentrate, dtype: float64

'% Iron Concentrate' & '% Silica Concentrate' are strongly correlated

- Considering the aim of this project which is to predict how much impurity are in the ore concentrate which will give the Engineers an early information to take actions - they will be able to take corrective actions in advance

- Hence the '% Iron Concentrate' feature along with the '% Silica Concentrate' feature will not be available during the flotation process as they are only the outputs

In [14]:
Mining_features = Mining.drop(["% Silica Concentrate","% Iron Concentrate"], axis=1)

In [16]:
Mining_features.head()

Unnamed: 0,% Iron Feed,% Silica Feed,Starch Flow,Amina Flow,Ore Pulp Flow,Ore Pulp pH,Ore Pulp Density,Flotation Column 01 Air Flow,Flotation Column 02 Air Flow,Flotation Column 03 Air Flow,...,Flotation Column 05 Air Flow,Flotation Column 06 Air Flow,Flotation Column 07 Air Flow,Flotation Column 01 Level,Flotation Column 02 Level,Flotation Column 03 Level,Flotation Column 04 Level,Flotation Column 05 Level,Flotation Column 06 Level,Flotation Column 07 Level
0,552.0,1698.0,301953.0,557434.0,395713.0,100664.0,174.0,249214.0,253235.0,250576.0,...,3064.0,250225.0,250884.0,457396.0,432962.0,424954.0,443558.0,502255.0,44637.0,523344.0
1,552.0,1698.0,302441.0,563965.0,397383.0,100672.0,174.0,249719.0,250532.0,250862.0,...,3064.0,250137.0,248994.0,451891.0,42956.0,432939.0,448086.0,496363.0,445922.0,498075.0
2,552.0,1698.0,304346.0,568054.0,399668.0,10068.0,174.0,249741.0,247874.0,250313.0,...,3064.0,251345.0,248071.0,45124.0,468927.0,43461.0,449688.0,484411.0,447826.0,458567.0
3,552.0,1698.0,304736.0,568665.0,397939.0,100689.0,174.0,249917.0,254487.0,250049.0,...,3064.0,250422.0,251147.0,452441.0,458165.0,442865.0,44621.0,471411.0,43769.0,427669.0
4,552.0,1698.0,303369.0,558167.0,400254.0,100697.0,174.0,250203.0,252136.0,249895.0,...,3064.0,249983.0,248928.0,452441.0,4529.0,450523.0,45367.0,462598.0,443682.0,425679.0


## Training a Model with a Random Forest Regressor Model

Because the training data is sufficiently large and the number of observations (data points) is higher as compared to the number of features, I'll use a low bias/high variance algorithm which Random Forest Regressor Model fits in

### Splitting the Data into Train & Test

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
X_train, X_test, y_train, y_test = train_test_split(Mining_features, Mining["% Silica Concentrate"], random_state=1)


In [21]:
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [22]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([("scaler", RobustScaler()), ("rf", RandomForestRegressor())])

In [23]:
pipe.fit(X_train, y_train)
print("Test score: {:.2f}".format(pipe.score(X_test, y_test)))

Test score: 0.76
