## Practical project 1 - 01 - Dataset adjustments

Explainable Automated Machine Learning course, LTAT.02.023
University of Tartu, Institute of Computer Science

Each team will work on a machine learning problem from end-to-end.

#### Project's general description:
<details>
  <summary>Click here for more information!</summary>

  Each team will work on a machine learning problem from end-to-end.

  **Step 1:** Choose a dataset. Build and train a baseline for comparison. To construct the baseline you do the following:
  * Try a set of possible machine learning algorithms (**13 algorithms**) using their **default hyperparameters** and choose the one with the highest performance for comparison.


  **Step 2:** Based on the problem at hand, you study:
  * the **potential pipeline structure**,
  * **algorithms**
  * or **feature transformers** at each step,
  * **hyper-parameters ranges**.
  
  Use hyperOpt with the potential search space to beat the baseline.


  **Step 3:** Monitor the performance of the constructed pipeline from the previous step across different time budgets (number of iterations) and report the least time budget that you are able to outperform the baseline.


  **Step 4:** Determine whether the difference in performance between the constructed pipeline and the baseline is statistically significant.

</details>

#### Dataset used:
<details>
  <summary>Click here for more information!</summary>

  Dataset is taken from Kaggle competition - Drinking Water Quality Prediction. The goal of the competition is to create a model that predicts the water quality in Estonian water stations based on the government's open data of the previous measurements.

  [Reference](https://www.kaggle.com/competitions/copy-of-drinking-water-quality)
</details>

#### References:
<details>
  <summary>Click here for more information!</summary>

  [MLJ for Data Scientists in Two Hours](https://juliaai.github.io/DataScienceTutorials.jl/end-to-end/telco/)
</details>

### Activate the current project. Check the packages available

In [1]:
using Pkg

Pkg.activate(".")
Pkg.status()

[32m[1m  Activating[22m[39m project at `/Volumes/Data Science 214386/DataScience214386/LTAT.02.023 - Explainable Automated Machine Learning/project-1`


[32m[1mStatus[22m[39m `/Volumes/Data Science 214386/DataScience214386/LTAT.02.023 - Explainable Automated Machine Learning/project-1/Project.toml`
 [90m [336ed68f] [39mCSV v0.10.7
 [90m [324d7699] [39mCategoricalArrays v0.10.7
 [90m [af321ab8] [39mCategoricalDistributions v0.1.9
 [90m [861a8166] [39mCombinatorics v1.0.2
[32m⌃[39m[90m [a93c6f00] [39mDataFrames v1.3.6
[32m⌃[39m[90m [add582a8] [39mMLJ v0.18.6
 [90m [c6f25543] [39mMLJDecisionTreeInterface v0.2.5
 [90m [1b6a4a23] [39mMLJMultivariateStatsInterface v0.5.0
 [90m [17a086e9] [39mMLJParticleSwarmOptimization v0.1.2
 [90m [5ae90465] [39mMLJScikitLearnInterface v0.2.0
 [90m [54119dfa] [39mMLJXGBoostInterface v0.2.1
 [90m [eff96d63] [39mMeasurements v2.8.0
[32m⌃[39m[90m [91a5bcdd] [39mPlots v1.35.7
 [90m [860ef19b] [39mStableRNGs v1.0.0
 [90m [fd094767] [39mSuppressor v0.2.1
[36m[1mInfo[22m[39m Packages marked with [32m⌃[39m have new versions available and may be upgradable.


### Get packages to use

In [2]:
using DataFrames
using CSV
using MLJ

### Do initial dataset transformations

#### Get train/test data

In [3]:
df_train = CSV.read(joinpath(@__DIR__, "data/original/train_original.csv"), delim=',', DataFrame)
display(first(df_train, 3))
df_test = CSV.read(joinpath(@__DIR__, "data/original/test_original.csv"), delim=',', DataFrame)
display(first(df_test, 3))

Unnamed: 0_level_0,station_id,Aluminium_2019,Aluminium_2020,Ammonium_2019,Ammonium_2020,Boron_2019
Unnamed: 0_level_1,Int64,Float64?,Float64?,Float64?,Float64?,Float64?
1,487,missing,missing,0.05,0.05,missing
2,1555,missing,missing,0.05,0.05,missing
3,205,missing,10.0,0.05,0.24,missing


Unnamed: 0_level_0,station_id,Aluminium_2019,Aluminium_2020,Ammonium_2019,Ammonium_2020,Boron_2019
Unnamed: 0_level_1,Int64,Float64?,Float64?,Float64?,Float64?,Float64?
1,163,5.0,5.0,0.08,0.08,0.071
2,167,missing,missing,0.08,0.08,missing
3,171,missing,missing,missing,missing,missing


#### Declare the function required for dataset's columnnames adjustments

* all columnnames to lowercase
* replace suffix like '_20XX'

In [4]:
# Ref, inspired by: dataframes.juliadata.org/stable/man/querying_frameworks/
function adjustcolnames(df, suffix)
    array = []
    for col in names(df)
        push!(array, replace(lowercase(col), suffix => ""))
    end
        rename!(df, Symbol.(array))
end

adjustcolnames (generic function with 1 method)

#### Do relevant adjustments for 2019/2020 datasets

* apply 'adjustcolnames' for each column
* add new column 'year
* union the 2019/2020 dataframes into one
* exclude 'station_id' and 'year'

In [5]:
# Adjust train

# 2019 - apply 'adjustcolnames' for each column
df_train_2019 = select(df_train, :station_id, Cols(r"_2019"))
# add new column 'year'
df_train_2019[!, "year"] .= 2019
adjustcolnames(df_train_2019, "_2019")

# 2020 - apply 'adjustcolnames' for each column
df_train_2020 = select(df_train, :station_id, Cols(r"_2020"))
# add new column 'year'
df_train_2020[!, "year"] .= 2020
adjustcolnames(df_train_2020, "_2020")

# union the 2019/2020 dataframes into one
df_train_all = vcat(df_train_2019, df_train_2020)

# exclude 'station_id' and 'year'
select!(df_train_all, Not([:year, :station_id]))
first(df_train_all, 3)

Unnamed: 0_level_0,aluminium,ammonium,boron,chloride,coli-like-bacteria-colilert,coli-like-bacteria
Unnamed: 0_level_1,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?
1,missing,0.05,missing,missing,missing,0.0
2,missing,0.05,missing,missing,missing,0.0
3,missing,0.05,missing,missing,missing,0.0


In [6]:
# Adjust test

# 2019 - apply 'adjustcolnames' for each column
df_test_2019 = select(df_test, :station_id, Cols(r"_2019"))
# add new column 'year'
df_test_2019[!, "year"] .= 2019
adjustcolnames(df_test_2019, "_2019")

# 2020 - apply 'adjustcolnames' for each column
df_test_2020 = select(df_test, :station_id, Cols(r"_2020"))
# add new column 'year'
df_test_2020[!, "year"] .= 2020
adjustcolnames(df_test_2020, "_2020")

# union the 2019/2020 dataframes into one
df_test_all = vcat(df_test_2019, df_test_2020)

# exclude 'station_id' and 'year'
select!(df_test_all, Not([:year, :station_id]))
first(df_test_all, 3)

Unnamed: 0_level_0,aluminium,ammonium,boron,chloride,coli-like-bacteria-colilert,coli-like-bacteria
Unnamed: 0_level_1,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?
1,5.0,0.08,0.071,130.0,missing,0.0
2,missing,0.08,missing,missing,missing,0.0
3,missing,missing,missing,112.0,missing,missing


### Save adjusted train/test datasets

In [7]:
CSV.write(joinpath(@__DIR__, "data/adjusted/train_adjusted.csv"), delim=';', df_train_all)
CSV.write(joinpath(@__DIR__, "data/adjusted/test_adjusted.csv"), delim=';', df_test_all)

"/Volumes/Data Science 214386/DataScience214386/LTAT.02.023 - Explainable Automated Machine Learning/project-1/data/adjusted/test_adjusted.csv"