# Control point: Data preprocessing
#### Máster en Análisis de Datos, Ciberseguridad y Computación en la Nube
#### Aprendizaje Automático - Punto de Control 1 (31/10/2019)

#### Name & Surnames: 

### Introduction

The dataset named Hoerchen (Hoerchen.csv) has more than 145K samples and more than 70 features.

The main objective of the control point is to preprocess the train data, designing a complete preprocessing scheme, and test it on test data. 

You must take into account that this is not a toy dataset, and its size could be relevant.

The function "automatic_scoring" provides a way for comparing different schemes using a classifier, by means of 10-Fold CV and using AUC as metric. You will need to put the right seed as requested. Notice that the function just needs inputs (X) and target (y) arrays as input.

If you try anytime several options it is important to show the results of those discarded trials, because what is not visible cannot be evaluated.

The function "automatic_testing" trains the model on the train data and applies it to the test data. Do not change the classification algorithm, its parameters and the scoring choice. Those are fixed and their optimization is out of the scope of this control point.

The deliverable of this control point is this Jupyter Notebook containing the code, plus some short answers in markdown cells if required.

NOTE: Keep in mind that some functions accept both Pandas dataframes and Numpy arrays, but some others only one of them. Nevertheless, we should know how to pass form one to the other and viceversa.

NOTE: Keep in mind that some functions will take some time to run. You can continue working on other cells during the run to avoid wasting time waiting.

### Exercises:

* (i) Split the data into 4 parts, i.e. train inputs and target (xtr, ytr) and test inputs and target (xte and yte), in such a way that the proportion of the classes is kept constant in train and test parts. The size of the training set must be 70% of the total size of the data, and the random seed to be used must be your ID card number (i.e. DNI without the letter). This random seed must be kept during all the control point in any possible place. [5%] <br>
<br>
* (ii) Checking for missing values and outliers. If any, treat the data however you consider better, arguing your decisions. [20%] <br>
  <br>
    - (a) Is there any missing value? If so, regarding the characteristics of the data, decide what to do arguing your answer. Modify your data according to your answer if necessary.  <br>
      <br>
    - (b) Is there any collective outlier? If so, regarding the characteristics of the data, decide what to do arguing your answer.  Modify your data according to your answer if necessary. <br>
    <br>
    - (c) From now on, this is your basic data. Therefore, it is save to overwrite the names of the data parts. <br>
<br>

* (iii) The feature selection method SelectPercentile (sklearn.feature_selection.SelectPercentile) uses different scores (f_classif, mutual_info_classif, chi2, f_regression, etc) in order to select the most relevant features. In the Scikit-Learn documentation
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
you have the function info and an example of use of chi2 score. Use the feature selection method SelectPercentile with the mutual_info_classif score, and percentile parameter 20. [30%] <br>
<br>
    - (a) Which is the compression ratio you obtained? (Note: Compression ratio is the proportion of variables kept after the selection). <br>
    <br>
    - (b) Compare the performance with and without feature selection with the right scheme and function. Is selecting those variables a good idea? Argue your response. <br>
    <br>
    - (c) Regarding the answer to (b), get your current data in order to continue preprocessing.  <br>
    <br>
    
* (iv) Check the balance of your current dataset. Which is its imbalance ratio? We can understand it both as the number of times the majority class is bigger than the minority class, or the proportion of the samples that are from minority class. If imbalance ratio is higher than 49 to 1 (equivalent to having less than 2% of minority class samples), discuss if it makes sense to apply imbalanced data treatments or not. Consider the size of the data and the performance you have obtained in (iii) (b) for the data you currently have. Act in consequence with total freedom on the sampling method to use if you need any. [20%] <br>
<br>
* (v) Apply principal component analysis to your data for compression, capturing at least 95% of the cumulative variance. How many extracted variables do you have? Which reduction percentage would you get if you apply it? Compare the performance with the one of your current non-compressed data. Would you use the pca compression here? Act consequently with your answer, and keep the data overwriting the names. [15%] <br> 
<br>
* (vi) Once you are here, you have final preprocessed data using the definitive preprocessing scheme you have reasonably chosen. Check now the performance using the test data. Comment on the result you have obtained compared to the one in (v). [10%]

#### Auxiliar functions

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


seed = 9425400# Your DNI number without letter and left zeros here, e.g. 09425400T => 9425400


def automatic_scoring(X, y):
    average_score = cross_val_score(estimator=RandomForestClassifier(n_estimators=100, random_state=seed), X=X, y=y, cv=5, scoring='roc_auc').mean()
    return average_score

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score


def automatic_testing(X_train, y_train, X_test, y_test):
    auc_score = roc_auc_score(y_test, RandomForestClassifier(n_estimators=100, random_state=seed).fit(X_train, y_train).predict_proba(X_test)[:,1])
    return auc_score

### Solution:

In [3]:
# Tic-Toc
def tic():
    import time
    global startTime_for_tictoc
    startTime_for_tictoc = time.time()

    
def toc(verbose=True):
    import time
    gap = time.time() - startTime_for_tictoc
    if verbose:
        if 'startTime_for_tictoc' in globals():
            print("Elapsed time is " + str(gap) + " seconds.")
        else:
            print("Toc: start time not set")
    else:
        return gap

In [4]:
# (i)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


data = pd.read_csv('Hoerchen.csv')
xtr, xte, ytr, yte = train_test_split(data.values[:,:-1], data.values[:,-1], test_size=0.3, random_state=seed, stratify=data.values[:,-1])

In [5]:
# (ii)
# We check for imbalance rate (only for information purposes)
tic1=tic()
imb_rate = (-sum([data.values[x, -1] for x in list(range(0, len(data.values[:, -1]))) if data.values[x, -1] == -1]))/sum([data.values[x, -1] for x in list(range(0, len(data.values[:, -1]))) if data.values[x, -1] == 1])
imb_rate

111.4621913580247

In [6]:
# (ii) (a)
#Check for missing values in train
df_tr = pd.DataFrame(xtr)
df_tr = pd.concat([df_tr, pd.DataFrame(ytr, columns=[str(int(df_tr.columns[-1]) + 1)])], axis=1)
df_tr.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,74
count,102024.0,102025.0,102024.0,102024.0,102025.0,102024.0,102024.0,102025.0,102025.0,102024.0,...,102024.0,102024.0,102024.0,102024.0,102024.0,102024.0,102024.0,102025.0,102024.0,102025.0
mean,61.208536,26.490502,0.180773,1.66749,18.289738,1820.021907,-0.005729,0.200864,1.016555,-73.737062,...,1819.271988,0.02497,0.529898,0.278601,-64.647318,471.97156,0.473091,0.260587,0.191212,-0.98222
std,18.972717,4.494317,1.243443,31.875404,81.499202,1416.116244,1.028956,1.421043,20.647745,26.558552,...,1700.947179,1.180713,1.647576,9.641195,36.719066,406.384807,1.073515,0.187366,0.499566,0.187734
min,3.57,12.0,-3.83,-144.0,-1039.0,-562.3,-6.12,-2.86,-83.0,-1044.0,...,-413.7,-7.78,-10.0,-63.0,-311.0,-420.3,-20.0,-0.55,-1.91,-1.0
25%,48.0,23.56,-0.58,-17.5,-13.0,1014.8,-0.66,-0.6,-9.5,-86.0,...,860.9,-0.72,-0.55,-5.0,-82.0,174.6,-0.17,0.14,-0.1,-1.0
50%,62.46,25.79,0.1,1.0,11.5,1531.45,0.04,0.03,0.5,-69.5,...,1434.5,0.08,0.39,0.0,-55.0,377.6,0.56,0.26,0.28,-1.0
75%,75.42,28.57,0.82,19.5,40.0,2299.4,0.7,0.77,10.5,-56.5,...,2334.9,0.83,1.46,5.0,-38.0,667.225,1.21,0.39,0.58,-1.0
max,100.0,100.0,50.38,954.5,3380.0,52817.9,5.76,66.36,917.0,-23.0,...,64129.4,5.57,15.56,146.0,0.0,4197.9,6.6,1.0,1.0,1.0


In [7]:
# Remove them in train and get back numpy arrays
df_tr.dropna(axis=0, how='any', inplace=True)
xtr = df_tr.values[:, :-1]
ytr = df_tr.values[:, -1]
# Check
df_tr.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,74
count,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,...,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0,101975.0
mean,61.209005,26.490635,0.180929,1.669782,18.290689,1820.090182,-0.005694,0.200944,1.016524,-73.739608,...,1819.315531,0.025117,0.529994,0.278696,-64.649649,471.994169,0.473037,0.26058,0.191183,-0.982211
std,18.973114,4.494733,1.243541,31.876338,81.515264,1416.24531,1.028989,1.421225,20.650043,26.560722,...,1701.099294,1.180754,1.647679,9.642012,36.720379,406.411683,1.073562,0.187363,0.49954,0.18778
min,3.57,12.0,-3.83,-144.0,-1039.0,-562.3,-6.12,-2.86,-83.0,-1044.0,...,-413.7,-7.78,-10.0,-63.0,-311.0,-420.3,-20.0,-0.55,-1.91,-1.0
25%,48.0,23.56,-0.58,-17.5,-13.0,1014.8,-0.66,-0.6,-9.5,-86.0,...,860.9,-0.72,-0.55,-5.0,-82.0,174.55,-0.17,0.14,-0.1,-1.0
50%,62.46,25.79,0.1,1.0,11.5,1531.6,0.04,0.03,0.5,-69.5,...,1434.5,0.08,0.39,0.0,-55.0,377.6,0.56,0.26,0.28,-1.0
75%,75.42,28.57,0.82,19.5,40.0,2299.25,0.7,0.77,10.5,-56.5,...,2335.0,0.83,1.46,5.0,-38.0,667.25,1.21,0.39,0.58,-1.0
max,100.0,100.0,50.38,954.5,3380.0,52817.9,5.76,66.36,917.0,-23.0,...,64129.4,5.57,15.56,146.0,0.0,4197.9,6.6,1.0,1.0,1.0


In [8]:
# Remove them in test and get back numpy arrays
df_te = pd.DataFrame(xte)
df_te = pd.concat([df_te, pd.DataFrame(yte, columns=[str(int(df_te.columns[-1]) + 1)])], axis=1)
df_te.dropna(axis=0, how='any', inplace=True)
xte = df_te.values[:, :-1]
yte = df_te.values[:, -1]

In [9]:
# (ii) (b)
# Check for outliers => Rule
tic()
from sklearn.covariance import EllipticEnvelope
elip_env = EllipticEnvelope().fit(xtr)
toc()

Elapsed time is 190.12281107902527 seconds.


In [10]:
# Check for outliers in train
tic()
detection = elip_env.predict(xtr)
outlier_positions_mah = [x for x in range(xtr.shape[0]) if detection[x] == -1]
# Total amount of outliers in train
print("Outliers: " + str(len(outlier_positions_mah)))
# Those from minority class (+1.0)
print("From minority class: " + str(sum(ytr[outlier_positions_mah] != -1)))
# and majority class (-1.0)
print("From majority class: " + str(sum(ytr[outlier_positions_mah] == -1)))
# Positions from majority class train outliers
outlier_positions_mah_major = [x for x in range(xtr.shape[0]) if (detection[x] == -1 and ytr[x] == -1)]
# Check
print(len(outlier_positions_mah_major) == sum(ytr[outlier_positions_mah] == -1))
toc()

Outliers: 10198
From minority class: 526
From majority class: 9672
True
Elapsed time is 1.6098523139953613 seconds.


In [11]:
# Check for outliers in test
tic()
detection_test = elip_env.predict(xte)
outlier_positions_mah_test = [x for x in range(xte.shape[0]) if detection_test[x] == -1]
# Total amount of outliers in train
print("Outliers: " + str(len(outlier_positions_mah_test)))
# Those from minority class (+1.0)
print("From minority class: " + str(sum(ytr[outlier_positions_mah_test] != -1)))
# and majority class (-1.0)
print("From majority class: " + str(sum(ytr[outlier_positions_mah_test] == -1)))
# Positions from majority class train outliers
outlier_positions_mah_major_test = [x for x in range(xte.shape[0]) if (detection_test[x] == -1 and yte[x] == -1)]
# Check
print(len(outlier_positions_mah_major_test) == sum(yte[outlier_positions_mah_test] == -1))
toc()

Outliers: 4409
From minority class: 33
From majority class: 4376
True
Elapsed time is 0.6803216934204102 seconds.


In [12]:
# Outliers deletion train
df_tr = pd.DataFrame(xtr)
df_tr = pd.concat([df_tr, pd.DataFrame(ytr, columns=[str(int(df_tr.columns[-1]) + 1)])], axis=1)
df_tr.drop(df_tr.index[outlier_positions_mah_major], inplace=True)

In [13]:
# Outliers deletion test
df_te = pd.DataFrame(xte)
df_te = pd.concat([df_te, pd.DataFrame(yte, columns=[str(int(df_te.columns[-1]) + 1)])], axis=1)
df_te.drop(df_te.index[outlier_positions_mah_major_test], inplace=True)

In [14]:
# Data modified overwritten
xtr = df_tr.values[:, :-1]
ytr = df_tr.values[:, -1]
xte = df_te.values[:, :-1]
yte = df_te.values[:, -1]

In [15]:
# Check
print([xtr.shape, xte.shape, ytr.shape, yte.shape])

[(92303, 74), (39521, 74), (92303,), (39521,)]


In [16]:
# (iii)
tic()
from sklearn.feature_selection import SelectPercentile, mutual_info_classif
selperc = SelectPercentile(mutual_info_classif, percentile=20).fit(xtr, ytr)
xtr_selperc = selperc.transform(xtr)
xte_selperc = selperc.transform(xte)
toc()

Elapsed time is 172.21773481369019 seconds.


In [17]:
# Check
print(xtr_selperc.shape)
print(xte_selperc.shape)

(92303, 15)
(39521, 15)


In [18]:
# (iii) (a)
reduction_rate = 1 - (xtr_selperc.shape[1]/xtr.shape[1])
reduction_rate

0.7972972972972973

(a) Reduction is 79.73%.

In [19]:
# (iii) (b)
print("Original data: ")
tic()
auc = automatic_scoring(xtr, ytr)
toc()
print("Reduced data: ")
tic()
auc_selperc = automatic_scoring(xtr_selperc, ytr)
print([auc, auc_selperc])
toc()

Original data: 
Elapsed time is 551.9796137809753 seconds.
Reduced data: 
[0.9742921897252323, 0.9610601860395127]
Elapsed time is 150.432599067688 seconds.


(b) Selecting those 15 variables instead of the original 74 makes sense, because the performance is really similar.
In fact it is $auc = 0.978$ with the original 74 variables and $auc = 0.960$ with the selected subset of 15.

In [20]:
# Data modified overwritten
xtr = xtr_selperc
xte = xte_selperc

In [21]:
# (iv)
df_tr = pd.DataFrame(xtr)
df_tr = pd.concat([df_tr, pd.DataFrame(ytr, columns=[str(int(df_tr.columns[-1]) + 1)])], axis=1)

In [22]:
imb_rate = (-sum([df_tr.values[x, -1] for x in list(range(0, len(df_tr.values[:, -1]))) if df_tr.values[x, -1] == -1]))/sum([df_tr.values[x, -1] for x in list(range(0, len(df_tr.values[:, -1]))) if df_tr.values[x, -1] == 1])
imb_rate

100.76736493936053

In [23]:
min_prop = sum([df_tr.values[x, -1] for x in list(range(0, len(df_tr.values[:, -1]))) if df_tr.values[x, -1] == 1])/len(df_tr.values[:, -1])
min_prop

0.009826332838585961

The imbalance is higher that 100 to 1, with a ratio below 1%.

Taking into account the huge amount of training samples (almost 100k), any pairwise distance matrix will be humongous. Moreover, distances will be always big because we are in a 15-D space. Besides, the performance is good ($auc = 0.960$). For all, I would not try to balance the data.

In [24]:
# (v)
# PCA auxiliar functions
from sklearn.decomposition import PCA

def pca_projections_train(df, n_components=0.95):
    pca = PCA(n_components)
    X = df[df.columns[:-1]]  # Assuming the class in in the last column
    pca.fit(X)
    X = pca.transform(X)
    proj_df = pd.DataFrame(data=X, columns=['PC' + str(x) for x in list(range(1, X.shape[1] + 1))])
    proj_df = pd.concat([proj_df, df[df.columns[-1]]], axis=1)
    return proj_df


def pca_projections_test(df_train, df_test, n_components=0.95):
    pca = PCA(n_components)
    XTR = df_train[df_train.columns[:-1]]  # Assuming the class in in the last column
    XTE = df_test[df_test.columns[:-1]]  # Assuming the class in in the last column
    pca.fit(XTR)
    XTE = pca.transform(XTE)
    proj_df = pd.DataFrame(data=XTE, columns=['PC' + str(x) for x in list(range(1, XTE.shape[1] + 1))])
    proj_df = pd.concat([proj_df, df_test[df_test.columns[-1]]], axis=1)
    return proj_df

In [25]:
df_tr = pd.DataFrame(xtr)
df_tr = pd.concat([df_tr, pd.DataFrame(ytr, columns=[str(int(df_tr.columns[-1]) + 1)])], axis=1)

In [26]:
tic()
df_tr = pca_projections_train(df_tr, n_components=0.95)
print(df_tr.describe())
toc()

                PC1           PC2           PC3           PC4           PC5  \
count  9.230300e+04  9.230300e+04  9.230300e+04  9.230300e+04  9.230300e+04   
mean  -2.240716e-15  7.870371e-16 -3.645839e-16 -2.394059e-17 -4.955932e-16   
std    7.458612e+01  4.030751e+01  2.643358e+01  2.524495e+01  1.591198e+01   
min   -2.874108e+02 -1.100575e+02 -2.488094e+02 -1.059369e+02 -1.023504e+02   
25%   -3.552112e+01 -2.565724e+01 -1.527327e+01 -1.583123e+01 -9.530237e+00   
50%    1.872573e+00 -6.015692e+00  2.660715e+00 -8.142105e-01  1.294871e+00   
75%    3.532206e+01  1.937319e+01  1.790973e+01  1.458722e+01  1.087128e+01   
max    2.512953e+03  1.195989e+03  3.484352e+02  5.159453e+02  2.478668e+02   

                 15  
count  92303.000000  
mean      -0.980347  
std        0.197280  
min       -1.000000  
25%       -1.000000  
50%       -1.000000  
75%       -1.000000  
max        1.000000  
Elapsed time is 0.2902262210845947 seconds.


In [27]:
xtr_pca = df_tr.values[:, :-1]
ytr_pca = df_tr.values[:, -1]

In [28]:
tic()
auc_pca = automatic_scoring(xtr_pca, ytr)
print(auc_pca)
toc()

0.9112180560975179
Elapsed time is 203.51836967468262 seconds.


The performance using pca is much worse than not using it (AUC of 0.920 against 0.960). Therefore, considering the compresion is not that high (from 15 to 5 extracted variables), I would not use it.

No need to overwrite any name of the data parts.

In [29]:
# Check
print([xtr.shape, xte.shape, ytr.shape, yte.shape])

[(92303, 15), (39521, 15), (92303,), (39521,)]


In [30]:
# (vi)
df_tr = pd.DataFrame(xtr)
df_tr = pd.concat([df_tr, pd.DataFrame(ytr, columns=[str(int(df_tr.columns[-1]) + 1)])], axis=1)
df_te = pd.DataFrame(xte)
df_te = pd.concat([df_te, pd.DataFrame(yte, columns=[str(int(df_te.columns[-1]) + 1)])], axis=1)

In [31]:
tic()
auc_test = automatic_testing(xtr, ytr, xte, yte)
print(auc_test)
toc()

0.9553353070104559
Elapsed time is 36.9154589176178 seconds.


Taking into account the original size and imbalance, the performance with the final preprocessing scheme in testing is good ($auc_{test} = 0.945$) and similar to the cross-validated training one ($auc_{train} = 0.960$).