# Control point: Data preprocessing
#### Máster en Análisis de Datos, Ciberseguridad y Computación en la Nube
#### Aprendizaje Automático - Punto de Control 1 (31/10/2019)

#### Name & Surnames: Iker Ocio Zuazo

### Introduction

The dataset named Hoerchen (Hoerchen.csv) has more than 145K samples and more than 70 features.

The main objective of the control point is to preprocess the train data, designing a complete preprocessing scheme, and test it on test data. 

You must take into account that this is not a toy dataset, and its size could be relevant.

The function "automatic_scoring" provides a way for comparing different schemes using a classifier, by means of 10-Fold CV and using AUC as metric. You will need to put the right seed as requested. Notice that the function just needs inputs (X) and target (y) arrays as input.

If you try anytime several options it is important to show the results of those discarded trials, because what is not visible cannot be evaluated.

The function "automatic_testing" trains the model on the train data and applies it to the test data. Do not change the classification algorithm, its parameters and the scoring choice. Those are fixed and their optimization is out of the scope of this control point.

The deliverable of this control point is this Jupyter Notebook containing the code, plus some short answers in markdown cells if required.

NOTE: Keep in mind that some functions accept both Pandas dataframes and Numpy arrays, but some others only one of them. Nevertheless, we should know how to pass form one to the other and viceversa.

NOTE: Keep in mind that some functions will take some time to run. You can continue working on other cells during the run to avoid wasting time waiting.

### Exercises:

* (i) Split the data into 4 parts, i.e. train inputs and target (xtr, ytr) and test inputs and target (xte and yte), in such a way that the proportion of the classes is kept constant in train and test parts. The size of the training set must be 70% of the total size of the data, and the random seed to be used must be your ID card number (i.e. DNI without the letter). This random seed must be kept during all the control point in any possible place. [5%] <br>
<br>
* (ii) Checking for missing values and outliers. If any, treat the data however you consider better, arguing your decisions. [20%] <br>
  <br>
    - (a) Is there any missing value? If so, regarding the characteristics of the data, decide what to do arguing your answer. Modify your data according to your answer if necessary.  <br>
      <br>
    - (b) Is there any collective outlier? If so, regarding the characteristics of the data, decide what to do arguing your answer.  Modify your data according to your answer if necessary. <br>
    <br>
    - (c) From now on, this is your basic data. Therefore, it is save to overwrite the names of the data parts. <br>
<br>

* (iii) The feature selection method SelectPercentile (sklearn.feature_selection.SelectPercentile) uses different scores (f_classif, mutual_info_classif, chi2, f_regression, etc) in order to select the most relevant features. In the Scikit-Learn documentation
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
you have the function info and an example of use of chi2 score. Use the feature selection method SelectPercentile with the mutual_info_classif score, and percentile parameter 20. [30%] <br>
<br>
    - (a) Which is the compression ratio you obtained? (Note: Compression ratio is the proportion of variables kept after the selection). <br>
    <br>
    - (b) Compare the performance with and without feature selection with the right scheme and function. Is selecting those variables a good idea? Argue your response. <br>
    <br>
    - (c) Regarding the answer to (b), get your current data in order to continue preprocessing.  <br>
    <br>
    
* (iv) Check the balance of your current dataset. Which is its imbalance ratio? We can understand it both as the number of times the majority class is bigger than the minority class, or the proportion of the samples that are from minority class. If imbalance ratio is higher than 49 to 1 (equivalent to having less than 2% of minority class samples), discuss if it makes sense to apply imbalanced data treatments or not. Consider the size of the data and the performance you have obtained in (iii) (b) for the data you currently have. Act in consequence with total freedom on the sampling method to use if you need any. [20%] <br>
<br>
* (v) Apply principal component analysis to your data for compression, capturing at least 95% of the cumulative variance. How many extracted variables do you have? Which reduction percentage would you get if you apply it? Compare the performance with the one of your current non-compressed data. Would you use the pca compression here? Act consequently with your answer, and keep the data overwriting the names. [15%] <br> 
<br>
* (vi) Once you are here, you have final preprocessed data using the definitive preprocessing scheme you have reasonably chosen. Check now the performance using the test data. Comment on the result you have obtained compared to the one in (v). [10%]

#### Auxiliar functions

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


seed = 72726275 # Your DNI number without letter and left zeros here, e.g. 09425400T => 9425400


def automatic_scoring(X, y):
    average_score = cross_val_score(estimator=RandomForestClassifier(n_estimators=100, random_state=seed), X=X, y=y, cv=5, scoring='roc_auc').mean()
    return average_score

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score


def automatic_testing(X_train, y_train, X_test, y_test):
    auc_score = roc_auc_score(y_test, RandomForestClassifier(n_estimators=100, random_state=seed).fit(X_train, y_train).predict_proba(X_test)[:,1])
    return auc_score

### Solution:

(i) **Split the data into 4 parts, i.e. train inputs and target (xtr, ytr) and test inputs and target (xte and yte), in such a way that the proportion of the classes is kept constant in train and test parts. The size of the training set must be 70% of the total size of the data, and the random seed to be used must be your ID card number (i.e. DNI without the letter). This random seed must be kept during all the control point in any possible place. [5%]**

In [3]:
import pandas as pd


hoerchen = pd.read_csv('Hoerchen.csv')
hoerchen.head(2)

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v66,v67,v68,v69,v70,v71,v72,v73,v74,class
0,52.0,32.69,0.3,2.5,20.0,1256.8,-0.89,0.33,11.0,-55.0,...,1595.1,-1.64,2.83,-2.0,-50.0,445.2,-0.35,0.26,0.76,-1.0
1,58.0,33.33,0.0,16.5,9.5,608.1,0.5,0.07,20.5,-52.5,...,762.9,0.29,0.82,-3.0,-35.0,140.3,1.16,0.39,0.73,-1.0


In [4]:
from sklearn.model_selection import train_test_split

X = hoerchen.iloc[:,:-1]
y = hoerchen.iloc[:,-1]

xtr, xte, ytr, yte = train_test_split(X, y, random_state=seed, train_size = 0.70)

(ii) **Checking for missing values and outliers. If any, treat the data however you consider better, arguing your decisions. [20%] **<br>
  <br>
    - (a) Is there any missing value? If so, regarding the characteristics of the data, decide what to do arguing your answer. Modify your data according to your answer if necessary.


In [5]:
print("Is there any NaN value?")
print(hoerchen.isnull().any())
print("Number of rows: "+str(hoerchen.shape[0]))
hoerchen.describe()

Is there any NaN value?
v1        True
v2        True
v3        True
v4        True
v5        True
         ...  
v71       True
v72       True
v73       True
v74       True
class    False
Length: 75, dtype: bool
Number of rows: 145751


Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v66,v67,v68,v69,v70,v71,v72,v73,v74,class
count,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,...,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145751.0
mean,61.163887,26.494016,0.18113,1.688134,18.291705,1820.221539,-0.004247,0.197883,1.012614,-73.721691,...,1820.51269,0.026845,0.529557,0.263808,-64.705242,472.77818,0.472763,0.260546,0.191865,-0.982216
std,18.992155,4.531692,1.243167,31.923489,80.619876,1404.343739,1.028843,1.422744,20.64254,26.53607,...,1689.147092,1.181415,1.649862,9.65564,36.776184,406.651331,1.073043,0.187474,0.49965,0.187754
min,2.68,12.0,-3.86,-144.0,-1082.0,-718.8,-6.12,-2.86,-85.5,-1082.0,...,-668.0,-7.78,-10.0,-63.0,-322.0,-509.2,-20.0,-0.55,-1.91,-1.0
25%,47.95,23.56,-0.58,-17.5,-13.0,1017.6,-0.66,-0.6,-9.5,-86.0,...,862.4,-0.72,-0.55,-5.0,-82.0,175.3,-0.17,0.14,-0.1,-1.0
50%,62.4,25.77,0.1,1.0,11.0,1530.85,0.04,0.03,0.5,-69.5,...,1435.3,0.085,0.39,0.0,-55.0,377.9,0.56,0.26,0.28,-1.0
75%,75.34,28.57,0.82,19.5,40.0,2295.6,0.7,0.77,10.5,-56.5,...,2334.1,0.83,1.46,5.0,-38.0,669.1,1.21,0.39,0.58,-1.0
max,100.0,100.0,50.38,1059.5,3380.0,52817.9,5.99,72.28,973.5,-23.0,...,64129.4,5.94,18.85,146.0,0.0,4197.9,6.6,1.0,1.0,1.0


I will impute nan values because there is just one NaN value per column as we can see in *describe* table count.

In [6]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='mean')

nan_imputed_data = pd.DataFrame(data=imp.fit_transform(hoerchen))
nan_imputed_data.columns = hoerchen.columns
nan_imputed_data.describe()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v66,v67,v68,v69,v70,v71,v72,v73,v74,class
count,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,...,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0,145751.0
mean,61.163887,26.494016,0.18113,1.688134,18.291705,1820.221539,-0.004247,0.197883,1.012614,-73.721691,...,1820.51269,0.026845,0.529557,0.263808,-64.705242,472.77818,0.472763,0.260546,0.191865,-0.982216
std,18.992089,4.531676,1.243162,31.92338,80.619599,1404.338921,1.02884,1.422739,20.642469,26.535979,...,1689.141298,1.181411,1.649856,9.655607,36.776057,406.649936,1.073039,0.187473,0.499648,0.187754
min,2.68,12.0,-3.86,-144.0,-1082.0,-718.8,-6.12,-2.86,-85.5,-1082.0,...,-668.0,-7.78,-10.0,-63.0,-322.0,-509.2,-20.0,-0.55,-1.91,-1.0
25%,47.95,23.56,-0.58,-17.5,-13.0,1017.6,-0.66,-0.6,-9.5,-86.0,...,862.4,-0.72,-0.55,-5.0,-82.0,175.3,-0.17,0.14,-0.1,-1.0
50%,62.4,25.77,0.1,1.0,11.0,1530.9,0.04,0.03,0.5,-69.5,...,1435.3,0.08,0.39,0.0,-55.0,377.9,0.56,0.26,0.28,-1.0
75%,75.34,28.57,0.82,19.5,40.0,2295.6,0.7,0.77,10.5,-56.5,...,2334.1,0.83,1.46,5.0,-38.0,669.1,1.21,0.39,0.58,-1.0
max,100.0,100.0,50.38,1059.5,3380.0,52817.9,5.99,72.28,973.5,-23.0,...,64129.4,5.94,18.85,146.0,0.0,4197.9,6.6,1.0,1.0,1.0


    - (b) Is there any collective outlier? If so, regarding the characteristics of the data, decide what to do arguing your answer.  Modify your data according to your answer if necessary. <br>
    <br>


In [7]:
nan_imputed_data["class"].unique()

array([-1.,  1.])

In [8]:
import sklearn
from sklearn.covariance import EllipticEnvelope

X = nan_imputed_data.iloc[:,:-1]
y = nan_imputed_data.iloc[:,-1]

In [9]:
%%time
elip_env = sklearn.covariance.EllipticEnvelope(random_state = seed).fit(X)
detection = elip_env.predict(X)

Wall time: 2min 26s


In [10]:
outlier_positions_mah = []
for x in range(X.shape[0]):
    if detection[x] == -1:
        outlier_positions_mah.append(x)

In [11]:
if detection is []:
    print("There are not outliers in the data.")
else:
    print("The " + str(len(outlier_positions_mah)) + " outliers found are in positions:\n" + str(outlier_positions_mah))
    classes_names = ['-1', '1']
    classes_location = []
    for x in nan_imputed_data["class"].values[outlier_positions_mah]:
        classes_location.append(x)
    print("They correspond respectively to classes:\n"+str(classes_location))

The 14576 outliers found are in positions:
[40, 76, 281, 477, 890, 1352, 1586, 1735, 1751, 1756, 1774, 1777, 1786, 1822, 1825, 1862, 1864, 1871, 1898, 1951, 1955, 1985, 1988, 2002, 2007, 2010, 2026, 2045, 2051, 2077, 2081, 2087, 2111, 2134, 2140, 2197, 2201, 2325, 2366, 2369, 2377, 2387, 2416, 2488, 2492, 2500, 2510, 2515, 2517, 2522, 2554, 2578, 2589, 2594, 2599, 2614, 2650, 2654, 2661, 2676, 2679, 2680, 2699, 2730, 2731, 2757, 2761, 2810, 2814, 2819, 2829, 2861, 2866, 2877, 2910, 2911, 2926, 2946, 2959, 2962, 2963, 2976, 2997, 3002, 3009, 3028, 3031, 3035, 3051, 3055, 3056, 3060, 3065, 3074, 3077, 3111, 3120, 3195, 3199, 3212, 3219, 3222, 3234, 3235, 3252, 3261, 3272, 3277, 3283, 3310, 3327, 3343, 3348, 3355, 3384, 3385, 3408, 3414, 3431, 3434, 3441, 3464, 3466, 3490, 3494, 3499, 3515, 3529, 3539, 3560, 3570, 3573, 3576, 3614, 3615, 3631, 3642, 3648, 3649, 3651, 3652, 3655, 3680, 3686, 3698, 3722, 3727, 3744, 3752, 3780, 3792, 3817, 3827, 3850, 3859, 3890, 3907, 3943, 3987, 4034, 404

The quantity of the outliers is not several comparing with the quantity of rows that we have. We need to improve the speed up of our model so we are going to remove the outliers.
At the end, if we need more information to improve the score, we can try without removing them

In [12]:
outlier_free_data = X.copy()
outlier_free_data["class"] = y.copy()

outlier_free_data.drop(outlier_free_data.index[outlier_positions_mah], inplace=True)

    - (c) From now on, this is your basic data. Therefore, it is save to overwrite the names of the data parts. <br>

In [13]:
X = outlier_free_data.iloc[:,:-1]
y = outlier_free_data.iloc[:,-1]

xtr, xte, ytr, yte = train_test_split(X, y, random_state=seed, train_size = 0.70)

(iii) **The feature selection method SelectPercentile (sklearn.feature_selection.SelectPercentile) uses different scores (f_classif, mutual_info_classif, chi2, f_regression, etc) in order to select the most relevant features. In the Scikit-Learn documentation
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
you have the function info and an example of use of chi2 score. Use the feature selection method SelectPercentile with the mutual_info_classif score, and percentile parameter 20. [30%]** <br>
<br>
    - (a) Which is the compression ratio you obtained? (Note: Compression ratio is the proportion of variables kept after the selection). <br>
    <br>
    - (b) Compare the performance with and without feature selection with the right scheme and function. Is selecting those variables a good idea? Argue your response. <br>
    <br>
    - (c) Regarding the answer to (b), get your current data in order to continue preprocessing.  <br>
    <br>

In [14]:
%%time
from sklearn.feature_selection import SelectPercentile, mutual_info_classif

print("Shape before SelectPercentile: "+str(X.shape[0])+"-"+str(X.shape[1]))
X_new = SelectPercentile(mutual_info_classif, percentile=20).fit_transform(X, y)
print("Shape after SelectPercentile: "+str(X_new.shape[0])+"-"+str(X_new.shape[1]))

Shape before SelectPercentile: 131175-74
Shape after SelectPercentile: 131175-15
Wall time: 1min 43s


In [15]:
print("Compress ratio is "+str(X.shape[1])+"/"+str(X_new.shape[1])+": "+str(X.shape[1]/X_new.shape[1]))

Compress ratio is 74/15: 4.933333333333334


In [16]:
%%time

compare_score = automatic_scoring(X, y)
print("Score before feature selections")
print(compare_score)

Score before feature selections
0.9436689758679714
Wall time: 4min 30s


In [17]:
%%time

compare_score = automatic_scoring(X_new, y)
print("Score after feature selections")
print(compare_score)

Score after feature selections
0.9258791154224619
Wall time: 1min 29s


Number of features reduction is high and the score is similar so it's a good idea to get this features selection as an option to improve the speed of our model creation 

* (iv) **Check the balance of your current dataset. Which is its imbalance ratio? We can understand it both as the number of times the majority class is bigger than the minority class, or the proportion of the samples that are from minority class. If imbalance ratio is higher than 49 to 1 (equivalent to having less than 2% of minority class samples), discuss if it makes sense to apply imbalanced data treatments or not. Consider the size of the data and the performance you have obtained in (iii) (b) for the data you currently have. Act in consequence with total freedom on the sampling method to use if you need any. [20%]**

In [18]:
from collections import Counter

xtr, xte, ytr, yte = train_test_split(X, y, random_state=seed, train_size = 0.70)

print('Training statistics: {}'.format(Counter(ytr)))
print('Testing statistics: {}'.format(Counter(yte)))

82239/327


Training statistics: Counter({-1.0: 91430, 1.0: 392})
Testing statistics: Counter({-1.0: 39199, 1.0: 154})


251.4954128440367

In [19]:
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss

pipeline = make_pipeline(NearMiss(version=2),
                         LinearSVC(random_state=RANDOM_STATE, max_iter=10000))
pipeline.fit(X_train, y_train)

NameError: name 'LinearSVC' is not defined

* (v) **Apply principal component analysis to your data for compression, capturing at least 95% of the cumulative variance. How many extracted variables do you have? Which reduction percentage would you get if you apply it? Compare the performance with the one of your current non-compressed data. Would you use the pca compression here? Act consequently with your answer, and keep the data overwriting the names. [15%]**

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components)
pca.fit(X)
X = pca.transform(X)
proj_df = pd.DataFrame(data=X, columns=['PC' + str(x) for x in list(range(1, X.shape[1] + 1))])
proj_df = pd.concat([proj_df, df[df.columns[-1]]], axis=1)



* (vi) **Once you are here, you have final preprocessed data using the definitive preprocessing scheme you have reasonably chosen. Check now the performance using the test data. Comment on the result you have obtained compared to the one in (v). [10%]**