# Control point: Data preprocessing
#### Máster en Análisis de Datos, Ciberseguridad y Computación en la Nube
#### Aprendizaje Automático - Punto de Control 1 (31/10/2019)

#### Name & Surnames: 

### Introduction

The dataset named Hoerchen (Hoerchen.csv) has more than 145K samples and more than 70 features.

The main objective of the control point is to preprocess the train data, designing a complete preprocessing scheme, and test it on test data. 

You must take into account that this is not a toy dataset, and its size could be relevant.

The function "automatic_scoring" provides a way for comparing different schemes using a classifier, by means of 10-Fold CV and using AUC as metric. You will need to put the right seed as requested. Notice that the function just needs inputs (X) and target (y) arrays as input.

If you try anytime several options it is important to show the results of those discarded trials, because what is not visible cannot be evaluated.

The function "automatic_testing" trains the model on the train data and applies it to the test data. Do not change the classification algorithm, its parameters and the scoring choice. Those are fixed and their optimization is out of the scope of this control point.

The deliverable of this control point is this Jupyter Notebook containing the code, plus some short answers in markdown cells if required.

NOTE: Keep in mind that some functions accept both Pandas dataframes and Numpy arrays, but some others only one of them. Nevertheless, we should know how to pass form one to the other and viceversa.

NOTE: Keep in mind that some functions will take some time to run. You can continue working on other cells during the run to avoid wasting time waiting.

### Exercises:

* (i) Split the data into 4 parts, i.e. train inputs and target (xtr, ytr) and test inputs and target (xte and yte), in such a way that the proportion of the classes is kept constant in train and test parts. The size of the training set must be 70% of the total size of the data, and the random seed to be used must be your ID card number (i.e. DNI without the letter). This random seed must be kept during all the control point in any possible place. [5%] <br>
<br>
* (ii) Checking for missing values and outliers. If any, treat the data however you consider better, arguing your decisions. [20%] <br>
  <br>
    - (a) Is there any missing value? If so, regarding the characteristics of the data, decide what to do arguing your answer. Modify your data according to your answer if necessary.  <br>
      <br>
    - (b) Is there any collective outlier? If so, regarding the characteristics of the data, decide what to do arguing your answer.  Modify your data according to your answer if necessary. <br>
    <br>
    - (c) From now on, this is your basic data. Therefore, it is save to overwrite the names of the data parts. <br>
<br>

* (iii) The feature selection method SelectPercentile (sklearn.feature_selection.SelectPercentile) uses different scores (f_classif, mutual_info_classif, chi2, f_regression, etc) in order to select the most relevant features. In the Scikit-Learn documentation
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile)
you have the function info and an example of use of chi2 score. Use the feature selection method SelectPercentile with the mutual_info_classif score, and percentile parameter 20. [30%] <br>
<br>
    - (a) Which is the compression ratio you obtained? (Note: Compression ratio is the proportion of variables kept after the selection). <br>
    <br>
    - (b) Compare the performance with and without feature selection with the right scheme and function. Is selecting those variables a good idea? Argue your response. <br>
    <br>
    - (c) Regarding the answer to (b), get your current data in order to continue preprocessing.  <br>
    <br>
    
* (iv) Check the balance of your current dataset. Which is its imbalance ratio? We can understand it both as the number of times the majority class is bigger than the minority class, or the proportion of the samples that are from minority class. If imbalance ratio is higher than 49 to 1 (equivalent to having less than 2% of minority class samples), discuss if it makes sense to apply imbalanced data treatments or not. Consider the size of the data and the performance you have obtained in (iii) (b) for the data you currently have. Act in consequence with total freedom on the sampling method to use if you need any. [20%] <br>
<br>
* (v) Apply principal component analysis to your data for compression, capturing at least 95% of the cumulative variance. How many extracted variables do you have? Which reduction percentage would you get if you apply it? Compare the performance with the one of your current non-compressed data. Would you use the pca compression here? Act consequently with your answer, and keep the data overwriting the names. [15%] <br> 
<br>
* (vi) Once you are here, you have final preprocessed data using the definitive preprocessing scheme you have reasonably chosen. Check now the performance using the test data. Comment on the result you have obtained compared to the one in (v). [10%]

#### Auxiliar functions

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


seed = 72841579 # Your DNI number without letter and left zeros here, e.g. 09425400T => 9425400


def automatic_scoring(X, y):
    average_score = cross_val_score(estimator=RandomForestClassifier(n_estimators=100, random_state=seed), X=X, y=y, cv=5, scoring='roc_auc').mean()
    return average_score

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score


def automatic_testing(X_train, y_train, X_test, y_test):
    auc_score = roc_auc_score(y_test, RandomForestClassifier(n_estimators=100, random_state=seed).fit(X_train, y_train).predict_proba(X_test)[:,1])
    return auc_score

### Solution:

In [3]:
import pandas as pd

df = pd.read_csv('Hoerchen.csv')
df.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v66,v67,v68,v69,v70,v71,v72,v73,v74,class
0,52.0,32.69,0.3,2.5,20.0,1256.8,-0.89,0.33,11.0,-55.0,...,1595.1,-1.64,2.83,-2.0,-50.0,445.2,-0.35,0.26,0.76,-1.0
1,58.0,33.33,0.0,16.5,9.5,608.1,0.5,0.07,20.5,-52.5,...,762.9,0.29,0.82,-3.0,-35.0,140.3,1.16,0.39,0.73,-1.0
2,77.0,27.27,-0.91,6.0,58.5,1623.6,-1.4,0.02,-6.5,-48.0,...,1491.8,0.32,-1.29,0.0,-34.0,658.2,-0.76,0.26,0.24,-1.0
3,41.0,27.91,-0.35,3.0,46.0,1921.6,-1.36,-0.47,-32.0,-51.5,...,2047.7,-0.98,1.53,0.0,-49.0,554.2,-0.83,0.39,0.73,-1.0
4,50.0,28.0,-1.32,-9.0,12.0,464.8,0.88,0.19,8.0,-51.5,...,479.5,0.68,-0.59,2.0,-36.0,-6.9,2.02,0.14,-0.23,-1.0


In [4]:
X = df.iloc[:,:-1]
X.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v65,v66,v67,v68,v69,v70,v71,v72,v73,v74
0,52.0,32.69,0.3,2.5,20.0,1256.8,-0.89,0.33,11.0,-55.0,...,-8.0,1595.1,-1.64,2.83,-2.0,-50.0,445.2,-0.35,0.26,0.76
1,58.0,33.33,0.0,16.5,9.5,608.1,0.5,0.07,20.5,-52.5,...,-6.0,762.9,0.29,0.82,-3.0,-35.0,140.3,1.16,0.39,0.73
2,77.0,27.27,-0.91,6.0,58.5,1623.6,-1.4,0.02,-6.5,-48.0,...,7.0,1491.8,0.32,-1.29,0.0,-34.0,658.2,-0.76,0.26,0.24
3,41.0,27.91,-0.35,3.0,46.0,1921.6,-1.36,-0.47,-32.0,-51.5,...,6.0,2047.7,-0.98,1.53,0.0,-49.0,554.2,-0.83,0.39,0.73
4,50.0,28.0,-1.32,-9.0,12.0,464.8,0.88,0.19,8.0,-51.5,...,-14.0,479.5,0.68,-0.59,2.0,-36.0,-6.9,2.02,0.14,-0.23


In [5]:
y = df.iloc[:,-1]
y.unique()

array([-1.,  1.])

In [14]:
y.value_counts()

-1.0    144455
 1.0      1296
Name: class, dtype: int64

There is a clear data imbalance

### (i) Split the data into 4 parts, i.e. train inputs and target (xtr, ytr) and test inputs and target (xte and yte), in such a way that the proportion of the classes is kept constant in train and test parts. The size of the training set must be 70% of the total size of the data, and the random seed to be used must be your ID card number (i.e. DNI without the letter). This random seed must be kept during all the control point in any possible place. [5%] 

In [7]:
from sklearn.model_selection import train_test_split

xtr, xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=seed)

### (ii) Checking for missing values and outliers. If any, treat the data however you consider better, arguing your decisions. [20%] 

In [8]:
df.describe()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v66,v67,v68,v69,v70,v71,v72,v73,v74,class
count,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,...,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145750.0,145751.0
mean,61.163887,26.494016,0.18113,1.688134,18.291705,1820.221539,-0.004247,0.197883,1.012614,-73.721691,...,1820.51269,0.026845,0.529557,0.263808,-64.705242,472.77818,0.472763,0.260546,0.191865,-0.982216
std,18.992155,4.531692,1.243167,31.923489,80.619876,1404.343739,1.028843,1.422744,20.64254,26.53607,...,1689.147092,1.181415,1.649862,9.65564,36.776184,406.651331,1.073043,0.187474,0.49965,0.187754
min,2.68,12.0,-3.86,-144.0,-1082.0,-718.8,-6.12,-2.86,-85.5,-1082.0,...,-668.0,-7.78,-10.0,-63.0,-322.0,-509.2,-20.0,-0.55,-1.91,-1.0
25%,47.95,23.56,-0.58,-17.5,-13.0,1017.6,-0.66,-0.6,-9.5,-86.0,...,862.4,-0.72,-0.55,-5.0,-82.0,175.3,-0.17,0.14,-0.1,-1.0
50%,62.4,25.77,0.1,1.0,11.0,1530.85,0.04,0.03,0.5,-69.5,...,1435.3,0.085,0.39,0.0,-55.0,377.9,0.56,0.26,0.28,-1.0
75%,75.34,28.57,0.82,19.5,40.0,2295.6,0.7,0.77,10.5,-56.5,...,2334.1,0.83,1.46,5.0,-38.0,669.1,1.21,0.39,0.58,-1.0
max,100.0,100.0,50.38,1059.5,3380.0,52817.9,5.99,72.28,973.5,-23.0,...,64129.4,5.94,18.85,146.0,0.0,4197.9,6.6,1.0,1.0,1.0


### A

In [15]:
df.isnull().values.any()

True

Yes there are missing values. First we are going to try dropping the missing values. If we drop to many minor class instances, we are going to try to impute them.

In [20]:
nan_removed_data = df.dropna()
nan_removed_data['class'].value_counts()

-1.0    144381
 1.0      1296
Name: class, dtype: int64

We don lose any minoritary class so we delete rows with nan values.

### B

In [21]:
import sklearn
from sklearn.covariance import EllipticEnvelope

elip_env = sklearn.covariance.EllipticEnvelope().fit(nan_removed_data)
detection = elip_env.predict(nan_removed_data)
outlier_positions = [x for x in range(nan_removed_data.shape[0]) if detection[x] == -1]

print('Outliers positions: ' + str(outlier_positions))

Outliers positions: [40, 76, 281, 477, 890, 1351, 1585, 1734, 1750, 1755, 1773, 1776, 1785, 1821, 1824, 1861, 1863, 1870, 1897, 1950, 1954, 1984, 1987, 2001, 2009, 2025, 2044, 2050, 2076, 2080, 2085, 2086, 2110, 2133, 2139, 2196, 2200, 2324, 2365, 2368, 2376, 2386, 2415, 2491, 2499, 2509, 2514, 2516, 2521, 2553, 2577, 2588, 2593, 2598, 2613, 2649, 2653, 2660, 2675, 2678, 2679, 2698, 2729, 2730, 2756, 2760, 2809, 2813, 2818, 2828, 2860, 2864, 2876, 2909, 2910, 2925, 2945, 2958, 2961, 2962, 2975, 2976, 2996, 3001, 3008, 3027, 3030, 3034, 3050, 3054, 3055, 3059, 3064, 3073, 3076, 3110, 3119, 3194, 3198, 3211, 3218, 3221, 3233, 3234, 3251, 3271, 3282, 3309, 3326, 3342, 3347, 3354, 3383, 3384, 3407, 3413, 3430, 3433, 3440, 3463, 3465, 3489, 3493, 3498, 3514, 3528, 3538, 3559, 3569, 3572, 3575, 3613, 3630, 3640, 3646, 3647, 3649, 3650, 3653, 3684, 3720, 3725, 3742, 3750, 3790, 3815, 3825, 3848, 3857, 3888, 3905, 3941, 3985, 4032, 4041, 4158, 4335, 4350, 4354, 4441, 4461, 4485, 4555, 4578, 46

In [23]:
outlier_free_data = nan_removed_data.drop(nan_removed_data.index[outlier_positions])

outlier_free_data['class'].value_counts()

-1.0    130580
 1.0       529
Name: class, dtype: int64

We are loosing lot of minority class intances, so we are going to delete only the mayority class outliers.

In [33]:
outlier_free_data = nan_removed_data.copy()

numbers = []
for n in outlier_positions:
    if nan_removed_data.iloc[[n]]['class'].iloc[-1] == -1.0:
        numbers.append(n)

outlier_free_data.drop(outlier_free_data.index[numbers], inplace=True)
outlier_free_data['class'].value_counts()

-1.0    130580
 1.0      1296
Name: class, dtype: int64

### C

In [34]:
outlier_free_data.to_csv('Hoerchen_1.csv')

### (iii) The feature selection method SelectPercentile (sklearn.feature_selection.SelectPercentile) uses different scores (f_classif, mutual_info_classif, chi2, f_regression, etc) in order to select the most relevant features. In the Scikit-Learn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html#sklearn.feature_selection.SelectPercentile) you have the function info and an example of use of chi2 score. Use the feature selection method SelectPercentile with the mutual_info_classif score, and percentile parameter 20. [30%] 

In [50]:
import pandas as pd

df = pd.read_csv('Hoerchen_1.csv')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X.shape

(131876, 74)

### A

In [51]:
from sklearn.feature_selection import SelectPercentile, mutual_info_classif

X_new = SelectPercentile(mutual_info_classif, percentile=20).fit_transform(X, y)
X_new.shape

(131876, 15)

Only 15 features where selected from a total of 75

### B

In [52]:
xtr1, xte1, ytr1, yte1 = train_test_split(X, y, test_size=0.3, random_state=seed)
print("Original score: " + str(automatic_testing(xtr1, ytr1, xte1, yte1)))

xtr2, xte2, ytr2, yte2 = train_test_split(X_new, y, test_size=0.3, random_state=seed)
print("Selected features score: " + str(automatic_testing(xtr2, ytr2, xte2, yte2)))


Original score: 0.9776170428381525
Selected features score: 0.962595332475693


We are getting a very good score on both cases. 
Probably if we play with de seed, we can get a case that we get a better score with only 15 features.
In this case we get a better score with the original data so i would continue the exercises with the original data.

### C

In [53]:
df.to_csv('Hoerchen_2.csv')

###  (iv) Check the balance of your current dataset. Which is its imbalance ratio? We can understand it both as the number of times the majority class is bigger than the minority class, or the proportion of the samples that are from minority class. If imbalance ratio is higher than 49 to 1 (equivalent to having less than 2% of minority class samples), discuss if it makes sense to apply imbalanced data treatments or not. Consider the size of the data and the performance you have obtained in (iii) (b) for the data you currently have. Act in consequence with total freedom on the sampling method to use if you need any. [20%]

In [61]:
import pandas as pd
df = pd.read_csv('Hoerchen_2.csv')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

df.shape

(131876, 75)

In [55]:
df['class'].value_counts()

-1.0    130580
 1.0      1296
Name: class, dtype: int64

In [None]:
from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=seed)
X_res, y_res = sm.fit_resample(X, y)

### (v) Apply principal component analysis to your data for compression, capturing at least 95% of the cumulative variance. How many extracted variables do you have? Which reduction percentage would you get if you apply it? Compare the performance with the one of your current non-compressed data. Would you use the pca compression here? Act consequently with your answer, and keep the data overwriting the names. [15%]

In [57]:
import pandas as pd
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)

# Calculates dataframe PCA
def get_df_pca(df):    
    X = df.iloc[:,:-1]
    y = df.iloc[:,-1]
    pca.fit(X)
    X_reduced = pca.transform(X)
    # print("There have been selected " + str(X_reduced.shape[1]) + " principal components.")    
    columns = []
    for n in range(X_reduced.shape[1]):
        columns.append("PCA" + str(n))    
    df = pd.DataFrame(X_reduced, columns=columns)
    df['species'] = y
    return df     

df = pd.read_csv('Hoerchen_2.csv')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

df_pca = get_df_pca(df)
df_pca.shape

(131876, 8)

In [59]:
print("There have been selected " + str(df_pca.iloc[:,:-1].shape[1]) + " principal components.")

There have been selected 7 principal components.


In [60]:
xtr1, xte1, ytr1, yte1 = train_test_split(df.iloc[:,:-1], df.iloc[:,-1], test_size=0.3, random_state=seed)
print("Original score: " + str(automatic_testing(xtr1, ytr1, xte1, yte1)))

xtr2, xte2, ytr2, yte2 = train_test_split(df_pca.iloc[:,:-1], df_pca.iloc[:,-1], test_size=0.3, random_state=seed)
print("Selected features score: " + str(automatic_testing(xtr2, ytr2, xte2, yte2)))

Original score: 0.9776170428381525
Selected features score: 0.7913444620821342


We get a better score with the original data than with the 7 principal components.