In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session.

#  **PIMA INDIANS DATASET**
## Looking for the hightest performance with minimum factors.

<a id="tc"></a>
### **Table of contents**

1. [Introduction: understanding the problem](#introduction)
2. [Exploratory data analysis](#eda)
    - [Data summary](#datasummary)
    - [Duplicated data](#duplicated)
    - [Statistical summary](#statisticalsummary)
    - [Data Cleaning](#datacleaning)
    - [Outlier presence](#outliers)
    - [Relationship among numerical variables: correlation](#correlation)
    - [EDA Conclusions](#edaconclusions)
3. [Metric selection](#metric)
4. [Baseline model](#baselinemodel)
5. [Outlier Impact on performance](#outlierremoval)
    - [Outlier definition](#outlierdef)
    - [Custom function](#functiondef)
    - [Testing the pipeline](#outpipeline)
    - [Comments of outlier removal](#outremovalcomment)
6. [Testing different classification models](#testingmodels)
7. [Feature engineering](#featureengineering)
8. [Recursive feature extraction](#rfe)
9. [Conclusions](#conclu)
10. [Useful links](#links)

<a id="introduction"></a>

## 1. **Introduction: understanding the problem**

Prior to any library import and coding, we all should first understand the problem and the provided dataset. Thus, I will first read the problem context in order to clearly see what is this problem for: 

> *The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.*

Thus, we should come up with a model capable of predicting whether a patient -with PIMA Indian heritage- has diabetes or not.

From the previous statement we can say that we will have a variable measuring a 0-1 output or ><0.5 probability. Nevertheless, it is something that we will be seen in a the following sections. 

Once that we know the nature of the problem and the type of model that we may use in order to predict the explained variable, I will perform an exploratory analysis (or EDA) in order to better understad the available variables for the exercise as well as their properties.

*Why the EDA stage should be performed?*

We never know the data beforehand. It means that most of the time we part from a very general description of the problem. Also, the explanatory variables to be used in our models are different from exercise to exercise. In addition, depending on the variables and their properties, some models and techniques would work better than others. 

Thus, I will perform an EDA looking for:

* Raw data properties (e.g: number of variables, missing information within columns, nature of the variables (e.g.:numerical, categorical)
* Variable propoerties:
    * Numerical: distribution, outliers, basic satistics, how they interact among them.
    * Categorical: concentration of values on certain classes. 
* Perform basic data cleaning task.


[Back to Table of Contents](#tc)



<a id="eda"></a>
##  **2. Exploratory Data Analysis**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sbn
sbn.set_style("dark")

In [None]:
dataset = pd.read_csv(r"/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

<a id="datasummary"></a>
### **2.1 Data Summary**

In [None]:
dataset.info()

##### From the previous output, we can know that: 

* The dataset is composed by 8 numerical (either integer or float types) explanatory variables (#0-#7) and one explained variable called "Outcome" (#8) that  indicates whether the patient has diabetes or not.

* No variables contain null values.

Once that we discard null values, I will check if there are rows that are duplicated and if so, remove them from the dataset:

<a id="duplicated"></a>
### **2.2 Duplicated data**

In [None]:
dataset.duplicated().unique()

The *.duplicated()* funcion returns an array of the same lenth of the number of rows from the dataset. On it, we could spot wheter any row is duplicated or not. By applying the *.unique()* function, we obtain of many "different" values we get from the previous array. Since the above sentence only throws "False" values, it means that no rows are duplicated.

<a id="statisticalsummary"></a>
### **2.3 Statistical Summary**

Once that we are sure that the dataset has no duplicated information and not null values, we can run a statistical summary that provides a general "view" of the information provided. We first need to separate output variable from explanatory ones:

In [None]:
X = dataset.drop(["Outcome"], axis = 1)
y = dataset["Outcome"]

In [None]:
X.describe()

**Conclusion:** 

From the previous output we can see that: 
* There is a huge difference between the minimum and maximum values in the columns *Glucose*, *BloodPressure* and *Insuline*.
* Both *Insulin* and *DiabetesPedigreeFunction* may have outliers. We will check this fact through boxplots later on. 
* Variable standarization will be needed.
* Minimum of 0 value for columns: *Glucose*, *BloodPressure*, *SkinThickness*, *Insulin* and *BMI*. This fact needs further analysis,

<a id="datacleaning"></a>
### **2.4 Data Cleaning**

In [None]:
for col in X.columns:
    if np.min(X[col]) == 0:
        print("Number of rows within {} with zero values {}".format(col, X[X[col] == 0].shape[0]))

Far from elaborating an in-depth review of how these variables work and their normal levels, I have done som research in order to get a basic understanding of what I am analysing as well as to get a starting point for a further treatment of the following variables:

1. [What will hapen when the blood pressure goes to zero?](https://www.quora.com/What-will-happen-when-the-blood-pressure-goes-zero) Far from bringing medical rigor to this notebook, we have reasonable arguments to state that zero blood pressure cases should be treated for analysis purposes (removal or substituting the zero values by any other statistical measurement).
2. [Normal and Diabetic Blood levels](https://www.diabetes.co.uk/diabetes_care/blood-sugar-level-ranges.html). One more time, and taking into account only this source of information, we can deduct that Zero glucose in bloodstream is not normal. Thus, we should make a decision regarding how we will deal with such values (removal or substituting the zero values by any other statistical measurement)
3. [Insulin Basics](https://dtc.ucsf.edu/types-of-diabetes/type2/treatment-of-type-2-diabetes/medications-and-therapies/type-2-insulin-rx/insulin-basics/) Again, and since I haven't found a "normal" zero insuline level, we should need to make the same decision as above: removal or substituting the zero values by another statistical measurement).
4. [About Adult BMI](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html) Regarding BMI, we can say that represents a division of weight / height. Thus, it means that a zero BMI level implies 0 weight or infinite height so we can confidently say that such level is non-sense and should be either substituted or removed.
5. [Skin Thickness](https://www.histology.leeds.ac.uk/skin/skin_layers.php). This article states that the skin thickness can vary from 0.5mm to 4.0mm depending on the body part. For that reason, It could be reasonable to make the same decision regarding the zero values.


**Conclusion:** 
There are some cases where we could simply remove the affected rows (such as 0 BMI level since there are only 11 cases). Nevertheless, Insuline and SkinThickness have more than 11 zero cases (374 and 227, respectively). For that reason, I consider that I could better substitute such values instead of eliminating them from the original dataset.

In order to perform the metioned task, I will firsly substitute the zero values by the nan:

In [None]:
columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
for i in range(X.shape[0]):
    for col in columns:
        if X[col].iloc[i] == 0:
            X[col].iloc[i] = np.nan

Once all zero values have been substitued, it's time to give them a better value. In this case, I will opt for the average. 

If we had categorical variables such as sex, place, race, etc. I would calculate the mean value for the different segments of people [like the Titanic exercise](https://www.kaggle.com/josemaria2/titanic-analysis-with-97-accuracy), but since all variables are numerical I cannot do it in the same way.

Nevertheless, a good exercise could consist on categorizing the numerical variables and impute the mean or mode from each resulting category. Since all variables here are numeric, we could also use the KNNImputer from the Scikit Learn library. This method calculates the mean value of the K-nearest points from the nan value and substitues it.

In [None]:
from sklearn.impute import KNNImputer

In [None]:
imputer = KNNImputer(n_neighbors = 10, weights = "distance")
X_trans = pd.DataFrame(data = imputer.fit_transform(X), columns = X.columns)

<a id="outliers"></a>
### **2.5 Outliers presence** 

Outliers within a distribution are values that are much lower or higher than the rest. It also must be stated that there are not defined outlier definitios. It means that the classification between what we could call a "normal" or an "outlier" value has to do with their relative position within the distribution. 

For that purpose, we could better represent it graphically so we can visually spot the presence of such values:

Boxplots represent outliers with a special mark (such as dots, crosses, etc. depending on the used software). On this example, seaborn [boxplot function](https://seaborn.pydata.org/generated/seaborn.boxplot.html) consider as outliers those points beyond the 1.5 times the IQR.

In [None]:
fig, axes = plt.subplots(2,4, figsize = (15,10), squeeze = False)
for i,col in enumerate(X_trans.columns):
    plt.subplot(2,4,i+1)
    sbn.boxplot(y = X_trans[col].values, data = X_trans)
    plt.ylabel(col)
plt.tight_layout()
plt.show()

**Conclusion:** 

We can see from the previous boxplots that, except for *Glucose*, the rest of variables present extreme values. Such outliers will be further treated within a pipeline.

<a id="correlation"></a>
### **2.6 Relationship among numerical variables: correlation**

Since all our data is numerical, a question may arise: "Is there any correlation among our variables?". For that reason, and in order to answer the previous question, we can run a heatmap where the correlation coefficient is calculated for each combination of variables:

In [None]:
sbn.heatmap(X_trans.corr(method = "pearson"), annot = True)

**Conclusion:** 
1. The hightest correlation values range from 0.54 to 0.65 being one of them the relationship between Pregnancies and Age (with 0.54), which seems quite logical. the second example is the pair Insulin - Glucose and BMI - Skinthickness, with a 0.64 value.

Since there is no correlation beyond 0.70, I decide not to drop any variable based on the correlation coefficient.

<a id="edaconclusions"></a>
### **2.7 EDA Conclusions**

1. Presence of outliers: we have considered outliers as points beyond 1.5 * IQR and those will be processed within a pipeline. A different definition may lead to different results. 
2. Non-normal zero values for several variables have been treated with an Knn Imputer.
3. No variables have been dropped from dataset due to Pearon correlation coefficient.

[Back to Table of Contents](#tc)

<a id="metric"></a>
##  **3. Metric selection**

Once the EDA has been performed, it is time to construct and compare different solutions to our classification problem. 
There are different metrics that could be used on this exercise, but it is up to the analyst to use a particular metric depending on the purpose of the exercise and the provided information. 

Before selecting the metric, I first check the amount of cases with and without diabetes.

In [None]:
sbn.countplot(x = y)

As we can see in the previous plot, there are around 250 of positive cases whereas the number of negative ones are almost double. For this reason, I will use de F-1 Score instead of % Accuracy. 

[Back to Table of Contents](#tc)

<a id="baselinemodel"></a>
##  **4. Baseline Model**

We have reached this point having a better understanding of the provided information and with a minimal modification (zeros substitution). 
Now, it is time to create a baseline model(i.e: the very basic model capable of performing the desired classification).

For that purpose, here I create a basic pipeline that basically standarizes the data (2.3 Statistical summary conclusion) and trains the model.

I will firstly load the necessary libraries not only for the baseline model, but also for the rest of them:

In [None]:
from sklearn.preprocessing import FunctionTransformer, StandardScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

Pipeline inputs that could be modified: number of splits within cross validation (here I choose 5).

Baseline pipeline and score:

In [None]:
trans_steps = [("ss", StandardScaler(), X_trans.columns)]
col_trans = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [("trans", col_trans),("model", LogisticRegression())])
kfold = KFold(n_splits = 4)
cv_score = cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1")
print("Baseline model (Logistic Regression) reaches an F1-score of: {} %".format(round(np.mean(cv_score),3)*100))

[Back to Table of Contents](#tc)

<a id="outlierremoval"></a>
##  **5. Outlier removal impact on model**

The conclusion of the section 2.5 stated that there are many outliers within most of our variables. For that reason, once we have created and cross-validated the baseline model, I think it's time to prove whether an outlier removal step could improve that F1 score of 62.7%.

I will follow the following steps:

* First, I will define what is considered an outlier.
* Second, I will define the function that will be inserted within the pipeline.
* Third, I will run the previous pipeline, but this time inserting the outlier removal step.
* Fourth, I will check and comment any change in model's classification mark.

<a id="outlierdef"></a>
### **5.1 Definition of outlier**

I will stick to a very basic definition of outlier. Hence, I will define an outlier as any point within a distribution that is higher or lower than 1.5 times the IQR.

<a id="functiondef"></a>
### **5.2 Function creation**

In [None]:
def outlier_removal(X, n_iqr = 1.5, lower_perc = 0.25, verbose = False):
    copia = X.copy()
    for col in range (copia.shape[1]):
        column = copia[:,col]
        perc_low, perc_up = np.quantile(column, lower_perc), np.quantile(column, (1-lower_perc))
        iqr = perc_up - perc_low
        cutoff = iqr*n_iqr
        lower, upper = perc_low-cutoff , perc_up+cutoff
        ix_lower = np.where(column < lower)[0]
        if len(ix_lower) > 0:
            copia[ix_lower, col]  = perc_low
        ix_upper = np.where(column > upper)[0]
        if len(ix_upper) > 0:
            copia[ix_upper, col] = perc_up
    return copia

<a id="outpipeline"></a>
### **5.3 Testing the new pipeline**

In [None]:
trans_steps = [("ss", StandardScaler(), X_trans.columns)]
col_trans = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [("trans", col_trans),("no_out",FunctionTransformer(outlier_removal)) ,("model", LogisticRegression())])
kfold = KFold(n_splits = 4)
cv_score = cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1")
print("Baseline model (Logistic Regression) + outlier removal reaches an accuracy of: {} %".format(round(np.mean(cv_score),4)*100))

<a id="outremovalcomment"></a>
### **5.4 Comments on outliers removal**

I have just run a pipeline with a step consisting in dropping the extremely low and high values from the dataset. 
As a result, I have obtained a model that underperforms the first one in terms of F1-score( 62,7% vs 62,05 %). Nevertheless, there are three levers that can lead us to different results:
* Number of datasets folds. Here, I have used the quite standard value of four. However, different values may affect the final result. In fact, here I run an example with a final plot where we can see the obtained result for each falue of kfold:

In [None]:
kfold_results = []
for i in [2,4,5,10,20,50]:
    trans_steps = [("ss", StandardScaler(), X_trans.columns)]
    col_trans = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [("trans", col_trans),("no_out", FunctionTransformer(outlier_removal)),("model", LogisticRegression())])
    kfold = KFold(n_splits = i)
    cv_score = cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1")
    kfold_results.append(round(np.mean(cv_score),4))    

In [None]:
sbn.lineplot(x = [2,4,5,10,20,50], y = kfold_results)

* Low proportion of outliers vs non-outliers. In cases where the number of extreme values are low when compared to the rest of registries, their removal may not affect considerably the model classification power.
* Different definition of outlier. Previously, I stated that there is no *official* definition of what can be considered an extreme value. It could be more dependable of the dataset rather than academia definitions. For convenience, I chose a widely used definition (1.5 time the IQR, using .25 and .75 percentiles to compute that IQR). Nevertheless, different inputs (percentile values and times IQR values) could affect the final result.



Modifying percentile values:

In [None]:
def outlier_removal2(X, n_iqr = 1.5, lower_perc = 0.20, verbose = False):
    copia = X.copy()
    for col in range (copia.shape[1]):
        column = copia[:,col]
        perc_low, perc_up = np.quantile(column, lower_perc), np.quantile(column, (1-lower_perc))
        iqr = perc_up - perc_low
        cutoff = iqr*n_iqr
        lower, upper = perc_low-cutoff , perc_up+cutoff
        ix_lower = np.where(column < lower)[0]
        if len(ix_lower) > 0:
            copia[ix_lower, col]  = perc_low
        ix_upper = np.where(column > upper)[0]
        if len(ix_upper) > 0:
            copia[ix_upper, col] = perc_up
    return copia

In [None]:
trans_steps = [("ss", StandardScaler(), X_trans.columns)]
col_trans = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [("trans", col_trans),("no_out",FunctionTransformer(outlier_removal2)) ,("model", LogisticRegression())])
kfold = KFold(n_splits = 4)
cv_score = cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1")
print("Baseline model (Logistic Regression) + outlier removal reaches an accuracy of: {} %".format(round(np.mean(cv_score),4)*100))

Modifying iqr multiplier:

In [None]:
def outlier_removal3(X, n_iqr = 2.5, lower_perc = 0.20, verbose = False):
    copia = X.copy()
    for col in range (copia.shape[1]):
        column = copia[:,col]
        perc_low, perc_up = np.quantile(column, lower_perc), np.quantile(column, (1-lower_perc))
        iqr = perc_up - perc_low
        cutoff = iqr*n_iqr
        lower, upper = perc_low-cutoff , perc_up+cutoff
        ix_lower = np.where(column < lower)[0]
        if len(ix_lower) > 0:
            copia[ix_lower, col]  = perc_low
        ix_upper = np.where(column > upper)[0]
        if len(ix_upper) > 0:
            copia[ix_upper, col] = perc_up
    return copia

In [None]:
trans_steps = [("ss", StandardScaler(), X_trans.columns)]
col_trans = ColumnTransformer(transformers = trans_steps)
pipeline = Pipeline(steps = [("trans", col_trans),("no_out",FunctionTransformer(outlier_removal3)) ,("model", LogisticRegression())])
kfold = KFold(n_splits = 4)
cv_score = cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1")
print("Baseline model (Logistic Regression) + outlier removal reaches an accuracy of: {} %".format(round(np.mean(cv_score),4)*100))

**Conclusion:** 
IAs we can see in the previous chapter, I have tested several pipelines with different inputs for the pipeline and the arguments for our custom outlier remover function. As a result, I have obtained different results depeding on the inputs. Thus, I will stick with the top performer for future steps in the analysis. 

[Back to Table of Contents](#tc)

<a id="testingmodels"></a>
### **6. Testing different classification models**

I have only tested one classification algorithm. Nevertheless, there are some other that could even outperform the Logistic Regression. 
As an exercise, I will test the following functions along with a colection of hyperparameters. The purpose of this step is to obtain the *top performer* algorithm.
To do so, I will first create a dictinary containing two elements:
1. The model itself
2. *Hyperparameters* that can be modified during the Grid Search.

**Note:** the number of parameters can be quite large since some of them are list of numbers. However, extending the number of parameters can provoke the training phase to be quite large. In order to avoid this, I have choosen a sample of values for some certain parameters which are numbers (either integers of floats). 

In [None]:
dict_scores = {}
dict_scores.update({"LogReg":{"model":LogisticRegression(), "param_grid":{"model__solver":["newton-cg", "sag", "saga", "lbfgs"],"model__C":[-1,-0.5,-0,1,0,0.10,0.5,1]}}})
dict_scores.update({"DecTree":{"model": DecisionTreeClassifier(),"param_grid":{"model__criterion":["gini", "entropy"]}}})
dict_scores.update({"RandomForest":{"model": RandomForestClassifier(),"param_grid":{"model__n_estimators":[10,20,30,50,60,70,80,90,100], "model__criterion":["gini", "entropy"]}}})
dict_scores.update({"KNN":{"model": KNeighborsClassifier(), "param_grid":{"model__n_neighbors":[5,10,15,20], "model__p":[1,2]}}})
dict_scores.update({"Naive_bayes":{"model": GaussianNB()}})
dict_scores.update({"SVM":{"model":SVC(),"param_grid":{"model__C":[-1,-0.5,-0,1,0,0.10,0.5,1]}}})

Once the dictionary has been delcared, I will run the following code snippet. What it does is basically in each iteration:
1. Selects the model
2. Looks for the best parameters within the grid search
3. Updates the results (parameters and score) of the previous step

In [None]:
for key in dict_scores.keys():
    if "param_grid" in dict_scores[key]:
        trans_steps = [("ss", StandardScaler(), X_trans.columns)]
        col_trans = ColumnTransformer(transformers = trans_steps)
        pipeline = Pipeline(steps = [("trans", col_trans),("no_out", FunctionTransformer(outlier_removal2)),("model", dict_scores[key]["model"])])
        kfold = KFold(n_splits = 4)
        search = GridSearchCV(estimator = pipeline, param_grid = dict_scores[key]["param_grid"], scoring = "f1", n_jobs = -1, refit = True, cv = kfold, verbose = 0)
        search.fit(X_trans, y)
        dict_scores[key].update({"best_params":search.best_params_})
        dict_scores[key].update({"best_score": search.best_score_})
    else:
        trans_steps = [("ss", StandardScaler(), X_trans.columns)]
        col_trans = ColumnTransformer(transformers = trans_steps)
        pipeline = Pipeline(steps = [("trans", col_trans),("no_out", FunctionTransformer(outlier_removal2)),("model", dict_scores[key]["model"])])
        kfold = KFold(n_splits = 4)
        cv_score = cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, n_jobs = -1, verbose = 0, scoring = "f1")
        dict_scores[key].update({"best_params": "N/A"})
        dict_scores[key].update({"best_score": np.mean(cv_score)})
        

In [None]:
for key in dict_scores.keys():
    print("Model: {}. Best score: {}".format(key,dict_scores[key]["best_score"]))

The best available model is the RandomForest with the following parameters:

In [None]:
print(dict_scores["RandomForest"]["best_params"])

[Back to Table of Contents](#tc)

<a id="featureengineering"></a>
### **7. Feature Engineering**

Now that I have the top performing model along with its hyperparameters selected, I will try to improve it by providing the model new features consisting in combination of the existing ones. 
For that purpose, I will use the PolynomialFeatures() function provided by the library Scikit-Learn.
Since I am working with sklearn Pipelines, I will include this function as an aditional step.

In [None]:
deg_cv_score = {}
for deg in range(1,5):
    trans_steps = [("ss", StandardScaler(), X_trans.columns),("poly", PolynomialFeatures(degree = deg), X_trans.columns)]
    col_trans = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [("trans", col_trans),("no_out", FunctionTransformer(outlier_removal2)),("model", RandomForestClassifier(criterion = "entropy", n_estimators = 50))])
    kfold = KFold(n_splits = 10)
    deg_cv_score.update({deg:np.mean(cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1"))})

In [None]:
degrees = list(deg_cv_score.keys())
f1 = list(deg_cv_score.values())
sbn.lineplot(x = degrees, y = f1)

I have tested the *PolynomialFeatures()* function with a range of degrees. Here, the higher the degree, the more complex traning and prone to overfitting.
As we can see in the previous graph, the model reaches the highest f1 score with a degree = 3. Beyond that limit, the model starts overfitting. Thus, I will select degree = 3 for the next steps.

[Back to Table of Content](#tc)

<a id="rfe"></a>
### **8. Recursive Feature Extraction**

In [None]:
rfe_f1 = {}
for i in range(1,10):
    trans_steps = [("ss", StandardScaler(), X_trans.columns),("poly", PolynomialFeatures(degree = 3), X_trans.columns)]
    col_trans = ColumnTransformer(transformers = trans_steps)
    pipeline = Pipeline(steps = [("trans", col_trans),("no_out", FunctionTransformer(outlier_removal2)), ("rfe",RFE(estimator = RandomForestClassifier(criterion = "entropy", n_estimators = 50), n_features_to_select = i)),("model", RandomForestClassifier(criterion = "entropy", n_estimators = 50))])
    kfold = KFold(n_splits = 4)
    rfe_f1.update({i:np.mean(cross_val_score(estimator = pipeline, X = X_trans, y = dataset["Outcome"], cv = kfold, scoring = "f1"))})

After running the previous pipeline, we can check the performance of each model.Below, we can see both numerical results and a graph with the same information:

In [None]:
rfe_f1

In [None]:
num_features = list(rfe_f1.keys())
nfeat_f1 = list(rfe_f1.values())

In [None]:
sbn.lineplot(x = num_features, y = nfeat_f1)

[Back to Table of Contents](#tc)

<a id="conclu"></a>
### **9. Conclusion**

After defining the problem statement and performing an exploratory data analysis, I better understood both the exercise to be done and the provided dataset. On this part of the kernel, I took some decissions regarding outliers and variables removal prior a baseline model.

Once I got a baseline, I tested different algorithms along with several parameters for each one in order to find the best posiible model. After this, I measured the F1-score for different digrees of Polynomial Features, finding the value of three the one providing the best results. 

Since Polynomial Features provide a hughe amount of variables, I wanted to see if eliminating some of them could result in a sacrifice of performance, through recursive feature extraction. I found that 5 factors is enough to achieve the maximum performance given the decissions made, used techniques and data provided.
Following the previous steps I passed from a 62,7% to a 65,56%.

**Further steps**:

I haven't tested a neural network model on this notebook. Since this type of model can be tuned by modifying the number of perceptrons and layers (as well as activations functions, among others) I will probably part from these results in order to check how good a neural network can perform with this task.

[Back to Table of Contents](#tc)

<a id="links"></a>
### **10. Useful notebooks and links:**

* [Create Table of Contents in a Notebook](https://www.kaggle.com/dcstang/create-table-of-contents-in-a-notebook)
* [Removing Outliers within a Pipeline](https://www.kaggle.com/jonaspalucibarbosa/removing-outliers-within-a-pipeline)
* [How to remove outliers For Machine Learning](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/)