# Feature Engine - Unit 07 -  Drop Features & Smart Correlated Features

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

* Learn how to apply Drop Features transformer & Smart Correlated Features transformer



---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Package for Learning

And load our typical packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png">Drop Features

It drops a list of variables indicated by the developer. the function documentation is [here](https://feature-engine.readthedocs.io/en/1.1.x/selection/DropFeatures.html). The argument is the features you want to drop.

from feature_engine.selection import DropFeatures

We will use the penguin dataset. It has records for 3 different species of penguins, collected from 3 islands in the Palmer Archipelago, Antarctica

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url)
# df = sns.load_dataset('penguins')
df.head()

We will set the pipeline with `DropFeatures(),` and we want to drop the variables 'sex' and 'island'. We chose these arbitrarily, just for the exercise.
* In the workplace, you may consider the context. For example, your variable might be CustomerID, which typically is a combination of letters and numbers, with high cardinality and often you can't get much information out of it. Therefore, you may drop this variable.
* Other use cases could be when you create variables combining others, for example, 'distance' and 'time', you may create a variable 'speed' when dividing one by another. After that, you may discard 'distance' and 'time'
* After setting the pipeline, we `.fit_transform()` the data

pipeline = Pipeline([
      ( 'drop_features', DropFeatures(features_to_drop = ['sex', 'island']) )
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Smart Correlated Features


According to the documentation, this transformer finds groups of correlated features and then selects, from each group, a feature following certain criteria: Feature with the least missing values,  Feature with the most unique values, Feature with the highest variance. The documentation is found [here](https://feature-engine.readthedocs.io/en/1.1.x/selection/SmartCorrelatedSelection.html)
* The arguments we will use are variables, which are the list of variables to evaluate, if you don't parse anything it will consider all numerical variables in the dataset. The next is a method (like 'Pearson' or 'Spearman'), and threshold, which according to the documentation, is the correlation threshold above which a feature will be deemed correlated with another one and removed from the dataset.

from feature_engine.selection import SmartCorrelatedSelection

We will use the tips dataset. It holds records for waiter tips, based on day of the week, day time, total bill, gender, if it is a smoker table or not, and how many people were in the table.

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
df = pd.read_csv(url)
# df = sns.load_dataset('tips')
df.head()

When you load the dataset from Seaborn, the categorical variables data type is 'category', and for the ML tasks, and more specifically, for the exercise, it should be 'object'.

df.info()

We change the data type to `'object'` by looping over all the variables where its current data type is `'category'`

for col in df.select_dtypes(include='category').columns:
  df[col] = df[col].astype('object')

df.info()

We check for missing data. 
* There is no missing data

df.isnull().sum()

`SmartCorrelatedSelection()` transformer works on numerical data, therefore we have to encode the existing categorical variables, we do that in this exercise with `OrdinalEncoder()`. Then we add `SmartCorrelatedSelection()` where we don't pass the variables, meaning we want all numerical variables to be evaluated. We set the method as Pearson, the threshold as 0.6 and selection_method as the variance. A threshold of 0.6 means that any variable correlations that are at least moderate, will be considered and subject to removal

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **A Big warning**: the tips dataset is intended to be used in a regression task, where you are interested to predict tips. When working on a project, the tips variable wouldn't be a feature, but a target. Here we left it in on purpose as a feature just for the sake of the exercise.

from feature_engine.encoding import OrdinalEncoder
pipeline = Pipeline([
      ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary') ),
      ( 'SmartCorrelatedSelection', SmartCorrelatedSelection(method="pearson",
                                                             threshold=0.6,
                                                             selection_method="variance",))
])

df_transformed = pipeline.fit_transform(df)

We can check which sets of features were marked as correlated (using the rules we set in the previous pipeline). We do that by accessing the pipeline step and using the attribute `.correlated_feature_sets_`

pipeline['SmartCorrelatedSelection'].correlated_feature_sets_

We check which variables were removed with the attribute `.features_to_drop_`

pipeline['SmartCorrelatedSelection'].features_to_drop_

Alternatively, we inspect the df_transformed, and as we expected, the variables were removed

df_transformed.head()

 <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> **Additional warning**: This transformer is used in the features when setting your pipeline for your ML task and typically is one of the last steps of feature engineering since it needs the data pre-processing.
 
 
 <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Also, the tips dataset is intended to be used in a regression task, where you are interested in predicting tips. When working on a project, the tips variable wouldn't be a feature but a target. Here we left in purpose as a feature just for the sake of the exercise.