## Aula 14 - Feature Engineering

### Warm up

![](https://i.imgur.com/j7po4ZB.gif)

### Expectativas

.

.

.

.

.

.

.

![](https://assets.rebelmouse.io/eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbWFnZSI6Imh0dHBzOi8vbWVkaWEucmJsLm1zL2ltYWdlP3U9JTJGZmlsZXMlMkYyMDE1JTJGMTIlMkYyMCUyRjYzNTg2MjQxNTQ3NjM1Mzk3MC03NDE0Nzc2NzRfdHVtYmxyX2xwZnd6enhPUHQxcXp0d3RlLmdpZiZhbXA7aG89aHR0cCUzQSUyRiUyRmNkbjEudGhlb2R5c3NleW9ubGluZS5jb20mYW1wO3M9OTA0JmFtcDtoPWVlZTI4YmFhOTllZDE0YzFjYTM0YjA2YjAwZGMwYjRlZDllNzNiMjI5MjQ3NzQ3ZTY4N2RiYjg5ZWFlNmNjMGUmYW1wO3NpemU9OTgweCZhbXA7Yz0zNTg0ODI2MDE0IiwiZXhwaXJlc19hdCI6MTU2MDY5MzE4OH0.MQ6bc8mUWAsjqLd2zH53nhFvI3MCuu4mUP4-uyFn43E/img.jpg)

.

.

.

.

.

.

.

#### Machine Learning Pipeline
![Machine Learning Pipeline](https://cdn-images-1.medium.com/max/1600/1*2T5rbjOBGVFdSvtlhCqlNg.png)


##### Feature Engineering is an art

> "Each problem is domain specific and better features (suited to the problem) is often the **deciding factor** of the performance of your system."
> "Data Scientists often spend 70% of their time in the data preparation phase before modeling."               
> — [Link 1](https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b) do pré-aula.

> “Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering."
> — [Prof. Andrew Ng.](https://en.wikipedia.org/wiki/Andrew_Ng) (Stanford)

> “Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”  
> — [Prof. Pedro Domingos](https://en.wikipedia.org/wiki/Pedro_Domingos) (University of Washington)


==============================================
#### Dataset: dados históricos dos funcionários.
#### Tarefa: prever promoção do funcionário
==============================================

#### Importando o que precisaremos...

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from plotting import plot_confusion_matrix

In [None]:
data = pd.read_csv('data/Base Analytics.csv')
data.sample(5)

### Numerical Data

#### Filtrando numericos e excluindo nulos...

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
data = data[data['Admissão'] >= '2011-01-01'].reset_index().drop(columns='index')
data_num = data.select_dtypes(include=numerics).copy()
data_num = data_num.drop(columns=['ADP', 'Cod.Cargo', 'Cod.Cargo Admissão', 'CC', 
                                              'Hora Extra 2016', 'Hora Negativa 2016', 'Ad. Noturno 2016', 
                                              'Absenteísmo 2016', 'Hora Extra 2017', 'Hora Negativa 2017', 
                                              'Ad. Noturno 2017', 'Absenteísmo 2017', 'Banda', 
                                              '2012/13 Goal Achievement'], axis=1)

In [None]:
data.shape

In [None]:
data_old = data_num.dropna()
data_old.shape

In [None]:
X = data_old.drop(columns='PROMOVIDO')
y = data_old['PROMOVIDO']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(random_state=42, class_weight='balanced')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [None]:
_ = plot_confusion_matrix(y_test, y_pred)
_ = plot_confusion_matrix(y_test, y_pred, normalize='yes')

#### E se ao invés de excluir os nulos, substituíssemos pela média?

In [None]:
data_num.describe()

In [None]:
for col in data_num.columns:
    qtt = data_num[col].isnull().sum()
    if qtt > 0:
        print(col, qtt)
        data_num.update(data_num[col].fillna(data_num[col].mean()))

In [None]:
X = data_num.drop(columns=['PROMOVIDO'])
y = data_num['PROMOVIDO']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(random_state=42, class_weight='balanced')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [None]:
_ = plot_confusion_matrix(y_test, y_pred)
_ = plot_confusion_matrix(y_test, y_pred, normalize='yes')

[Scaler](https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
scaler = StandardScaler()
data_num_scaled = pd.DataFrame(scaler.fit_transform(data_num.drop(columns='PROMOVIDO')),
                               columns=list(data_num.drop(columns='PROMOVIDO').columns))

In [None]:
X = data_num_scaled
y = data_num['PROMOVIDO']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(random_state=42, class_weight='balanced')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [None]:
_ = plot_confusion_matrix(y_test, y_pred)
_ = plot_confusion_matrix(y_test, y_pred, normalize='yes')

[RobustScaler](https://scikit-learn.org/0.19/modules/generated/sklearn.preprocessing.RobustScaler.html)

In [None]:
rscaler = RobustScaler()
data_num_rscaled = pd.DataFrame(rscaler.fit_transform(data_num.drop(columns='PROMOVIDO')),
                               columns=list(data_num.drop(columns='PROMOVIDO').columns))

In [None]:
X = data_num_rscaled
y = data_num['PROMOVIDO']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(random_state=42, class_weight='balanced')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [None]:
_ = plot_confusion_matrix(y_test, y_pred)
_ = plot_confusion_matrix(y_test, y_pred, normalize='yes')

#### Binarization
- Ao invés de um count, por exemplo, usar somente a informação de ter ou não aquela informação.
- Exemplo: quantas vezes cada usuário ouviu uma música vs. se ouviu ou não uma música.
- No contexto do nosso dataset: se tivéssemos uma coluna que só dissesse quando foi o último aumento do funcionário, poderíamos usar somente a informação se houve aumento ou não.

#### Interaction between features
- Além dos valores individuais, podemos extrair informação importante da interação que pode haver entre features.
- No contexto nosso dataset?

#### Binning
- Podem haver valores muito raros, mas próximos de valores mais numerosos; é relevante a informação granular?
- Também podemos ter maior interesse em um grupo - ex: público alvo por idade;
- No contexto do nosso dataset?

#### Rounding
- Mesma ideia do binning
- Exemplos?
- E no contexto nosso dataset?

#### Statistical Transformation 
- Log: Log transforms are useful when applied to skewed distributions as they tend to expand the values which fall in the range of lower magnitudes and tend to compress or reduce the values which fall in the range of higher magnitudes.
- Box-Cox