<center> <img src="res/ds3000.png"> </center>

<center> <h1> Week 12 - Day 1</h1> </center>

<center> <h2> Part 2: Feature Engineering</h2></center>

## Outline
1. <a href='#1'>One-Hot Encoding</a>
2. <a href='#2'>Feature Discretization</a>
3. <a href='#3'>Polynomial Features</a>
4. <a href='#4'>Univariate Nonlinear Transformation</a>

<a id="1"></a>

## 1. One-Hot Encoding (Dummy Variables)
* A common way to represent categorical variables 
* Logic: replace a categorical varable with one or more new features that can have the values 0 and 1.


* Use the **OneHotEncoder()** method under the preprocessing module
    * `sparse = False` means OneHotEncoder will return a numpy array (instead of a sparse matrix)
    

* Categorical variables encoded with numeric values can also be one-hot encoded
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
import pandas as pd

data = pd.read_csv("res/ave_grades_quidditch.csv")

In [None]:
data

In [None]:
new_df = data["House"].values.reshape(-1,1)
new_df

In [None]:
from sklearn.preprocessing import OneHotEncoder

#instantiate the OneHotEncoder
encoder = OneHotEncoder(sparse = False)

#fit and transform the dataframe
encoded_df = #TODO in class

In [None]:
encoded_df

In [None]:
encoder.get_feature_names()

In [None]:
features_df = pd.DataFrame(encoded_df, columns = encoder.get_feature_names())
features_df

### 1.1. One-Hot Encoding with Multiple Columns

In [None]:
features = data.iloc[:, 1:3]

In [None]:
features

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse = False)

encoded_df = encoder.fit_transform(features)

In [None]:
encoded_df

In [None]:
encoder.get_feature_names()

In [None]:
features_df = pd.DataFrame(encoded_df, columns = encoder.get_feature_names() )

In [None]:
features_df

<a id="2"></a>

## 2. Feature Discretization (Binning)
* Discretization: splitting continuous variables into multiple features (bins)
* Transforms the continuous variable into a discrete one that represents intervals spanning the range of the variable's values
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=3, encode = "onehot-dense", strategy = "quantile")

potion_discretized = discretizer.fit_transform(data["Potion_Ave"].values.reshape(-1, 1))

In [None]:
potion_discretized

In [None]:
potion_df = pd.DataFrame(potion_discretized, columns = ["potionBin1", "potionBin2", "potionBin3"])
potion_df

In [None]:
discretizer.bin_edges_

### 2.1. KBinsDiscretizer Parameters
* **n_bins**: The number of bins to produce (min is 2)
* **encode:** Method used to encode the transformed result. 
    * `onehot`: Encode the transformed result with one-hot encoding and return a sparse matrix (default).
    * `onehot-dense`: Encode the transformed result with one-hot encoding and return a dense array.
    * `ordinal`: Return the bin identifier encoded as an integer value without one-hot encoding the variable.
* **strategy:** Strategy used to define the widths of the bins.
    * `uniform`: All bins in each feature have identical widths.
    * `quantile`: All bins in each feature have the same number of points.
    * `kmeans`: Values in each bin have the same nearest center of a 1D k-means cluster.


<a id="3"></a>

## 3. Polynomial Features
* Can generate polynomial features
* Using polynomial features together with a linear regression model yields a polynomial regression model
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [None]:
from sklearn.preprocessing import PolynomialFeatures

#includes polynomials up to x**5
#a degree of 5 yields 5 features
poly = PolynomialFeatures(degree=5, include_bias=False)

In [None]:
potion_poly = poly.fit_transform(data["Potion_Ave"].values.reshape(-1, 1))
potion_poly

In [None]:
poly_df = pd.DataFrame(potion_poly, columns = poly.get_feature_names())
poly_df

<a id="4"></a>

## 4. Univariate Nonlinear Transformation
* Can apply nonlinear transformation to features
* Common functions for nonlinear scaling
    * log() and exp() implemented in Numpy

In [None]:
import numpy as np

X_train_log = np.log(data["Potion_Ave"].values+1)

In [None]:
X_train_log