<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Feature Engineering: Categorical
              
</p>
</div>

DS-NTL-010824
<p>Phase 3</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

#### Feature Engineering: Transforming input data
- Categorical data to numeric form


#### A key aspect to making a better prediction machine

#### Categorical data 
- Suspect that the status of a categorical value affects outcome.
- Want to add as a variable to regress on.
- Need to convert to numeric form.

Two types of categorical data:

<center><img src = "Images/ordinalvsnominal.png" width = 900/></center>


**Dealing with ordinal categoricals**

-Clear progression/order of values:

Pizza cheesiness rating:

- E.g., not cheesy, slightly cheesy, cheesy, very cheesy, extremely cheesy, dripping oceans of cheese



Ordinal encoding:

not cheesy: 0, slightly cheesy: 1, cheesy: 2, very cheesy: 3, extremely cheesy: 4, dripping oceans of cheese: 5


<center><img src = "Images/cheesy_pizza.jpg" /></center>
<center>


A real example: housing dataset
- Using pandas categorical coding for ordinal values
- Sklearn OrdinalEncoder

In [None]:
import pandas as pd
housing_df = pd.read_csv('Data/ames_housing.csv')
housing_df.columns

Lots of columns. Let's check out 'ExterCond' column: quality of material on house exterior.

In [None]:
housing_df['ExterCond'].unique()

This is a column of strings, but these are really categories.
- Pandas has categorical datatype.
- Special methods for categorical datatype.

In [None]:
housing_df['ExterCond'] = housing_df['ExterCond'].astype('category')
housing_df['ExterCond']

Good, but need to establish category order

In [None]:
housing_df['ExterCond'] = housing_df['ExterCond'].cat.reorder_categories(['Po', 'Fa', 'TA', 'Gd', 'Ex'])
housing_df['ExterCond']

Get the numerical values of ordinal categorical:

In [None]:
housing_df['ExterCond']

In [None]:
housing_df['ExterCond'].cat.codes

#### Using scikit learn OrdinalEncoder()

In [None]:
from sklearn.preprocessing import OrdinalEncoder

Some objects in scikit-learn are predictive models:
- LinearRegression()
    - .fit() 
    - .predict()

Other objects are transformers:
- OrdinalEncoder(), StandardScaler(), Normalizer(), etc.
    -  .fit()
    - .transform()
    - .fit_transform()


.fit() method for transformers:
- fit() or fit_transform() transformer to **training set**.
- transform() test set and/or train set.

OrdinalEncoder fits and transforms categorical data to numerical.
- Can do many ordinal categorical columns at once.

In [None]:
ord_cat_selector = ['ExterCond', 'LotShape']
cat_subset = housing_df[ord_cat_selector]
cat_subset

Measure of irregularity of lot shape.
- Clearly ordinal.

In [None]:
cat_subset['LotShape'].unique()

Ordinal encoder will do the mapping all at once:
- Define ordinal order for each categorical variable.

In [None]:
extcond_list = ['Po', 'Fa', 'TA', 'Gd', 'Ex'] 
reg_list = ['Reg', 'IR1', 'IR2', 'IR3']

In [None]:
o_enc = OrdinalEncoder(categories = [extcond_list, reg_list])
o_enc.fit(cat_subset)

Now transform the categorical subset

In [None]:
o_enc.transform(cat_subset)

In [None]:
X_ord = pd.DataFrame(o_enc.transform(cat_subset),
                        columns = cat_subset.columns)
X_ord

In [None]:
cat_subset

Nice thing is you've also set up inverse transform:

In [None]:
X_ord.head()

In [None]:
o_enc.inverse_transform(X_ord)

Some advantages of ordinal encoder:
- Set up encoding order for many categorical columns once.
- Transform/inverse transform at same time
- **Integrates into scikit learn pipeline architecture (will see this later)**

**Dealing with nominal categoricals**

Ordinal encoding nominal categoricals introduce spurious relations:

- Doesn't make sense 

In [None]:
housing_df['RoofStyle'].unique()

- pd.get_dummies()
- sklearn's OneHotEncoder()

Create column for each unique value of nominal categorical:
- Each column takes on 0/1 value.

In [None]:
pd.get_dummies(housing_df['RoofStyle']).tail()

When doing regression, there is issue with transforming feature in this way:
- Accidentally introduced a correlation.
- E.g., constraint: if 5 of the columns are zero the last one must be 1.
- For $k$ values of nominal categorical only $k-1$ carry information.
- **Solution**: Get rid of one of the columns.

In [None]:
X_roof = pd.get_dummies(housing_df['RoofStyle'], drop_first = True)
X_roof.tail()

The `get_dummies()` function is useful for EDA, but when you're building machine learning models and pipelines in Phase 4, it will be important to do any one-hot encoding by using `sklearn`'s tool, the `OneHotEncoder`. The main advantage of this is that it stores information about the columns and creates a persistent function that can be used on future data of the same form. This idea of transforming "future data of the same form" is central to  the predictive statistical work we'll do in later phases. See [this page](https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-are-the-pros-and-cons) for more.

#### Using scikit-learn OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder
onehot_enc = OneHotEncoder(drop = 'first') # sparse = False to make it into a dataframe

Notice that by default the `.transform()` method returns a **sparse matrix**. If we want to see the 1's and 0's we can either override this by setting `sparse=False` in the encoder instance or we can call `todense()` on the sparse matrix:

In [None]:
nominal_cols = ['RoofStyle','HouseStyle']
X_nom_trans =onehot_enc.fit_transform(housing_df[nominal_cols]).todense()
X_nom_trans

In [None]:
onehot_enc.get_feature_names_out()

Can also initiate inverse transform

In [None]:
onehot_enc.inverse_transform(X_nom_trans)

#### Some general advice on encoding many nominal variables

- Watch out for feature size explosion!
- Features are great, but...
    - Lots of features can lead to problems (we will see this later in **great detail**)
    - Can use up tons of memory (will need to encode as sparse matrix)


In [None]:
X_nom_trans.shape

In [None]:
cols =onehot_enc.get_feature_names_out()
cols

In [None]:
X_nom= pd.DataFrame(X_nom_trans,columns = cols)
X_nom.head()

In [None]:
#combine all catergorical variables

cat_df = pd.concat([X_ord,X_nom],axis = 1)
cat_df