# 11. Handling Categorical Data
All of the features we have examined thus far have been numeric. There are many features in the dataset that we examined that had string values. We ignored these at the time because all data passed to a Scikit-Learn estimator must be numeric. Let's choose some string and numeric columns and attempt to fit a model with string columns.

In [None]:
import pandas as pd
housing = pd.read_csv('../data/housing.csv')
housing.head()

In [None]:
h = housing[['LotShape', 'LandContour', 'Neighborhood', 'OverallQual', 'WoodDeckSF', 'LotArea']].copy()
h.head()

In [None]:
h.isna().sum()

In [None]:
X = h.values
y = housing['SalePrice'].values

In [None]:
X

In [None]:
y

In [None]:
X

### Try to fit the model :(

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

## This was the worst part of scikit-learn
Other languages like R, internally handle string column.

## This is got fixed in scikit-learn version 0.20!
There has been a lot of work to fix this with the upgraded `OneHotEncoder` class in version 0.20. Let's check that you have version 0.20 installed now.

In [None]:
import sklearn
sklearn.__version__

# Make variables Categorical
Notice that the `OverallQual` is a categorical variable despite it being numeric.

In [None]:
h['LotShape'] = pd.Categorical(h['LotShape'])
h['LandContour'] = pd.Categorical(h['LandContour'])
h['Neighborhood'] = pd.Categorical(h['Neighborhood'])
h['OverallQual'] = pd.Categorical(h['OverallQual'])

## Old way - use `pd.get_dummies` 
The pandas function `pd.get_dummies` did **one hot encoding**. Let's see how it worked.

In [None]:
h.head()

By default, `get_dummies` will encode all string columns and any columns that are Pandas category.

In [None]:
h_dummies = pd.get_dummies(h)
h_dummies.head()

### Only the string columns were encoded
The columns that were numeric were left alone. You can use the **`nunique`** method to find the number of unique values in each column. This will give you an idea of how wide your DataFrame will become after the encoding.

In [None]:
h.nunique()

In [None]:
h_dummies.shape

# Use the upgraded `OneHotEncoder`

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder(sparse=False)

In [None]:
X = ohe.fit_transform(h)

In [None]:
X.shape

# Wow, thats a lot of features - what happened?
We need to encode just the categorical features. By default, `OneHotEncoder` will encode every single column in our DataFrame.

# Introducing `ColumnTransformer`
There is a new transformer in scikit-learn called `ColumnTransformer` that allows you to apply different transformations to different columns of your DataFrame.

## Create a list of 3-item tuples 

The `ColumnTransformer` requires a list of 3-item tuples to for it to work. The first value of the tuple is a string called the **name**. This will be used if you refer to the transformer later on during a grid search. The second value of the tuple is the actual **transformer**. In this example, we are doing one hot encoding. The last value in the tuple are the list of **columns** to be applied the transformation.

Let's import the `ColumnTransformer` and create the list of three-item tuples. Here, we just have one transformer, so our list is of length 1.

In [None]:
from sklearn.compose import ColumnTransformer
transformers = [('cat', ohe, ['LotShape', 'LandContour', 'Neighborhood'])]

### What happens to the other columns?
Only the columns explicitly stated get transformed. The other columns are either dropped or kept. The default is to drop the columns. You can keep them, like we do below by passing the `'passthrough'` string to the `remainder` parameter.

In [None]:
ct = ColumnTransformer(transformers, remainder='passthrough')
X = ct.fit_transform(h)
X.shape

In [None]:
X

### Get new column names - NotImplementedError :(
Scikit-Learn returns a numpy object. In the future, we will be able to use `get_feature_names` to get the column names, but as of now, this feature isn't implemented when we use 'passthrough'.

In [None]:
ct.get_feature_names()

## Scale the numeric variables
We can modify the above `ColumnTransformer` object so that it also transforms the numeric variables. We do this by extending the list with another three item tuple.

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

In [None]:
transformers = [('cat', ohe, ['LotShape', 'LandContour', 'Neighborhood', 'OverallQual']),
                ('num', ss, ['WoodDeckSF', 'LotArea'])]

ct = ColumnTransformer(transformers)
X = ct.fit_transform(h)
X.shape

### Continue with machine learning

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X, y)
cross_val_score(lr, X, y, cv=10)

# Exercises
* Manually choose some categorical and some numeric features and use `ColumnTransformer` to both encode and scale the appropriate values.
* Can you write a function that iterates through each column of the DataFrame and changes the data type to `Categorical`. Once you have done this, can you build a model that uses all the data?