# 9. Handling Categorical Data
All of the features we have examined thus far have been numeric. There are many features in the dataset that we examined that had string values. We ignored these at the time because all data passed to a Scikit-Learn estimator must be numeric. Let's choose some string and numeric columns and attempt to fit a model with string columns.

In [2]:
import pandas as pd
housing = pd.read_csv('../data/housing.csv')
housing.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
h = housing[['LotShape', 'LandContour', 'Neighborhood', 'OverallQual', 'WoodDeckSF']]
h.head()

Unnamed: 0,LotShape,LandContour,Neighborhood,OverallQual,WoodDeckSF
0,Reg,Lvl,CollgCr,7,0
1,Reg,Lvl,Veenker,6,298
2,IR1,Lvl,CollgCr,7,0
3,IR1,Lvl,Crawfor,7,0
4,IR1,Lvl,NoRidge,8,192


In [4]:
h.isna().sum()

LotShape        0
LandContour     0
Neighborhood    0
OverallQual     0
WoodDeckSF      0
dtype: int64

In [5]:
X = h.values
y = housing['SalePrice'].values

In [6]:
X

array([['Reg', 'Lvl', 'CollgCr', 7, 0],
       ['Reg', 'Lvl', 'Veenker', 6, 298],
       ['IR1', 'Lvl', 'CollgCr', 7, 0],
       ...,
       ['Reg', 'Lvl', 'Crawfor', 7, 0],
       ['Reg', 'Lvl', 'NAmes', 5, 366],
       ['Reg', 'Lvl', 'Edwards', 5, 736]], dtype=object)

In [7]:
y

array([208500, 181500, 223500, ..., 266500, 142125, 147500])

### Try to fit the model :(

In [8]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)

ValueError: could not convert string to float: 'Edwards'

## This is/was the worst part of scikit-learn
Other languages like R, internally handle string column.

## This is getting fixed in scikit-learn version 0.20
There has been a lot of work to fix this with the upgraded `OneHotEncoder` class in version 0.20. Let's check that you have version 0.20 installed now.

In [10]:
import sklearn
sklearn.__version__

'0.20rc1'

## Old way - use `pd.get_dummies` 
The pandas function `pd.get_dummies` did **one hot encoding**. Let's see how it worked.

In [11]:
h.head()

Unnamed: 0,LotShape,LandContour,Neighborhood,OverallQual,WoodDeckSF
0,Reg,Lvl,CollgCr,7,0
1,Reg,Lvl,Veenker,6,298
2,IR1,Lvl,CollgCr,7,0
3,IR1,Lvl,Crawfor,7,0
4,IR1,Lvl,NoRidge,8,192


In [12]:
h_dummies = pd.get_dummies(h)
h_dummies.head()

Unnamed: 0,OverallQual,WoodDeckSF,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,7,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,6,298,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
2,7,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,7,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,8,192,1,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0


### Only the string columns were encoded
The columns that were numeric were left alone. You can use the **`nunique`** method to find the number of unique values in each column. This will give you an idea of how wide your DataFrame will become after the encoding.

In [13]:
h.nunique()

LotShape          4
LandContour       4
Neighborhood     25
OverallQual      10
WoodDeckSF      274
dtype: int64

In [14]:
h_dummies.shape

(1460, 35)

# Use the upgraded `OneHotEncoder`

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [18]:
ohe = OneHotEncoder(sparse=False)

In [21]:
X = ohe.fit_transform(h)

In [22]:
X.shape

(1460, 317)

# Wow, thats a lot of features - what happened?
We need to encode just the categorical features. There is a new transformer in scikit-learn called `ColumnTransformer` that does this. It takes in a 3-item tuple to do this.

In [24]:
from sklearn.compose import ColumnTransformer

In [25]:
transformers = [('cat', ohe, ['LotShape', 'LandContour', 'Neighborhood'])]

In [29]:
ct = ColumnTransformer(transformers, remainder='passthrough')
X = ct.fit_transform(h)
X.shape

(1460, 35)