# Handling categorical data in `scikit-learn`

`scikit-learn` models typically work with quantitative data. When face with categorical data, those should be transformed before feeding the model so as not to get strange behaviors (or errors).

In [1]:
import pandas
import io  # Only used to show a sample file in the notebook
from sklearn.linear_model import LinearRegression, LogisticRegression

Let us assume we have the following data read using `pandas.read_csv` function:

In [2]:
file_content = io.StringIO("""age;sex;weight;height
23;M;70;180
22;M;65;160
31;F;80;190
26;M;80;175
22;F;65;170
""")

df = pandas.read_csv(file_content, sep=";")
print(df)

   age sex  weight  height
0   23   M      70     180
1   22   M      65     160
2   31   F      80     190
3   26   M      80     175
4   22   F      65     170


At that point, we can build models relying on variables `age`, `weight` and `height` as those are quantitative variables. Let us, for example, fit a linear regression to explain `height` using other two variables:

In [3]:
X = df[["age", "weight"]]
y = df["height"]

model = LinearRegression()
model.fit(X, y)
print(model.predict(X))

[ 170.86538462  168.04487179  189.39102564  178.65384615  168.04487179]


But, if we want to add the `sex` variable in the model, we have an issue:

In [4]:
X = df[["age", "weight", "sex"]]
y = df["height"]

try:
    model = LinearRegression()
    model.fit(X, y)
    print(model.predict(X))
except ValueError:
    print("Could not fit the model because of erroneous type")

Could not fit the model because of erroneous type


Note that we do not have the same issue if the categorical variable is the one to be predicted by the model:

In [5]:
X = df[["age", "weight", "height"]]
y = df["sex"]

# If we want to do classification, we should use LogisticRegression instead of LinearRegression
model = LogisticRegression()
model.fit(X, y)
print(model.predict(X))

['M' 'M' 'F' 'M' 'F']


Now, if our categorical variable is a regressor in the model, we should transform it before using `scikit-learn`. This is true even if we do not get an error at fitting. For example, if the categories are numbered `1`, `2`, and `3`, it is of prime importance to recode them so that `scikit-learn` does not treat them as numerical data!

To transform a given column corresponding to categorical data, we will use the `pandas.get_dummies` function as follows:

In [6]:
df_sklearn = pandas.get_dummies(df, columns=["sex"])
print(df_sklearn)

   age  weight  height  sex_F  sex_M
0   23      70     180      0      1
1   22      65     160      0      1
2   31      80     190      1      0
3   26      80     175      0      1
4   22      65     170      1      0


In [7]:
X = df_sklearn[["age", "weight", "sex_F", "sex_M"]]
y = df_sklearn["height"]

model = LinearRegression()
model.fit(X, y)
print(model.predict(X))

[ 169.97093023  166.01744186  189.33139535  179.01162791  170.66860465]


Note however that this pre-processing can be done in `scikit-learn` directly using the `OneHotEncoder` class:

In [None]:
from sklearn.preprocessing import OneHotEncoder

X = df[["age", "weight", "sex"]]
y = df["height"]

preprocessor = OneHotEncoder()

X_preprocessed = preprocessor.fit_transform(X)

print(X_preprocessed)