## Using categorical data in mathematical models

You have probably heard of studies that fit models to categorical data. For instance, you might have heard that the incidence of a particular type of cancer is higher among folks who work in nail salons. To reach such a conclusion, it is often necessary to consider multiple types of categorical data - profession, zip code, etc - along with quantitative data such as age. 

But how can a string value like profession be converted into a number for use in statistical modeling?

Let's start by considering a *binary* variable - one that has only two possibilities

One way to do so is through the use of what's called 'one-hot encoding'. This

In [None]:
#pd.get_dummies() is a good way to do this, see
https://stackabuse.com/one-hot-encoding-in-python-with-pandas-and-scikit-learn/

In [5]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
data = [['Male', 1], ['Female', 3], ['Female', 2]]
encoder.fit(data)
OneHotEncoder(handle_unknown='ignore')
data_as_binary = encoder.transform(data)
print(data_as_binary)

  (0, 1)	1.0
  (0, 2)	1.0
  (1, 0)	1.0
  (1, 4)	1.0
  (2, 0)	1.0
  (2, 3)	1.0


In [6]:
## Notes

This [example from the scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_categorical.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-categorical-py) has a very fancy way of processing categorical data in a dataframe using a column transformer hooked up to the OneHotEncoder. A similar approach allows for dropping rows with empty values

In [None]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = make_column_transformer(
    (OneHotEncoder(sparse=False, handle_unknown='ignore'),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')