# Categorical Variables

Many data sets contain features that are non-numerical or categorical features, in that they can only take on one of a limited number of possible values. In general, the possible states are fixed, such as the `sex`, `smoker` and `day` features from the `tips` data set.

Categorical features can be nominal or ordinal:

* A nominal feature is a categorical feature without no relations among the different categories
* An ordinal feature is a categorical feature where the possible values have an intrensic relationship. For example, if we encode the results of a race as *first*, *second*, and *thiird*.

To use categorical features to generate a machine learning model with the scikit-learn library, we must convert them into numerical values. This process is generally known as encoding, and the scikit-learn library provides several different encondings in the `preprocessing` module.

In [25]:
%run "S0-init.ipynb"

Samples from the tips dataset:
     total_bill   tip     sex smoker   day    time  size
128       11.38  2.00  Female     No  Thur   Lunch     2
177       14.48  2.00    Male    Yes   Sun  Dinner     2
19        20.65  3.35    Male     No   Sat  Dinner     3
213       13.27  2.50  Female    Yes   Sat  Dinner     2
183       23.17  6.50    Male    Yes   Sun  Dinner     4


To begin, we first create a fictious set of features from the four colors as strings: 'Red', 'Blue', 'Yellow', and 'Green'. We transform than these categorical features into one of four numerical values by using the `LabelEncoder` estimator. We `fit` this etimator to the set of possible categorical values (or four colors), and transform the generated set of features to compare the numerical label to the original color.

In [27]:
from sklearn.preprocessing import LabelEncoder

# Define allowed colors
clrs = ['Red', 'Blue', 'Yellow', 'Green']

# Size of sample
num_clrs = 10

# Create random sample of ten colors
tst = np.random.choice(clrs, size=num_clrs, replace=True)
tst

array(['Red', 'Blue', 'Yellow', 'Green', 'Green', 'Blue', 'Green', 'Blue',
       'Yellow', 'Green'], dtype='<U6')

In [28]:
# Create and fit label encoder to the list of allowed colors
le = LabelEncoder()
le.fit(clrs)

LabelEncoder()

In [29]:
# Transform sample data, and reshape vector
# to a 2D matrix (10, 1)
le_data = le.transform(tst).reshape(num_clrs, 1)

In [30]:
# Display encode label and color
for clr, idx in zip(tst, le_data):
    print(idx, clr)

[2] Red
[0] Blue
[3] Yellow
[1] Green
[1] Green
[0] Blue
[1] Green
[0] Blue
[3] Yellow
[1] Green


This encoding is fine if the data are ordinal, but in this case, our colors are likely nominal and there is no numerical relationship between the different features. Thus, we need to perform an additional transformation to convert our data into a numerical format that a machine learning model can effectively process. To do this, a commonly used approach known as One Hot Encoding is used. This approach generates a new feature for each possible value in our category. Thus, for our four colors, we need four features. These features will be binary, in that a value of zero indicates that the feature is not present for the specific instance, and a value of one indicates it is present. Furthermore, only one set of these new features can be present (or on) for a specific instance.

We can leverage this technique by using the `OneHotEncoder` estimator from the scikit-learn preprocessing module.

In [31]:
# Create our one hot encoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

print('Going from color to encoding:\n')
# Display encode label and color
for clr, ohed in zip(tst, ohe.fit_transform(le_data)):
    print(ohed, clr)
    
# Go in reverse
print('\nGoing from encoding to color:\n')
enc = [0,1,0,0]

# TODO: not working properly yet - As result we want to see 'Green'
# and 'Green' only
print(f"{enc} = {le.inverse_transform(enc)}")

Going from color to encoding:

[0. 0. 1. 0.] Red
[1. 0. 0. 0.] Blue
[0. 0. 0. 1.] Yellow
[0. 1. 0. 0.] Green
[0. 1. 0. 0.] Green
[1. 0. 0. 0.] Blue
[0. 1. 0. 0.] Green
[1. 0. 0. 0.] Blue
[0. 0. 0. 1.] Yellow
[0. 1. 0. 0.] Green

Going from encoding to color:

[0, 1, 0, 0] = ['Blue' 'Green' 'Blue' 'Blue']


We can perform this process in one step by using the `LabelBinarizer` estimator from the sciket-learn pre-preocessing module. With this estimator we can directly convert a set of categorical labels to a one hot encoded matrix by using the `fit` method to generate the encoder and the `transform` method applies the encoding to a set of labels.

In [32]:
# Create label binarizer estimator and fit to the colors
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit(clrs)

# Transform and display encode label and color
for clr, idx in zip(tst, lb.transform(tst)):
    print(idx, clr)

# Go in reverse
print('\nGoing from encoding to color:\n')
enc = np.array([0, 1, 0, 0]).reshape(1, len(clrs))
print(f'{enc[0]} = {lb.inverse_transform(enc)}')

[0 0 1 0] Red
[1 0 0 0] Blue
[0 0 0 1] Yellow
[0 1 0 0] Green
[0 1 0 0] Green
[1 0 0 0] Blue
[0 1 0 0] Green
[1 0 0 0] Blue
[0 0 0 1] Yellow
[0 1 0 0] Green

Going from encoding to color:

[0 1 0 0] = ['Green']


# Linear Regression with categorical

We will use the scikit-learn `LabelBinarizer` to convert the day feature in the _tips_ data set into a set of numerical features.

In [33]:
lb = LabelBinarizer()
ohe_day = lb.fit_transform(df.day.to_numpy())

In [34]:
# Total bill and bindary days of the week as independent variables 
n_ind_data = np.column_stack((df.total_bill, ohe_day))
# Tips as target
dep_data = df.tip.values.reshape(df.shape[0], 1)

# This is the amount to hold out for 'blind' testing
frac = 0.4

# Create test/train splits for data and labels
# Explicitly set our random seed to enable reproduceability
ind_train, ind_test, dep_train, dep_test \
    = train_test_split(n_ind_data, dep_data, test_size=frac, random_state=23)

In [35]:
pd.DataFrame(n_ind_data, columns=['Total_bill', 'Fri', 'Sat', 'Sun', 'Thur']) 

Unnamed: 0,Total_bill,Fri,Sat,Sun,Thur
0,16.99,0.0,0.0,1.0,0.0
1,10.34,0.0,0.0,1.0,0.0
2,21.01,0.0,0.0,1.0,0.0
3,23.68,0.0,0.0,1.0,0.0
4,24.59,0.0,0.0,1.0,0.0
...,...,...,...,...,...
239,29.03,0.0,1.0,0.0,0.0
240,27.18,0.0,1.0,0.0,0.0
241,22.67,0.0,1.0,0.0,0.0
242,17.82,0.0,1.0,0.0,0.0


In [36]:
# Demonstrate encoding results
days = ['Thur', 'Fri', 'Sat', 'Sun']

# Fit, transform, and display encode label and day
for day, idx in zip(days, lb.fit(df.day.to_numpy()).transform(days)):
    print(idx, day)

[0 0 0 1] Thur
[1 0 0 0] Fri
[0 1 0 0] Sat
[0 0 1 0] Sun


Given this new feature matrix, we generate a new linear regression model in the following code cell:
* We fit the model, 
* display the fit coefficients, 
* compute the model performance, 
* and finally display the regression model plot and the residual model plot

In [37]:
# Create and Fit our linear regression model to training data
model = LinearRegression(fit_intercept=True)
model.fit(ind_train, dep_train)

# Display model fit parameters for training data
# Label binarizer sortw the labels, hence change in display order.
print(f'tip = {model.intercept_[0]:4.2f}\n',
      f'+ {model.coef_[0][0]:4.2f} total_bill\n',
      f'+ {model.coef_[0][4]:4.2f} Day=="Thu"\n',
      f'+ {model.coef_[0][1]:4.2f} Day=="Fri"\n',
      f'+ {model.coef_[0][2]:4.2f} Day=="Sat"\n',
      f'+ {model.coef_[0][3]:4.2f} Day=="Sun"\n')

# Compute model predictions for test data
results = model.predict(ind_test)

# Compute score and display result (Coefficient of Determination)
score = 100.0 * model.score(ind_test, dep_test)
print(f'Multivariate LR Model score = {score:5.1f}%')

tip = 1.01
 + 0.10 total_bill
 + 0.10 Day=="Thu"
 + 0.22 Day=="Fri"
 + -0.05 Day=="Sat"
 + -0.27 Day=="Sun"

Multivariate LR Model score =  38.6%


In this case, our new model performs slightly worse than the original single variable linear regression model. This suggests that the day of the week is not an important variable in the underlying relationship between `total_bill` and `tip`.

Creating a baseline model that does something really simple,  like the original single variable linear regression, can give us something to compare our other models, and you may be able to find a better predicting model.