<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_02_2_pandas_cat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks

**Module 2: Python for Machine Learning**

- Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
- For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).


# Module 2 Material

Main video lecture:

- Part 2.1: Introduction to Pandas [[Video]](https://www.youtube.com/watch?v=wixHCvnvnsU&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_1_python_pandas.ipynb)
- **Part 2.2: Categorical Values** [[Video]](https://www.youtube.com/watch?v=Fm7Ax23hDP0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_2_pandas_cat.ipynb)
- Part 2.3: Grouping, Sorting, and Shuffling in Python Pandas [[Video]](https://www.youtube.com/watch?v=tUhaD8xWd7k&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_3_pandas_grouping.ipynb)
- Part 2.4: Using Apply and Map in Pandas [[Video]](https://www.youtube.com/watch?v=YNo_mg1RrkM&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_4_pandas_functional.ipynb)
- Part 2.5: Feature Engineering in Pandas for Deep Learning in PyTorch [[Video]](https://www.youtube.com/watch?v=ezaVtM405Qs&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_5_pandas_features.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.


In [1]:
try:
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: using Google CoLab


# Part 2.2: Categorical and Continuous Values

Neural networks require their input to be a fixed number of columns. This input format is very similar to spreadsheet data; it must be entirely numeric. It is essential to represent the data so that the neural network can train from it. Before we look at specific ways to preprocess data, it is important to consider four basic types of data, as defined by [[Cite:stevens1946theory]](http://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf). Statisticians commonly refer to as the [levels of measure](https://en.wikipedia.org/wiki/Level_of_measurement):

- Character Data (strings)
  - **Nominal** - Individual discrete items, no order. For example, color, zip code, and shape.
  - **Ordinal** - Individual distinct items have an implied order. For example, grade level, job title, Starbucks(tm) coffee size (tall, vente, grande)
- Numeric Data
  - **Interval** - Numeric values, no defined start. For example, temperature. You would never say, "yesterday was twice as hot as today."
  - **Ratio** - Numeric values, clearly defined start. For example, speed. You could say, "The first car is going twice as fast as the second."


## Encoding Continuous Values

One common transformation is to normalize the inputs. It is sometimes valuable to normalize numeric inputs in a standard form so that the program can easily compare these two values. Consider if a friend told you that he received a 10-dollar discount. Is this a good deal? Maybe. But the cost is not normalized. If your friend purchased a car, the discount is not that good. If your friend bought lunch, this is an excellent discount!

Percentages are a prevalent form of normalization. If your friend tells you they got 10% off, we know that this is a better discount than 5%. It does not matter how much the purchase price was. One widespread machine learning normalization is the Z-Score:

$$ z = \frac{x - \mu}{\sigma} $$

To calculate the Z-Score, you also need to calculate the mean(&mu; or $\bar{x}$) and the standard deviation (&sigma;). You can calculate the mean with this equation:

$$ \mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n} $$

The standard deviation is calculated as follows:

$$ \sigma = \sqrt{\frac{1}{N} \sum\_{i=1}^N (x_i - \mu)^2} $$

The following Python code replaces the mpg with a z-score. Cars with average MPG will be near zero, above zero is above average, and below zero is below average. Z-Scores more that 3 above or below are very rare; these are outliers.


Ah, the quirks of LaTeX rendering in different environments! Let's break it down in a way that's hopefully more accessible for your current setup. We'll convert those LaTeX equations into a more descriptive text format:

1. **Z-Score Formula:**
   - The Z-Score is calculated using the formula: `z = (x - μ) / σ`.
   - Here, `x` represents an individual data point, `μ` (mu) is the mean of the data points, and `σ` (sigma) is the standard deviation of the data points.

2. **Mean Calculation:**
   - The mean $(μ or \(\bar{x}\))$ is the average value of all data points.
   - It's calculated as: `μ = (x₁ + x₂ + ... + xn) / n`, where `x₁, x₂, ..., xn` are the data points and `n` is the number of data points.

3. **Standard Deviation Calculation:**
   - Standard deviation (σ) measures the amount of variation or dispersion in a set of values.
   - It's calculated as: `σ = sqrt((1/N) * Σ (xi - μ)² from i=1 to N)`.
   - In this formula, `Σ` denotes the sum over all data points from `i = 1` to `N` (the total number of data points), `xi` is each individual data point, and `μ` is the mean.

4. **Interpreting Z-Scores:**
   - In your context, Z-Scores are used to normalize the 'mpg' (miles per gallon) values of cars.
   - A Z-Score near zero indicates average MPG.
   - A Z-Score above zero indicates above-average MPG.
   - A Z-Score below zero indicates below-average MPG.
   - Z-Scores significantly more than 3 or less than -3 are considered outliers (very rare and extreme values).

Remember, the essence of normalization and Z-Scores is to put different data points on a common scale, making them easier to compare and analyze, much like using a universal ruler for measurement in various conditions! 📏🚗💨

In [1]:
import pandas as pd
from scipy.stats import zscore

In [3]:
url = "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv"
mpg_data = pd.read_csv(url, na_values=["NA", "?"])
mpg_data.sample(10, random_state=43)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
102,26.0,4,97.0,46.0,1950,21.0,73,2,volkswagen super beetle
210,19.0,6,156.0,108.0,2930,15.5,76,3,toyota mark ii
136,16.0,8,302.0,140.0,4141,14.0,74,1,ford gran torino
64,15.0,8,318.0,150.0,4135,13.5,72,1,plymouth fury iii
208,13.0,8,318.0,150.0,3940,13.2,76,1,plymouth volare premier v8
288,18.2,8,318.0,135.0,3830,15.2,79,1,dodge st. regis
20,25.0,4,110.0,87.0,2672,17.5,70,2,peugeot 504
306,28.8,6,173.0,115.0,2595,11.3,79,1,chevrolet citation
147,24.0,4,90.0,75.0,2108,15.5,74,2,fiat 128
138,14.0,8,318.0,150.0,4457,13.5,74,1,dodge coronet custom (sw)


In [4]:
mpg_data= (mpg_data
     .assign(mpg = lambda x: zscore(x["mpg"]))
    )

mpg_data.sample(10, random_state=43)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
102,0.318393,4,97.0,46.0,1950,21.0,73,2,volkswagen super beetle
210,-0.578335,6,156.0,108.0,2930,15.5,76,3,toyota mark ii
136,-0.962647,8,302.0,140.0,4141,14.0,74,1,ford gran torino
64,-1.090751,8,318.0,150.0,4135,13.5,72,1,plymouth fury iii
208,-1.346959,8,318.0,150.0,3940,13.2,76,1,plymouth volare premier v8
288,-0.680818,8,318.0,135.0,3830,15.2,79,1,dodge st. regis
20,0.190289,4,110.0,87.0,2672,17.5,70,2,peugeot 504
306,0.677084,6,173.0,115.0,2595,11.3,79,1,chevrolet citation
147,0.062185,4,90.0,75.0,2108,15.5,74,2,fiat 128
138,-1.218855,8,318.0,150.0,4457,13.5,74,1,dodge coronet custom (sw)


### Encoding Categorical Values as Dummies

The traditional means of encoding categorical values is to make them dummy variables. This technique is also called one-hot-encoding. Consider the following data set.


In [7]:
url02 = "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv"
df = pd.read_csv(url02, na_values=["NA", "?"],)

pd.set_option("display.max_columns", 0)
pd.set_option("display.max_rows", 0)

df.sample(10, random_state=43)

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
1480,1481,e2,c,88671.0,42.566667,1,28.031496,69,26.208515,44,0.850394,0.610236,0.361355,d
1312,1313,kd,c,54595.0,17.433333,1,5.644953,45,7.371364,50,0.901575,0.440945,0.302802,b
305,306,pz,a,51088.0,7.25,1,4.448103,5,12.593227,46,0.834646,0.511811,0.211488,b
1545,1546,qp,c,65151.0,40.941667,1,24.196135,100,10.18519,42,0.846457,0.661417,0.3601,c
1716,1717,f8,c,67807.0,32.491667,1,13.778096,84,23.506815,42,0.968504,0.791339,0.231563,c
1420,1421,pz,a,52682.0,38.991667,2,3.822477,27,26.940003,38,0.818898,0.748031,0.680887,b
1314,1315,de,a,,11.583333,0,11.792412,65,23.613601,48,0.905512,0.551181,0.747107,b
795,796,kw,c,52717.0,32.491667,1,4.312097,17,14.734298,42,0.968504,0.799213,0.312282,b
817,818,qp,c,69508.0,33.683333,1,5.264137,12,18.434453,43,0.956693,0.751969,0.118082,c
1486,1487,pz,a,49742.0,22.525,2,3.632069,13,16.272025,48,0.968504,0.559055,0.777081,b


The _area_ column is not numeric, so you must encode it with one-hot encoding. We display the number of areas and individual values. There are just four values in the _area_ categorical variable in this case.


In [8]:
areas = list(df["area"].unique())
print(f"Number of areas: {len(areas)}")
print(f"Areas: {areas}")

Number of areas: 4
Areas: ['c', 'd', 'a', 'b']


There are four unique values in the _area_ column. To encode these dummy variables, we would use four columns, each representing one of the areas. For each row, one column would have a value of one, the rest zeros. For this reason, this type of encoding is sometimes called one-hot encoding. The following code shows how you might encode the values "a" through "d." The value A becomes [1,0,0,0] and the value B becomes [0,1,0,0].


In [9]:
dummies = pd.get_dummies(["a", "b", "c", "d"], prefix="area",dtype=int)
print(dummies)

   area_a  area_b  area_c  area_d
0       1       0       0       0
1       0       1       0       0
2       0       0       1       0
3       0       0       0       1


We can now encode the actual column.


In [10]:
dummies = pd.get_dummies(df["area"], prefix="area",dtype=int)
print(dummies[0:10])  # Just show the first 10

   area_a  area_b  area_c  area_d
0       0       0       1       0
1       0       0       1       0
2       0       0       1       0
3       0       0       1       0
4       0       0       0       1
5       0       0       1       0
6       0       0       0       1
7       1       0       0       0
8       0       0       1       0
9       1       0       0       0


For the new dummy/one hot encoded values to be of any use, they must be merged back into the data set.


In [11]:
df = pd.concat([df, dummies], axis=1)

To encode the _area_ column, we use the following code. Note that it is necessary to merge these dummies back into the data frame.


In [13]:
pd.set_option("display.max_columns", 0)
pd.set_option("display.max_rows", 15)
df[["id", "job", "area", "income", "area_a", "area_b", "area_c", "area_d"]]

Unnamed: 0,id,job,area,income,area_a,area_b,area_c,area_d
0,1,vv,c,50876.0,0,0,1,0
1,2,kd,c,60369.0,0,0,1,0
2,3,pe,c,55126.0,0,0,1,0
3,4,11,c,51690.0,0,0,1,0
4,5,kl,d,28347.0,0,0,0,1
...,...,...,...,...,...,...,...,...
1995,1996,vv,c,51017.0,0,0,1,0
1996,1997,kl,d,26576.0,0,0,0,1
1997,1998,kl,d,28595.0,0,0,0,1
1998,1999,qp,c,67949.0,0,0,1,0


Usually, you will remove the original column _area_ because the goal is to get the data frame to be entirely numeric for the neural network.


In [14]:
pd.set_option("display.max_columns", 0)
pd.set_option("display.max_rows", 5)

df.drop("area", axis=1, inplace=True)
display(df[["id", "job", "income", "area_a", "area_b", "area_c", "area_d"]])

Unnamed: 0,id,job,income,area_a,area_b,area_c,area_d
0,1,vv,50876.0,0,0,1,0
1,2,kd,60369.0,0,0,1,0
...,...,...,...,...,...,...,...
1998,1999,qp,67949.0,0,0,1,0
1999,2000,pe,61467.0,0,0,1,0


### Removing the First Level

The **pd.concat** function also includes a parameter named _drop_first_, which specifies whether to get k-1 dummies out of k categorical levels by removing the first level. Why would you want to remove the first level, in this case, _area_a_? This technique provides a more efficient encoding by using the ordinarily unused encoding of [0,0,0]. We encode the _area_ to just three columns and map the categorical value of _a_ to [0,0,0]. The following code demonstrates this technique.


In [15]:
dummies = pd.get_dummies(["a", "b", "c", "d"], prefix="area", drop_first=True, dtype=int)
print(dummies)

   area_b  area_c  area_d
0       0       0       0
1       1       0       0
2       0       1       0
3       0       0       1


As you can see from the above data, the _area_a_ column is missing, as it **get_dummies** replaced it by the encoding of [0,0,0]. The following code shows how to apply this technique to a dataframe.


In [18]:
[col for col in df.columns if col.startswith('area_')]

['area_a', 'area_b', 'area_c', 'area_d']

In [19]:
# Read the dataset
df = pd.read_csv(url02, na_values=["NA", "?"],)

# encode the area column as dummy variables
dummies = pd.get_dummies(df["area"], drop_first=True, prefix="area")
df = pd.concat([df, dummies], axis=1)
cols = [col for col in df.columns if col.startswith('area_')]

# display the encoded dataframe
pd.set_option("display.max_columns", 0)
pd.set_option("display.max_rows", 10)

display(df[cols])

Unnamed: 0,area_b,area_c,area_d
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,0,1
...,...,...,...
1995,0,1,0
1996,0,0,1
1997,0,0,1
1998,0,1,0


Let us create a function that we may later use in our code

In [23]:
def create_dummies_var(df, col_name):
    """
    Creates dummy variables for a specified column and concatenates them to the original DataFrame.
    Returns a subset of the DataFrame containing only the dummy columns.

    Args:
    df (pd.DataFrame): The original DataFrame.
    column_name (str): The name of the column to be encoded as dummy variables.
    col_name (str): The column name also to be used as a prefix for the dummy variables.

    Returns:
    pd.DataFrame: A subset of the DataFrame containing only the dummy columns.
    """
    # Create dummy variables
    dummies = pd.get_dummies(df[col_name], drop_first=True, prefix=col_name)

    # Concatenate dummy variables to the original DataFrame
    df = pd.concat([df, dummies], axis=1)

    # Select columns that start with the prefix
    cols = [col for col in df.columns if col.startswith(col_name + '_')]

    # Return the DataFrame with only dummy columns
    return df.loc[:, cols]

# Read the dataset
df = pd.read_csv(url02, na_values=["NA", "?"],)

dummy_df = create_dummies_var(df, 'area')
dummy_df.sample(10, random_state=43)

Unnamed: 0,area_b,area_c,area_d
1480,0,1,0
1312,0,1,0
305,0,0,0
1545,0,1,0
1716,0,1,0
1420,0,0,0
1314,0,0,0
795,0,1,0
817,0,1,0
1486,0,0,0


## Target Encoding for Categoricals

Target encoding is a popular technique for Kaggle competitions. Target encoding can sometimes increase the predictive power of a machine learning model. However, it also dramatically increases the risk of overfitting. Because of this risk, you must take care of using this method.

Generally, target encoding can only be used on a categorical feature when the output of the machine learning model is numeric (regression).

The concept of target encoding is straightforward. For each category, we calculate the average target value for that category. Then to encode, we substitute the percent corresponding to the category that the categorical value has. Unlike dummy variables, where you have a column for each category with target encoding, the program only needs a single column. In this way, target coding is more efficient than dummy variables.


The older `np.random.rand(10)` and the much newer `rng.random(10)` don't generate the same set of numbers, even when initialized with the same seed. This difference arises due to the distinct underlying mechanisms in NumPy's older random number generation (`np.random`) and the newer random number generator API (`np.random.default_rng()`).

1. **`np.random.rand(10)`**: This function is part of NumPy's older random number generation module. It uses a global random state, which can be seeded using `np.random.seed()`. However, this approach is not recommended for more modern applications due to potential issues with reproducibility and parallel processing.

2. **`rng = np.random.default_rng(seed)` and `rng.random(10)`**: The `default_rng` is part of NumPy's newer approach to random number generation introduced in version 1.17. It provides an instance of a generator (`rng` in this case) that can be used to generate random numbers. This method is preferable for modern applications because it allows for better reproducibility and is suitable for parallel processing. However, the random numbers generated by this method will not be the same as those from the older `np.random.rand()` function, even with the same seed.

### Summary

- If you need reproducibility and compatibility with older code that uses `np.random.rand()`, you should continue using `np.random.seed()` and `np.random.rand()`.
- If you are developing new code and want a more modern, flexible approach to random number generation, you should use `np.random.default_rng()`.

This change in behavior is a result of improvements in the random number generation algorithms and API design, aiming to provide more reliable and maintainable random number generation for complex applications.

In [24]:
import numpy as np

# Create a random number generator with a fixed seed
rng = np.random.default_rng(seed=43)

# Create a small sample dataset using the rng object
df = pd.DataFrame({
    "cont_9": rng.random(10) * 100,  # Using rng to generate random numbers
    "cat_0": ["dog"] * 5 + ["cat"] * 5,
    "cat_1": ["wolf"] * 9 + ["tiger"] * 1,
    "y": [1, 0, 1, 1, 1, 1, 0, 0, 0, 0],
})

pd.set_option("display.max_columns", 0)
pd.set_option("display.max_rows", 0)
display(df)


Unnamed: 0,cont_9,cat_0,cat_1,y
0,65.229926,dog,wolf,1
1,4.377532,dog,wolf,0
2,2.002959,dog,wolf,1
3,83.921258,dog,wolf,1
4,58.714305,dog,wolf,1
5,22.470523,cat,wolf,1
6,75.179227,cat,wolf,0
7,26.36922,cat,wolf,0
8,41.997791,cat,wolf,0
9,45.103139,cat,tiger,0


We want to change them to a number rather than creating dummy variables for "dog" and "cat," we would like to change them to a number. We could use 0 for a cat and 1 for a dog. However, we can encode more information than just that. The simple 0 or 1 would also only work for one animal. Consider what the mean target value is for cat and dog.


In [28]:
means0 = (df.groupby("cat_0")["y"]
          .mean().to_dict())
means0

{'cat': 0.2, 'dog': 0.8}

The danger is that we are now using the target value ($y$) for training. This technique will potentially lead to overfitting. The possibility of overfitting is even greater if a small number of a particular category. To prevent this from happening, we use a weighting factor. The stronger the weight, the more categories with fewer values will tend towards the overall average of $y$. You can perform this calculation as follows.


In [14]:
df["y"].mean()

0.5

You can implement target encoding as follows. For more information on Target Encoding, refer to the article ["Target Encoding Done the Right Way"](https://maxhalford.github.io/blog/target-encoding/), that I based this code upon.


In [29]:
def calc_smooth_mean(df1, df2, cat_name, target, weight):
    # Compute the global mean
    mean = df[target].mean()

    # Compute the number of values and the mean of each group
    agg = df.groupby(cat_name)[target].agg(["count", "mean"])
    counts = agg["count"]
    means = agg["mean"]

    # Compute the "smoothed" means
    smooth = (counts * means + weight * mean) / (counts + weight)

    # Replace each value by the according smoothed mean
    if df2 is None:
        return df1[cat_name].map(smooth)
    else:
        return df1[cat_name].map(smooth), df2[cat_name].map(smooth.to_dict())

This function, `calc_smooth_mean`, appears to implement a technique known as "smooth mean encoding" for categorical variables, particularly useful in machine learning. It's used to convert categorical variables into numerical values in a way that incorporates information about the target variable and avoids overfitting. Let's go through the code line by line:

### Function Definition
```python
def calc_smooth_mean(df1, df2, cat_name, target, weight):
```
- Defines a function named `calc_smooth_mean` with parameters `df1`, `df2`, `cat_name`, `target`, and `weight`.

### Global Mean of Target
```python
mean = df[target].mean()
```
- Calculates the global mean of the target column in DataFrame `df`. This line seems to have an error: it should use `df1` instead of `df`, based on the function parameters.

### Group By and Aggregate
```python
agg = df.groupby(cat_name)[target].agg(["count", "mean"])
```
- Groups `df` by the categorical column (`cat_name`) and calculates two aggregates for the target column: count (number of occurrences) and mean. Again, this should probably be `df1`.

### Separate Aggregated Data
```python
counts = agg["count"]
means = agg["mean"]
```
- Extracts the count and mean from the aggregated DataFrame `agg`.

### Calculate Smoothed Means
```python
smooth = (counts * means + weight * mean) / (counts + weight)
```
- Calculates the smoothed mean for each category. This is a weighted average between the global mean and the category-specific mean, where the weight (`weight`) is a parameter that determines the balance between the two.

### Map Smoothed Means to Categories
```python
if df2 is None:
    return df1[cat_name].map(smooth)
else:
    return df1[cat_name].map(smooth), df2[cat_name].map(smooth.to_dict())
```
- If `df2` is `None`, the function applies the smoothed means to the categories in `df1` and returns the resulting series.
- If `df2` is provided, the function applies the smoothed means to both `df1` and `df2`. It maps the smoothed means to the categories in both DataFrames and returns two series.
- The `map` function is used to replace each category in the specified column with its corresponding smoothed mean value.
- In the case of `df2`, the smoothed mean values are converted to a dictionary for mapping.

### Notes
- The purpose of this function is to replace the categorical variable in `df1` (and optionally in `df2`) with numerical values that represent a smoothed average target value for each category.
- This approach is often used in machine learning to encode categorical variables in a way that incorporates information about the target variable while reducing the risk of overfitting to the idiosyncrasies of the training data.
- The function seems to contain an error: it references `df` instead of `df1` for the initial calculations. If `df` is not defined outside the function, this will result in an error.

The following code encodes these two categories.


In [30]:
WEIGHT = 5
df["cat_0_enc"] = calc_smooth_mean(
    df1=df, df2=None, cat_name="cat_0", target="y", weight=WEIGHT
)
df["cat_1_enc"] = calc_smooth_mean(
    df1=df, df2=None, cat_name="cat_1", target="y", weight=WEIGHT
)

pd.set_option("display.max_columns", 0)
pd.set_option("display.max_rows", 0)

display(df)

Unnamed: 0,cont_9,cat_0,cat_1,y,cat_0_enc,cat_1_enc
0,65.229926,dog,wolf,1,0.65,0.535714
1,4.377532,dog,wolf,0,0.65,0.535714
2,2.002959,dog,wolf,1,0.65,0.535714
3,83.921258,dog,wolf,1,0.65,0.535714
4,58.714305,dog,wolf,1,0.65,0.535714
5,22.470523,cat,wolf,1,0.35,0.535714
6,75.179227,cat,wolf,0,0.35,0.535714
7,26.36922,cat,wolf,0,0.35,0.535714
8,41.997791,cat,wolf,0,0.35,0.535714
9,45.103139,cat,tiger,0,0.35,0.416667


## Encoding Categorical Values as Ordinal

Typically categoricals will be encoded as dummy variables. However, there might be other techniques to convert categoricals to numeric. Any time there is an order to the categoricals, a number should be used. Consider if you had a categorical that described the current education level of an individual.

- Kindergarten (0)
- First Grade (1)
- Second Grade (2)
- Third Grade (3)
- Fourth Grade (4)
- Fifth Grade (5)
- Sixth Grade (6)
- Seventh Grade (7)
- Eighth Grade (8)
- High School Freshman (9)
- High School Sophomore (10)
- High School Junior (11)
- High School Senior (12)
- College Freshman (13)
- College Sophomore (14)
- College Junior (15)
- College Senior (16)
- Graduate Student (17)
- PhD Candidate (18)
- Doctorate (19)
- Post Doctorate (20)

The above list has 21 levels and would take 21 dummy variables to encode. However, simply encoding this to dummies would lose the order information. Perhaps the most straightforward approach would be to simply number them and assign the category a single number equal to the value in the parenthesis above. However, we might be able to do even better. A graduate student is likely more than a year so you might increase one value.

## High Cardinality Categorical

If there were many, perhaps thousands or tens of thousands, then one-hot encoding is no longer a good choice. We call these cases high cardinality categorical. We generally encode such values with an embedding layer, which we will discuss later when introducing natural language processing (NLP).
