In [1]:
# When building a computational model, most of the design effort is not writing code to build the complex model. Rather, most of the effort in computational model building is preprocessing and cleaning up the input data. Neural networks are no exception to this rule. In fact, neural networks tend to require the most preprocessing of input data compared to all other statistical and machine learning models. This is because neural networks are really good at identifying patterns and trends in data; therefore, they are susceptible to getting stuck when looking at abstract or raw data. When data has many categorical values, or large gaps between numerical values, a neural network might think that these variables are less important (or more important) than they really are. As a result, the neural network may ignore other variables that should provide more meaningful information to the model.

# For example, if a bank wanted to build a neural network model to identify if a company was eligible for a loan, it might look at factors such as a company's net worth. If the bank's input dataset contained information from large fortune 500 companies, such as Google and Facebook, as well as small mom-and-pop stores, the variability in net worth would be outrageous. Without normalizing the input data, a neural network could look at net worth as being a strong indicator of loan eligibility, and as a result, could ignore all other factors, such as debt-to-income ratio, credit status, or requested loan amount. Instead, if the net worth was normalized on a factor such as number of employees, the neural network would be more likely to weigh other factors more evenly to net worth. This would result in a neural network model that assesses loan eligibility more fairly, without introducing any additional risk.

# In the next few pages, we'll look at different preprocessing steps that can prepare input data for training neural network models. By the end of this section, we'll no longer need to use dummy input data—we'll be able to apply neural network models to any dataset!

In [2]:
# For a neural network to understand and evaluate a categorical variable, we must preprocess the values using a technique called one-hot encoding. One-hot encoding identifies all unique column values and splits the single categorical column into a series of columns, each containing information about a single unique categorical value.

In [3]:
# For each unique column value, every individual data point is evaluated, and if the categorical value matches, it is given the value of 1, otherwise it is 0. This binary encoding ensures that each neuron receives the same amount of information from the categorical variable. As a result, the neural network will interpret each value individually and provide each categorical value with an independent weight.

# Although one-hot encoding is a very robust solution, it can be very memory-intensive. Therefore, categorical variables with a large number of unique values (or very large variables with only a few unique values) might become difficult to navigate or filter once encoded. To address this issue, we must reduce the number of unique values in the categorical variables. The process of reducing the number of unique categorical values in a dataset is known as bucketing or binning. Bucketing data typically follows one of two approaches:

# Collapse all of the infrequent and rare categorical values into a single "other" category.
# Create generalized categorical values and reassign all data points to the new corresponding values.
# The first bucketing approach takes advantage of the fact that uncommon categories and "edge cases" are rarely statistically significant. Therefore, regression and classification models are unlikely to be able to use rare categorical values to produce robust models, and instead will ignore the rare events altogether and focus on more informative values.

# The second bucketing approach collapses the number of unique categorical values and maintains relative order and magnitude so that the machine learning model can train on the categorical variable with minimal impact to performance. This approach is particularly useful when dealing with a categorical variable whose distribution of unique values is relatively even. Once we have bucketed our categorical variables, we can proceed to transform the categorical variable using one-hot encoding.