<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_03_3_feature_encode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# T81-558: Applications of Deep Neural Networks

**Module 3: Introduction to PyTorch**

- Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
- For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).


# Module 3 Material

- Part 3.1: Deep Learning and Neural Network Introduction [[Video]](https://www.youtube.com/watch?v=d-rU5IuFqLs&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_1_neural_net.ipynb)
- Part 3.2: Introduction to PyTorch [[Video]](https://www.youtube.com/watch?v=Pf-rrhMolm0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_2_pytorch.ipynb)
- **Part 3.3: Encoding a Feature Vector for PyTorch Deep Learning** [[Video]](https://www.youtube.com/watch?v=7SGPm2tIT58&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_3_feature_encode.ipynb)
- Part 3.4: Early Stopping and Network Persistence [[Video]](https://www.youtube.com/watch?v=lS0vvIWiahU&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_4_early_stop.ipynb)
- Part 3.5: Sequences vs Classes in PyTorch [[Video]](https://www.youtube.com/watch?v=NOu8jMZx3LY&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_03_5_pytorch_class_sequence.ipynb)

# Part 3.3: Encoding a Feature Vector for PyTorch Deep Learning

Neural networks can accept many types of data. We will begin with tabular data, where there are well-defined rows and columns. This data is what you would typically see in Microsoft Excel. Neural networks require numeric input. This numeric form is called a feature vector. Each input neurons receive one feature (or column) from this vector. Each row of training data typically becomes one vector. This section will see how to encode the following tabular data into a feature vector. You can see an example of tabular data below.


In [1]:
import pandas as pd
url = "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv"

In [2]:
df = pd.read_csv(
    url,
    na_values=["NA", "?"],
)
df

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.100000,1,9.017895,35,11.738935,49,0.885827,0.492126,0.071100,b
1,2,kd,c,60369.0,18.625000,2,7.766643,59,6.805396,51,0.874016,0.342520,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,vv,c,51017.0,38.233333,1,5.454545,34,14.013489,41,0.881890,0.744094,0.104838,b
1996,1997,kl,d,26576.0,33.358333,2,3.632069,20,8.380497,38,0.944882,0.877953,0.063851,a
1997,1998,kl,d,28595.0,39.425000,3,7.168218,99,4.626950,36,0.759843,0.744094,0.098703,f
1998,1999,qp,c,67949.0,5.733333,0,8.936292,26,3.281439,46,0.909449,0.598425,0.117803,c


You can make the following observations from the above data:

- The target column is the column that you seek to predict. There are several candidates here. However, we will initially use the column "product". This field specifies what product someone bought.
- There is an ID column. You should exclude his column because it contains no information useful for prediction.
- Many of these fields are numeric and might not require further processing.
- The income column does have some missing values.
- There are categorical values: job, area, and product.

To begin with, we will convert the job code into dummy variables.


In [3]:
dummies = pd.get_dummies(df["job"], prefix="job")
print(dummies.shape)

dummies

(2000, 33)


Unnamed: 0,job_11,job_al,job_am,job_ax,job_bf,job_by,job_cv,job_de,job_dz,job_e2,...,job_pe,job_po,job_pq,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


Because there are 33 different job codes, there are 33 dummy variables. We also specified a prefix because the job codes (such as "ax") are not that meaningful by themselves. Something such as "job_ax" also tells us the origin of this field.

Next, we must merge these dummies back into the main data frame. We also drop the original "job" field, as the dummies now represent it.


In [4]:
pd.set_option("display.max_columns", 7)
pd.set_option("display.max_rows", 5)

df = pd.concat([df, dummies], axis=1)
df.drop("job", axis=1, inplace=True)

pd.set_option("display.max_columns", 9)
pd.set_option("display.max_rows", 10)

df

Unnamed: 0,id,area,income,aspect,...,job_rn,job_sa,job_vv,job_zz
0,1,c,50876.0,13.100000,...,False,False,True,False
1,2,c,60369.0,18.625000,...,False,False,False,False
2,3,c,55126.0,34.766667,...,False,False,False,False
3,4,c,51690.0,15.808333,...,False,False,False,False
4,5,d,28347.0,40.941667,...,False,False,False,False
...,...,...,...,...,...,...,...,...,...
1995,1996,c,51017.0,38.233333,...,False,False,True,False
1996,1997,d,26576.0,33.358333,...,False,False,False,False
1997,1998,d,28595.0,39.425000,...,False,False,False,False
1998,1999,c,67949.0,5.733333,...,False,False,False,False


We also introduce dummy variables for the area column.


In [5]:
pd.set_option("display.max_columns", 7)
pd.set_option("display.max_rows", 5)

df = pd.concat([df, pd.get_dummies(df["area"], prefix="area")], axis=1)
df.drop("area", axis=1, inplace=True)

pd.set_option("display.max_columns", 9)
pd.set_option("display.max_rows", 10)
display(df)

Unnamed: 0,id,income,aspect,subscriptions,...,area_a,area_b,area_c,area_d
0,1,50876.0,13.100000,1,...,False,False,True,False
1,2,60369.0,18.625000,2,...,False,False,True,False
2,3,55126.0,34.766667,1,...,False,False,True,False
3,4,51690.0,15.808333,1,...,False,False,True,False
4,5,28347.0,40.941667,3,...,False,False,False,True
...,...,...,...,...,...,...,...,...,...
1995,1996,51017.0,38.233333,1,...,False,False,True,False
1996,1997,26576.0,33.358333,2,...,False,False,False,True
1997,1998,28595.0,39.425000,3,...,False,False,False,True
1998,1999,67949.0,5.733333,0,...,False,False,True,False


The last remaining transformation is to fill in missing income values.


In [6]:
med = df["income"].median()
df["income"] = df["income"].fillna(med)

There are more advanced ways of filling in missing values, but they require more analysis. The idea would be to see if another field might hint at what the income was. For example, it might be beneficial to calculate a median income for each area or job category. This technique is something to keep in mind for the class Kaggle competition.

At this point, the Pandas dataframe is ready to be converted to Numpy for neural network training. We need to know a list of the columns that will make up _x_ (the predictors or inputs) and _y_ (the target).

The complete list of columns is:


In [7]:
print(list(df.columns))

['id', 'income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'product', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


This data includes both the target and predictors. We need a list with the target removed. We also remove **id** because it is not useful for prediction.


In [8]:
x_columns = df.columns.drop(["product", "id"])
print(list(x_columns))

['income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


## Generate X and Y for a Classification Neural Network

We can now generate _x_ and _y_. Note that this is how we generate y for a classification problem. Regression would not use dummies and would encode the numeric value of the target.


In [9]:
from sklearn import preprocessing

# Convert to numpy - Classification
x_columns = df.columns.drop("product").drop("id")
x = df[x_columns].values
le = preprocessing.LabelEncoder()
y = le.fit_transform(df["product"])
products = le.classes_
y = dummies.values

We can display the _x_ and _y_ matrices.


In [10]:
print(x)
print(y)

[[50876.0 13.1 1 ... False True False]
 [60369.0 18.625 2 ... False True False]
 [55126.0 34.766666666666666 1 ... False True False]
 ...
 [28595.0 39.425 3 ... False False True]
 [67949.0 5.733333333333333 0 ... False True False]
 [61467.0 16.891666666666666 0 ... False True False]]
[[False False False ... False  True False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]


The x and y values are now ready for a neural network. Make sure that you construct the neural network for a classification problem. Specifically,

- Classification neural networks have an output neuron count equal to the number of classes.
- Classification neural networks should use **categorical_crossentropy** and a **softmax** activation function on the output layer.

## Generate X and Y for a Regression Neural Network

The program generates the _x_ values the say way for a regression neural network. However, _y_ does not use dummies. Make sure to replace **income** with your actual target.


In [11]:
y = df["income"].values

# Module 3 Assignment

You can find the first assignment here: [assignment 3](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class3.ipynb)
