# Lab 4: Data Imputation using an Autoencoder

In this lab, you will build and train an autoencoder to impute (or "fill in") missing data.

We will be using the
Adult Data Set provided by the UCI Machine Learning Repository [1], available
at https://archive.ics.uci.edu/ml/datasets/adult.
The data set contains census record files of adults, including their
age, martial status, the type of work they do, and other features.

Normally, people use this data set to build a supervised classification
model to classify whether a person is a high income earner.
We will not use the dataset for this original intended purpose.

Instead, we will perform the task of imputing (or "filling in") missing values in the dataset. For example,
we may be missing one person's martial status, and another person's age, and
a third person's level of education. Our model will predict the missing features
based on the information that we do have about each person.

We will use a variation of a denoising autoencoder to solve this data imputation
problem. Our autoencoder will be trained using inputs that have one categorical feature artificially
removed, and the goal of the autoencoder is to correctly reconstruct all features,
including the one removed from the input.

In the process, you are expected to learn to:

1. Clean and process continuous and categorical data for machine learning.
2. Implement an autoencoder that takes continuous and categorical (one-hot) inputs.
3. Tune the hyperparameters of an autoencoder.
4. Use baseline models to help interpret model performance.

In [None]:
import csv
import numpy as np
import random
import torch
import torch.utils.data

## Part 0

We will be using a package called `pandas` for this assignment.

If you are using Colab, `pandas` should already be available.
If you are using your own computer,
installation instructions for `pandas` are available here:
https://pandas.pydata.org/pandas-docs/stable/install.html

In [None]:
import pandas as pd

# Part 1. Data Cleaning [15 pt]

The adult.data file is available at `https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data`

The function `pd.read_csv` loads the adult.data file into a pandas dataframe.
You can read about the pandas documentation for `pd.read_csv` at
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
header = ['age', 'work', 'fnlwgt', 'edu', 'yredu', 'marriage', 'occupation',
 'relationship', 'race', 'sex', 'capgain', 'caploss', 'workhr', 'country']
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    names=header,
    index_col=False)

In [None]:
df.shape # there are 32561 rows (records) in the data frame, and 14 columns (features)

(32561, 14)

### Part (a) Continuous Features [3 pt]

For each of the columns `["age", "yredu", "capgain", "caploss", "workhr"]`, report the minimum, maximum, and average value across the dataset.

Then, normalize each of the features `["age", "yredu", "capgain", "caploss", "workhr"]`
so that their values are always between 0 and 1.
Make sure that you are actually modifying the dataframe `df`.

Like numpy arrays and torch tensors,
pandas data frames can be sliced. For example, we can
display the first 3 rows of the data frame (3 records) below.

In [None]:
df[:3] # show the first 3 records

Unnamed: 0,age,work,fnlwgt,edu,yredu,marriage,occupation,relationship,race,sex,capgain,caploss,workhr,country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States


Alternatively, we can slice based on column names,
for example `df["race"]`, `df["hr"]`, or even index multiple columns
like below.

In [None]:
subdf = df[["age", "yredu", "capgain", "caploss", "workhr"]]
subdf[:3] # show the first 3 records

Unnamed: 0,age,yredu,capgain,caploss,workhr
0,39,13,2174,0,40
1,50,13,0,0,13
2,38,9,0,0,40


Numpy works nicely with pandas, like below:

In [None]:
np.sum(subdf["caploss"])

2842700

Just like numpy arrays, you can modify
entire columns of data rather than one scalar element at a time.
For example, the code  

`df["age"] = df["age"] + 1`

would increment everyone's age by 1.

In [None]:
print("Before normalizing the features:")
print("The minimum age is", (df["age"]).min(), ", the maximum age is", (df["age"]).max(), ", and the average age is", (df["age"]).mean(),".")
print("The minimum yredu is", (df["yredu"]).min(), ", the maximum yredu is", (df["yredu"]).max(), ", and the average yredu is", (df["yredu"]).mean(),".")
print("The minimum capgain is", (df["capgain"]).min(), ", the maximum capgain is", (df["capgain"]).max(), ", and the average capgain is", (df["capgain"]).mean(),".")
print("The minimum caploss is", (df["caploss"]).min(), ", the maximum caploss is", (df["caploss"]).max(), ", and the average caploss is", (df["caploss"]).mean(),".")
print("The minimum workhr is", (df["workhr"]).min(), ", the maximum workhr is", (df["workhr"]).max(), ", and the average workhr is", (df["workhr"]).mean(),".\n")
print(df[["age", "yredu", "capgain", "caploss", "workhr"]])


Before normalizing the features:
The minimum age is 17 , the maximum age is 90 , and the average age is 38.58164675532078 .
The minimum yredu is 1 , the maximum yredu is 16 , and the average yredu is 10.0806793403151 .
The minimum capgain is 0 , the maximum capgain is 99999 , and the average capgain is 1077.6488437087312 .
The minimum caploss is 0 , the maximum caploss is 4356 , and the average caploss is 87.303829734959 .
The minimum workhr is 1 , the maximum workhr is 99 , and the average workhr is 40.437455852092995 .

       age  yredu  capgain  caploss  workhr
0       39     13     2174        0      40
1       50     13        0        0      13
2       38      9        0        0      40
3       53      7        0        0      40
4       28     13        0        0      40
...    ...    ...      ...      ...     ...
32556   27     12        0        0      38
32557   40      9        0        0      40
32558   58      9        0        0      40
32559   22      9        0      

In [None]:
# Normalizing
df["age"] = (df["age"]-(df["age"]).min()) / ((df["age"]).max() - (df["age"]).min())
df["yredu"] = (df["yredu"]-(df["yredu"]).min()) / ((df["yredu"]).max() - (df["yredu"]).min())
df["capgain"] = (df["capgain"]-(df["capgain"]).min()) / ((df["capgain"]).max() - (df["capgain"]).min())
df["caploss"] = (df["caploss"]-(df["caploss"]).min()) / ((df["caploss"]).max() - (df["caploss"]).min())
df["workhr"] = (df["workhr"]-(df["workhr"]).min()) / ((df["workhr"]).max() - (df["workhr"]).min())

print("After normalizing the features:")
print("The minimum age is", (df["age"]).min(), ", the maximum age is", (df["age"]).max(), ", and the average age is", (df["age"]).mean(),".")
print("The minimum yredu is", (df["yredu"]).min(), ", the maximum yredu is", (df["yredu"]).max(), ", and the average yredu is", (df["yredu"]).mean(),".")
print("The minimum capgain is", (df["capgain"]).min(), ", the maximum capgain is", (df["capgain"]).max(), ", and the average capgain is", (df["capgain"]).mean(),".")
print("The minimum caploss is", (df["caploss"]).min(), ", the maximum caploss is", (df["caploss"]).max(), ", and the average caploss is", (df["caploss"]).mean(),".")
print("The minimum workhr is", (df["workhr"]).min(), ", the maximum workhr is", (df["workhr"]).max(), ", and the average workhr is", (df["workhr"]).mean(),".\n")
print(df[["age", "yredu", "capgain", "caploss", "workhr"]])

After normalizing the features:
The minimum age is 0.0 , the maximum age is 1.0 , and the average age is 0.2956389966482299 .
The minimum yredu is 0.0 , the maximum yredu is 1.0 , and the average yredu is 0.6053786226876733 .
The minimum capgain is 0.0 , the maximum capgain is 1.0 , and the average capgain is 0.010776596203049342 .
The minimum caploss is 0.0 , the maximum caploss is 1.0 , and the average caploss is 0.020042201500220156 .
The minimum workhr is 0.0 , the maximum workhr is 1.0 , and the average workhr is 0.40242301889890814 .

            age     yredu   capgain  caploss    workhr
0      0.301370  0.800000  0.021740      0.0  0.397959
1      0.452055  0.800000  0.000000      0.0  0.122449
2      0.287671  0.533333  0.000000      0.0  0.397959
3      0.493151  0.400000  0.000000      0.0  0.397959
4      0.150685  0.800000  0.000000      0.0  0.397959
...         ...       ...       ...      ...       ...
32556  0.136986  0.733333  0.000000      0.0  0.377551
32557  0.3150

### Part (b) Categorical Features [1 pt]

What percentage of people in our data set are male? Note that the data labels all have an unfortunate space in the beginning, e.g. " Male" instead of "Male".

What percentage of people in our data set are female?

In [None]:
# hint: you can do something like this in pandas
sum(df["sex"] == " Male")

percent_men = ((sum(df["sex"] == " Male"))*100)/len(df["sex"])
print("The percentage of people in our dataset that are male is:", percent_men, "%.")

percent_women = ((sum(df["sex"] == " Female"))*100)/len(df["sex"])
print("The percentage of people in our dataset that are female is:", percent_women, "%.")

The percentage of people in our dataset that are male is: 66.92054912318417 %.
The percentage of people in our dataset that are female is: 33.07945087681582 %.


### Part (c) [2 pt]

Before proceeding, we will modify our data frame in a couple more ways:

1. We will restrict ourselves to using a subset of the features (to simplify our autoencoder)
2. We will remove any records (rows) already containing missing values, and store them in a second dataframe. We will only use records without missing values to train our autoencoder.

Both of these steps are done for you, below.

How many records contained missing features? What percentage of records were removed?

In [None]:
contcols = ["age", "yredu", "capgain", "caploss", "workhr"]
catcols = ["work", "marriage", "occupation", "edu", "relationship", "sex"]
features = contcols + catcols
df = df[features]

In [None]:
missing = pd.concat([df[c] == " ?" for c in catcols], axis=1).any(axis=1)
df_with_missing = df[missing]
df_not_missing = df[~missing]

In [None]:
print(len(df_with_missing),"records contained missing features.")
print((len(df_with_missing)*100/len(df)),"% of records contained missing features.")

1843 records contained missing features.
5.660145572924664 % of records contained missing features.


### Part (d) One-Hot Encoding [1 pt]

What are all the possible values of the feature "work" in `df_not_missing`? You may find the Python function `set` useful.

In [None]:
set(df_not_missing['work'])

{' Federal-gov',
 ' Local-gov',
 ' Private',
 ' Self-emp-inc',
 ' Self-emp-not-inc',
 ' State-gov',
 ' Without-pay'}

We will be using a one-hot encoding to represent each of the categorical variables.
Our autoencoder will be trained using these one-hot encodings.

We will use the pandas function `get_dummies` to produce one-hot encodings
for all of the categorical variables in `df_not_missing`.

In [None]:
data = pd.get_dummies(df_not_missing)

In [None]:
data[:3]

Unnamed: 0,age,yredu,capgain,caploss,workhr,work_ Federal-gov,work_ Local-gov,work_ Private,work_ Self-emp-inc,work_ Self-emp-not-inc,...,edu_ Prof-school,edu_ Some-college,relationship_ Husband,relationship_ Not-in-family,relationship_ Other-relative,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife,sex_ Female,sex_ Male
0,0.30137,0.8,0.02174,0.0,0.397959,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
1,0.452055,0.8,0.0,0.0,0.122449,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,1
2,0.287671,0.533333,0.0,0.0,0.397959,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1


### Part (e) One-Hot Encoding [2 pt]

The dataframe `data` contains the cleaned and normalized data that we will use to train our denoising autoencoder.

How many **columns** (features) are in the dataframe `data`?

Briefly explain where that number come from.

In [None]:
print("There are", len(data.columns), "columns (features) in the dataframe data.")
# This number comes from the get_dummies function which produces one-hot encodings by assigning strings in the dataset to 0 and 1.
# This means that each possible value of a feature is now represented by a column.
# 0s and 1s represent all the information

There are 57 columns (features) in the dataframe data.


### Part (f) One-Hot Conversion [3 pt]

We will convert the pandas data frame `data` into numpy, so that
it can be further converted into a PyTorch tensor.
However, in doing so, we lose the column label information that
a panda data frame automatically stores.

Complete the function `get_categorical_value` that will return
the named value of a feature given a one-hot embedding.
You may find the global variables `cat_index` and `cat_values`
useful. (Display them and figure out what they are first.)

We will need this function in the next part of the lab
to interpret our autoencoder outputs. So, the input
to our function `get_categorical_values` might not
actually be "one-hot" -- the input may instead
contain real-valued predictions from our neural network.

In [None]:
datanp = data.values.astype(np.float32)

In [None]:
cat_index = {}  # Mapping of feature -> start index of feature in a record
cat_values = {} # Mapping of feature -> list of categorical values the feature can take

# build up the cat_index and cat_values dictionary
for i, header in enumerate(data.keys()):
    if "_" in header: # categorical header
        feature, value = header.split()
        feature = feature[:-1] # remove the last char; it is always an underscore
        if feature not in cat_index:
            cat_index[feature] = i
            cat_values[feature] = [value]
        else:
            cat_values[feature].append(value)

def get_onehot(record, feature):
    """
    Return the portion of `record` that is the one-hot encoding
    of `feature`. For example, since the feature "work" is stored
    in the indices [5:12] in each record, calling `get_range(record, "work")`
    is equivalent to accessing `record[5:12]`.

    Args:
        - record: a numpy array representing one record, formatted
                  the same way as a row in `data.np`
        - feature: a string, should be an element of `catcols`
    """
    start_index = cat_index[feature]
    stop_index = cat_index[feature] + len(cat_values[feature])
    return record[start_index:stop_index]

def get_categorical_value(onehot, feature):
    """
    Return the categorical value name of a feature given
    a one-hot vector representing the feature.

    Args:
        - onehot: a numpy array one-hot representation of the feature
        - feature: a string, should be an element of `catcols`

    Examples:

    >>> get_categorical_value(np.array([0., 0., 0., 0., 0., 1., 0.]), "work")
    'State-gov'
    >>> get_categorical_value(np.array([0.1, 0., 1.1, 0.2, 0., 1., 0.]), "work")
    'Private'
    """
    # <----- TODO: WRITE YOUR CODE HERE ----->
    # You may find the variables `cat_index` and `cat_values`
    # (created above) useful.
    return cat_values[feature][np.argmax(onehot)]


In [None]:
# Testing it out
print(get_categorical_value(np.array([0., 0., 0., 0., 0., 0., 0.]), "work"))
print(get_categorical_value(np.array([0., 0., 0., 0., 0., 0., 1.]), "marriage"))
print(get_categorical_value(np.array([0., 0., 0., 0., 0., 0., 0.]), "occupation"))

Federal-gov
Widowed
Adm-clerical


In [None]:
# more useful code, used during training, that depends on the function
# you write above

def get_feature(record, feature):
    """
    Return the categorical feature value of a record
    """
    onehot = get_onehot(record, feature)
    return get_categorical_value(onehot, feature)

def get_features(record):
    """
    Return a dictionary of all categorical feature values of a record
    """
    return { f: get_feature(record, f) for f in catcols }

### Part (g) Train/Test Split [3 pt]

Randomly split the data into approximately 70% training, 15% validation and 15% test.

Report the number of items in your training, validation, and test set.

In [None]:
# set the numpy seed for reproducibility
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html
np.random.seed(50)

# todo
np.random.shuffle(datanp)

train_split = int(len(datanp) * 0.70)
val_test_split = int(len(datanp) * 0.85)

train_data = datanp[:train_split]
val_data = datanp[train_split:val_test_split]
test_data = datanp[val_test_split:]

print("There are", len(train_data), "items in my training set.")
print("There are", len(val_data), "items in my validation set.")
print("There are", len(test_data), "items in my test set.")


There are 21502 items in my training set.
There are 4608 items in my validation set.
There are 4608 items in my test set.


## Part 2. Model Setup [5 pt]

### Part (a) [4 pt]

Design a fully-connected autoencoder by modifying the `encoder` and `decoder`
below.

The input to this autoencoder will be the features of the `data`, with
one categorical feature recorded as "missing". The output of the autoencoder
should be the reconstruction of the same features, but with the missing
value filled in.

**Note**: Do not reduce the dimensionality of the input too much!
The output of your embedding is expected to contain information
about ~11 features.

In [None]:
from torch import nn

class AutoEncoder(nn.Module):
    def __init__(self):
        self.name = "AutoEncoder"
        super(AutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(57, 48), # TODO -- FILL OUT THE CODE HERE!
            nn.ReLU(),
            nn.Linear(48, 32),
            nn.ReLU(),
            nn.Linear(32, 16)
        )
        self.decoder = nn.Sequential(
            nn.Linear(16, 32), # TODO -- FILL OUT THE CODE HERE!
            nn.ReLU(),
            nn.Linear(32, 48),
            nn.ReLU(),
            nn.Linear(48, 57),
            nn.Sigmoid() # get to the range (0, 1)
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

### Part (b) [1 pt]

Explain why there is a sigmoid activation in the last step of the decoder.

(**Note**: the values inside the data frame `data` and the training code in Part 3 might be helpful.)

In [None]:
# There is a sigmoid activation in the last step of the decoder because we to make the output between 0 and 1.
# This is because we had previously normalized the data to be between values of 0 and 1.
# Hence we want the output of the decoder to be between 0 and 1.

## Part 3. Training [18]

### Part (a) [6 pt]

We will train our autoencoder in the following way:

- In each iteration, we will hide one of the categorical features using the `zero_out_random_features` function
- We will pass the data with one missing feature through the autoencoder, and obtain a reconstruction
- We will check how close the reconstruction is compared to the original data -- including the value of the missing feature

Complete the code to train the autoencoder, and plot the training and validation loss every few iterations.
You may also want to plot training and validation "accuracy" every few iterations, as we will define in
part (b). You may also want to checkpoint your model every few iterations or epochs.

Use `nn.MSELoss()` as your loss function. (Side note: you might recognize that this loss function is not
ideal for this problem, but we will use it anyway.)

In [None]:
import matplotlib.pyplot as plt

In [None]:
def zero_out_feature(records, feature):
    """ Set the feature missing in records, by setting the appropriate
    columns of records to 0
    """
    start_index = cat_index[feature]
    stop_index = cat_index[feature] + len(cat_values[feature])
    records[:, start_index:stop_index] = 0
    return records

def zero_out_random_feature(records):
    """ Set one random feature missing in records, by setting the
    appropriate columns of records to 0
    """
    return zero_out_feature(records, random.choice(catcols))

def train(model, train_loader, valid_loader, num_epochs=5, learning_rate=1e-4):
    """ Training loop. You should update this."""
    torch.manual_seed(42)
    criterion = nn.MSELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    train_accuracy, valid_accuracy, train_losses, valid_losses, iterations = [], [], [], [], []

    for epoch in range(num_epochs):
        batch_size = 0
        for data in train_loader:
            datam = zero_out_random_feature(data.clone()) # zero out one categorical feature
            recon = model(datam)
            train_loss = criterion(recon, data)
            train_loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            batch_size +=1

        for data in valid_loader:
            datam = zero_out_random_feature(data.clone())
            recon = model(datam)
            valid_loss = criterion(recon, data)

        iterations.append(epoch)
        train_losses.append(float(train_loss)/batch_size)
        valid_losses.append(float(valid_loss)/batch_size)
        train_accuracy.append(get_accuracy(model, train_loader))
        valid_accuracy.append(get_accuracy(model, valid_loader))

        print(("Epoch {}: Train Loss: {}, Validation Loss: {} |"+ "Train Accuracy: {}, Validation Accuracy: {}").format(
                  epoch + 1, train_losses[epoch], valid_losses[epoch], train_accuracy[epoch], valid_accuracy[epoch]))

        model_path = "model_{0}_bs{1}_lr{2}_epoch{3}".format(model.name, batch_size, learning_rate, epoch)
        torch.save(model.state_dict(), model_path)

    plt.title("Training Curve: Model Loss")
    plt.plot(iterations, train_losses, label="Train")
    plt.plot(iterations, valid_losses, label="Validation")
    plt.xlabel("Iterations")
    plt.ylabel("Loss")
    plt.legend()
    plt.show()

    plt.title("Training Curve: Model Accuracy")
    plt.plot(iterations, train_accuracy, label="Train")
    plt.plot(iterations, valid_accuracy, label="Validation")
    plt.xlabel("Iterations")
    plt.ylabel("Training Accuracy")
    plt.legend()
    plt.show()

    print("Final Training Accuracy: {}".format(train_accuracy[-1]))
    print("Final Validation Accuracy: {}".format(valid_accuracy[-1]))


### Part (b) [3 pt]

While plotting training and validation loss is valuable, loss values are harder to compare
than accuracy percentages. It would be nice to have a measure of "accuracy" in this problem.

Since we will only be imputing missing categorical values, we will define an accuracy measure.
For each record and for each categorical feature, we determine whether
the model can predict the categorical feature given all the other features of the record.

A function `get_accuracy` is written for you. It is up to you to figure out how to
use the function. **You don't need to submit anything in this part.**
To earn the marks, correctly plot the training and validation accuracy every few
iterations as part of your training curve.

In [None]:
def get_accuracy(model, data_loader):
    """Return the "accuracy" of the autoencoder model across a data set.
    That is, for each record and for each categorical feature,
    we determine whether the model can successfully predict the value
    of the categorical feature given all the other features of the
    record. The returned "accuracy" measure is the percentage of times
    that our model is successful.

    Args:
       - model: the autoencoder model, an instance of nn.Module
       - data_loader: an instance of torch.utils.data.DataLoader

    Example (to illustrate how get_accuracy is intended to be called.
             Depending on your variable naming this code might require
             modification.)

        >>> model = AutoEncoder()
        >>> vdl = torch.utils.data.DataLoader(data_valid, batch_size=256, shuffle=True)
        >>> get_accuracy(model, vdl)
    """
    total = 0
    acc = 0
    for col in catcols:
        for item in data_loader: # minibatches
            inp = item.detach().numpy()
            out = model(zero_out_feature(item.clone(), col)).detach().numpy()
            for i in range(out.shape[0]): # record in minibatch
                acc += int(get_feature(out[i], col) == get_feature(inp[i], col))
                total += 1
    return acc / total

### Part (c) [4 pt]

Run your updated training code, using reasonable initial hyperparameters.

Include your training curve in your submission.

In [None]:
model_1 = AutoEncoder()

train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=64, shuffle=True)
train(model_1, train_loader, val_loader, num_epochs=20, learning_rate=0.001)

### Part (d) [5 pt]

Tune your hyperparameters, training at least 4 different models (4 sets of hyperparameters).

Do not include all your training curves. Instead, explain what hyperparameters
you tried, what their effect was, and what your thought process was as you
chose the next set of hyperparameters to try.

In [None]:
# I will tune one hyperparameter at a time and choose the best value for that
# hyperparameter based on the training curve and validation accuracy.

In [None]:
# Tuning 1:
# I will use a high number of epochs to try to make the model overfit.
# I want to see where the performance plateaus.
model_2 = AutoEncoder()
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=64, shuffle=True)
train(model_2, train_loader, val_loader, num_epochs=75, learning_rate=0.001)

In [None]:
# There was an increase in the accuracy. Loss is already pretty small but
# pleateaus around 20-25 epochs. However accuracy looks like it could keep
# increasing.

In [None]:
# Tuning 2:
# Use a different learning rate
# I will try a larger learning rate to see if the model learning faster means
# that there is an increase in validation loss.
model_3 = AutoEncoder()
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=64, shuffle=True)
train(model_3, train_loader, val_loader, num_epochs=75, learning_rate=0.005)

In [None]:
# There was a slight increase in the accuracy.

In [None]:
# Tuning 3:
# Use a different batch size
# I will try a larger batch size to see if I can reducee noise
model_4 = AutoEncoder()
train_loader = torch.utils.data.DataLoader(train_data, batch_size=128, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=128, shuffle=True)
train(model_4, train_loader, val_loader, num_epochs=75, learning_rate=0.005)

In [None]:
# Not much of a change in terms of accuracy, but it did decrease noise a bit.

In [None]:
# Tuning 4:
# Use a different batch size
# I will try decreasing the batch size just to observe the effects.
model_5 = AutoEncoder()
train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=32, shuffle=True)
train(model_5, train_loader, val_loader, num_epochs=75, learning_rate=0.005)

In [None]:
# Slightly higher accuracy but way more noise.
# I want to test the effects of changing the learning rate again.

In [None]:
# Tuning 5:
# Use a larger learning size
# I know that a smaller learning rate allows the model to learn more accurately,
# but it doesn't mean that smaller learning rate means higher validation accuracy
model_6 = AutoEncoder()
train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=32, shuffle=True)
train(model_6, train_loader, val_loader, num_epochs=75, learning_rate=0.01)

In [None]:
# Much worse accuracy, revert to 0.005 which seemed to be optimal.

In [None]:
# The best model in terms of accuracy was model_5.

## Part 4. Testing [12 pt]

### Part (a) [2 pt]

Compute and report the test accuracy.

In [None]:
test_loader = torch.utils.data.DataLoader(test_data, batch_size=32, shuffle=True)
test_accuracy = get_accuracy(model_5, test_loader)
print("Final Test Accuracy: {} %".format(test_accuracy * 100))

Final Test Accuracy: 69.96889467592592 %


### Part (b) [4 pt]

Based on the test accuracy alone, it is difficult to assess whether our model
is actually performing well. We don't know whether a high accuracy is due to
the simplicity of the problem, or if a poor accuracy is a result of the inherent
difficulty of the problem.

It is therefore very important to be able to compare our model to at least one
alternative. In particular, we consider a simple **baseline**
model that is not very computationally expensive. Our neural network
should at least outperform this baseline model. If our network is not much
better than the baseline, then it is not doing well.

For our data imputation problem, consider the following baseline model:
to predict a missing feature, the baseline model will look at the **most common value** of the feature in the training set.

For example, if the feature "marriage" is missing, then this model's prediction will be the most common value for "marriage" in the training set, which happens to be "Married-civ-spouse".

What would be the test accuracy of this baseline model?


In [None]:
most_common_val = {}
for col in df_not_missing.columns:
  most_common_val[col] = df_not_missing[col].value_counts().idxmax()

test_accuracy = sum(df_not_missing['marriage'] == most_common_val['marriage'])/len(df_not_missing)
print("The test accuracy of this baseline model is:", test_accuracy*100, "% .")

The test accuracy of this baseline model is: 46.67947131974738 % .


### Part (c) [1 pt]

How does your test accuracy from part (a) compared to your basline test accuracy in part (b)?

In [None]:
# My model's test accuracy is about 70% which is higher compared to the
# baseline test accuracy of about 47%. This is because my model trains from
# looking at all the data and is optimized by the validation data.
# The baseline most common model just looks at the most frequently used number
# of the feature in the training set and ignores the rest of the data,
# whereas my model makes these decision based on patterns and other inferences.

### Part (d) [1 pt]

Look at the first item in your test data.
Do you think it is reasonable for a human
to be able to guess this person's education level
based on their other features? Explain.

In [None]:
get_features(test_data[0])

{'work': 'Private',
 'marriage': 'Divorced',
 'occupation': 'Prof-specialty',
 'edu': 'Bachelors',
 'relationship': 'Not-in-family',
 'sex': 'Male'}

In [None]:
# Yes I do think it is reasonable for a human to be able to guess another
# person's education level based on their other feautes.
# A person's features, such as their occupation, are typicially indicitave of
# the education they received. For example, we can see that this man is a
# specialty prof, which means that he must at least have some sort of a
# university level education. Similarly for work.
# Looking at solely features such as sex or relationship would not be
# reasonable to guess a person's education level however.

### Part (e) [2 pt]

What is your model's prediction of this person's education
level, given their other features?


In [None]:
test_person = test_data[0]
education_start_index = cat_index['edu']
education_stop_index = cat_index['edu'] + len(cat_values['edu'])
test_person[education_start_index: education_stop_index] = 0
test_person = torch.from_numpy(test_person)

prediction = model_5(test_person)
prediction = prediction.detach().cpu().numpy()
pred_edu = get_feature(prediction, "edu")
print("My model predicts this person's education level is", pred_edu, ".")

My model predicts this person's education level is Bachelors .


### Part (f) [2 pt]

What is the baseline model's prediction
of this person's education level?

In [None]:
print("The baseline model's prediction of this person's education level is", most_common_val["edu"],".")

The baseline model's prediction of this person's education level is  HS-grad .
