# Deep Learning with Keras III (student admissions, mlp)

Overview: 
- Data: Student admissions data
- Methodology: Artifical neural network
 - Type: Multilayer perceptron

## 1. Introduction
- Part I:
- Part II: 
In this notebook we apply a **mlp** neural network to student admissions data. The objective of the model is to build a model that predicts whether a student is admitted or not based on three explanatory variables.

## 2. The data
The dataset consists of 400 observations of students applying to university. Our **binary** outcome variable is whether the student was admitted or not (`admit`) and the three explanatory feature variables which will be used to predict the outcome are
- `gre`: standardized test
- `gpa`: academic achievement measure
- `rank`: categorical rank of student

### Preprocessing
We have to undertake some steps in order to apply our multilayer perceptron model. Necessary preprocessing steps are
- Normalization: Normalize the numeric feature variables (gre and gpa) so they are all in the same range between 0 and 1.
- Convert to categorical: Rank is coded numerical in the data even though it is a categorical variable. Therefore, we will re-encode this variable by including dummies for each rank instead of the numerical value. This technique is known as [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/). 

In addition, we will split the data into training and testing data. Thus, we can evaluate the trained model fitted to the train data on previously unseen test data to avoid that the model's accuracy is simply a consequence of "memorizing" the test dataset.     

In [1]:
# Standard imports
import pandas as pd
import numpy as np

# Keras imports
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
from keras import utils

# set seed
np.random.seed(16)

Using TensorFlow backend.


In [2]:
# Import data from csv 
data = pd.read_csv("data/student_admissions.csv")

# Initial shape and head
print("Shape of data:",data.shape)
print("\nHead of initial data:\n",data.head())

# Normalize to a range between 0 and 1
data["gre"] = data["gre"] / data["gre"].max()
data["gpa"] = data["gpa"] / data["gpa"].max()

# get dummies for rank
dummies = pd.get_dummies(data["rank"], prefix="rank")

# Dropping numerical "rank", concatenating data and dummies
data = pd.concat([data.drop("rank", axis=1), dummies], axis=1, sort=False)

# print new shape and head
print("\nShape of processed data:", data.shape)
print("\nHead of processed data:\n",data.head())

Shape of data: (400, 4)

Head of initial data:
    admit  gre   gpa  rank
0      0  380  3.61     3
1      1  660  3.67     3
2      1  800  4.00     1
3      1  640  3.19     4
4      0  520  2.93     4

Shape of processed data: (400, 7)

Head of processed data:
    admit    gre     gpa  rank_1  rank_2  rank_3  rank_4
0      0  0.475  0.9025       0       0       1       0
1      1  0.825  0.9175       0       0       1       0
2      1  1.000  1.0000       1       0       0       0
3      1  0.800  0.7975       0       0       0       1
4      0  0.650  0.7325       0       0       0       1


**Next step:**

Using a function to split the data into train and test set and converting them to a numpy array. In addition, we still have to re-encode the y-variable using a dummy for each category (two in this case).  

In [3]:
# Train_test function
def train_test(df, test_size):
    idx = df.index
    test_len = int(len(idx)*test_size)
    sample = np.random.choice(idx, size = test_len, replace = False)
    train = df.iloc[sample]
    test = df.drop(sample)
    return train, test

# train, test - split
train_data, test_data = train_test(data, 0.8)

# Split into y_train, x_train
y_train = np.array(utils.to_categorical(train_data["admit"], num_classes=2))
x_train = np.array(train_data.drop(columns=["admit"]))

# Split into y_test, x_test, 
y_test = np.array(utils.to_categorical(test_data["admit"], num_classes=2))
x_test = np.array(test_data.drop(columns=["admit"]))

# print results 
print(y_train[:5])
print("\n",x_train[:5])

[[1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]]

 [[0.925  1.     0.     0.     1.     0.    ]
 [0.825  0.7275 0.     0.     1.     0.    ]
 [0.725  0.865  0.     0.     0.     1.    ]
 [0.85   0.8275 0.     1.     0.     0.    ]
 [0.775  0.9025 1.     0.     0.     0.    ]]


## 3. The model
We are using a multilayer perceptron neural network for this binary problem. Before we have built the model by specifying it in a list to the `Sequential()`-function. Now, we are adding layers sequentially using the `.add()`-function. This is another option to specify the network.

Other compiler options: 
- [Loss functions](https://keras.io/losses/): mean_squared_error, binary_crossentropy, categorical_crossentropy, ...
- [Optimizers](https://keras.io/optimizers/): adam, rmsprop, sgd, ...
- [Metrics](https://keras.io/metrics/): accuracy, mae, ...

In [4]:
# time 
from time import time
start_model = time()

# Initializing model object
model = Sequential()

# Adding layers
model.add(Dense(128, activation = "relu", input_dim = x_train.shape[1]))
model.add(Dropout(0.2))
model.add(Dense(128, activation = "relu"))
model.add(Dropout(0.2))
model.add(Dense(2, activation = "softmax"))

# Compiler
model.compile(loss = "binary_crossentropy",
              optimizer = "adam",
              metrics = ["accuracy"])

# Fit the model
model.fit(x = x_train, y = y_train,
          epochs = 256,
          batch_size = 128, 
          verbose = 0,
          validation_data = (x_test, y_test))

end_model = time()
time_model = end_model - start_model
print("Total running time:", round(time_model, 2))

# Evaluate model
train_score = model.evaluate(x_train, y_train)
test_score = model.evaluate(x_test, y_test)

# Print results
print("\nTraining Accuracy:", train_score[1].round(4))
print("Testing Accuracy:", test_score[1])

Total running time: 3.6

Training Accuracy: 0.7219
Testing Accuracy: 0.675


**Notes:**

A testing accuracy below 70% seems not to be overwhelming. However, this is a fairly small dataset that is used to train the model (320 training observations) and there is a lot of margin for parameter training left. It may also be a good idea to compare the performance to supervised machine learning algorithms like random forest, which may outperform the neural network. 

**Next steps:**
- Compare to traditional supervised ML algorithms
- Visualize training, testing accuracy over time.
- Summary