<a href="https://colab.research.google.com/github/sabrinasforza/Portfolio/blob/main/University_project_3_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 2: Predicting Student Grades 

## Abstract
The second task of this neural network project consists of building a multi-class classifier to predict secondary school student performance. The data used to train the classifier models was collected from two Portuguese secondary schools in 2014.

Firstly, a random forest classifier was used a baseline model. Then a multilayer perceptron was trained before building a 3-layer dense deep neural network to compare their performance. We then built two wide and deep neural networks with different architectures. 

## Main Findings

It was found that several attributes within the dataset helped predict student performance more effectively. These consisted of previous grades, mother's education, aspiration to take up higher education, studytime, amount of failures and amount of absences.

Using a reduced dataset, both a dense, deep neural networks and a wide and deep neural network with varying architecutre were highly effective as predicting student performance. The most effective model was the wide and deep neural network with (x), as it scored an F1 score of (X).  


## Team Name: Group 7

## Students: 


* DEWHIRST GREGOR | 202172038 
* GRAY JAMES | 202181792 
* LANCASTER THOMAS | 202191074 
* SFORZA SABRINA | 202173415 
* SMITH PADDY | 202171142 
* THOMAS SAM KURIAKOSE | 202179307 
 



# Model and Method 

## Packages

In [None]:
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import Perceptron
!pip install -U tensorflow
%tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)
print(keras.__version__)
from keras.utils.vis_utils import plot_model
from sklearn.neural_network import MLPClassifier

from tensorflow import keras
!pip install -q -U keras-tuner
import keras_tuner as kt
import os
import time
import datetime
import numpy as np
import pandas as pd
import sklearn
import seaborn as sns 

# Keras
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import callbacks
from tensorflow.keras import backend as K

# Standard ML stuff
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD, FastICA
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

# Oversampling of minority class 'Churn customers'
from imblearn.over_sampling import SMOTE

2.8.0
2.8.0


##Data Processing and Feature Selection


In [None]:
#importing test set

drive.mount('/content/drive')
path = "/content/drive/MyDrive/test_task2.csv"
test = pd.read_csv(path)

NameError: ignored

In [None]:
#importing train set 

path = "/content/drive/MyDrive/train_task2.csv"
train = pd.read_csv(path)

In [None]:
#choosing predictors and targets 
x_test = test
x_train = train
y_train = train.pop('Grade')

In [None]:
#dropping id column from test and train sets 
x_train.drop('id', axis=1, inplace=True)
x_test.drop('id', axis=1, inplace=True)

### Describe Dataset

The dataset consists of various information on students from two Portuguese schools, collected through surveys and questionnaires. In total, 33 attributes have been collected for each student, and there are (x) observations within the dataset. 

Some initial analysis through a heatmap shows a correlation of 0.6 between the amount of hours studied and the grade recieved, which is theoretically cogent. 

Furthermore, a negative correlation of -0,2 can be both seen with the amount of failures and the amount of of absences a student has.These are both to be expected as previous failure will likely reflect future performance and absences will likely determine how much content is being consumed. 

After, the grades distribution was checked, showing a positive skew. This shows that there is not a normal distribution amongst the student sample. 

In [None]:
#exploring trends in the data

corr = train.corr()
sns.heatmap(corr) 
plt.show()

In [None]:
# checking if student grades are evenly distributed

plt.style.use("ggplot")
train["Grade"].value_counts().plot(kind="bar", 
                                  figsize = (8,5), color = "darkviolet")
plt.title("Frequency of the classes of our Target variable", size=20)
plt.xlabel("Target Variable", size = 16)
plt.ylabel("Frequency", size = 16)

### Feature Importance

A chi-squared feature selection is performed in order to ensure only relevant attributes are being used to train the classification model. This is because certain attributes may predict a students performance far better than others. Removing unwanted attributes, thus reducing the size of the dataset, can make it easier and quicker to train neural networks. However shrinking the amount of input variables could create problems of overfitting, as the model may not be able to generalise to other datasets. 


In [None]:
# All input variables have to be converted into ordinal variables first before feature importance can be performed.

def prepare_inputs(x_train, x_test):
    oe = OrdinalEncoder()
    oe.fit(x_train)
    X_train_enc = oe.transform(x_train)
    X_test_enc = oe.transform(x_test)
    return X_train_enc, X_test_enc

    def prepare_targets(y_train):
    le = LabelEncoder()
    le.fit(y_train)
    y_train_enc = le.transform(y_train)
    return y_train_enc

    def select_features(x_train, y_train, x_test):
    fs = SelectKBest(score_func=chi2, k='all')
    fs.fit(x_train, y_train)
    x_train_fs = fs.transform(x_train)
    x_test_fs = fs.transform(x_test)
    return x_train_fs, x_test_fs, fs

    X_train_enc, X_test_enc = prepare_inputs(x_train, x_test)
y_train_enc = prepare_targets(y_train)
x_train_fs, x_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc)

In [None]:
# Importance scores are then calculated for each variable, and a bar plot is produced
for i in range(len(fs.scores_)):
    print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
plt.figure(figsize = (12,6))
plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.title("Feature Importance Score", size = 20)
plt.xlabel("Features/ Variables", size = 16, color = "black")
plt.ylabel("Importance Score", size = 16, color = "black")
plt.show()

The feature importance analysis reveals large variation between the input variables in terms of how important they are in predicting student performance. In particular variables G1 and G2 are strong predictors for student performance. This is because these variables are students grades from first and second term respectively, thus will be effective predictors for a students final grade. 

However only keeping variables G1 and G2 would overly shrink the training set, risking overfitting. Therefore only input variables that scored higher than 25 were kept in the training set. This left the variables "Medu", "higher", "studytime", "failures", "absences", "G1", "G2". 

In [None]:
x_train = x_train.drop(columns = ["school", "sex", "age", "address", "famsize", "Pstatus", "Fedu", "Mjob", "Fjob", "reason", "guardian", "traveltime", "schoolsup", "famsup",
                                  "paid", "activities", "nursery", "internet", "romantic", "famrel", "freetime", "goout", "Dalc", "Walc", "health"])

x_test = x_test.drop(columns = ["school", "sex", "age", "address", "famsize", "Pstatus", "Fedu", "Mjob", "Fjob", "reason", "guardian", "traveltime", "schoolsup", "famsup",
                                  "paid", "activities", "nursery", "internet", "romantic", "famrel", "freetime", "goout", "Dalc", "Walc", "health"])

### One-Hot Encoding

The data types of each variables are then investigated in order to ensure they can be supported by machine learning algorithms. Categorical variables are particularly difficult, thus often are encoded into numerical values using various techniques. Amongst the remaining variables, only "higher" is categorical, whilst the others are either ordinal or nominal. 

In [None]:
x_train.dtype

In [None]:
x_train = pd.get_dummies(x_train, drop_first=True, columns = ["higher"])
x_test = pd.get_dummies(x_test, drop_first=True, columns = ["higher"])

## Scaling Data 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

## Validation Set

Finally, validation set was generated from training data to avoid overfitting and for an early evaluation of the model (Larsen et al, 1996). The size chosen for the validation set was 20%. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(x_train, y_train) 

##Baseline Model



###Random Forest Classifier

A baseline model was first fitted on the training set, in order to compare performance of the neural networks. A random forest classifier was chosen, as it has been shown to be a versatile and powerful model for multiclass classifcation. 


## Neural Network Models 

### Multilayer Perceptron 

The first model used was a Multilayer Perceptron with default parameters. Despite being a simple model, it provided a term of comparison to measure the performance of the Neural Networks. The model received a score of 0.96 when applied to the test set. 

### Feedforward Neural Network 
The second model built was a 3-layer feedforward neural network. The loss function used was sparse categorical crossentropy. Adam optimiser was used for this model. The activation function used was ReLu for the hidden layers and SoftMax for the output. 
The best hyperparameters were retrieved using KerasTuner. This method proved to be effective given the high number of possible combinations for the hyperparameter. The search provided the following parameters: 

When applied to the test set, the model scored 0.94. The accuracy score on the training set was 1.00 

Explain the difference between scores on train and test sets 

### Wide and Deep Neural Network 
The third model built was a Wide and Deep Neural Network with 3 hidden layers. 
Activation function 
Loss function 
Optimizer 
KerasTuner was used to retrieve the best hyperparameters. The search provided a learning rate of 0.001 and 769 neurons per layer. The model scored 0.98 on the test set and 1.00 accuracy on the training set. 
The higher accuracy score on the training set may be a sign of overfitting. For this reason, He initialization and LeakyReLU, a variation of the ReLU activation function, have been added. The small slopes of the LeakyReLU function prevent neurons dying during the training stages whereas He initialization ...

However, these implementations have not improved the score from 0.98. The performance graph also shows the loss for both training and validation data to be highly fluctuating. This could be due to ...

### Wide and Deep Neural Network with an embedding layer 

The final model built was a Wide and Deep Neural Network with an embedding layer. While OneHotEncoding represents an effective solution to handle categorical data, embedding can help red\uce memory space and improve performance (Guo and Berkhahn, 2016). 

# Training and Validation 

Presented below is reproducible code for producing all four models produced. Alongside this, the code is annotated in order to show the approach and decisions taken, along with justifications, in order to achieve our student performance predictions. 

## Baseline Model - Random Forest Classifier

In [None]:
# Random Forest
rnd_clf = RandomForestClassifier(max_depth = None, n_estimators = 550, criterion = 'gini')

rnd_clf.fit(x_train y_train)

rf_pred = rnd_clf.predict(x_test)
rf_pred = pd.DataFrame(rf_pred)
rf_pred

# scored 0.98 on Kaggle

## Multilayer Perceptron

In [None]:
#sklearn basic MultiLayer Perceptron
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)

## Feedforward Deep Neural Network

In [None]:
# New search with adam optimizer

#Used unscaled data 
#did the validation split through the hyperparameter search 

def model_builder(hp):
  model = keras.Sequential()
  model.add(keras.layers.Flatten(input_shape=(58, )))

  # Tune the number of units in the first Dense layer
  # Choose an optimal value between 1-800
  hp_units = hp.Int('units', min_value=1, max_value=800, step=64)
  model.add(keras.layers.Dense(units=hp_units, activation='relu'))
  model.add(keras.layers.Dense(units= hp_units, activation = 'relu'))
  model.add(keras.layers.Dense(units= hp_units, activation = 'relu'))
  model.add(keras.layers.Dense(21, activation = "softmax"))

  # Tune the learning rate for the optimizer
  # Choose an optimal value from 0.01, 0.001, or 0.0001
  hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
  
  model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])

  return model

In [None]:
tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=50,
                     factor=3,
                     directory='dir')

stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

In [None]:
tuner.search(x_train_encoded_df17, y_train, epochs=50, validation_split=0.2, callbacks=[stop_early])

In [None]:
#training model using best hyperparameters 
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

model = tuner.hypermodel.build(best_hps)
history = model.fit(x_train_encoded_df17, y_train, epochs=50, validation_split = 0.2)

## Wide and Deep Neural Network

In [None]:
#hyperparameter search 
def model_builder(hp):
  model = keras.Sequential()
  input = model.add(keras.layers.Flatten(input_shape=(58, )))

  # Tune the number of units in the first Dense layer
  # Choose an optimal value between 1-800
  hp_units = hp.Int('units', min_value=1, max_value=800, step=64)
  input = keras.layers.Input(shape= 58,)
  hidden1 = keras.layers.Dense(units = hp_units, activation="relu")(input)
  hidden2 = keras.layers.Dense(units = hp_units, activation="relu")(hidden1)
  concat = keras.layers.Concatenate()([input,hidden2])
  output = keras.layers.Dense(21, activation = "softmax")(concat)
  model = keras.Model(inputs=[input], outputs=[output])

  # Tune the learning rate for the optimizer
  # Choose an optimal value from 0.01, 0.001, or 0.0001
  hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
  
  model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])

  return model

In [None]:
tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=30,
                     factor=3,
                     directory='dir_1',)

#early stopping function
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

In [None]:
tuner.search(x_train_encoded_df17, y_train, epochs=50, validation_split=0.2, callbacks=[stop_early])

In [None]:
#training model using best hyperparameters 
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

model = tuner.hypermodel.build(best_hps)
history = model.fit(x_train_encoded_df17, y_train, epochs=50, validation_split = 0.2)

### Additional Hyperparameter Search to Avoid Overfitting

In [None]:
# new hyperparameter search 

## with regularisation and gradient clipping 


def model_builder(hp):
  model = keras.Sequential()
  input = model.add(keras.layers.Flatten(input_shape=(58, )))

  # Tune the number of units in the first Dense layer
  # Choose an optimal value between 1-800
  hp_units = hp.Int('units', min_value=1, max_value=800, step=64)
  input = keras.layers.Input(shape= 58,)
  hidden1 = keras.layers.Dense(units = hp_units, kernel_initializer = "he_normal")(input)
  hidden1out = keras.layers.LeakyReLU(alpha = 0.2) (hidden1)
  hidden2 = keras.layers.Dense(units = hp_units, kernel_initializer = "he_normal")(hidden1out)
  hidden2out = keras.layers.LeakyReLU(alpha = 0.2) (hidden2)
  output = keras.layers.Dense(21, activation = "softmax")(hidden2out)
  model = keras.Model(inputs=[input], outputs=[output])

  # Tune the learning rate for the optimizer
  # Choose an optimal value from 0.01, 0.001, or 0.0001
  hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
  
  model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])

  return model

In [None]:
tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=40,
                     factor=3,
                     directory='dir_2',)

#early stopping function
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

In [None]:
tuner.search(x_train_encoded_df17, y_train, epochs=70, validation_split=0.2, callbacks=[stop_early])

In [None]:
#training model using best hyperparameters 
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

model = tuner.hypermodel.build(best_hps)
history = model.fit(x_train_encoded_df17, y_train, epochs=70, validation_split = 0.2)

## Wide and Deep Neural Network with Text Embedding Layers

# Results and Discussion

How did the models perform? 



Insert summary table of train/test performance and parameters for each model 

Table created here: https://docs.google.com/document/d/1Dlgx9VmFhBQXaAt7JKBXxncLi1s_48yVECCXrEAkjZE/edit?usp=sharing


## Random Forest Classifier

## Multilayer Perceptron

In [None]:
mlp_preds = clf.predict(x_test_encoded_df17) 
mlp_preds = pd.DataFrame(mlp_preds)
mlp_preds.to_csv('mlp_preds.csv')
files.download('mlp_preds.csv')

#this scored 0.96 on Kaggle 

## Feedforward Deep Neural Network

In [None]:
print(best_hps.get('learning_rate'), best_hps.get('units')) 

In [None]:
pd.DataFrame(history.history).plot(figsize=(8, 5)) 
plt.grid(True)
plt.gca().set_ylim(0, 1)

In [None]:
#fitting model to test set 

hp_preds = model.predict(x_test_encoded_df17)

hp_preds = pd.DataFrame(hp_preds)
hp_preds.to_csv('hp_preds.csv')
files.download('hp_preds.csv')

# results - 0.98 on Kaggle

## Wide and Deep Neural Network

In [None]:
#retrieving the best hyperparameters selected

print(best_hps.get('learning_rate'), best_hps.get('units')) 

In [None]:
#observing performance 

pd.DataFrame(history.history).plot(figsize=(8, 5)) 
plt.grid(True)
plt.gca().set_ylim(0, 1)

In [None]:
#fitting model to test set 

wd_search_preds = model.predict(x_test_encoded_df17)

wd_search_preds = pd.DataFrame(wd_search_preds)
wd_search_preds.to_csv('wd_search_preds.csv')
files.download('wd_search_preds.csv')

#0.98 on Kaggle 

### Additional Hyperparameter Search to Avoid Overfitting

In [None]:
#retrieving the best hyperparameters selected

print(best_hps.get('learning_rate'), best_hps.get('units'))

In [None]:
#observing performance 

pd.DataFrame(history.history).plot(figsize=(8, 5)) 
plt.grid(True)
plt.gca().set_ylim(0, 1)

In [None]:
#fitting model to test set 

wd_search_preds_2 = model.predict(x_test_encoded_df17)

wd_search_preds_2 = pd.DataFrame(wd_search_preds_2)
wd_search_preds_2.to_csv('wd_search_preds_2.csv')
files.download('wd_search_preds_2.csv')

#0.98 on Kaggle

## Wide and Deep Neural Network with text embedding layer 

The final model created was a Wide and Deep Neural Network with an embedding layer. 


In [None]:
import os
import time
import datetime
import numpy as np
import pandas as pd

# Keras
import tensorflow.keras as keras
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import callbacks
from tensorflow.keras import backend as K

# Standard ML stuff
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA, TruncatedSVD, FastICA
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

# Oversampling of minority class 'Churn customers'
from imblearn.over_sampling import SMOTE

# Plotting
import matplotlib.pyplot as plt

In [None]:
#retrieve train and test sets 
path = "/content/drive/MyDrive/test_task2.csv"
test = pd.read_csv(path)

In [None]:
path = "/content/drive/MyDrive/train_task2.csv"
train = pd.read_csv(path)

In [None]:
train.dtypes


school        float64
sex           float64
age           float64
address       float64
famsize       float64
Pstatus       float64
Medu          float64
Fedu          float64
Mjob          float64
Fjob          float64
reason        float64
guardian      float64
traveltime    float64
studytime     float64
failures      float64
schoolsup     float64
famsup        float64
paid          float64
activities    float64
nursery       float64
higher        float64
internet      float64
romantic      float64
famrel        float64
freetime      float64
goout         float64
Dalc          float64
Walc          float64
health        float64
absences      float64
G1            float64
G2            float64
Grade         float64
dtype: object

In [None]:
numeric_cols = ['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2']
target_col = train['Grade']
ignored_cols = ['id']
categorical_cols = train.select_dtypes(include='object').columns
categorical_cols = [col for col in categorical_cols if col not in target_col]

In [None]:
from sklearn.preprocessing import LabelEncoder
for col in categorical_cols:
    train[col] = LabelEncoder().fit_transform(train[col])

In [None]:
train[numeric_cols] = StandardScaler().fit_transform(train[numeric_cols])

In [None]:
pca = PCA(n_components=3)
_X = pca.fit_transform(train[numeric_cols + categorical_cols])
pca_data = pd.DataFrame(_X, columns=["PCA1", "PCA2", "PCA3"])
train[["PCA1", "PCA2", "PCA3"]] = pca_data

fica = FastICA(n_components=3)
_X = fica.fit_transform(train[numeric_cols + categorical_cols])
fica_data = pd.DataFrame(_X, columns=["FICA1", "FICA2", "FICA3"])
train[["FICA1", "FICA2", "FICA3"]] = fica_data

tsvd = TruncatedSVD(n_components=3)
_X = tsvd.fit_transform(train[numeric_cols + categorical_cols])
tsvd_data = pd.DataFrame(_X, columns=["TSVD1", "TSVD2", "TSVD3"])
train[["TSVD1", "TSVD2", "TSVD3"]] = tsvd_data

grp = GaussianRandomProjection(n_components=3)
_X = grp.fit_transform(train[numeric_cols + categorical_cols])
grp_data = pd.DataFrame(_X, columns=["GRP1", "GRP2", "GRP3"])
train[["GRP1", "GRP2", "GRP3"]] = grp_data

srp = SparseRandomProjection(n_components=3)
_X = srp.fit_transform(train[numeric_cols + categorical_cols])
srp_data = pd.DataFrame(_X, columns=["SRP1", "SRP2", "SRP3"])
train[["SRP1", "SRP2", "SRP3"]] = srp_data

#tsne = TSNE(n_components=3)
#_X = tsne.fit_transform(telcom[numeric_cols + categorical_cols])
#tsne_data = pd.DataFrame(_X, columns=["TSNE1", "TSNE2", "TSNE3"])
#telcom[["TSNE1", "TSNE2", "TSNE3"]] = tsne_data

numeric_cols.extend(pca_data.columns.values)
numeric_cols.extend(fica_data.columns.values)
numeric_cols.extend(tsvd_data.columns.values)
numeric_cols.extend(grp_data.columns.values)
numeric_cols.extend(srp_data.columns.values)

In [None]:
smote = SMOTE(sampling_strategy='minority', random_state=42)
os_smote_X, os_smote_Y = smote.fit_resample(train[numeric_cols + categorical_cols], train[target_col].values.ravel())

train = pd.DataFrame(os_smote_X, columns=numeric_cols + categorical_cols)
train['Grade'] = os_smote_Y
print(train.shape)

KeyError: ignored

In [None]:
train = train.drop('id', axis=1)

In [None]:
K.clear_session()

In [None]:
FEATURE_COLS = numeric_cols + categorical_cols
TARGET_COL = 'Grade'
EPOCHS = 50
BATCH_SIZE = 32
CLASS_WEIGHTS = {0 : 1., 1 : 2.5}

In [None]:
cat_inputs = []
num_inputs = []
embeddings = []
embedding_layer_names = []
emb_n = 10

In [None]:
categorical_cols

['school',
 'sex',
 'address',
 'famsize',
 'Pstatus',
 'Mjob',
 'Fjob',
 'reason',
 'guardian',
 'schoolsup',
 'famsup',
 'paid',
 'activities',
 'nursery',
 'higher',
 'internet',
 'romantic']

In [None]:
# Embedding for categorical features
for col in categorical_cols:
    _input = layers.Input(shape=[17], name=col)
    _embed = layers.Embedding(train[col].max() + 1, emb_n, name=col+'_emb')(_input)
    cat_inputs.append(_input)
    embeddings.append(_embed)
    embedding_layer_names.append(col+'_emb')
    
# Simple inputs for the numeric features
for col in numeric_cols:
    numeric_input = layers.Input(shape=(30, ), name=col)
    num_inputs.append(numeric_input)
    
# Merge the numeric inputs
merged_num_inputs = layers.concatenate(num_inputs)
#numeric_dense = layers.Dense(31, activation='relu')(merged_num_inputs)

# Merge embedding and use a Droput to prevent overfittting
merged_inputs = layers.concatenate(embeddings)
spatial_dropout = layers.SpatialDropout1D(0.2)(merged_inputs)
flat_embed = layers.Flatten()(spatial_dropout)

# Merge embedding and numeric features
all_features = layers.concatenate([flat_embed, merged_num_inputs])

# MLP for classification
x = layers.Dropout(0.2)(layers.Dense(100, activation='relu')(all_features))
x = layers.Dropout(0.2)(layers.Dense(50, activation='relu')(x))
x = layers.Dropout(0.2)(layers.Dense(25, activation='relu')(x))
x = layers.Dropout(0.2)(layers.Dense(15, activation='relu')(x))

# Final model
output = layers.Dense(21, activation='sigmoid')(x)
model = models.Model(inputs=cat_inputs + num_inputs, outputs=output)


ValueError: ignored

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 school (InputLayer)            [(None, 17)]         0           []                               
                                                                                                  
 sex (InputLayer)               [(None, 17)]         0           []                               
                                                                                                  
 address (InputLayer)           [(None, 17)]         0           []                               
                                                                                                  
 famsize (InputLayer)           [(None, 17)]         0           []                               
                                                                                              

In [None]:
_hist = model.fit(
    x= train[FEATURE_COLS],
    y=train[TARGET_COL],
    validation_data=(train[FEATURE_COLS], train[TARGET_COL]),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    class_weight=CLASS_WEIGHTS,
    verbose=2
)

Epoch 1/50


ValueError: ignored

# Summary and Recommendations 

# References 
Guo, C., & Berkhahn, F. (2016). Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737.

Larsen, J., Hansen, L. K., Svarer, C., & Ohlsson, M. (1996). Design and regularization of neural networks: the optimal use of a validation set. In Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop (pp. 62-71). IEEE.