# 1. Introduction  

This is a multi-class classification problem, meaning that there are more than two classes to be predicted, in fact there are three flower species. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling.

# 2. Import Classes and Functions

We can begin by importing all of the classes and functions we will need in this tutorial.

This includes both the functionality we require from Keras, but also data loading from pandas as well as data preparation and model evaluation from scikit-learn.

In [1]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
import pandas as pd 
import numpy as np 

# 3. Load our data  


In [2]:
df = pd.read_csv('data/customertrain.csv')
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


We can see that we have 8068 training examples, but we do have some things to sort out:  

- We will neeed to deal with all the null values in some of the features and we will auto generate values
- Our output variables 'Y' also has nulls, we will remove those rows
- We will hot encode / label our Y  

# 5. Hot Encoding Y

The output variable contains six different string values.

When modeling multi-class classification problems using neural networks, it is good practice to reshape the output attribute from a vector that contains values for each class value to be a matrix with a boolean for each class value and whether or not a given instance has that class value or not.

This is called `one hot encoding` or creating dummy variables from a categorical variable.

For example, in this problem six class values are [1,2,3,4,5,6]. We can turn this into a one-hot encoded binary matrix for each data instance that would look as follows:    
  
![onehot](./images/y_one_hot.png)

In [4]:
# Drop rows with no Output values
df.dropna(subset=['Var_1'], inplace=True)

# extract Y and drop from dataframe
Y = df["Var_1"]
df = df.drop(["Var_1"], axis=1)

# encode class values as integers
yencoder = LabelEncoder()
yencoder.fit(Y)
encoded_Y = yencoder.transform(Y)

# convert integers to one hot encoded)
hot_y = np_utils.to_categorical(encoded_Y)
pd.DataFrame(hot_y).head()

Unnamed: 0,0,1,2,3,4,5,6
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0


# 4. Prepare our features  

Our features which contain strings need to be converted to numbers. After all, we can't calculate equations on strings. We will also drop Segmentation and ID fields, since they not needed.

In [5]:
# Encode string features
columns = ["Gender","Ever_Married","Graduated","Profession","Spending_Score"]
for feature in columns:
    le = LabelEncoder()
    df[feature] = le.fit_transform(df[feature])

df = df.drop(["Segmentation","ID"], axis=1)        
df.head()

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size
0,1,0,22,0,5,1.0,2,4.0
1,0,1,38,1,2,,0,3.0
2,0,1,67,1,2,1.0,2,1.0
3,1,1,67,1,7,0.0,1,2.0
4,0,1,40,1,3,,1,6.0


An important part of regression is understanding which features are missing. We can choose to ignore all rows with missing values, or fill them in with either mode, median or mode.  

- Mode = most common value
- Median = middle value
- Mean = average

Here is a handy function you can call which will fill in the missing features by your desired method. We will choose to fill in values with the average.  

After funning below, you should see 7992 with non-null values.

In [6]:
def fillmissing(df, feature, method):
  if method == "mode":
    df[feature] = df[feature].fillna(df[feature].mode()[0])
  elif method == "median":
    df[feature] = df[feature].fillna(df[feature].median())
  else:
    df[feature] = df[feature].fillna(df[feature].mean())

features_missing = df.columns[df.isna().any()]
for feature in features_missing:
  fillmissing(df, feature= feature, method= "mean")
  
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7992 entries, 0 to 8067
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Gender           7992 non-null   int64  
 1   Ever_Married     7992 non-null   int64  
 2   Age              7992 non-null   int64  
 3   Graduated        7992 non-null   int64  
 4   Profession       7992 non-null   int64  
 5   Work_Experience  7992 non-null   float64
 6   Spending_Score   7992 non-null   int64  
 7   Family_Size      7992 non-null   float64
dtypes: float64(2), int64(6)
memory usage: 561.9 KB


Finally, lets extract our features into X

# Normalise X  

Now, lets normalise X so the values lie between -1 and 1. We do this so we can get all features into a similar range. We use the following equation  

$X_{(i)} = \frac{x_{(i)}-mean(x)}{max(x)-min(x)}$  
  
The goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values. This process of rescaling the features is so that they have mean as 0 and variance as 1.


In [7]:
# X = df.to_numpy() 
# mu = X.mean(0) # 
# sigma = X.std(0) # standard deviation: max(x)-min(x)
# xn = (X - mu) / sigma
# Normalize features within range -1 (minimum) and 1 (maximum)
scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(df)


# 6. Define The Neural Network Model

There is a KerasClassifier class in Keras that can be used as an Estimator in scikit-learn, the base type of model in the library. The KerasClassifier takes the name of a function as an argument. This function must return the constructed neural network model, ready for training.

Below is a function that will create a baseline neural network for the customer classification problem. It creates a simple fully connected network with one hidden layer that contains 8 neurons.

The hidden layer uses a rectifier activation function which is a good practice. Because we used a one-hot encoding for our customer dataset, the output layer must create 6 output values, one for each class. The output value with the largest value will be taken as the class predicted by the model.

So, now you are asking “What are reasonable numbers to set these to?”  

- Input layer = set to the size of the features, but add a bias neuron (ie. 9)
- Hidden layers = set to input_layer * 2 (ie. 18)
- Output layer = set to the size of the labels of Y. In our case, this is 7 categories

The network topology of this simple one-layer neural network can be summarized as:

```
9 inputs -> [18 hidden nodes] -> 7 outputs
```

Note that we use a **softmax** activation function in the output layer. This is to ensure the output values are in the range of 0 and 1 and may be used as predicted probabilities.

Finally, the network uses the efficient **Adam gradient descent optimization algorithm** with a logarithmic loss function, which is called **categorical_crossentropy** in Keras.

In [8]:
# define baseline model
def baseline_model():
	# create model
	model = Sequential()
	# Rectified Linear Unit Activation Function
	model.add(Dense(16, input_dim=8, activation='relu'))
	model.add(Dense(16, activation = 'relu'))
	# Softmax for multi-class classification
	model.add(Dense(7, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

We can now create our KerasClassifier for use in scikit-learn.

We can also pass arguments in the construction of the KerasClassifier class that will be passed on to the fit() function internally used to train the neural network. Here, we pass the number of epochs as 200 and batch size as 5 to use when training the model. Debugging is also turned off when training by setting verbose to 0.  

Advantages of using a batch size < number of all samples:  

- It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.  
- Typically networks train faster with mini-batches. That's because we update the weights after each propagation.  

In [9]:
# model = baseline_model()
cmodel = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=100, verbose=0)

# 6. Evaluate The Model with k-Fold Cross Validation

Now, lets evaluate the neural network model on our training data.

The scikit-learn evaluates models using various techniques. The gold standard for evaluating machine learning models is **k-fold cross validation**.

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

1. Shuffle the dataset randomly.  
2. Split the dataset into k groups  
3. For each unique group:  
3.1 Take the group as a hold out or test data set  
3.2 Take the remaining groups as a training data set  
3.3 Fit a model on the training set and evaluate it on the test set  
3.4 Retain the evaluation score and discard the model  
4. Summarize the skill of the model using the sample of model evaluation scores  

Lets define the model evaluation procedure. Here, we set  

- The number of folds to be 10 (a good default) 
- Shuffle the data before partitioning it. 

In [10]:
kfold = KFold(n_splits=10, shuffle=True)

Now we can evaluate our model (estimator) on our dataset (X and hot_y) using a 10-fold cross-validation procedure (kfold).

Evaluating the model returns an object that describes the evaluation of the 10 constructed models for each of the splits of the dataset.

In [11]:
result = cross_val_score(cmodel, X, hot_y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (result.mean()*100, result.std()*100))

2021-08-20 08:27:53.211905: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-20 08:27:53.347927: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Baseline: 65.90% (1.73%)


sss

In [12]:
model = baseline_model()
model.compile(loss='binary_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
                  
history = model.fit(X, hot_y, validation_split=0.33, epochs=200, batch_size=100, verbose=0)
# evaluate the keras model
_, accuracy = model.evaluate(X, hot_y)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy: 66.47


In [None]:
pd.DataFrame(pred)
# print(f'Training Set Accuracy: {(pred == y).mean() * 100:f}')

# 7. Conclusion

In this post you discovered how to develop and evaluate a neural network using the Keras Python library for deep learning.
You learned:  
  
- How to load data and make it available to Keras.  
- How to prepare multi-class classification data for modeling using one hot encoding.  
- How to use Keras neural network models with scikit-learn.  
- How to define a neural network using Keras for multi-class classification.  
- How to evaluate a Keras neural network model using scikit-learn with k-fold cross validation  

Some interesting things to observe:  

With batch size of 5, we end up with 66.10% accuracy:  

- Without normalising, it takes 3200 seconds for cross_val_score  
- With normalising, it takes 1422 seconds for cross_val_score  

With batch size of 100, we end up with an accuracy of 66.38%:  

- Without normalising, it takes 83 seconds for cross_val_score  
- With normalising, it takes 78 seconds for cross_val_score  