# **GENDER CLASSIFICATION**
**By : Garry Ariel**

This notebook contains the steps of processing given information, build and train the model, and use it to predict gender. The model used here are Logistic Regression (LR) and Neural Network (NN).

First thing to do is import some necessary packages and read the data in.

In [None]:
# Import packages
import numpy as np
import pandas as pd

# Read the data
data_df = pd.read_csv("/kaggle/input/gender-classification/Transformed Data Set - Sheet1.csv")

# Take a look at some data examples
data_df.head(10)

Take a look some statistics about the data using following syntax.

In [None]:
# Describe the data
data_df.describe()

## **Preprocessing the Data**  
Next, we will do some pre-processing to the data, such as turn categorical variables into one-hot-encoding form.

In [None]:
# Turn male into 1 and female 0
data_df['Gender'].replace(to_replace = 'F', value = 0, inplace = True)
data_df['Gender'].replace(to_replace = 'M', value = 1, inplace = True)

In [None]:
# Create one hot encoding
fav_color_df = pd.get_dummies(data_df[["Favorite Color"]], prefix = "color")
fav_music_df = pd.get_dummies(data_df[["Favorite Music Genre"]], prefix = "music")
fav_beverage_df = pd.get_dummies(data_df[["Favorite Beverage"]], prefix = "beverage")
fav_drink_df = pd.get_dummies(data_df[["Favorite Soft Drink"]], prefix = "drink")

In [None]:
# Merging one hot encoding and create new dataframe
transformed_df = pd.merge(fav_color_df, fav_music_df, left_index = True, right_index = True)
transformed_df = pd.merge(transformed_df, fav_beverage_df, left_index = True, right_index = True)
transformed_df = pd.merge(transformed_df, fav_drink_df, left_index = True, right_index = True)

# Take a look at some data examples
transformed_df.head(10)

## **Feature Selection**
Next, we will select some features which will be used to feed the model later. We specified 3 ways to select the features.

1. Choose the feature manually.
We can experiment about which features give higher accuracy.

In [None]:
# Choose feature (Manual)
feature = [
    "music_Electronic",
    "music_Hip hop",
    "music_Jazz/Blues",
    "music_Pop",
    "music_R&B and soul",
    "beverage_Vodka",
    "drink_Other"
]

2. Choose all features without filter it.

In [None]:
# Choose all feature
# feature = []
# for col in transformed_df.columns:
#     feature.append(col)

3. Choose features based on its correlation to gender variable. Specify a threshold, such that every features which have correlation to gender variable greater than threshold will be chosen as a feature.

In [None]:
# Choose feature (By rule)
# feature = []
# analyze_df = pd.merge(transformed_df, data_df["Gender"], left_index = True, right_index = True)
# for index, row in analyze_df.corr().iterrows():
#     if abs(row["Gender"]) > 0.08 and index != "Gender":
#         feature.append(index)

## **Preparing Data**
In the following step, we will format the data so that the data can be feed into the model. We will also split the data into train and test dataset with the comparison of 4:1.

In [None]:
# Import packages related to training model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, matthews_corrcoef, accuracy_score

# Turn into numpy array
X = np.asarray(transformed_df[feature])
y = np.asarray(data_df['Gender'])

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
# Preprocess train data
header = []
for col in transformed_df[feature].columns:
    header.append(col)
header = np.array(header)

x_df = pd.DataFrame(
    X_train,
    columns = header
)

y_df = pd.DataFrame(
    y_train,
    columns = ["gender"]
)

train_df = pd.merge(x_df, y_df, left_index = True, right_index = True)

# Look at the correlation
corr_df = train_df.corr()
corr_df.head(len(feature))

## **Build and Train Logistic Regression Model**

We will just simply feed the model with the data using any default parameters. After trained, we use the model to predict the gender, and evaluate the accuracy. To experiment with the accuracy, we can change the features we used in previous steps.

In [None]:
# Create logistic regression
LR = LogisticRegression().fit(X_train, y_train)

In [None]:
# Predict result
y_predict = LR.predict(X_test)

# Evaluate the accuracy
score = accuracy_score(y_predict, y_test)

# Print result
print("The accuracy is " + str(score))

## **Build and Train Neural Network**

The model we used here are as the following.
1. Fully connected layer with 128 neurons using ReLU activation function.
2. Dropout layer with probability 0.4.
3. Fully connected layer with 2 neurons (as output) using softmax activation function.

In [None]:
# Using NN model
import tensorflow as tf
from tensorflow import keras

# Create callback
class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs = {}):
        if ((logs.get('val_accuracy') > 0.72 and logs.get('val_loss') <= 0.5931) or logs.get('val_accuracy') >= 0.9):
            self.model.stop_training = True
            print("Stop here")
callback = myCallback()

# Build model
tf.random.set_seed(42)
model = keras.Sequential([
    keras.layers.Dense(128, activation = 'relu', input_shape = [len(feature)]),
    keras.layers.Dropout(0.4),
    keras.layers.Dense(2, activation = 'softmax')
])

# Compile model
model.compile(
    loss = 'binary_crossentropy',
    optimizer = keras.optimizers.Adam(0.001),
    metrics = ['accuracy']
)

# Fit the model
model.fit(
    X_train, y_train,
    epochs = 200,
    batch_size = 1,
    verbose = 1,
    validation_split = 0.2,
    callbacks = [callback]
)

Then we used the trained NN model to predict the gender and evaluate the accuracy. As before, we can experiment with the accuracy by change the features we used, or changing some parameters.

In [None]:
# Predict result (If the last layer using softmax)
y_predict = model.predict(X_test)
result = []
for index in range(len(y_predict)):
  each_result = np.argmax(y_predict[index])
  result.append(each_result)

# Formatting
result = np.array(result)
    
# Evaluate the accuracy
score = accuracy_score(result, y_test)

# Print result
print("The accuracy is " + str(score))