# Gender Classification
This notebook aims to visualise the different interests that males and females have, and then use the data to create a classifier which predicts the different genders.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from collections import Counter
from xgboost import XGBClassifier
from keras.models import Sequential
from keras.optimizers import Adam, SGD
from keras.callbacks import ReduceLROnPlateau
from keras.layers import Dense, Dropout, Input
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
pd.options.mode.chained_assignment = None 

In [None]:
df = pd.read_csv('../input/gender-classification/Transformed Data Set - Sheet1.csv')

In [None]:
df['Favorite Music Genre'] = np.where(df['Favorite Music Genre']=='Folk/Traditional', 'Folk', df['Favorite Music Genre'])

In [None]:
df

## Data visualisation
Firstly, we will take a look at the data by creating bar charts which show how many men and women like the different features in our dataframe.

In [None]:
def bars(column):
    fig, ax = plt.subplots(1, 1, figsize=(7, 4))
    
    fem = df[column][np.where(df['Gender']=='F', True, False)]
    mal = df[column][np.where(df['Gender']=='M', True, False)]

    fcount = Counter(fem)
    mcount = Counter(mal)

    f = ax.bar(fcount.keys(), list(fcount.values()), width=.5, color='lightgreen')
    m = ax.bar(fcount.keys(), list(mcount.values()), bottom=list(fcount.values()), width=.5, color='orange')

    ax.legend((f, m), ('F', 'M'))
    ax.set_title(column + ' per gender')
    ax.set_xlabel(column)
    ax.set_ylabel('Number of people')
    plt.show()

Here we can see that women prefer cool colours, both genders like neutral ones and men slightly like warm colours more.

In [None]:
bars('Favorite Color')

Women substantially enjoy rock and pop more, while men significantly like hip hop, folk and jazz more.

In [None]:
bars('Favorite Music Genre')

Both genders roughly evenly enjoy the different beverages.

In [None]:
bars('Favorite Beverage')

Here we can see that women like 7UP/Sprire and Coca Cola/Pepsi more, wheras men like Fanta and other drinks.

In [None]:
bars('Favorite Soft Drink')

## Feature engineering
Now it is time to engineer our features so that they can be inputted into our classifier.

Firstly, we want to convert the categorical features into numerical ones using a LabelEncoder. Then, we split the 'df' dataframe into X and y.

In [None]:
le = LabelEncoder()

for col in df:
    df[col] = le.fit_transform(df[col])
    
X = df.drop('Gender', axis=1)
y = df['Gender']

Furthermore, we will be taking a look at the distribution of the different features when various transformation and scaling techniques are applied. The below graphs show the log, box cox, standard scaler and min max scaler transformations applied to our four features in X.

In [None]:
for col in X:
    fig, axes = plt.subplots(1, 5, figsize=(15, 3))
    
    feature = X[col]
    log = (feature+1).transform(np.log)
    box_cox = pd.Series(stats.boxcox(feature+1)[0])
    ss = pd.DataFrame(StandardScaler().fit_transform(np.array(feature).reshape(-1, 1)))
    mm = pd.DataFrame(MinMaxScaler().fit_transform(np.array(feature).reshape(-1, 1)))
    
    feature.hist(ax=axes[0], color='blue')
    log.hist(ax=axes[1], color='orange')
    box_cox.hist(ax=axes[2], color='green')
    ss.hist(ax=axes[3], color='purple')
    mm.hist(ax=axes[4], color='pink')
    
    axes[0].set_title('Normal')
    axes[1].set_title('Log transform')
    axes[2].set_title('Box Cox')
    axes[3].set_title('Standard Scaler')
    axes[4].set_title('Min Max Scaler')
    
    X[col+'_log'] = log
    X[col+'_boxcox'] = box_cox

Subsequently, we will then split the X and y into train and test sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Next, we apply the transformations that we graphed in the above graphs.

In [None]:
def scale(scaler, ending):
    for col in X.columns[:4]:
        X_train[col+'_'+ending]=pd.DataFrame(scaler.fit_transform(np.array(X_train[col]).reshape(-1, 1)))
        X_test[col+'_'+ending] = pd.DataFrame(scaler.transform(np.array(X_test[col]).reshape(-1, 1)))

scale(StandardScaler(), 'ss')
scale(MinMaxScaler(), 'mm')
    
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

The following plot is a heatmap which shows the correlation of the different features in X.

In [None]:
sns.heatmap(X.corr(), annot=True)
plt.show()

Afterwards, we now use bar charts to see how much the top 10 features in our dataset contribute to predicting our final result.

In [None]:
pca = PCA(n_components=10)
X_train = pca.fit_transform(X_train, y_train)
X_test = pca.transform(X_test)
evr = pca.explained_variance_ratio_

plt.bar(range(len(evr)), evr, color='lightgreen')
plt.title('Explained variance ratio of top 10 features')
plt.ylabel('Explanined variance ratio')
plt.xlabel('Top 10 features')
plt.show()

## Classifing data
The final stage of our notebook is creating a classification model which predicts whether a person is male or female, based on our given data.

We make five models: SVC, Linear SVC, K Nearest Neighbours, Random Forest and XGBoost. Then, we loop over each of them, fit our train data to them, and print the accuracies of each predictor.

In [None]:
models = [['SVC', SVC(random_state=1)], ['LinearSVC', LinearSVC(random_state=1, max_iter=1342)], 
          ['K Nearest Neighbours', KNeighborsClassifier()], 
          ['Random Forest', RandomForestClassifier(random_state=1)], 
          ['XGBoost', XGBClassifier(random_state=1)]]
scores = []
cross_vals = []

for classifier in models:
    model = classifier[1]
    model.fit(X_train, y_train)

    score = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()

    scores.append(score)
    cross_vals.append(cross_val)

    print(classifier[0], 'score: ' + str(round(score*100, 2)) + '% cross val: ' + str(round(cross_val*100, 2)) + '%')

Finally, we use bar charts to visualise how accurate each classifier has been in predicting our data.

In [None]:
labels = []
for label in np.array(models):
    labels.append(label[0])
    
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))

ax1.bar(labels, scores, color='skyblue')
ax1.set_title('Model scores')

ax2.bar(labels, cross_vals, color='pink')
ax2.set_title('Cross val score')

plt.show()

### Thank you for reading my notebook.
### If you enjoyed this notebook and found it helpful, please give it an upvote so that I can make more of these.