# GENDER DETECTION PROBLEM

## **ML ROADMAP**

1. Définir un objectif mesurable
    métrique : Précision, Recall, Score F1
2. EDA (Exploratory Data Analysis) : comprendre au maximum les données dont on dispose
    - Analyse de la forme :
        - identification de la target
        - Nombre de lignes et de colonnes
        - Types de variables
        - Identification des valeurs manquantes
    - Analyze du fond :
        - Visualisation de la target (Histogramme / Boxplot)
        - Compréhension des différentes variables (Internet)
        - Visualisation des relations features-target (Histogramme/ Boxplot)
        - Identification des outliers
3. Pre-processing
    - Création du Train set/ Test set
    - Élimination des NaN
    - Encodage
    - Suppression des outliers néfaste au modèle (Après le premier modèle)
    - Feature Selection
    - Feature Engineering (Créer de nouvelles variables) (Polynomial feature)(PCA)
    - Feature Scaling
4. Modeling
    - Définir une fonction d'évaluation
    - Entrainement de différents modèles
    - Optimisation avec GridSearchCV
    - (Optionnel) Analyse des erreurs et retour au Preprocessing / EDA
    - Learning Curve et prise de décision.


## **IMPORTING PACKAGES**

In [None]:
import re
import joblib
import numpy as np
import unicodedata
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (f1_score, 
                             accuracy_score,
                             precision_score,)
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model  import SGDClassifier
from camel_tools.utils.charmap import CharMapper
from sklearn.linear_model import (LogisticRegression, 
                                  RidgeClassifier)
from sklearn.model_selection import (train_test_split, 
                                     learning_curve, 
                                     learning_curve, 
                                     ShuffleSplit, 
                                     GridSearchCV)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (confusion_matrix, 
                             ConfusionMatrixDisplay)
from sklearn.ensemble import (AdaBoostClassifier, 
                              ExtraTreesClassifier, 
                              RandomForestClassifier, 
                              GradientBoostingClassifier, 
                              HistGradientBoostingClassifier)

## **EXPLORATORY DATA ANALYSIS**

- Importing Data for EDA

In [None]:
names_df=pd.read_csv('Datasets/full_names.csv')

1. Let's some stats and visualizations 😀

In [None]:
names_df.gender.value_counts(len(names_df))

In [None]:
names_df['name'].apply(lambda x: len(x)).max()

In [None]:
plt.pie(names_df.gender.value_counts(len(names_df)), labels=['Women', 'Man'], explode=[0.1, 0], autopct='%1.1f%%',
        shadow=True, startangle=90)

In [None]:
wc = WordCloud(width=600, height=600, max_words=100, background_color='white').generate_from_frequencies(names_df.name.value_counts())
plt.figure(figsize=(12, 8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

## **PRE-PROCESSING**

1. Defining the pre-processing functions

- in the first place we do some cleaning starting with removing some emojis

In [None]:
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

- then we remove all accents

In [None]:
def strip_accents(text):
    """
    Strip accents from input String.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    try:
        text = unicode(text, 'utf-8')
    except (TypeError, NameError): # unicode is a default on python 3 
        pass
    text = unicodedata.normalize('NFD', text)
    text = text.encode('ascii', 'ignore')
    text = text.decode("utf-8")
    return str(text)

def text_to_id(text):
    """
    Convert input text to id.

    :param text: The input string.
    :type text: String.

    :returns: The processed String.
    :rtype: String.
    """
    if not re.search('[ء-ي]', text):
        text = strip_accents(text.lower())
        text = re.sub('[^0-9a-zA-Z_-]', ' ', text)
    return text

- Here we create a function to respelling the translated names according to test that i made mysel.\
For example i found that the "|" that was supposed to be an "A",  "p" thas has to be replaced by an "a"\
in the presence of conditions that are defined in the " arabic translation" function.

In [None]:
def respell(s):
    respellings = {
                '|': 'A', 
                "'": '', 
                'p':'a', 
                "Y":'a', 
                '<':'I', 
                '$':'sh', 
                '>': 'A', 
                '*':'d', 
                '~':"", 
                '}':"", 
                '&':'a',
                'Z':'T'
                 }
    for wrong in respellings:
        try:
            index = s.index(wrong)
            s = s[:index] + respellings[wrong] + s[len(wrong)+index:]
        except:
            pass
    return ''.join(s)

In [None]:
def arabic_translation(names_df, ar2bw = CharMapper.builtin_mapper('ar2bw')):
    for i, row in names_df.iterrows():
        if re.search('[ء-ي]', row['name']):
            names_df.at[i, 'name'] = ar2bw(row['name'])
            names_df.at[i, 'name'] = respell(row['name']).lower()
    return names_df

In [None]:
def Encode(names_df, train=True):

    # Step 1: Pad names with matrix to make all names same dimension
    name_length = 20
    nb_char=27
    names_df['encoded_name']=[np.zeros((name_length,nb_char)) for name in names_df['name']]

    # Step 2: Encode Characters to Numbers
    names_df['alpha_name'] = [
        [
            int(max(0.0, ord(char)-96.0))
            for char in name
        ]
        for name in names_df['name']
    ]

    # Step 3: Encode names to matrix of 0 and 1
    for num in range(len(names_df)):
        for i, j in zip(range(len(names_df['alpha_name'][num])), names_df['alpha_name'][num]):
            names_df['encoded_name'][num][i,j]=1

    return names_df.drop('alpha_name', axis=1)

- In this function we gather the previous pre-processing functions.

In [None]:
def preprocess(names_df, train=True):
    # Step 1: Lowercase
    names_df['name'] = names_df['name'].str.strip()
    names_df['name'] = names_df['name'].str.lower()
    names_df=names_df.drop_duplicates(subset='name').reset_index(drop=True)
    # Step 2: Normalize into english
    names_df=arabic_translation(names_df)
    names_df['name']= names_df['name'].apply(lambda x: text_to_id(x))
    names_df=Encode(names_df)
    
    X = names_df.iloc[:,-1]
    X = np.asarray(X.values.tolist())
    X = X.reshape(X.shape[0], X.shape[1]*X.shape[2])
    if train:
        le = LabelEncoder()
        # Step 6: Encode Gender to Numbers
        names_df['gender'] = le.fit_transform(names_df['gender'])
        y = names_df.iloc[:,-2]
        # names_df['gender'] = [
        #     0.0 if gender=='F' else 1.0 
        #     for gender in names_df['gender']
        # ]
        return X, y
    else:
        return X

2. Importing data

In [None]:
names_df = pd.read_csv('Datasets/full_names.csv')

3. Process the data

In [None]:
X, y=preprocess(names_df, train=True)

## **MODELING**

1. Model creation :
    - As we know the main objective is to predict the gender which is the target variable 'y', that mean that we're in presence of a suppervised model where 
        - male : 1
        - female : 0
    - To make the classification used the following models : 
        - SVM: SGDClassifier
        - ETC: ExtraTreesClassifier
        - DTC: DecisionTreeClassifier 
        - RFC: RandomForestClassifier
        - GBC: GradientBoostingClassifier
        - HGBC: HistGradientBoostingClassifier
        - ADAC: AdaBoostClassifier
        - GNB: GaussianNB
        - KNN: KNeighborsClassifier
        - LR: LogisticRegression
        - RC: RidgeClassifier
    
2. Model evaluation : 
- We decided to evaluate our model on :
    - **Precision** : Precision explains how many of the correctly predicted cases actually turned out to be positive.\
        Precision is useful in the cases where False Positive is a higher concern than False Negatives.\
    - **accuracy** : Recall explains how many of the actual positive cases we were able to predict correctly with our model.\
        It is a useful metric in cases where False Negative is of higher concern than False Positive.\
    - **f1_score** : It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is equal to Recall.
- Which on is most taken in consideration in our case : 
    - **Precision**
- Why ?
    - Cause is our case wrong results could lead us to detect false gender of a customer and it can lead it to churn and this could be harmful to the business.

First, we spllit the data using the cross validation method    

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

1. Model declaration

In [None]:
classifiers = {
            #    'SVM' : SGDClassifier(), 
               'ETC' : ExtraTreesClassifier(),#The Extra-Trees algorithm builds an ensemble of unpruned decision or regression trees according to the classical top-down procedure. 
                                              #Its two main differences with other tree-based ensemble methods are that it splits nodes by choosing cut-points fully at random and 
                                              # that it uses the whole learning sample (rather than a bootstrap replica) to grow the trees.
            #    'DTC' : DecisionTreeClassifier(), 
               'RFC' : RandomForestClassifier(n_estimators=200),
            #    'GBC' : GradientBoostingClassifier(),
            #    'HGBC' : HistGradientBoostingClassifier(),
            #    'ADAC' : AdaBoostClassifier(),
            #    'GNB' : GaussianNB(),
            #    'KNN' : KNeighborsClassifier(),
            #    'LR' : LogisticRegression(),
            #    'RC' : RidgeClassifier()
               }

2. Model training + scoring

In [None]:
# MODELS LEARNINGS AND TESTING
f, axes = plt.subplots(1, len(classifiers), figsize=(30, 10), sharey='row')
precision = []
f1 = []
names = []
for i, (key, classifier) in enumerate(classifiers.items()):
    classifier.fit(X, y)
    y_pred = classifier.predict(X_test)
    precision.append(precision_score(y_test, y_pred)) # accuracy
    f1.append(f1_score(y_test, y_pred))
    names.append(key)
    cf_matrix = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(cf_matrix)
    print(key)
    disp.plot(ax=axes[i], xticks_rotation=45)
    disp.ax_.set_title(key)
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')

f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.40, hspace=0.1)

f.colorbar(disp.im_, ax=axes)
plt.show()

We remarked that only 2 models (ETC, RFC) performed well and gived satisfaying results to our problem
the first gave 74% of accuracy and the second gave us (75%)\
From there we decided to continue our modeling only with these 2 models and see if there is more improvement that can be made on them

3. Visualising the results

In [None]:
colors = ['orange', 'green']
metrics_name = ['Precision', 'F1 score']

anot = 0.5 
fig, axes = plt.subplots(1, len(classifiers), figsize=(30, 10), sharey='row')
for i, metric in enumerate([precision, f1]):
    ax = axes[i]
    values = ax.barh(np.arange(2), np.array(metric)*100, label= metrics_name[i], color = colors[i])
    ax.set_title(metrics_name[i], fontsize = 20)
    for s in ['top', 'right']:
        ax.spines[s].set_visible(False)
 
    # Remove x, y Ticks
    ax.xaxis.set_ticks_position('none')
    ax.yaxis.set_ticks_position('none')
    
    # Add padding between axes and labels
    ax.xaxis.set_tick_params(pad = 5)
    ax.yaxis.set_tick_params(pad = 10)
    
    # Add x, y gridlines
    ax.grid(visible = True, color ='grey',
            linestyle ='-.', linewidth = 0.5,
            alpha = 0.2)
    
    # Show top values
    ax.invert_yaxis()
    # Add annotation to bars
    ax.bar_label(values, fontsize=12)
    
    anot = 1
# Add Text watermark
fig.text(0.9, 1, 'amine Zidelmal', fontsize = 15,
    color ='black', ha ='right', va ='top',
    alpha = 0.7)
# f.text(0.4, 0.1, 'Objectives visualisations', ha='center')

fig.text(0.09, 0.25, '\n\n\n\n\n\n\n\n\n\n\n\n'.join(names), fontsize = 20, rotation=0, ha='left')
fig.text(0.55, 0.25, '\n\n\n\n\n\n\n\n\n\n\n\n'.join(names), fontsize = 20, rotation=0, ha='left')
fig.text(0.5, 0.07, 'Metric rate (%)', fontsize = 20, ha='left')
fig.text(0.07, 0.5, 'Models', fontsize = 20, ha='left')
plt.subplots_adjust(wspace=0.40, hspace=0.1)
plt.show()

4. Model optimisation

In [None]:
param_grid = { 
    'n_estimators': [200],
    'max_features': ['sqrt', 'log2'],
    'max_depth' : [8],
}

gs_rfc=GridSearchCV(RandomForestClassifier(), param_grid, cv=5, verbose=1, n_jobs=1)
gs_rfc.fit(X_train, y_train)

In [None]:
gs_rfc.best_params_

- After using GridSearchCV we found that the best params for RFC were : n_estimators = 200 and max_depth = 8
- For the ETC the predifined params were already the best

In [None]:
f1_score(y_test, y_pred)

In [None]:
accuracy_score(y_test, y_pred)

precision_score(y_test, y_pred)

5. Error analysis

We remarked that there is some errors when marking the model learn on all the data set and testing on the X_test,\
so we had to see where is the problem.\
And in fact we found that there is some names that were not well labeled and we fixed this mistake

In [None]:
names_df['name'] = names_df['name'].str.strip()
names_df['name'] = names_df['name'].str.lower()
names_df=names_df.drop_duplicates(subset='name').reset_index(drop=True)
X_xtrain, X_xtest, y_xtrain, y_xtest = train_test_split(names_df.iloc[:,0], names_df.iloc[:,-1], test_size=0.2, random_state=0)

In [None]:
testdata=pd.DataFrame({'NAME':X_xtest, "Gender":y_xtest, 'pred':y_pred})

In [None]:
testdata['Gender'] = LabelEncoder().fit_transform(testdata['Gender'])

In [None]:
testdata[testdata.Gender!=testdata.pred]

6. Learning Curve

In [None]:
N, train_score, val_score = learning_curve(classifier, X_train, y_train, train_sizes=np.linspace(0.2, 1.0, 10), cv=5)
print(N)
plt.plot(N,train_score.mean(axis=1), label='train')
plt.plot(N,val_score.mean(axis=1), label='validation')
plt.xlabel('train_sizes')
plt.legend()

**Learning curves show us how the performance of a classifier changes.**\
So, on this curve you can see both the training and the cross-validation score. The training score doesn’t change much by adding more examples.\
But the cross-validation score definitely does! We can see that the performances didn't reach it max potentiel at 87%.

So, what this tells us is that adding more examples over the ones we currently have is probably is required.

Let's pickle the best model 😀

In [None]:
joblib.dump(classifiers['RFC'], 'RFC.pkl')