# Défi IA, solution proposée par la team SAAT

SAAT, équipe de l'INSA Toulouse composée de Vu Nam Anh LE, Aimée SIMCIC--MORI, Thanh Tin VO & Sophia YAZZOURH. 

Cette année, nous avons eu l’opportunité de participer au concours Defi-IA 2021sur Kaggle, organisé par plusieurs écoles, notamment l’INSA Toulouse. L’objectif de ce défi est de créer un algorithme qui attribue la bonne catégorie des métiers à une descriptiond’un emploi. Cela revient donc à faire une classification multi-classe parmi 28 catégories d’emploi.

Les données ont été récupérées de CommonCrawl, qui a été utilisé pour entraîner le modèle GPT-3. Les données sont donc représentatives de ce qui peuvent être trouvés sur Internet en anglais parlé. Par conséquent, elles contiennent naturellement des biais de langage, de la discrimination. L’enjeu de ce concours est donc de développer un algorithme qui est à la fois précis, mais aussi juste sur les erreurs de classifications homme/femme.

Ici, on trouvera la solution proposée par l'équipé SAAT. Le développement sera découpée en deux parties : tout d'abord le pre-processing appliqué au données, puis l'algorithme de classification choisi. 

# 1. Importation des librairies, données & scripts 

## 1.1 Librairies

In [60]:
import pandas as pd
import numpy as np
import os
import time
import pickle
import warnings
warnings.filterwarnings("ignore")
import sklearn.metrics as smet
import sklearn.model_selection as sms

## 1.2 Scripts 

In [49]:
import sys
sys.path.append('./scripts')
import Cleaning as ct
import Vectorization as Vecto 
import Learning as RL

## 1.3 Données

In [50]:
DATA_PATH = "./data"
train_df = pd.read_json(DATA_PATH+"/train.json") # Training data 
test_df = pd.read_json(DATA_PATH+"/test.json") # Testing data 
names = pd.read_csv(DATA_PATH+ '/categories_string.csv')['0'].to_dict()
jobs = pd.read_csv(DATA_PATH+'/train_label.csv', index_col='Id')['Category']
jobs = jobs.map(names)
jobs = jobs.rename('job') # The jobs of trainging data
genders = pd.read_json(DATA_PATH+'/train.json').set_index('Id')['gender']
# genders of the people in training data
train_label = pd.read_csv(DATA_PATH+"/train_label.csv") # the jobs numbered from 1 to 28

# 2. Pre-processing 

## 2.1 Minuscules

In [51]:
train_df["description_lower"] = [x.lower() for x in train_df.description]
test_df["description_lower"] = [x.lower() for x in test_df.description]

## 2.2 Cleaning 

In [52]:
ct.clean_df_column(train_df, "description_lower", "description_cleaned")
train_df[["description_lower", "description_cleaned"]] # Cleaning lower description in the train data

100%|██████████| 217197/217197 [05:49<00:00, 622.13it/s]


Unnamed: 0,description_lower,description_cleaned
0,she is also a ronald d. asmus policy entrepre...,she is also ronald asmus policy entrepreneur f...
1,he is a member of the aicpa and wicpa. brent ...,he is memb of the aicp and wicp brent graduate...
2,dr. aster has held teaching and research posi...,dr aster has held teaching and research posit ...
4,he runs a boutique design studio attending cl...,he run boutiqu design studio attending client ...
5,"he focuses on cloud security, identity and ac...",he focus cloud security identity and access ma...
...,...,...
271492,a member of the uwa cultural collections boar...,memb of the uwa cultural collect board gary wa...
271493,kelly has worked globally leading teams of co...,kelly has worked globally leading team of cons...
271494,he's the lead author of a recent study that f...,he the lead author of recent study that found ...
271495,she specializes in the theoretical and pedago...,she specializ in the theoretical and pedagogic...


In [53]:
ct.clean_df_column(test_df, "description", "description_cleaned")
test_df[["description_lower", "description_cleaned"]] # Cleaning lower description in the test data

100%|██████████| 54300/54300 [01:23<00:00, 648.34it/s]


Unnamed: 0,description_lower,description_cleaned
3,she currently works on cnn’s newest primetime...,she currently work cnn newest primetim show pa...
6,lavalette’s photographs have been shown widel...,lavalet photograph hav been shown widely and h...
11,along with his academic and professional deve...,along with his academic and professional devel...
17,she obtained her ph.d. in islamic studies at ...,she obtained her ph in islamic stud at duk uni...
18,she studies issues of women and islam and has...,she stud issu of women and islam and has writt...
...,...,...
271476,"prior to that, she worked as a research staff...",prior to that she worked research staff memb a...
271477,the group’s antics began when they switched t...,the group antic began when they switched the v...
271482,"formerly, she was the coordinator for music e...",formerly she was the coordinator for music edu...
271485,she started her law practice at morris mannin...,she started her law practic at morr manning ma...


## 2.3 Vectorization by TFidf

In [54]:
X_test= test_df # Test data
X= train_df # Data to train and create the best model
y= train_label.Category.values # The reponse of the train data
X_train, X_valid, y_train, y_valid = sms.train_test_split(X, y, test_size=0.2, random_state=1)
# Devide the train data and reponse to the train and valid (X,y) (data,reponse) which helps us
# find out the best model

In [55]:
features_parameters = [[None, "count"],
                      [10000, "count"],
                      [None, "tfidf"],
                      [10000, "tfidf"],]

# One-Hot-Encoding is the simplest vectorization method which is represented by "count"
# TF-IDF is more complicated method which is represented by "tfidf"

metadata = {}
for nb_hash, vectorizer_type in features_parameters:
    vect_method = Vecto.Vectorizer(vectorizer_type = vectorizer_type, nb_hash = nb_hash )
    ts = time.time()
    vec, feathash, X_train_vec = vect_method.vectorizer_train(X_train, columns = "description_cleaned")
    X_valid_vec = vect_method.apply_vectorizer(X_valid, columns = "description_cleaned", vec = vec, feathash = feathash)
    X_test_vec = vect_method.apply_vectorizer(X_test, columns = "description_cleaned", vec = vec, feathash = feathash)
    
    te = time.time()
    
    metadata.update({(nb_hash, vectorizer_type):te-ts})
    
    print("nb_hash : " + str(nb_hash) + ", vectorizer_type : " + str(vectorizer_type))
    print("Runing time for vectorization : %.1f seconds" %( metadata[(nb_hash, vectorizer_type)]))
    print("Test shape : " + str(X_test_vec.shape))
    print("Train shape : " + str(X_train_vec.shape))
    print("Valid shape : " + str(X_valid_vec.shape))

    vect_method.save_dataframe(X_test_vec, "test") # Vectorized X_test
    vect_method.save_dataframe(X_train_vec, "train") # Vectorized X_train
    vect_method.save_dataframe(X_valid_vec, "valid") # Vectorized X_valid
    

nb_hash : None, vectorizer_type : count
Runing time for vectorization : 13.0 seconds
Test shape : (54300, 158749)
Train shape : (173757, 158749)
Valid shape : (43440, 158749)
nb_hash : 10000, vectorizer_type : count
Runing time for vectorization : 7.4 seconds
Test shape : (54300, 10000)
Train shape : (173757, 10000)
Valid shape : (43440, 10000)
nb_hash : None, vectorizer_type : tfidf
Runing time for vectorization : 12.2 seconds
Test shape : (54300, 158749)
Train shape : (173757, 158749)
Valid shape : (43440, 158749)
nb_hash : 10000, vectorizer_type : tfidf
Runing time for vectorization : 7.3 seconds
Test shape : (54300, 10000)
Train shape : (173757, 10000)
Valid shape : (43440, 10000)


# 3. Régression Logistique

## 3.1 Training 

In [56]:
FORCE_TO_RUN = True
features_parameters = [[None, "tfidf"]] # Using the TF-IDF (the more complicated method)

model_parameters = [["lr", {"C":[0.1, 1, 10]}]]

if FORCE_TO_RUN:
    metadata = {}
    for nb_hash, vectorizer_type in features_parameters:
        print(nb_hash, vectorizer_type)
        vect_method = Vecto.Vectorizer(vectorizer_type = vectorizer_type, nb_hash = nb_hash )
        X_train = vect_method.load_dataframe("train")
        Y_train = y_train
        X_valid = vect_method.load_dataframe("valid")
        Y_valid = y_valid

        for ml_model_name, param_grid in model_parameters:
            ml_class = RL.MlModel(ml_model_name=ml_model_name, param_grid=param_grid)
            best_model, best_metadata = ml_class.train_all_parameters(X_train, Y_train, X_valid, Y_valid
                                                                      , save_metadata=False)
            accuracy_test = best_model.score(X_valid, Y_valid)
            f1_macro_score_test = smet.f1_score(best_model.predict(X_valid),Y_valid, average='macro')
            balanced_accuracy_test = smet.balanced_accuracy_score(best_model.predict(X_valid),Y_valid)
            best_metadata.update({"balanced_accuracy_test":balanced_accuracy_test,"accuracy_test": accuracy_test, "f1_macro_score_test":f1_macro_score_test})
            metadata.update({(vectorizer_type, str(nb_hash), ml_model_name): best_metadata})

None tfidf


100%|██████████| 3/3 [06:03<00:00, 121.30s/it]


Best model's parameters : {'C': 1, 'n_jobs': -1}


## 3.2 Prediction

In [57]:
X_test = vect_method.load_dataframe("test") # Loading the test data
y_pred = best_model.predict(X_test) # Predicting the response

# 4. Génération des résultats pour Kaggle

In [58]:
test_df["Category"] = y_pred
baseline_file = test_df[["Id","Category"]]
if os.path.isdir('./results') == False:
    os.mkdir("./results")

baseline_file.to_csv("./results/baseline.csv", index=False)