# Simplification de la méthode avec l'utilisation d'un Pipeline sklearn

## Introduction

Nous allons donc toujours utiliser la même méthode, mais au lieu de donner un Y qui soit sous forme de 1 ou de 0, nous allons avoir une matrice de texte, traduite implicitement par le Pipeline, car impossible à calculer sinon

On commence par créer nos colonnes X et Y, qui correspondent respectivement aux colonnes 'title' + 'abstract' et 'verified_uat_labels'

In [9]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier

from nltk.stem import WordNetLemmatizer
import nltk


# Loading dataset containing first five categories
data = pd.read_parquet('val-00000-of-00001-66ce8665444026dc.parquet')

# On supprime les lignes contenant des valeurs nulles
data = data.dropna()

# Preprocessing	
# Extracting title and abstract from the dataset
X = data['title'] + data['abstract']

# Extracting labels from the dataset (target)
Y_list = data['verified_uat_labels']




On va lemmatizer le texte de X, pour obtenir de meilleurs résultats, ce qui correspond à prendre la racine de chaque mot, et on va modifier Y pour éviter que ce soit une liste de liste, mais une liste de string permettant ensuite de l'utiliser correctement

In [10]:

nltk.download('wordnet')
# Lemmatization of the text for better results

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in text.split()])

X = X.apply(lemmatize_text)


# Concatenate all labels to a single string to be understood by the classifier
Y = []
for i in Y_list:
    labels = ''
    for j in i:
        labels += j + ' '
    labels = labels[:-1]
    Y.append(labels)
y = pd.DataFrame(Y)
y = y.to_numpy()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Quent\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Nous définissons ensuite nos paramètres pour notre Classifier, qui va être le SGDClassifier, qui correspond au Stochastich Gradient Descent. Nous définissons ensuite notre Pipeline, qui va pouvoir prendre en entrée, des valeurs textuelles ou numériques, vu que celui-ci effectue un pré-traitement avant de l'utiliser dans le Classsifier définit dans notre Pipeline

In [11]:
# Parameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss",n_jobs=-1,verbose=1)
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

# Supervised Pipeline
pipeline = Pipeline(
    [   
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier(**sdg_params)),
    ]
)

## Entrainement du modèle

In [12]:

def eval_and_print_metrics(clf, X_train, y_train, X_test, y_test):
    global y_pred
    print("Number of training samples:", len(X_train))
    print("Unlabeled samples in training set:", sum(1 for x in y_train if x == -1))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(
        "Micro-averaged F1 score on test set: %0.3f"
        % f1_score(y_test, y_pred, average="micro")
    )
    print("-" * 10)
    print()


if __name__ == "__main__":
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    print("Supervised SGDClassifier on 100% of the data:")
    eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)

Supervised SGDClassifier on 100% of the data:
Number of training samples: 2265
Unlabeled samples in training set: 0


  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


-- Epoch 1
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 30.08, NNZs: 18374, Bias: -2.084129, T: 2265, Avg. loss: 0.010278
Total training time: 0.04 seconds.
-- Epoch 1
-- Epoch 1
-- Epoch 2
-- Epoch 1
-- Epoch 1
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 30.57, NNZs: 18374, Bias: -2.107032, T: 2265, Avg. loss: 0.009484
Total training time: 0.02 seconds.
Norm: 28.08, NNZs: 18374, Bias: -2.471854, T: 4530, Avg. loss: 0.001599
Total training time: 0.07 seconds.
-- Epoch 3
-- Epoch 2
Norm: 30.32, NNZs: 18374, Bias: -1.993796, T: 2265, Avg. loss: 0.008936
Total training time: 0.02 seconds.
-- Epoch 2
Norm: 29.78, NNZs: 18374, Bias: -2.040612, T: 2265, Avg. loss: 0.009700
Total training time: 0.04 seconds.
Norm: 30.08, NNZs: 18374, Bias: -2.013664, T: 2265, Avg. loss: 0.009050
Total training time: 0.09 seconds.
-- Epoch 2
Norm: 31.13, NNZs: 18374, Bias: -2.069901, T: 2265, Avg. loss: 0.008778
Total training time: 0.03 seconds.
Norm: 30.72, NNZs: 18374, Bias: -2.106311, T: 2265, Avg. loss: 0.009573
Tot

[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    0.1s


Norm: 30.29, NNZs: 18374, Bias: -2.002578, T: 2265, Avg. loss: 0.009309
Total training time: 0.02 seconds.
Norm: 24.43, NNZs: 18374, Bias: -3.010393, T: 11325, Avg. loss: 0.001280
Total training time: 0.04 seconds.
-- Epoch 6
Norm: 23.65, NNZs: 18374, Bias: -3.166618, T: 13590, Avg. loss: 0.001196
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 24.81, NNZs: 18374, Bias: -3.017125, T: 11325, Avg. loss: 0.001301
Total training time: 0.04 seconds.
-- Epoch 2
Norm: 25.08, NNZs: 18374, Bias: -2.830938, T: 9060, Avg. loss: 0.001216
Total training time: 0.03 seconds.
-- Epoch 5
Norm: 25.77, NNZs: 18374, Bias: -2.879452, T: 9060, Avg. loss: 0.001394
Total training time: 0.03 seconds.
-- Epoch 6
Norm: 23.86, NNZs: 18374, Bias: -3.149117, T: 13590, Avg. loss: 0.001246
Total training time: 0.04 seconds.
Norm: 25.67, NNZs: 18374, Bias: -2.873438, T: 9060, Avg. loss: 0.001431
Total training time: 0.02 seconds.
Norm: 23.63, NNZs: 18374, Bias: -3.155816, T: 13590, Avg. loss: 0.001230
Total traini

[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    0.9s


-- Epoch 4Norm: 25.26, NNZs: 18374, Bias: -2.831002, T: 9060, Avg. loss: 0.001248
Total training time: 0.05 seconds.
Convergence after 7 epochs took 0.06 seconds
-- Epoch 5
-- Epoch 3

-- Epoch 2
Norm: 23.54, NNZs: 18374, Bias: -3.134413, T: 13590, Avg. loss: 0.001182
Total training time: 0.06 seconds.
-- Epoch 7
Norm: 24.46, NNZs: 18374, Bias: -2.974345, T: 11325, Avg. loss: 0.001308
Total training time: 0.05 seconds.
Norm: 26.51, NNZs: 18374, Bias: -2.655987, T: 6795, Avg. loss: 0.001321
Total training time: 0.02 seconds.
-- Epoch 6
Norm: 30.92, NNZs: 18374, Bias: -2.040246, T: 2265, Avg. loss: 0.008707
Total training time: 0.01 seconds.
-- Epoch 4
-- Epoch 1
Norm: 26.54, NNZs: 18374, Bias: -2.646480, T: 6795, Avg. loss: 0.001314
Total training time: 0.03 seconds.
Norm: 25.22, NNZs: 18374, Bias: -2.829066, T: 9060, Avg. loss: 0.001275
Total training time: 0.04 seconds.
-- Epoch 2
Norm: 25.25, NNZs: 18374, Bias: -2.840169, T: 9060, Avg. loss: 0.001229
Total training time: 0.04 seconds

[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:    2.1s


Norm: 26.56, NNZs: 18374, Bias: -2.650665, T: 6795, Avg. loss: 0.001351Convergence after 7 epochs took 0.04 seconds
Norm: 26.14, NNZs: 18374, Bias: -2.870556, T: 9060, Avg. loss: 0.001312
Total training time: 0.02 seconds.
-- Epoch 5
Convergence after 7 epochs took 0.04 seconds
Norm: 30.45, NNZs: 18374, Bias: -2.036018, T: 2265, Avg. loss: 0.009123
Total training time: 0.00 seconds.

Total training time: 0.02 seconds.
Norm: 29.70, NNZs: 18374, Bias: -2.009899, T: 2265, Avg. loss: 0.009306
Total training time: 0.01 seconds.
-- Epoch 1
Norm: 24.88, NNZs: 18374, Bias: -3.027409, T: 11325, Avg. loss: 0.001316
Total training time: 0.02 seconds.
-- Epoch 2
-- Epoch 1
Norm: 22.93, NNZs: 18374, Bias: -3.198112, T: 15855, Avg. loss: 0.001181
Total training time: 0.04 seconds.
-- Epoch 4
Norm: 25.21, NNZs: 18374, Bias: -2.890915, T: 9060, Avg. loss: 0.001255
Total training time: 0.02 seconds.
-- Epoch 6
Norm: 31.13, NNZs: 18374, Bias: -2.164702, T: 2265, Avg. loss: 0.009065
Total training time: 

[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:    3.6s


Norm: 23.32, NNZs: 18374, Bias: -3.252371, T: 15855, Avg. loss: 0.001202-- Epoch 2
Convergence after 7 epochs took 0.08 seconds
Norm: 24.21, NNZs: 18374, Bias: -2.993688, T: 11325, Avg. loss: 0.001188
Total training time: 0.03 seconds.
Norm: 28.28, NNZs: 18374, Bias: -2.428269, T: 4530, Avg. loss: 0.001468
Total training time: 0.02 seconds.
-- Epoch 1

Total training time: 0.06 seconds.
Convergence after 7 epochs took 0.06 seconds
-- Epoch 3
Norm: 28.22, NNZs: 18374, Bias: -2.430608, T: 4530, Avg. loss: 0.001432
Total training time: 0.02 seconds.
-- Epoch 6
Norm: 30.99, NNZs: 18374, Bias: -2.068490, T: 2265, Avg. loss: 0.008843
Total training time: 0.01 seconds.
Norm: 25.20, NNZs: 18374, Bias: -2.825735, T: 9060, Avg. loss: 0.001227
Total training time: 0.03 seconds.
Norm: 29.50, NNZs: 18374, Bias: -1.970421, T: 2265, Avg. loss: 0.009198
Total training time: 0.00 seconds.
-- Epoch 1
Norm: 28.11, NNZs: 18374, Bias: -2.419651, T: 4530, Avg. loss: 0.001595
Total training time: 0.01 second

[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed:    5.9s


Norm: 28.06, NNZs: 18374, Bias: -2.416651, T: 4530, Avg. loss: 0.001648-- Epoch 1
Norm: 30.93, NNZs: 18374, Bias: -2.102375, T: 2265, Avg. loss: 0.009733
Total training time: 0.00 seconds.
Norm: 30.99, NNZs: 18374, Bias: -2.094198, T: 2265, Avg. loss: 0.009006
Total training time: 0.01 seconds.
Norm: 30.62, NNZs: 18374, Bias: -2.036760, T: 2265, Avg. loss: 0.008970
Total training time: 0.01 seconds.

Total training time: 0.02 seconds.
-- Epoch 3
Norm: 23.07, NNZs: 18374, Bias: -3.233719, T: 15855, Avg. loss: 0.001244
Total training time: 0.04 seconds.
Norm: 28.26, NNZs: 18374, Bias: -2.395269, T: 4530, Avg. loss: 0.001440
Total training time: 0.01 seconds.
Norm: 23.68, NNZs: 18374, Bias: -3.144338, T: 13590, Avg. loss: 0.001284
Total training time: 0.03 seconds.
-- Epoch 2
-- Epoch 2
-- Epoch 2
Convergence after 7 epochs took 0.04 seconds
Norm: 26.42, NNZs: 18374, Bias: -2.665298, T: 6795, Avg. loss: 0.001317
Total training time: 0.02 seconds.
-- Epoch 4
Norm: 24.44, NNZs: 18374, Bias:

[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed:    7.9s


Convergence after 7 epochs took 0.03 seconds
Norm: 28.22, NNZs: 18374, Bias: -2.428564, T: 4530, Avg. loss: 0.001473
Total training time: 0.01 seconds.
-- Epoch 3
-- Epoch 4
Norm: 27.60, NNZs: 18374, Bias: -2.832064, T: 9060, Avg. loss: 0.001475
Total training time: 0.02 seconds.
-- Epoch 1
Norm: 23.70, NNZs: 18374, Bias: -3.149395, T: 13590, Avg. loss: 0.001214
Total training time: 0.02 seconds.
-- Epoch 5
Norm: 28.35, NNZs: 18374, Bias: -2.428211, T: 4530, Avg. loss: 0.001656
Total training time: 0.01 seconds.
-- Epoch 3
Norm: 28.36, NNZs: 18374, Bias: -2.410077, T: 4530, Avg. loss: 0.001436
Total training time: 0.01 seconds.
Norm: 24.58, NNZs: 18374, Bias: -2.999456, T: 11325, Avg. loss: 0.001257
Total training time: 0.02 seconds.
-- Epoch 7
Norm: 24.46, NNZs: 18374, Bias: -3.019529, T: 11325, Avg. loss: 0.001296
Total training time: 0.03 seconds.
Norm: 23.88, NNZs: 18374, Bias: -3.153722, T: 13590, Avg. loss: 0.001200
Total training time: 0.04 seconds.
-- Epoch 7
-- Epoch 3
Norm: 2

[Parallel(n_jobs=-1)]: Done 2210 out of 2210 | elapsed:    9.4s finished


Micro-averaged F1 score on test set: 0.013
----------



## Calcul des résultats

In [13]:
print(np.mean(y_pred == y_test))

# Print the accuracy, precision, recall and f1_score
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Precision: ", precision_score(y_test, y_pred, average='micro'))
print("Recall: ", recall_score(y_test, y_pred, average='micro'))
print("F1 Score: ", f1_score(y_test, y_pred, average='micro'))

0.00022104293671330205
Accuracy:  0.013245033112582781
Precision:  0.013245033112582781
Recall:  0.013245033112582781
F1 Score:  0.013245033112582781


## Calcul de précision personnalisée

In [14]:
# Doing my own accuracy calculation
# Search for each word in the predicted labels in the Y_test labels and calculate the percentage of words found
percentage = 0
for i in range(len(y_pred)):
    for word in y_pred[i].split(' '):
        print(word, y_test[i])
        print(word in y_test[i][0])
        if word in y_test[i][0]:
            percentage += 1/len(y_pred[i].split(' '))
            
print(f"Accuracy: {percentage/len(y_pred)*100:.2f}%")

active ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
galactic ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
nuclei ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
active ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
galaxies ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
True
blazars ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
high ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
True
energy ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
astrophysics ['bl lacertae objects high-redshift galaxies spectroscopy relativistic jets gamma-ray sources']
False
high-redshif