# Simplification de la méthode avec l'utilisation d'un Pipeline sklearn

## Introduction

Nous allons donc toujours utiliser la même méthode, mais au lieu de donner un Y qui soit sous forme de 1 ou de 0, nous allons avoir une matrice de texte, traduite implicitement par le Pipeline, car impossible à calculer sinon

On commence par créer nos colonnes X et Y, qui correspondent respectivement aux colonnes 'title' + 'abstract' et 'verified_uat_labels'

In [3]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier

from nltk.stem import WordNetLemmatizer
import nltk


# Loading dataset containing first five categories
data = pd.read_parquet('val-00000-of-00001-66ce8665444026dc.parquet')

# On supprime les lignes contenant des valeurs nulles
data = data.dropna()

# Preprocessing	
# Extracting title and abstract from the dataset
X = data['title'] + data['abstract']

# Extracting labels from the dataset (target)
Y_list = data['verified_uat_labels']




On va lemmatizer le texte de X, pour obtenir de meilleurs résultats, ce qui correspond à prendre la racine de chaque mot, et on va modifier Y pour éviter que ce soit une liste de liste, mais une liste de string permettant ensuite de l'utiliser correctement

In [4]:

nltk.download('wordnet')
# Lemmatization of the text for better results

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in text.split()])

X = X.apply(lemmatize_text)


# Concatenate all labels to a single string to be understood by the classifier
Y = []
for i in Y_list:
    labels = ''
    for j in i:
        labels += j + ' '
    labels = labels[:-1]
    Y.append(labels)
y = pd.DataFrame(Y)
y = y.to_numpy()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Quent\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Nous définissons ensuite nos paramètres pour notre Classifier, qui va être le SGDClassifier, qui correspond au Stochastich Gradient Descent. Nous définissons ensuite notre Pipeline, qui va 

In [None]:
# Parameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss",n_jobs=-1,verbose=1)
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

# Supervised Pipeline
pipeline = Pipeline(
    [   
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", SGDClassifier(**sdg_params)),
    ]
)


def eval_and_print_metrics(clf, X_train, y_train, X_test, y_test):
    global y_pred
    print("Number of training samples:", len(X_train))
    print("Unlabeled samples in training set:", sum(1 for x in y_train if x == -1))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(
        "Micro-averaged F1 score on test set: %0.3f"
        % f1_score(y_test, y_pred, average="micro")
    )
    print("-" * 10)
    print()


if __name__ == "__main__":
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    print("Supervised SGDClassifier on 100% of the data:")
    eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)

Supervised SGDClassifier on 100% of the data:
Number of training samples: 2265
Unlabeled samples in training set: 0


  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    0.1s


-- Epoch 1
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 29.44, NNZs: 18292, Bias: -1.936579, T: 2265, Avg. loss: 0.009214
Total training time: 0.00 seconds.
-- Epoch 2
-- Epoch 1
-- Epoch 1
Norm: 30.44, NNZs: 18292, Bias: -2.020933, T: 2265, Avg. loss: 0.009664
Total training time: 0.01 seconds.
-- Epoch 1
-- Epoch 1
Norm: 30.82, NNZs: 18292, Bias: -2.045484, T: 2265, Avg. loss: 0.009278
Total training time: 0.01 seconds.
-- Epoch 2
-- Epoch 2
-- Epoch 1
-- Epoch 1
-- Epoch 1
Norm: 29.47, NNZs: 18292, Bias: -1.927539, T: 2265, Avg. loss: 0.009080
Total training time: 0.03 seconds.
-- Epoch 2
Norm: 30.47, NNZs: 18292, Bias: -2.019462, T: 2265, Avg. loss: 0.008836
Total training time: 0.02 seconds.
Norm: 30.57, NNZs: 18292, Bias: -2.039189, T: 2265, Avg. loss: 0.009392
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 28.55, NNZs: 18292, Bias: -2.366915, T: 4530, Avg. loss: 0.001608
Total training time: 0.03 seconds.
Norm: 30.66, NNZs: 18292, Bias: -2.018760, T: 2265, Avg. loss: 0.009261
Tot

[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    0.7s


-- Epoch 4
-- Epoch 2
Norm: 25.16, NNZs: 18292, Bias: -2.800650, T: 9060, Avg. loss: 0.001232
Total training time: 0.02 seconds.
-- Epoch 5
Norm: 31.18, NNZs: 18292, Bias: -2.058415, T: 2265, Avg. loss: 0.008464
Total training time: 0.01 seconds.
-- Epoch 2
Norm: 23.24, NNZs: 18292, Bias: -3.156071, T: 15855, Avg. loss: 0.001208
Total training time: 0.04 seconds.
Convergence after 7 epochs took 0.04 seconds
-- Epoch 1
Norm: 23.44, NNZs: 18292, Bias: -3.055986, T: 13590, Avg. loss: 0.001168
Total training time: 0.04 seconds.
Norm: 25.43, NNZs: 18292, Bias: -2.761355, T: 9060, Avg. loss: 0.001364
Total training time: 0.02 seconds.
Norm: 24.13, NNZs: 18292, Bias: -2.900680, T: 11325, Avg. loss: 0.001198
Total training time: 0.03 seconds.
Norm: 28.41, NNZs: 18292, Bias: -2.378005, T: 4530, Avg. loss: 0.001379
Total training time: 0.01 seconds.
Norm: 23.03, NNZs: 18292, Bias: -3.149072, T: 15855, Avg. loss: 0.001158
Total training time: 0.04 seconds.
Norm: 30.78, NNZs: 18292, Bias: -2.02428

[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:    1.6s


Norm: 23.30, NNZs: 18292, Bias: -3.186955, T: 15855, Avg. loss: 0.001210Norm: 23.17, NNZs: 18292, Bias: -3.148148, T: 15855, Avg. loss: 0.001223
Total training time: 0.05 seconds.
Norm: 31.14, NNZs: 18292, Bias: -2.062388, T: 2265, Avg. loss: 0.007862
Total training time: 0.01 seconds.

Total training time: 0.03 seconds.
Convergence after 7 epochs took 0.05 seconds
Norm: 28.29, NNZs: 18292, Bias: -2.366262, T: 4530, Avg. loss: 0.001452
Total training time: 0.02 seconds.
Norm: 27.95, NNZs: 18292, Bias: -2.348466, T: 4530, Avg. loss: 0.001644
Total training time: 0.01 seconds.
Norm: 28.28, NNZs: 18292, Bias: -2.379676, T: 4530, Avg. loss: 0.001352
Total training time: 0.01 seconds.
Norm: 29.44, NNZs: 18292, Bias: -1.932733, T: 2265, Avg. loss: 0.009162
Total training time: 0.00 seconds.
Norm: 22.90, NNZs: 18292, Bias: -3.183208, T: 15855, Avg. loss: 0.001133
Total training time: 0.04 seconds.
-- Epoch 2
Convergence after 7 epochs took 0.03 seconds
-- Epoch 1
-- Epoch 2
Norm: 24.42, NNZs:

[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:    2.9s


Norm: 28.20, NNZs: 18292, Bias: -2.364602, T: 4530, Avg. loss: 0.001453Convergence after 7 epochs took 0.04 seconds
Norm: 28.02, NNZs: 18292, Bias: -2.399043, T: 4530, Avg. loss: 0.001531
Total training time: 0.02 seconds.
-- Epoch 3
Norm: 23.83, NNZs: 18292, Bias: -3.105684, T: 13590, Avg. loss: 0.001239
Total training time: 0.04 seconds.
Norm: 24.48, NNZs: 18292, Bias: -2.943404, T: 11325, Avg. loss: 0.001259
Total training time: 0.02 seconds.
-- Epoch 6
Norm: 28.43, NNZs: 18292, Bias: -2.393283, T: 4530, Avg. loss: 0.001501
Total training time: 0.01 seconds.

Total training time: 0.01 seconds.
-- Epoch 3
-- Epoch 1
Norm: 26.73, NNZs: 18292, Bias: -2.584385, T: 6795, Avg. loss: 0.001456
Total training time: 0.02 seconds.
Norm: 23.78, NNZs: 18292, Bias: -3.113929, T: 13590, Avg. loss: 0.001242
Total training time: 0.04 seconds.
-- Epoch 3
-- Epoch 3
Norm: 23.33, NNZs: 18292, Bias: -3.155188, T: 15855, Avg. loss: 0.001238
Total training time: 0.05 seconds.
Convergence after 7 epochs to

In [2]:
print(np.mean(y_pred == y_test))

# Print the accuracy, precision, recall and f1_score
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Precision: ", precision_score(y_test, y_pred, average='micro'))
print("Recall: ", recall_score(y_test, y_pred, average='micro'))
print("F1 Score: ", f1_score(y_test, y_pred, average='micro'))

NameError: name 'y_pred' is not defined

In [11]:
# Doing my own accuracy calculation
# Search for each word in the predicted labels in the Y_test labels and calculate the percentage of words found
percentage = 0
for i in range(len(y_pred)):
    for word in y_pred[i].split(' '):
        print(word, y_test[i])
        print(word in y_test[i][0])
        if word in y_test[i][0]:
            percentage += 1/len(y_pred[i].split(' '))
            
print(f"Accuracy: {percentage/len(y_pred)*100:.2f}%")

mars ['mars planetary atmospheres atmospheric clouds atmospheric variability remote sensing']
True
planetary ['mars planetary atmospheres atmospheric clouds atmospheric variability remote sensing']
True
atmospheres ['mars planetary atmospheres atmospheric clouds atmospheric variability remote sensing']
True
galaxy ['catalogs surveys galaxy evolution photometry high-redshift galaxies observational astronomy astronomical methods']
True
evolution ['catalogs surveys galaxy evolution photometry high-redshift galaxies observational astronomy astronomical methods']
True
galaxy ['catalogs surveys galaxy evolution photometry high-redshift galaxies observational astronomy astronomical methods']
True
formation ['catalogs surveys galaxy evolution photometry high-redshift galaxies observational astronomy astronomical methods']
False
cosmology ['catalogs surveys galaxy evolution photometry high-redshift galaxies observational astronomy astronomical methods']
False
galaxy ['catalogs surveys galaxy ev