# Simplification de la méthode avec l'utilisation d'un Pipeline sklearn

## Introduction

Nous allons donc toujours utiliser la même méthode, mais au lieu de donner un Y qui soit sous forme de 1 ou de 0, nous allons avoir une matrice de texte, traduite implicitement par le Pipeline, car impossible à calculer sinon

On commence par créer nos colonnes X et Y, qui correspondent respectivement aux colonnes 'title' + 'abstract' et 'verified_uat_labels'

In [12]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier

from nltk.stem import WordNetLemmatizer
import nltk

# Loading dataset containing first five categories
data = pd.read_parquet('val-00000-of-00001-66ce8665444026dc.parquet')

# On supprime les lignes contenant des valeurs nulles
data = data.dropna()
data.head()

Unnamed: 0,bibcode,title,abstract,verified_uat_ids,verified_uat_labels
0,2020RNAAS...4..137D,Recommendations for Teaching Introductory Astr...,Colleges and universities around the world wer...,"[1529, 1583, 563, 486, 1145, 74]","[solar system astronomy, stellar astronomy, ga..."
1,2023ApJ...949..109L,The ALMA Survey of 70 μm Dark High-mass Clumps...,We present dynamical properties of 294 cores e...,"[787, 1565, 1569, 732, 1302, 844, 847, 1297]","[infrared dark clouds, star forming regions, s..."
2,2020RNAAS...4....3G,L-band Calibration of the Green Bank Telescope...,"Since 2016, the HI-MaNGA survey has been obtai...","[544, 1360, 1671, 690]","[flux calibration, radio telescopes, surveys, ..."
3,2022RNAAS...6..165V,Search for Extended Sources in the Images from...,We present a convenient tool (ChaSES) which al...,"[1858, 2306, 1968, 1861]","[astronomy data analysis, astronomy image proc..."
4,2021ApJ...910...54K,The Connection between Warm Carbon-chain Chemi...,Some observations of warm carbon-chain chemist...,"[75, 267, 329, 838, 849, 1569, 371]","[astrochemistry, collapsing clouds, cosmic ray..."


In [13]:
# Preprocessing	
# Extracting title and abstract from the dataset
X = data['title'] + data['abstract']

# Extracting labels from the dataset (target)
Y_list = data['verified_uat_labels']

On va lemmatizer le texte de X, pour obtenir de meilleurs résultats, ce qui correspond à prendre la racine de chaque mot, et on va modifier Y pour éviter que ce soit une liste de liste, mais une liste de string permettant ensuite de l'utiliser correctement

In [14]:

nltk.download('wordnet')
# Lemmatization of the text for better results

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(w) for w in text.split()])

X = X.apply(lemmatize_text)


# Concatenate all labels to a single string to be understood by the classifier
Y = []
for i in Y_list:
    labels = ''
    for j in i:
        labels += j + ' '
    labels = labels[:-1]
    Y.append(labels)
y = pd.DataFrame(Y)
y = y.to_numpy()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Quent\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Nous définissons ensuite nos paramètres pour notre Classifier, qui va être le SGDClassifier, qui correspond au Stochastich Gradient Descent. Nous définissons ensuite notre Pipeline, qui va pouvoir prendre en entrée, des valeurs textuelles ou numériques, vu que celui-ci effectue un pré-traitement avant de l'utiliser dans le Classsifier définit dans notre Pipeline

In [15]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
# Parameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss",n_jobs=-1,verbose=1)
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

# Supervised Pipeline
pipeline = Pipeline(
    [   
        ("vect", CountVectorizer(**vectorizer_params)),
        ("tfidf", TfidfTransformer()),
        ("clf", OneVsRestClassifier(LinearSVC(), n_jobs=-1,verbose=1)),
    ]
)

## Entrainement du modèle

In [16]:

def eval_and_print_metrics(clf, X_train, y_train, X_test, y_test):
    global y_pred
    print("Number of training samples:", len(X_train))
    print("Unlabeled samples in training set:", sum(1 for x in y_train if x == -1))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(
        "Micro-averaged F1 score on test set: %0.3f"
        % f1_score(y_test, y_pred, average="micro")
    )
    print("-" * 10)
    print()


if __name__ == "__main__":
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    print("Supervised SGDClassifier on 100% of the data:")
    eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)

Supervised SGDClassifier on 100% of the data:
Number of training samples: 2265
Unlabeled samples in training set: 0


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 281 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 780 tasks      | elapsed:    8.2s
[Parallel(n_jobs=-1)]: Done 1480 tasks      | elapsed:   12.0s
[Parallel(n_jobs=-1)]: Done 2186 out of 2209 | elapsed:   16.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 2209 out of 2209 | elapsed:   16.3s finished


Micro-averaged F1 score on test set: 0.016
----------



## Calcul des résultats

In [17]:
print(np.mean(y_pred == y_test))

# Print the accuracy, precision, recall and f1_score
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Precision: ", precision_score(y_test, y_pred, average='micro'))
print("Recall: ", recall_score(y_test, y_pred, average='micro'))
print("F1 Score: ", f1_score(y_test, y_pred, average='micro'))

0.00023858602692864347
Accuracy:  0.015894039735099338
Precision:  0.015894039735099338
Recall:  0.015894039735099338
F1 Score:  0.015894039735099338


## Calcul de précision personnalisée

In [18]:
# Doing my own accuracy calculation
# Search for each word in the predicted labels in the Y_test labels and calculate the percentage of words found
percentage = 0
for i in range(len(y_pred)):
    for word in y_pred[i].split(' '):
        print(word, y_test[i])
        print(word in y_test[i][0])
        if word in y_test[i][0]:
            percentage += 1/len(y_pred[i].split(' '))
            
print(f"Accuracy: {percentage/len(y_pred)*100:.2f}%")

high ['binary stars dwarf galaxies dark matter']
False
resolution ['binary stars dwarf galaxies dark matter']
False
spectroscopy ['binary stars dwarf galaxies dark matter']
False
galaxy ['binary stars dwarf galaxies dark matter']
False
kinematics ['binary stars dwarf galaxies dark matter']
False
stellar ['binary stars dwarf galaxies dark matter']
False
kinematics ['binary stars dwarf galaxies dark matter']
False
close ['white dwarf stars']
False
binary ['white dwarf stars']
False
stars ['white dwarf stars']
True
white ['white dwarf stars']
True
dwarf ['white dwarf stars']
True
stars ['white dwarf stars']
True
binary ['eclipsing binary stars stellar pulsations radial velocity spectroscopy']
True
stars ['eclipsing binary stars stellar pulsations radial velocity spectroscopy']
True
contact ['eclipsing binary stars stellar pulsations radial velocity spectroscopy']
False
binary ['eclipsing binary stars stellar pulsations radial velocity spectroscopy']
True
stars ['eclipsing binary stars ste

# Conclusion

La durée d'entrainement sur la base de données val-0000... s'est fait en une douzaine de secondes, pour une totalité de 3025 lignes, ce qui est assez rapide.


(18677 valeurs pour le train dataset)