# Base de classification en Allemand

    - Transformers : Summarization : 2 modèles --> 2 Résumés / Puis score de similarités de ces 2 résumés
    Noter que l'on peut faire aussi la similarité des textes (autre note ?) et non du résumé
    - Text classification sur une base de catégories "Presse" 
    - Sentiment analysis : voir si le ton du texte est de même type 
    - Les 2 derniers classifier seronts utilisés en produit scalaire : Par Catégorie : texte1: note1 - texte2 : note2
    et donc sum(notes_par_catégorie) = sum(note1*note2) * 100 au bout (note sur 100)

In [1]:
import pandas as pd
import pke
import spacy
import torch
import stanza
import spacy_stanza
import warnings
import string
import gensim
from gensim.models import KeyedVectors
import enchant    # Pour correction orthographique de synonymes
import numpy as np
import re
from transformers import pipeline
from transformers import AutoModel
from transformers import AutoModelForSequenceClassification
from transformers import AutoModelWithLMHead, AutoTokenizer
from tqdm.notebook import tqdm
from nltk.corpus import stopwords
tqdm.pandas()
warnings.filterwarnings("ignore")

C:\Users\stg-sdu\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.QVLO2T66WEPI7JZ63PS3HMOHFEY472BC.gfortran-win_amd64.dll
C:\Users\stg-sdu\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


**Selection des modèles NLP : ici ALLEMAND**

In [2]:
# Chargement pour l'utilisation de Spacy  - Français
nlp_de = spacy.load("de_core_news_sm")

In [3]:
dico_spacy = {'de':nlp_de}   # 'en':nlp_en,'de':nlp_de,'es':nlp_es,'pl':nlp_pl  - POUR MEMOIRE
langues = ['en','fr','es','de','pl','ar','tr']

In [4]:
# Chargement du modèle Word2Vec pour utilisation de synonymes
from gensim.models import Word2Vec
model_gensim = gensim.models.KeyedVectors.load_word2vec_format("D:/Users/STG-SDU/Documents/NLP/german.model", binary=True)

In [5]:
# Stopwords Français NLTK + Spacy 
stopWords = list(nlp_de.Defaults.stop_words)
stopwords_de = list(stopwords.words('german'))  
stopwords_de = list(set(stopwords_de + stopWords))
stopwds_lg = {'de':stopwords_de}

In [6]:
# correcteur orthographique pour validation des synonymes OPTIONNEL CAR NON NECESSAIRE
# d = enchant.Dict("de") 

**Sélection des modèles Transformers : Summary - Text Classification - Sentiment Analysis - Similarity**

In [7]:
# Modèles Transformers de Résumé (NB : Ne pas oublier d'ajouter la truncation pour tous les modèles, peut être source d'erreur)
summarizer1 = pipeline("summarization", model="ml6team/mt5-small-german-finetune-mlsum", truncation = "only_first")

In [8]:
# 2e résumé
from transformers import BertTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/bert2bert_shared-german-finetuned-summarization'
tokenizer = BertTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
def summarizer2(text):
    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    output = model.generate(input_ids, attention_mask=attention_mask)
    return tokenizer.decode(output[0], skip_special_tokens=True)

The following encoder weights were not tied to the decoder ['bert/pooler']
The following encoder weights were not tied to the decoder ['bert/pooler']


In [9]:
# classes : politics, economy, entertainment, environment,sport,health
text_clf1 = pipeline("zero-shot-classification", model = "MoritzLaurer/mDeBERTa-v3-base-mnli-xnli", truncation = "only_first")

In [10]:
# Zero shot classification (permet de chosir nos propres thèmes)
text_clf2 = pipeline('zero-shot-classification', model="Sahajtomar/German_Zeroshot",truncation = "only_first")
# ce modèle est un zero shot classification : catégories possibles choisies par mes soins (dans la presse)
candidate_labels = ['Wissenschaft','Politik','Bildung','Nachrichten','Gesundheit','Technologie','Gesellschaft','Sport','Wirtschaft','Kultur','International','Umwelt']

In [11]:
# Sentiment Analysis
sentiment1 = pipeline("text-classification", model = 'oliverguhr/german-sentiment-bert', truncation = "only_first")
# ATTENTION CE MODELE n°2 SE DEFINIT SUR 5 niveaux
sentiment2 = pipeline("text-classification", model = 'nlptown/bert-base-multilingual-uncased-sentiment', truncation = "only_first")
# on prend un  3e niveau de classification
sentiment3 = pipeline("text-classification", model = 'symanto/xlm-roberta-base-snli-mnli-anli-xnli', truncation = "only_first")

In [12]:
# ENCODAGE AVEC SENTENCE TRANSFORMER
from sentence_transformers import SentenceTransformer,util
encoder = SentenceTransformer("Sahajtomar/German-semantic")
encoder2 = SentenceTransformer("symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli")
def score_similarite(sentence1,sentence2):
    # attention, pour que torch fonctionne en dimension sentence1 (et 2) est une liste simple
    embed1 = encoder.encode(sentence1, convert_to_tensor=True)
    embed2 = encoder.encode(sentence2, convert_to_tensor=True)
    embed3 = encoder2.encode(sentence1, convert_to_tensor=True)
    embed4 = encoder2.encode(sentence2, convert_to_tensor=True)
    return round(float(util.pytorch_cos_sim(embed1,embed2))+float(util.pytorch_cos_sim(embed3,embed4))*100/2,2)

**Selection Data par langues**

In [13]:
data = pd.read_csv('eval_data_prep_v1.csv')

In [14]:
data

Unnamed: 0,pair_id,pair_lang,source_url_1,publish_date_1,source_url_2,publish_date_2,title_1,text_1,meta_description_1,meta_keywords_1,title_2,text_2,meta_description_2,meta_keywords_2,ligne
0,1484189203_1484121193,en_en,https://wsvn.com,,https://wsvn.com,,Police: 2 men stole tools from Lowe’s in Davie,"DAVIE, FLA. (WSVN) - Police need help catching...",,[''],No-swim advisory lifted for Deerfield Beach Pier,"DEERFIELD BEACH, FLA. (WSVN) - A no-swim advis...",,[''],0
1,1484011097_1484011106,en_en,https://www.zdnet.com,,https://securityboulevard.com,Fri Oct 25 11:10:18 2019,"Open database leaked 179GB in customer, US gov...",Govt officials confirm Trump can block US comp...,The US Department of Homeland Security has bec...,[''],Best Western’s Massive Data Leak: 179GB Amazon...,The latest huge unsecured cloud storage find i...,The latest huge unsecured cloud storage find i...,[''],1
2,1484039488_1484261803,en_en,https://www.presstelegram.com,Tue Dec 31 00:00:00 2019,https://boingboing.net,Wed Jan 1 00:00:00 2020,Ducks are own worst enemies in sloppy loss in ...,"Ducks defenseman Erik Gudbranson, left, knocks...",,[''],Woody Guthrie's 1943 New Year's Resolutions ar...,Woody Guthrie's 1943 New Year's Resolutions ar...,"I'd seen this before, but I was reminded of it...",[''],2
3,1484332324_1484796748,en_en,https://www.financialexpress.com,Thu Jan 2 08:28:22 2020,https://www.news18.com,,Another Bengal vs Centre tussle? Govt rejects ...,The West Bengal government’s proposal was reje...,The West Bengal government's proposal was reje...,"['republic day', 'west bengal tableau', 'benga...",'Congress Rejected 7 Times': BJP's Reminder as...,Mumbai: The NCP and Shiv Sena on Thursday targ...,BJP ally and Union minister Ramdas Athawale sa...,"['BJP', 'congress', 'Mamata Banerjee', 'NCP', ...",3
4,1484012256_1484419682,en_en,https://www.birminghammail.co.uk,Wed Jan 1 15:03:04 2020,http://m.fightbacknews.org,Wed Jan 1 00:00:00 2020,Bars and clubs you loved and lost this decade ...,The video will start in 8 Cancel\n\nSign up to...,Nightclubs and bars that have closed in the pa...,"['Birmingham City Centre', 'Digbeth', 'Things ...",Top 20 films of the 2010s,"Jacksonville, FL - I'm not sure how we'll look...","Jacksonville, FL - I'm not sure how we'll look...","['organizing', 'activism', 'socialism', ""Peopl...",4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4948,1553907621_1553488848,es_it,https://www.diariolibre.com,,https://www.basketuniverso.it,Thu Mar 19 23:45:49 2020,Denver Nuggets reporta que “un miembro de la o...,Los Denver Nuggets de la NBA reportaron este j...,Los Denver Nuggets de la NBA reportaron este j...,['NBA'],"Coronavirus, un caso anche fra i Denver Nuggets",Nato ad Alatri (Fr) nel ’93 e qui diplomato al...,"Un altro caso di Coronavirus nella NBA, stavol...",[''],4948
4949,1646957948_1643667075,es_it,https://diario16.com,Sun Jun 28 04:20:00 2020,https://www.laregione.ch,Wed Jun 24 16:41:00 2020,Vivir en España es más barato que en la media ...,El estudio realizado por Eurostat muestra que ...,El estudio realizado por Eurostat muestra que ...,[''],"Coronavirus, in Europa 140mila morti in più in...","Nei mesi di marzo e aprile 2020, dalla decima ...",Il picco di morti aggiuntivi rispetto alla med...,[''],4949
4950,1504063453_1502866628,es_it,https://elaragueno.com.ve,,https://www.greenme.it,Thu Jan 23 12:46:02 2020,Activan sistema de vigilancia epidemiológica e...,Foto: Archivo Foto: Archivo\n\nEl sistema de v...,,[''],Coronavirus in Cina: consumo di serpenti e zup...,La diffusione del mortale Coronavirus potrebbe...,L'infezione da coronavirus potrebbe aver avuto...,"['cina', 'coronavirus', 'pipistrelli', 'serpen...",4950
4951,1647862428_1647712939,es_it,http://www.am.com.mx,Mon Jun 29 00:00:00 2020,https://it.sputniknews.com,Mon Jun 29 13:46:00 2020,Emite Irán orden de arresto contra Tump por as...,CDMX.- Anunció Irán este lunes que ha emitido ...,"Ha emitido Irán una orden de arresto, que ha s...","['DONALD TRUMP', 'ESTADOS UNIDOS', 'IRÁN']",Iran emette mandato di arresto per Trump per l...,Il procuratore di Teheran Ali Alqasi Mehr in u...,"Il procuratore di Teheran, Ali Alqasi Mehr, ha...","['mondo', 'qasem soleimani', ""le tensioni tra ...",4951


**Corrections**

In [15]:
path = 'C:/Users/stg-sdu/Notebooks/NLP/SemEval-2022/Data/'
eval_data_location = path + "semeval-2022_task8_eval_data_202201.csv"
data_eval_location = pd.read_csv(eval_data_location)

In [17]:
# liste_indexes = []
# liste_lignes= []
# for i in range(len(allemand)):
#     b = False
#     if len(allemand.title_1[i]) == 0:
#         print('index ',i,' titre 1 : ligne ',allemand.ligne[i],data_eval_location.ia_link1[allemand.ligne[i]])
#         b=True
#     if len(allemand.text_1[i]) == 0:
#         print('index ',i,' texte 1 : ligne ',allemand.ligne[i],data_eval_location.ia_link1[allemand.ligne[i]])  
#         liste_indexes.append(i);liste_lignes.append(i)
#         b=True
#     if len(allemand.title_2[i]) == 0:
#         print('index ',i,' titre 2 : ligne ',allemand.ligne[i],data_eval_location.ia_link2[allemand.ligne[i]])
#         b=True
#     if len(allemand.text_2[i]) == 0:
#         print('index ',i,' texte 2 : ligne ',allemand.ligne[i],data_eval_location.ia_link2[allemand.ligne[i]])
#         b = True
#     if b == True:
#         liste_indexes.append(i);liste_lignes.append(i)

**Programme**

In [18]:
# remémorer numéro de ligne - compléter les Nan
data['ligne'] = data.index
data = data.fillna('')

In [19]:
# séparation des datasets, le dernier étant à traduire en plus
allemand = data.loc[data.pair_lang == 'de_de',['ligne','title_1','title_2','text_1','text_2']].reset_index(drop=True)

In [20]:
# Fonction de calcul du score (produit scalaire) pour résultats de classifaction
def fonction_produit_dotcom(liste_categor, dico_scores1,dico_scores2):
    """"dico scores sont les résultats obtenus pour chaque catégorie des textes 1 et 2"""
    result = 0.0
    for cat in liste_categor:
        result += round(dico_scores1[cat] * dico_scores2[cat],4)
    return result * 100

In [21]:
# transformation des résultats du transformer type1
def transform_text_clf1(liste_dico):
    res = {}
    for dic in liste_dico:
        res[dic['label']] = dic['score']
    return res

In [22]:
# transformation des résultats du transformer type2
def transform_text_clf2(liste_cat,liste_sc):
    res = {}
    for i in range(len(liste_cat)):
        res[liste_cat[i]] = liste_sc[i]
    return res

In [23]:
# Test sur sentiment3
liste_categories = ['ENTAILMENT','NEUTRAL','CONTRADICTION']
labels = ['politics', 'economy', 'entertainment', 'environment','sport','health']
liste_labels = ['positive','negative','neutral']
liste_sentiments = ['1 star','2 stars','3 stars','4 stars','5 stars']

In [24]:
# Fonctions de summarization 
def summarization(texte):
    return summarizer1(texte)[0]['summary_text'], summarizer2(texte)

In [25]:
dico_classifiers = {'text_clf1': 'score_classif1','text_clf2':'score_classif2','sentiment1':'score_sentiment1',
                    'sentiment2': 'score_sentiment2','sentiment3': 'score_sentiment3'}
dico_categories = {'text_clf1': labels,'text_clf2':candidate_labels,'sentiment1':liste_labels,
                    'sentiment2': liste_sentiments,'sentiment3':liste_categories}

In [26]:
# Fonctions de classification et sentiment analysis
def classification(texte,clf):
    # assume nms des claasifiers et methode de transformation
    if clf == "text_clf1":
        try:
            classes = text_clf1(texte,dico_categories['text_clf1'])
        except:
            return 'error'
        else:
            return transform_text_clf2(classes['labels'],classes['scores'])
    elif clf == "text_clf2":                                 
        try:
            classes = text_clf2(texte,dico_categories['text_clf2'])
        except:
            return 'error'
        else:
            return transform_text_clf2(classes['labels'],classes['scores'])                          
    elif clf == "sentiment1":
        try:
            scores = transform_text_clf1(sentiment1(texte,return_all_scores=True)[0])
        except:
            return 'error'
        else:
            return scores
    elif clf == "sentiment2":
        try:
            scores = transform_text_clf1(sentiment2(texte,return_all_scores=True)[0])
        except:
            return 'error'
        else:
            return scores
    elif clf == "sentiment3":
        try:
            scores = transform_text_clf1(sentiment3(texte,return_all_scores=True)[0])
        except:
            return 'error'
        else:
            return scores
    else:
        return 'error'

In [27]:
# Prétraitement NLP pour PKE : suppression des mots de moins de 2 lettres non numériques
def supp_moins_2_lettres_stopwords(phrase,stopwd):
    temp = phrase.split(' ')
    res = ''
    for mot in temp:
        if mot not in stopwd and (len(mot)>2 or (len(mot)>0 and mot[0] in ['0','1','2','3','4','5','6','7','8','9'])):
            res += mot + ' '
    return res[:-1]

In [28]:
# Prétraitement NLP pour PKE : suppression des traits d'union(regroupe)/ des apostrophes / ponctuations
def modif(texte,stopmots):
    # modifications simples des textes : ponctuations, petits mots, stopwords (à faire pour entités et pke textes)
    texte=re.sub('\'',' ',texte)   # suppression apostrophe
    texte=re.sub('-','',texte)    # suppression trait union
    regex = re.compile('[%s]' % re.escape(string.punctuation)) # suppression de toutes les ponctuations
    texte=regex.sub(' ',texte)
    texte = supp_moins_2_lettres_stopwords(texte,stopmots)
    return texte

In [29]:
# Ajout des synonymes (existants en orthographe) à la suite de l'analyse pke
def ajout_synonymes(mot, correct_ortho = False):
    # on ajoute les 10 premiers synonymes existants, on vérifie orthographe (optionnel)
    syns = model_gensim.most_similar(mot,topn = 20)
    if correct_ortho == True:
        res = []
        for m in syns:
            if d.check(m[0]):   #  il y a le mot et son pourcentage d'importance
                res.append(m)
        syns = res
    return syns[:10]

In [30]:
# Choix des paramètres de la méthode : A revoir ?
methode1 = {"NOUN", "PROPN", "ADJ","VERB"}
methode2 = {"NOUN", "PROPN", "ADJ"}
nb_mots = {'meth1': 30, 'meth2':50}

In [31]:
# PKE : Analyse des termes principaux dans les textes et titres 
# Problème 
def transformation_pke_results(res1,res2, correct_ortho = False):
    """
    Transformation des resultats de PKE : Pb bigramme peuvent ne pas être ds les 2 textes mais 1 mot seulement
    liste de clés et dictionnaires de valeurs, bigrammes jouera ainsi de maniere coefficientée 
    Exemple : fuite eau:0.05 --> 3 mots au final : fuite, eau, fuite eau : 0.05
    De plus on ajoute les synonymes issus de gensim en les coefficiant et vérifiant que cela """
    
    liste1 = []; liste2 = [] ; dico1 = {}; dico2 = {}
    for elt in res1:
        liste1.append(elt[0])
        dico1[elt[0]] = round(elt[1],3)
        if ' ' in elt[0]:    # bigramme dans ce cas, ajout des 2 mots
            liste = elt[0].split(' ')
            for mot in liste:
                liste1.append(mot)
                dico1[mot] = round(elt[1],3)
                try:
                    synonyms = ajout_synonymes(mot,correct_ortho = correct_ortho)
                except:
                    pass
                else:
                    for syn in synonyms:
                        liste1.append(syn[0])   # Ajout du mot 
                        dico1[syn[0]] = round(elt[1] * syn[1], 3)  # poids considéré
                    
    for elt in res2:
        liste2.append(elt[0])
        dico2[elt[0]] = round(elt[1],3)
        if ' ' in elt[0]:
            liste = elt[0].split(' ')
            for mot in liste:
                liste2.append(mot)
                dico2[mot] = round(elt[1],3)
                try:
                    synonyms = ajout_synonymes(mot,correct_ortho = correct_ortho)
                except:
                    pass
                else:
                    for syn in synonyms:
                        liste2.append(syn[0])   # Ajout du mot 
                        dico2[syn[0]] = round(elt[1] * syn[1], 3)  # poids considéré
    
    # similarites entre les 2 listes issus de pke avec poids
    sim = 0
    for elt in liste1:
        if elt in liste2:
            sim += (dico1[elt] + dico2[elt])/2
    return sim

In [32]:
def entites_communes(nlp,text1,text2):
    """"
    Cette première fonction ne regarde que les entités communes : personnes, dates, groupe, localisations
    Elle sera appliquée aux textes et aux titres et cumulé : si cumul en titre et texte : compte double !"""
    
    doc1 = nlp(text1)
    doc2 = nlp(text2)
    nb_commun_ent = 0; liste_commun_ent = []
    nb_commun_geo = 0; liste_commun_geo = []
    nb_commun_dat = 0; liste_commun_dat = []
    
    if len(doc1.ents)>0 and len(doc2.ents)>0:
        liste1 = []; dico1 = {}
        for elt in doc1.ents:
            if elt.label_ in ['PERSON','PER'] and ' ' in elt.text:
                mots = elt.text.split(' ')
                for mot in mots:
                    if mot not in liste1:
                        liste1.append(mot)
                        dico1[mot] = elt.label_
            elif elt.label_ in ['LOC','ORG','GPE','DATE','TIME']:
                if elt.text not in liste1:
                    liste1.append(elt.text)
                    dico1[elt.text] = elt.label_
        liste2 = []
        for elt in doc2.ents:
            if elt.label_ in ['PERSON','PER'] and ' ' in elt.text:
                mots = elt.text.split(' ')
                for mot in mots:
                    if mot not in liste2:
                        liste2.append(mot)
            elif elt.label_ in ['LOC','ORG','GPE','DATE','TIME']:
                if elt.text not in liste2:
                    liste2.append(elt.text)
        
        # points communs des listes        
        for elt in liste1:
            if elt in liste2:
                if dico1[elt] == 'LOC':
                    nb_commun_geo += 1
                    liste_commun_geo.append(elt)
                elif dico1[elt] in ['DATE','TIME']:
                    nb_commun_dat += 1
                    liste_commun_dat.append(elt)
                else:
                    nb_commun_ent += 1
                    liste_commun_ent.append(elt)
                    
    return nb_commun_ent, liste_commun_ent,nb_commun_geo, liste_commun_geo,nb_commun_dat, liste_commun_dat

In [33]:
def Creation_features_comparaison(df,langue, test_position = [methode1,methode2]):
    """Création des notes pour classification ensuite"""
    
    resultats = pd.DataFrame(columns = ['summary1_text1','summary2_text1','summary1_text2','summary2_text2',
            'nb_entites_idem','nb_lieux_idem', 'nb_dates_idem','entites_idem','lieux_idem','dates_idem',
            'score_similarite_titres','score_similarite_resume1','score_similarite_resume2','score_classif1','score_classif2',
            'score_sentiment1','score_sentiment2','score_sentiment3','meth1_similarites','meth2_similarites'])
    
    # initialisation de la langue stanza
    stanza.download(langue)
    nlp_stanza = spacy_stanza.load_pipeline(langue)
    stopmts = stopwds_lg[langue]
    if langue in dico_spacy.keys():
        nlp_spacy = dico_spacy[langue]
    else:
        nlp_spacy = None
    print(nlp_spacy != None)   
    for i in tqdm(range(len(df))):
        dico_res = {}
        
        # Summary et comparatifs 
        dico_res['summary1_text1'],dico_res['summary2_text1'] = summarization(df.text_1[i])
        dico_res['summary1_text2'],dico_res['summary2_text2'] = summarization(df.text_2[i])
        dico_res['score_similarite_titres'] = score_similarite([df.title_1[i]],[df.title_2[i]])
        dico_res['score_similarite_resume1'] = score_similarite([dico_res['summary1_text1']],[dico_res['summary1_text2']])
        dico_res['score_similarite_resume2'] = score_similarite([dico_res['summary2_text1']],[dico_res['summary2_text2']])
        
        # analyse de textes classification et de sentiments
        texte1 = df.title_1[i] + ' ' + df.text_1[i]
        texte2 = df.title_2[i] + ' ' + df.text_2[i]
        if len(texte1)>0 and len(texte2)>0:
            for classifier in dico_classifiers.keys():
                scores1 = classification(texte1,classifier)
                scores2 = classification(texte2,classifier)
                if scores1 != 'error' and scores2 != 'error':
                    dico_res[dico_classifiers[classifier]] = fonction_produit_dotcom(dico_categories[classifier], scores1,scores2)
                else:
                    scores1 = classification(df.title_1[i],classifier)
                    scores2 = classification(df.title_2[i],classifier)
                    if scores1 != 'error' and scores2 != 'error':
                        dico_res[dico_classifiers[classifier]] = fonction_produit_dotcom(dico_categories[classifier], scores1,scores2)
                    else:
                        dico_res[dico_classifiers[classifier]] = None
                
        # pré traitement des textes pour PKE
        texte1 = modif(texte1, stopmts)
        texte2 = modif(texte2, stopmts)
        
        # ENTITES COMMUNES : on tient compte des bigrammes Noms qui posent erreurs ex: Joe Biden et Biden 
        # Ici, on considère mieux le CUMUl titres et Textes avec une pondération double pour le titre 
        # Il faut aussi enlever les petits mots donc pré-traitement en texte
        
        nb_ent1,list_ent1,nb_geo1,list_geo1,nb_dat1,list_dat1 = entites_communes(nlp_stanza,df.title_1[i],df.title_2[i])
        nb_ent2,list_ent2,nb_geo2,list_geo2,nb_dat2,list_dat2 = entites_communes(nlp_stanza,df.text_1[i],df.text_2[i])
        if nlp_spacy != None:
            nb_ent3,list_ent3,nb_geo3,list_geo3,nb_dat3,list_dat3 = entites_communes(nlp_spacy,df.title_1[i],df.title_2[i])
            nb_ent4,list_ent4,nb_geo4,list_geo4,nb_dat4,list_dat4 = entites_communes(nlp_spacy,df.text_1[i],df.text_2[i])
        else:
            nb_ent3,list_ent3,nb_geo3,list_geo3,nb_dat3,list_dat3 = (0,[],0,[],0,[])
            nb_ent4,list_ent4,nb_geo4,list_geo4,nb_dat4,list_dat4 = (0,[],0,[],0,[])
        dico_res['nb_entites_idem'] = nb_ent1 * 2 + nb_ent2 + nb_ent3 * 2 + nb_ent4
        dico_res['nb_lieux_idem'] = nb_geo1  * 2 + nb_geo2 + nb_geo3  * 2 + nb_geo4
        dico_res['nb_dates_idem'] = nb_dat1 * 2 + nb_dat2 + nb_dat3 * 2 + nb_dat4
        # fusion des listes en supprimant les doublons
        dico_res['entites_idem'] = list(set(list_ent1+list_ent2+ list_ent3+list_ent4))
        dico_res['lieux_idem'] = list(set(list_geo1+list_geo2+list_geo3+list_geo4))
        dico_res['dates_idem'] = list(set(list_dat1+list_dat2+list_dat3+list_dat4))
        
        for j,meth in enumerate(test_position):
            nom ='meth'+str(j+1)
            nb_mots_meth = nb_mots[nom]
            if len(texte1)>0 and len(texte2)>0:
                extractor = pke.unsupervised.TopicRank()
                extractor.load_document(input=texte1,language=langue,normalization="stemming")
                extractor.candidate_selection(pos=meth)
                extractor.candidate_weighting()
                keyphrases3 = extractor.get_n_best(n=nb_mots_meth)
                extractor = pke.unsupervised.TopicRank()
                extractor.load_document(input=texte2,language=langue,normalization="stemming")
                extractor.candidate_selection(pos=meth)
                extractor.candidate_weighting()
                keyphrases4 = extractor.get_n_best(n=nb_mots_meth)
                dico_res[nom+'_similarites'] = round(100*transformation_pke_results(keyphrases3,keyphrases4),1)
            else:
                dico_res[nom+'_similarites'] = 'Error'

        resultats.loc[len(resultats)] = dico_res
        
    newdf = pd.concat([df,resultats],axis=1)
    return newdf

In [95]:
#similarites = Creation_features_comparaison(allemand[:300].reset_index(drop=True),'de')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-14 23:14:22 INFO: Downloading default packages for language: de (German)...
2022-01-14 23:14:24 INFO: File exists: C:\Users\stg-sdu\stanza_resources\de\default.zip.
2022-01-14 23:14:28 INFO: Finished downloading models and saved to C:\Users\stg-sdu\stanza_resources.
2022-01-14 23:14:28 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| sentiment | sb10k   |
| ner       | conll03 |

2022-01-14 23:14:28 INFO: Use device: cpu
2022-01-14 23:14:28 INFO: Loading: tokenize
2022-01-14 23:14:28 INFO: Loading: mwt
2022-01-14 23:14:28 INFO: Loading: pos
2022-01-14 23:14:29 INFO: Loading: lemma
2022-01-14 23:14:29 INFO: Loading: depparse
2022-01-14 23:14:29 INFO: Loading: sentiment
2022-01-14 23:14:29 INFO: Loading: ner
2022-01-14 23:14:30 INFO: Done loading processors!


True


  0%|          | 0/300 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [34]:
#similarites.to_csv('eval_de_notes.csv')

In [35]:
#similarites = Creation_features_comparaison(allemand[300:].reset_index(drop=True),'de')

In [36]:
# precedent = pd.read_csv('eval_de_notes.csv',index_col=0)
# similarites2 = pd.concat([precedent,similarites], axis=0)
# similarites2 = similarites2.reset_index(drop=True)
# similarites2.to_csv('eval_de_notes.csv')

In [38]:
compl_allemand = pd.read_csv('eval_compl_allemand.csv')
len(compl_allemand)

151

In [39]:
similarites = Creation_features_comparaison(compl_allemand.reset_index(drop=True),'de')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-01-17 17:52:16 INFO: Downloading default packages for language: de (German)...
2022-01-17 17:52:18 INFO: File exists: C:\Users\stg-sdu\stanza_resources\de\default.zip.
2022-01-17 17:52:23 INFO: Finished downloading models and saved to C:\Users\stg-sdu\stanza_resources.
2022-01-17 17:52:23 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| sentiment | sb10k   |
| ner       | conll03 |

2022-01-17 17:52:23 INFO: Use device: cpu
2022-01-17 17:52:23 INFO: Loading: tokenize
2022-01-17 17:52:23 INFO: Loading: mwt
2022-01-17 17:52:23 INFO: Loading: pos
2022-01-17 17:52:23 INFO: Loading: lemma
2022-01-17 17:52:24 INFO: Loading: depparse
2022-01-17 17:52:24 INFO: Loading: sentiment
2022-01-17 17:52:25 INFO: Loading: ner
2022-01-17 17:52:26 INFO: Done loading processors!


True


  0%|          | 0/151 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Your max_length is set to 20, but you input_length is only 14. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 20, but you input_length is only 19. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Your max_length is set to 20, but you input_length is only 9. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)


In [40]:
similarites.to_csv('eval_compl_de_notes.csv')