# Utiliser des modèles

De plus en plus de modèles préentrainés

In [2]:
#!pip install spacy pandas

## Charger les données

In [2]:
import pandas as pd
df = pd.read_csv("../data/CSS_exact_openalex.csv")
df = df[~df["abstract"].isna()]
df["texte"] = df["title"] + " " + df["abstract"]
df.head()

Unnamed: 0,id,doi,title,display_name,relevance_score,publication_year,publication_date,language,type,type_crossref,...,locations.source,sustainable_development_goals.id,sustainable_development_goals.display_name,sustainable_development_goals.score,grants.funder,grants.funder_display_name,grants.award_id,counts_by_year.year,counts_by_year.cited_by_count,texte
0,https://openalex.org/W2159397589,https://doi.org/10.1126/science.1167742,Computational Social Science,Computational Social Science,1318.1996,2009,2009-02-06,en,article,journal-article,...,nan|nan|nan|nan|nan,https://metadata.un.org/sdg/10,Reduced inequalities,0.45,,,,2025|2024|2023|2022|2021|2020|2019|2018|2017|2...,41|114|168|146|219|273|254|250|257|220|277|267...,"Computational Social Science 14,0642,033Metric..."
1,https://openalex.org/W2070907364,https://doi.org/10.1140/epjst/e2012-01697-8,Manifesto of computational social science,Manifesto of computational social science,488.98035,2012,2012-11-01,en,article,journal-article,...,nan|nan|nan|nan|nan|nan|nan|nan|nan|nan|nan|nan,https://metadata.un.org/sdg/3,Good health and well-being,0.47,,,,2025|2024|2023|2022|2021|2020|2019|2018|2017|2...,4|22|41|35|47|33|33|38|24|32|36|30|19|4,Manifesto of computational social science The ...
3,https://openalex.org/W3022499311,https://doi.org/10.1146/annurev-soc-121919-054621,Computational Social Science and Sociology,Computational Social Science and Sociology,365.42535,2020,2020-04-28,en,article,journal-article,...,nan|nan|nan,https://metadata.un.org/sdg/10,Reduced inequalities,0.41,,,,2025|2024|2023|2022|2021|2020,17|51|65|56|34|9,Computational Social Science and Sociology The...
7,https://openalex.org/W4389636360,https://doi.org/10.1162/coli_a_00502,Can Large Language Models Transform Computatio...,Can Large Language Models Transform Computatio...,297.86798,2023,2023-12-12,en,article,journal-article,...,nan|nan,,,,,,,2025|2024|2023,72|114|39,Can Large Language Models Transform Computatio...
9,https://openalex.org/W3124521395,https://doi.org/10.3389/fpsyg.2014.00668,On agent-based modeling and computational soci...,On agent-based modeling and computational soci...,275.2734,2014,2014-07-14,en,article,journal-article,...,nan|nan|nan|nan|nan,https://metadata.un.org/sdg/10,Reduced inequalities,0.47,,,,2025|2024|2023|2022|2021|2020|2019|2018|2017|2...,4|17|16|13|18|22|10|13|12|8|7|2|3,On agent-based modeling and computational soci...


## Une approche intégrée avec SpaCy

- `SpaCy` a des modèles entraînés pour les NER
- Par exemple pour le français, [plusieurs modèles sont disponibles](https://spacy.io/models/fr)
    - Avec des architectures différentes
- Une bibliothèque qui donne un framework commun.


Une étape : télécharger des modèles

https://github.com/explosion/spacy-models/releases/tag/fr_core_news_md-3.8.0

In [9]:
import numpy as np

In [10]:
np.__version__

'2.3.0'

In [1]:
#pip install numpy==1.26.4

Les utiliser

In [14]:
import spacy

nlp = spacy.load("en_core_web_md")

In [17]:
doc = nlp(df.loc[3, "texte"])

In [13]:
#from spacy import displacy
#displacy.render(doc, style="ent", jupyter=True)

### Manipuler les représentations du texte

In [26]:
for token in doc[0:20]:
    print(f"{token.text:<15} | lemma: {token.lemma_} | POS: {token.pos_}")

Computational   | lemma: Computational | POS: PROPN
Social          | lemma: Social | POS: PROPN
Science         | lemma: Science | POS: PROPN
and             | lemma: and | POS: CCONJ
Sociology       | lemma: Sociology | POS: PROPN
The             | lemma: the | POS: DET
integration     | lemma: integration | POS: NOUN
of              | lemma: of | POS: ADP
social          | lemma: social | POS: ADJ
science         | lemma: science | POS: NOUN
with            | lemma: with | POS: ADP
computer        | lemma: computer | POS: NOUN
science         | lemma: science | POS: NOUN
and             | lemma: and | POS: CCONJ
engineering     | lemma: engineering | POS: NOUN
fields          | lemma: field | POS: NOUN
has             | lemma: have | POS: AUX
produced        | lemma: produce | POS: VERB
a               | lemma: a | POS: DET
new             | lemma: new | POS: ADJ


In [29]:
doc[12].pos_

'NOUN'

In [31]:
[token for token in doc if token.pos_=="NOUN"]

[integration,
 science,
 computer,
 science,
 engineering,
 fields,
 area,
 study,
 science,
 field,
 methods,
 sources,
 data,
 media,
 records,
 archives,
 theories,
 behavior,
 evolution,
 field,
 sociology,
 analysis,
 depth,
 analysis,
 subfields,
 work]

In [36]:
for ent in  nlp(df.loc[7, "texte"]).ents[0:10]:
    print(f"{ent.text:<25} | label: {ent.label_}")

zero                      | label: CARDINAL
zero                      | label: CARDINAL
zero                      | label: CARDINAL
13                        | label: CARDINAL
25                        | label: CARDINAL
English                   | label: LANGUAGE
today                     | label: DATE
CSS                       | label: ORG
two                       | label: CARDINAL
1                         | label: CARDINAL


In [None]:
displacy.render(doc[0:100], style="dep", jupyter=True, options={"compact": True})

### Un usage : récupérer uniquement les verbes

In [37]:
def get_verbs(doc):
    """
    Get the verbs from a spacy doc
    """
    return [token.lemma_ for token in doc if token.pos_ == "VERB"]

tmp =  df["texte"][0:100].apply(lambda x: get_verbs(nlp(x)))

In [38]:
from collections import Counter

Counter([j for i in tmp for j in i if j]).most_common(20)

[('use', 82),
 ('base', 54),
 ('provide', 37),
 ('discuss', 34),
 ('make', 29),
 ('identify', 26),
 ('have', 26),
 ('understand', 23),
 ('include', 22),
 ('propose', 21),
 ('present', 21),
 ('generate', 21),
 ('develop', 20),
 ('show', 20),
 ('study', 19),
 ('offer', 19),
 ('explore', 19),
 ('increase', 18),
 ('address', 18),
 ('introduce', 17)]

### Un autre usage : nettoyer des textes

Plutôt que faire de la tokenisation brutale, nettoyer en prenant uniquement les lemmes puis faire un TF-IDF

In [None]:
# exercice

## Aller utiliser d'autres modèles sur huggingface

Commençons par faire un tour sur Huggingface

### Prédire des entités nommées

Utilsons le modèle [GliNER disponible sur HuggingFace](https://github.com/urchade/GLiNER)

Ou sa version plus récente [GliNer](https://huggingface.co/knowledgator/gliner-multitask-large-v0.5)

In [40]:
#!pip install gliner

In [50]:
from gliner import GLiNER

# on récupère le modèle
model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

In [54]:
text = df.loc[9, "texte"]
labels = ["software or algorithm","countries"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

cognitive models => software or algorithm


### De nombreux modèles et la possibilité d'en entraîner

PAr exemple : https://huggingface.co/NousResearch/Minos-v1

### Analyse de sentiment

Une question : **quelles sont les prises de paroles les plus négatives ?**

- Embarras du choix
    - Par ex : [🚀 distilbert-based Multilingual Sentiment Classification Model
](https://huggingface.co/tabularisai/multilingual-sentiment-analysis)
- Comprendre le modèle / ce qu'il fait
- Importance d'évaluer son résultat

## Utiliser des modèles extérieurs

Il faut un endpoint :

- Ollama
- OpenAI
- ...

In [36]:
import requests

# définir l'endpoint
endpoint = ""

# Définir la prompt de classification
prompt = """Est-ce que ce texte est positif ou négatif : Je ne comprends rien au NLP. 
Ne réponds que POSITIF ou NEGATIF."""

# Envoi de la requête à Ollama
response = requests.post(
    endpoint,
    json={
        'model': 'llama3.3',
        'prompt': prompt,
        'stream': False  # stream=False pour avoir une réponse simple
    }
)

# Traitement de la réponse
result = response.json()
result

{'model': 'llama3.3',
 'created_at': '2025-05-20T15:57:20.053721724Z',
 'response': 'NEGATIF',
 'done': True,
 'done_reason': 'stop',
 'context': [128006,
  882,
  128007,
  271,
  14101,
  54312,
  1744,
  3846,
  69067,
  1826,
  20940,
  333,
  6033,
  308,
  19395,
  50848,
  551,
  14465,
  841,
  60946,
  82,
  55455,
  8065,
  452,
  12852,
  13,
  720,
  8989,
  75871,
  82,
  1744,
  27592,
  964,
  2843,
  6033,
  85165,
  835,
  2843,
  13,
  128009,
  128006,
  78191,
  128007,
  271,
  98227,
  835,
  2843],
 'total_duration': 178492331,
 'load_duration': 14576150,
 'prompt_eval_count': 45,
 'prompt_eval_duration': 44405857,
 'eval_count': 4,
 'eval_duration': 118533952}