# Structurer et explorer des données textuelles

Notebook Introduction au traitement du langage naturel - 15/05/2025 - Émilien Schultz

## Les données

Open Alex

## Les bibliothèques

- `pandas` pour la manipulation de données
- `nltk` pour le traitement de texte
- `matplotlib` pour la visualisation
- `scikit-learn` pour le traitement de texte et la modélisation


In [2]:
#pip install pandas nltk scikit-learn matplotlib

## Nettoyer les données (préprocessing)

- Supprimer les doublons
- Supprimer les lignes vides
- Convertir en minuscules
- Garder uniquement de l'anglais

In [19]:
import pandas as pd

# charge les données
df = pd.read_csv("../data/CSS_exact_openalex.csv")

# filtrer les éléments nuls
print("Nombre éléments nuls: ", df["abstract"].isna().sum())
df = df[~df["abstract"].isna()]

# colonne avec tout le contenu textuel
df["texte"] = df["title"] + " " + df["abstract"]

# ajouter un filtre sur la longueur des abstracts
df = df[(df["texte"].apply(len) < 10000) & (df["texte"].apply(len) > 100)]

Nombre éléments nuls:  759


In [20]:
df["texte"].head()

0    Computational Social Science 14,0642,033Metric...
1    Manifesto of computational social science The ...
3    Computational Social Science and Sociology The...
7    Can Large Language Models Transform Computatio...
9    On agent-based modeling and computational soci...
Name: texte, dtype: object

In [21]:
len(df)

683

## Analyse à l'échelle des mots

### Chercher la présence d'un mot

Les bases de la fouille de données. Quels sont les questions qui parlent d'intelligence artificielle ?

In [23]:
filtre = df["texte"].str.contains("artificial intelligence|AI")
filtre.sum()

np.int64(51)

In [32]:
#df[filtre]

In [33]:
df[filtre].loc[58, "texte"]

'Serious Games and AI: Challenges and Opportunities for Computational Social Science The video game industry plays an essential role in the entertainment sphere of our society. However, from Monopoly to Flight Simulators, serious games have also been appealing tools for learning a new language, conveying values, or training skills. Furthermore, the resurgence of Artificial Intelligence (AI) and data science in the last decade has created a unique opportunity since the amount of data collected through a game is immense, as is the amount of data needed to feed such AI algorithms. This paper aims to identify relevant research lines using Serious Games as a novel research tool, especially in Computational Social Sciences. To contextualize, we also conduct a (non-systematic) literature review of this field. We conclude that the synergy between games and data can foster the use of AI for good and open up new strategies to empower humanity and support social research with novel computational 

Rechercher les nombres de 4 chiffres dans le texte

In [46]:
import re
re.findall(r"\s(\d{4})\s","Ceci est un texte 2323 qds qsds dqs dsq 32 4444444")

['2323']

Chercher un contexte d'un mot avec une expression régulière

In [37]:
import re
re.findall(".{10}Artificial Intelligence.{10}", df[filtre].loc[58, "texte"])

['rgence of Artificial Intelligence (AI) and ']

In [50]:
(df["texte"].str.lower()
            .str.contains("artificial intelligence")
            .sum())

np.int64(31)

Et si on cherche plusieurs termes ?

In [55]:
termes = [" ai ", "artificial intelligence"]

(df["texte"].str.lower()
            .str.contains("|".join(termes))
            .sum())

np.int64(46)

Faire une recherche sur toutes les variables possibles de l'IA

- intelligence artificelle
- algorithme
- AI
- ...

### Tokenisation

Découper un texte

#### Utiliser les regex

In [59]:
import re
word_pattern = r"\w+"
tokens = re.findall(word_pattern, "Ceci est un test,de.découpage en mots")
tokens

['Ceci', 'est', 'un', 'test', 'de', 'découpage', 'en', 'mots']

In [62]:
%%timeit
df["texte"].apply(lambda x: re.findall(r"\w+",x.lower()))

23.1 ms ± 461 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Utiliser une première bibliothèque : `nltk`

In [76]:
import nltk
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize

word_tokenize("Ceci est un test, ici-même", language="french")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/emilien/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


['Ceci', 'est', 'un', 'test', ',', 'ici-même']

In [68]:
#word_tokenize?

In [70]:
%%timeit
df["texte"].apply(word_tokenize)

347 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [71]:
df["texte_tok"] = df["texte"].apply(word_tokenize)

### Quels sont les termes les plus fréquents ?

In [72]:
from collections import Counter

In [74]:
compteur = Counter([j for i in list(df["texte_tok"]) for j in i])

In [75]:
compteur.most_common(20)

[(',', 6972),
 ('the', 5687),
 ('of', 5198),
 ('and', 5063),
 ('.', 4603),
 ('to', 3034),
 ('in', 2659),
 ('a', 2230),
 ('social', 1840),
 ('for', 1399),
 ('on', 1221),
 ('that', 1168),
 ('data', 1065),
 (')', 1058),
 ('(', 1038),
 ('is', 974),
 ('science', 895),
 ('computational', 872),
 ('with', 824),
 ('as', 815)]

### Quelles sont les expressions qui reviennent le plus souvent ?

Utilisons les bigrammes et les trigrammes

In [81]:
from nltk.util import ngrams
list(ngrams(["Je","suis","content","de","faire","du","Python"],3))

[('Je', 'suis', 'content'),
 ('suis', 'content', 'de'),
 ('content', 'de', 'faire'),
 ('de', 'faire', 'du'),
 ('faire', 'du', 'Python')]

In [84]:
from nltk.util import ngrams
from nltk.tokenize import word_tokenize

def generate_bigrams_nltk(text):
    tokens = word_tokenize(text.lower())
    bigrams = list(ngrams(tokens, 3))
    return bigrams

generate_bigrams_nltk(df["texte"].iloc[0])

[('computational', 'social', 'science'),
 ('social', 'science', '14,0642,033metricstotal'),
 ('science', '14,0642,033metricstotal', 'downloads14,064last'),
 ('14,0642,033metricstotal', 'downloads14,064last', '6'),
 ('downloads14,064last', '6', 'months2,037last'),
 ('6', 'months2,037last', '12'),
 ('months2,037last', '12', 'months4,190total'),
 ('12', 'months4,190total', 'citations2,033last'),
 ('months4,190total', 'citations2,033last', '6'),
 ('citations2,033last', '6', 'months1last'),
 ('6', 'months1last', '12'),
 ('months1last', '12', 'months1view'),
 ('12', 'months1view', 'all'),
 ('months1view', 'all', 'metrics')]

#### Enlever les stop words

In [87]:
nltk.download("stopwords")
from nltk.corpus import stopwords
#stopwords.words("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/emilien/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [90]:
nltk.download("stopwords")

from nltk.corpus import stopwords

english_stopwords = list(set(stopwords.words("english")))

def generate_bigrams_nltk(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in english_stopwords]
    bigrams = list(ngrams(filtered_tokens, 2))
    return bigrams

def generate_trigrams_nltk(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in english_stopwords]
    bigrams = list(ngrams(filtered_tokens, 3))
    return bigrams


#generate_trigrams_nltk(df["texte"].iloc[3])


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/emilien/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [92]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words=english_stopwords, ngram_range=(3, 3), max_features=300)
bigrammes = (
    pd.DataFrame(
        vectorizer.fit_transform(df["texte"]).toarray(),
        columns=vectorizer.get_feature_names_out(),
    )
    .T.sum(axis=1)
    .sort_values(ascending=False)
)
bigrammes

computational social science     733
computational social sciences     99
social science research           57
large language models             55
natural language processing       54
                                ... 
social welfare scholars            4
source information societal        4
csv keyword koronawirus            4
stable calibration curve           4
related social behavior            4
Length: 300, dtype: int64

## Représenter les textes

### Présence de mots

In [101]:
df["texte"] = df["texte"].str.lower()
df["dim1"] = df["texte"].str.contains("AI")
df["dim2"] = df["texte"].str.contains("science")
df["dim3"] = df["texte"].str.contains("algorithm")
df["dim4"] = df["texte"].str.contains("llm")
table = df[["dim1","dim2","dim3","dim4"]].replace({True:1,False:0})
table

  table = df[["dim1","dim2","dim3","dim4"]].replace({True:1,False:0})


Unnamed: 0,dim1,dim2,dim3,dim4
0,0,1,0,0
1,0,1,0,0
3,0,1,0,0
7,0,1,0,1
9,0,1,0,0
...,...,...,...,...
1427,0,1,0,0
1429,0,1,0,1
1430,0,1,0,1
1436,0,1,0,0


In [104]:
#table.sum(axis=1) > 3

## Parenthèse scikit-learn

- on crée un objet/modèle par défaut avec des paramètres
- on l'adapte aux données (fit) : on calcule les valeurs du modèle par rapport aux données
- on prédit/transforme d'autres données (ou les mêmes) avec le modèle entrainé

### Vecteur brut : Document term matrix (DTM) / tableau

In [120]:
from sklearn.feature_extraction.text import CountVectorizer

# créer mon object de ML
vectorizer = CountVectorizer(stop_words=english_stopwords, 
                             ngram_range=(1, 1), 
                             max_features=500)

# appliquer sur les données
X = vectorizer.fit_transform(df["texte"])
X = pd.DataFrame(X.toarray(),columns=list(vectorizer.get_feature_names_out()))
X

Unnamed: 0,19,2020,ability,abstract,academic,access,accounts,accuracy,across,active,...,word,words,work,working,workshop,world,would,www,years,yet
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
679,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
680,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
681,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,3,1,0,1,0


In [20]:
# X.iloc[2]

### Une version un peu plus avancée

- Term Frequency-Inverse Document Frequency
    - Amélioration du DTM
- Approche souvent utilisée pour mettre en valeur les mots les plus spécifiques
- `Scikit-learn` a `TfidfVectorizer`

$$\text{TF-IDF}(t, d, D) = \left( \frac{f_{t,d}}{n_d} \right) \times \log \left(\frac{N}{\text{df}_t} \right)
$$

In [127]:
from sklearn.feature_extraction.text import TfidfVectorizer

# créer un objet
vectorizer = TfidfVectorizer(stop_words=english_stopwords, 
                             ngram_range=(1, 1), 
                             max_features=500)

# applique 
X = vectorizer.fit_transform(df["texte"])

# mettre en forme
X = pd.DataFrame(X.toarray(),columns=list(vectorizer.get_feature_names_out()))
X.loc[45].sort_values()

19             0.000000
popularity     0.000000
popular        0.000000
political      0.000000
policy         0.000000
                 ...   
performance    0.211079
tasks          0.216117
nlp            0.253746
strategies     0.343297
data           0.421374
Name: 45, Length: 500, dtype: float64

In [22]:
len(vectorizer.get_feature_names_out())

800

Faire la matrice TF-IDF, identifier les mots qui ont le score le plus important

## Distance entre deux textes

In [131]:
from sklearn.metrics.pairwise import cosine_similarity

X = vectorizer.fit_transform(df["texte"])
cosine_similarity(X[0], X[50])

array([[0.38885635]])

In [134]:
from sklearn.metrics import pairwise_distances

distances = pd.DataFrame(pairwise_distances(X, metric="cosine"))
distances

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,673,674,675,676,677,678,679,680,681,682
0,0.000000,0.792856,0.683709,0.893938,0.862499,0.610080,0.882147,0.815532,0.841434,0.842699,...,0.934959,0.858356,0.981518,0.986131,0.957205,0.965213,0.974557,0.975362,0.959075,0.908659
1,0.792856,0.000000,0.792380,0.950557,0.906473,0.905994,0.870481,0.927642,0.841452,0.863517,...,0.866931,0.890152,0.938361,0.945397,0.918154,0.970727,0.965741,0.968236,0.880703,0.840238
2,0.683709,0.792380,0.000000,0.941215,0.840601,0.599256,0.885338,0.871111,0.940411,0.830467,...,0.875743,0.806255,0.948565,0.970533,0.779773,0.969907,0.924779,0.933613,0.893843,0.864202
3,0.893938,0.950557,0.941215,0.000000,0.848588,0.923259,0.936298,0.943772,0.922710,0.954753,...,0.923321,0.956766,0.648462,0.975396,0.974159,0.989460,0.831055,0.815939,0.946884,0.901888
4,0.862499,0.906473,0.840601,0.848588,0.000000,0.858155,0.951050,0.942069,0.978197,0.955526,...,0.958900,0.896101,0.947750,0.982619,0.976048,0.988140,0.951046,0.950153,0.935287,0.895721
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,0.965213,0.970727,0.969907,0.989460,0.988140,0.976631,0.954114,0.978225,0.988795,0.983512,...,0.993712,0.969345,0.987226,0.986877,0.993546,0.000000,0.993665,0.993407,0.992462,0.985277
679,0.974557,0.965741,0.924779,0.831055,0.951046,0.948494,0.957458,0.944670,0.992543,0.964447,...,0.968872,0.967898,0.857216,0.963013,0.975592,0.993665,0.000000,0.022451,0.891661,0.950135
680,0.975362,0.968236,0.933613,0.815939,0.950153,0.950489,0.962293,0.930484,0.992779,0.971529,...,0.969212,0.966520,0.847640,0.962830,0.972397,0.993407,0.022451,0.000000,0.890135,0.945566
681,0.959075,0.880703,0.893843,0.946884,0.935287,0.945041,0.932284,0.976206,0.977289,0.959670,...,0.903483,0.859381,0.849099,0.921980,0.641424,0.992462,0.891661,0.890135,0.000000,0.864907


In [136]:
distances[10].idxmax()

564

## Application : Faire un nuage de mots avec WordCloud

Un coup d'oeil à la [documentation](https://amueller.github.io/word_cloud/)