# Bases du traitement automatique des textes

Ce notebook couvre:

 - Prétraitement des textes: tokenization (transformation des textes en séquences de phrases, mots, ou autre unités linguistiques); lemmatization (réduction à la racine); etc.
 - Construction de réseaux de co-occurrence.
 - Comparaison de fréquences d'occurrence.

On commence par importer les packages:

In [329]:
import pandas as pd
import networkx as nx
import nltk

(Si l'importation de NLTK échoue, fermez jupyter, tapez `pip install nltk` dans votre terminal, puis relancez jupyter) 

La deuxième étape est le téléchargement de quelques fonctionnalités nltk de base.

In [387]:
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('tagsets_json')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/lucasgautheron/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/lucasgautheron/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/lucasgautheron/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets_json to
[nltk_data]     /Users/lucasgautheron/nltk_data...
[nltk_data]   Unzipping help/tagsets_json.zip.


True

## Pré-traitement des textes: tokenization, lemmatization, et filtres lexicaux

Nous allons d'abord aborder la tokenzation, une étape cruciale dans le traitement automatique des textes.
Cela peut couvrir, par exemple, la transformation d'un document en une séquence de mots. 
NLTK permet de réaliser cette opération:

In [369]:
words = nltk.tokenize.word_tokenize("I love apples. And also strawberries.")
words

['I', 'love', 'apples', '.', 'And', 'also', 'strawberries', '.']

Le processus fournit une liste de mots (en plus des éléments de ponctuation).

Il est fréquemment utile de réduire chacun de ces mots à leur racine, qui contient presque toute l'information sémantique utile. Par exemple, en pratique, il n'y a pas de distinction sémantique entre "strawberry" et "strawberries".

Pour cela, on a recours à la lemmatization:

In [371]:
lemmatizer = nltk.stem.WordNetLemmatizer()
stems = [
    lemmatizer.lemmatize(word) for word in words
]
stems

['I', 'love', 'apple', '.', 'And', 'also', 'strawberry', '.']

Dans certains cas, les mots tels que "I" ou "And" sont une distraction (ils ne portent aucune information sémantique utile).
Une manière de s'en débarrasser est de filtrer les mots selon leur catégorie lexicale. Celle-ci peut-être déterminée via la fonction "part-of-speech tagging" (pos tagging) de NLTK:

In [372]:
nltk.tag.pos_tag(words)

[('I', 'PRP'),
 ('love', 'VBP'),
 ('apples', 'NNS'),
 ('.', '.'),
 ('And', 'CC'),
 ('also', 'RB'),
 ('strawberries', 'NNS'),
 ('.', '.')]

Pour comprendre la signification des tags, on peut consulter le registre suivant:

In [388]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Ainsi, pour filtrer les noms et les adjectifs:
 - on itère sur chaque paire (mot, tag) [par example: ('apple', 'NNS')]
 - on retient tous les éléments de tag commençant par NN (noms) ou JJ (adjectifs)

Voici comment le faire avec une "compréhension de liste":

In [334]:
words = [
    word
    for word, tag in nltk.tag.pos_tag(words)
    if tag[:2] == "NN" or tag[:2] == "JJ"
]
words

['apple', 'strawberry']

On peut rassembler les étapes suivantes en définissant une fonction qui prend un texte en entrée, le découpe en mots (tokenization), filtre les noms et adjectifs, et les réduits à leur racine.

In [403]:
def text_to_words(text):
    text = text.lower() # on passe le texte en minuscules (important! sinon, "strawberry" et "Strawberry" sont considérés comme des mots distincts)
    
    words = nltk.tokenize.word_tokenize(text)
    
    tags = nltk.tag.pos_tag(words)
    
    words = [
        lemmatizer.lemmatize(word)
        for word, tag in tags
        if tag[:2] == "NN" or tag[:2] == "JJ"
    ]
    return words

In [393]:
text_to_words("I love green apples. And also red strawberries.")

['i', 'green', 'apple', 'red', 'strawberry']

On peut désormais appliquer cette fonction à notre corpus. Pour des raisons d'efficacité, nous allons l'appliquer aux titre des articles en sciences du climat. Mais on pourrait aussi l'appliquer aux résumés.

In [404]:
df = pd.read_parquet("science/climate/articles.parquet")
df.dropna(subset=["title"], inplace=True) # retrait des articles de titre inconnu/vide
df["words"] = df["title"].map(text_to_words) # création d'une nouvelle colonne "words" à partir du titre et de notre fonction text_to_words
df["words"]

article_id
4393546166             [data, et, al, uncertainty, job, seeker]
4393714386             [data, et, al, uncertainty, job, seeker]
4220805390    [urban, risk, europe, available, continental-s...
4220871093    [carbon, footprint, assessment, decarbonisatio...
4224019370        [future, cryosphere, impact, global, warming]
                                    ...                        
4386855500    [accounting, time, greenhouse, gas, emission, ...
4389633312    [green, sukuk, saudi, arabia, challenge, poten...
4389862110                   [esg, green, management, thailand]
4389881210    [credibility, green, bond, market, gbers, case...
4390419499    [development, green, finance, eaeu, country, a...
Name: words, Length: 230400, dtype: object

In [405]:
document_frequency = dict()

for words in df["words"].tolist():
    for word in set(words):
        document_frequency[word] = document_frequency.get(word, 0)+1

In [406]:
document_frequency["fossil"]

366

In [407]:
frequent_words = [
    word for word in document_frequency.keys() if document_frequency[word] >= 100
]

In [408]:
frequent_words[:10]

['al',
 'job',
 'uncertainty',
 'data',
 'et',
 'available',
 'urban',
 'europe',
 'risk',
 'development']

In [409]:
G = nx.Graph()
for words in df["words"]:
    words = [word for word in words if word in frequent_words]
    words = list(set(words))
    for i, a in enumerate(words):
        for j, b in enumerate(words):
            if i<j:
                if G.has_edge(a, b):
                    G[a][b]["coocc"] += 1
                else:
                    G.add_edge(a, b, coocc=1)

In [410]:
nx.write_gexf(G, "output/coocc.gexf")

In [411]:
import numpy as np

words = list(G.nodes)
N = len(df)
occ = {
    word: document_frequency[word]/N for word in words
}

weight = dict()
for a, b, attrs in G.edges(data=True):
    weight[(a,b)] = np.log(occ[a]*occ[b])/np.log(attrs["coocc"]/N) - 1

nx.set_edge_attributes(G, weight, "weight")
G.remove_edges_from([(a,b) for a, b, attrs in G.edges(data=True) if attrs["weight"]<0.1 or attrs["coocc"]<15])

In [412]:
nx.write_gexf(G, "output/coocc.gexf")

In [413]:
authors = pd.read_parquet("science/climate/authors.parquet")
articles_authors = pd.read_parquet("science/climate/articles_authors.parquet")

In [414]:
articles_authors = articles_authors.merge(authors, left_on="author_id", right_index=True)

In [415]:
article_gender = articles_authors.groupby("article_id").agg(
    male = ("gender", lambda g: np.sum(g=="m")),
    female = ("gender", lambda g: np.sum(g=="f"))
)

In [416]:
df = df.merge(article_gender, how="inner", left_index=True, right_index=True)

In [424]:
def term_frequency(df):
    document_frequency = dict()

    for words in df["words"].tolist():
        for word in set(words):
            document_frequency[word] = document_frequency.get(word, 0)+1

    return document_frequency

male_docs = df[df["male"]>df["female"]]
female_docs = df[df["male"]<df["female"]]

male_freq = term_frequency(male_docs)
female_freq = term_frequency(female_docs)

n_male = len(male_docs)
n_female = len(female_docs)

In [425]:
freq = []
for word in frequent_words:
    freq.append({
        "word": word,
        "male": male_freq.get(word, 0),
        "female": female_freq.get(word, 0),
        "male_proportion": male_freq.get(word, 0)/n_male,
        "female_proportion": female_freq.get(word, 0)/n_female
    })

In [426]:
freq = pd.DataFrame(freq)
freq

Unnamed: 0,word,male,female,male_proportion,female_proportion
0,al,398,123,0.003259,0.002256
1,job,88,32,0.000721,0.000587
2,uncertainty,1493,407,0.012225,0.007466
3,data,6464,2120,0.052928,0.038888
4,et,305,107,0.002497,0.001963
...,...,...,...,...,...
2176,birdlife,0,0,0.000000,0.000000
2177,v0.1,142,45,0.001163,0.000825
2178,v0.2,71,22,0.000581,0.000404
2179,cientific,312,0,0.002555,0.000000


In [428]:
freq["ratio"] = freq["female_proportion"]/freq["male_proportion"]
freq = freq[freq["male"]+freq["female"]>20]

In [430]:
# male dominated

freq.sort_values("ratio", ascending=True).head(10)

Unnamed: 0,word,male,female,male_proportion,female_proportion,ratio
2180,pati,312,0,0.002555,0.0,0.0
2179,cientific,312,0,0.002555,0.0,0.0
2064,elite,111,6,0.000909,0.00011,0.121094
2094,lidar,180,14,0.001474,0.000257,0.174241
1787,plume,118,10,0.000966,0.000183,0.189851
1909,cop27,93,8,0.000761,0.000147,0.192709
1562,constant,129,12,0.001056,0.00022,0.208395
1703,tropomi,102,10,0.000835,0.000183,0.219632
950,saudi,101,10,0.000827,0.000183,0.221806
1726,turbulence,90,9,0.000737,0.000165,0.224024


In [431]:
# female dominated

freq.sort_values("ratio", ascending=False).head(10)

Unnamed: 0,word,male,female,male_proportion,female_proportion,ratio
2174,weecology/portaldata,0,242,0.0,0.004439,inf
2175,weecology/forecasts,0,172,0.0,0.003155,inf
1931,nursing,31,116,0.000254,0.002128,8.382839
1954,nurse,32,104,0.000262,0.001908,7.280785
1979,curriculum,20,61,0.000164,0.001119,6.832736
1923,maternal,24,72,0.000197,0.001321,6.720724
84,libguides,29,78,0.000237,0.001431,6.025477
261,woman,134,355,0.001097,0.006512,5.934968
1939,pm2.5,25,66,0.000205,0.001211,5.914237
1975,emotional,26,61,0.000213,0.001119,5.255951
