Before I can create a unique word frequency list or translate individual words, I need to know the type of each word in each sentence. For this I can use part-of-speech tagging with an HMM or transformer that has been pre-trained on a french corpus. Two options for french are the Stanford POS pre-trained model which uses their Maximum Entropy model, or a Long-Short-Term Memory Conditional Random Field (LSTM-CRF) pre-trained by huggingface that uses flair contextual string embeddings. Using this method I will add a new column to my csv database that includes all words in the sentence and their type in JSON format.

POS French camemBERT Flair tag list: https://huggingface.co/qanastek/pos-french-camembert-flair

In [1]:
import flair

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
'''
from nltk.tag import StanfordPOSTagger

# Set the paths to the Stanford POS Tagger jar and model files
jar = '<path_to>/stanford-postagger-3.7.0.jar'
model = '<path_to>/models/french.tagger'

# Set the JAVAHOME environment variable to point to your JDK installation
import os
java_path = "<path_to>/jdk1.8.0_121/bin/java.exe"
os.environ['JAVAHOME'] = java_path

# Initialize the tagger
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')

# Tag a sentence
res = pos_tagger.tag('je suis libre'.split())
print(res)
'''

In [2]:
from flair.data import Sentence
from flair.models import SequenceTagger

# Load the LSTM-CRF model pre-trained for french from huggingface
model = SequenceTagger.load("qanastek/pos-french")

# Create a sentence
sentence = Sentence("George Washington est allé à Washington")

# Predict tags
model.predict(sentence)

# Print predicted pos tags
print(sentence.to_tagged_string())

Downloading pytorch_model.bin: 100%|██████████| 1.25G/1.25G [01:15<00:00, 16.6MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


2023-11-04 14:03:59,617 SequenceTagger predicts: Dictionary with 69 tags: <unk>, O, DET, NFP, ADJFP, AUX, VPPMS, ADV, PREP, PDEMMS, NMS, COSUB, PINDMS, PPOBJMS, VERB, DETFS, NFS, YPFOR, VPPFS, PUNCT, DETMS, PROPN, ADJMS, PPER3FS, ADJFS, COCO, NMP, PREL, PPER1S, ADJMP, VPPMP, DINTMS, PPER3MS, PPER3MP, PREF, ADJ, DINTFS, CHIF, XFAMIL, PRELFS, SYM, NOUN, MOTINC, PINDFS, PPOBJMP, NUM, PREFP, PDEMFS, VPPFP, PPER3FP
Sentence[6]: "George Washington est allé à Washington" → ["George"/PROPN, "Washington"/XFAMIL, "est"/AUX, "allé"/VPPMS, "à"/PREP, "Washington"/PROPN]


In [3]:
# Create a sentence
sentence = Sentence("C'est la premiere fois que j'utilise cette methode et il y a des fautes de grammaire et un moirfas qui n'existe pas!")

# Predict tags
model.predict(sentence)

# Print predicted pos tags
print(sentence.to_tagged_string())

Sentence[23]: "C'est la premiere fois que j'utilise cette methode et il y a des fautes de grammaire et un moirfas qui n'existe pas!" → ["C'est"/PREP, "la"/DETFS, "premiere"/ADJFS, "fois"/NFS, "que"/COSUB, "j'utilise"/PREP, "cette"/PDEMFS, "methode"/NFS, "et"/COCO, "il"/PPER3MS, "y"/PPOBJMS, "a"/VERB, "des"/DET, "fautes"/NFP, "de"/PREP, "grammaire"/NFS, "et"/COCO, "un"/DINTMS, "moirfas"/NMS, "qui"/PREL, "n'existe"/VERB, "pas"/ADV, "!"/PUNCT]


This is pretty accurate (Overall accuracy around 98%, but far less so for certain word types), but I need to make sure the labels aren't too specific, otherwise they could split the counts for the same word with similar but not identical meaning across different contexts into separate categories.