# nlptextdoc library source code

## 1. Prepare the Python environment

Check the python virtual environment in use:

In [1]:
import sys
print(sys.executable)

/workspace/nlptextdoc/.venv/bin/python


Install all the required libraries with pip:

In [None]:
pip install -r requirements.txt

### 1.1 Install pandas with pyarrow feather file format support

Make sure that your version of pandas is >= 2.0.0 and that pyarrow >= 12.0.0 is installed:

In [2]:
import pandas as pd
pd.__version__

'2.0.2'

In [3]:
import pyarrow
pyarrow.__version__

'12.0.0'

### 1.2 Install spaCy with french language support

Make sure that your version of spacy is >= 3.5 and that french models are installed:

In [4]:
import spacy
spacy.__version__

2023-07-18 19:02:18.422442: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX_VNNI
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-18 19:02:18.750068: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-18 19:02:21.390551: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-07-18 19:02:21.390863: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to re

'3.5.3'

In [5]:
spacy.util.get_installed_models()

['fr_dep_news_trf', 'fr_core_news_md']

In [6]:
def spacy_InitWithFrenchTokenizerOnly():
    nlp = spacy.blank('fr')
    return nlp

In [7]:
nlp = spacy_InitWithFrenchTokenizerOnly()
doc = nlp("Ce chapitre va te montrer tout ce qu'il y a à savoir à propos du pipeline de traitement de spaCy.")
for word in doc:
    print(word)

Ce
chapitre
va
te
montrer
tout
ce
qu'
il
y
a
à
savoir
à
propos
du
pipeline
de
traitement
de
spaCy
.


### 1.3 Install FastText and its language identification model

Make sure that fasttext is properly installed:

In [8]:
import fasttext

Download the fasttext language detection model:

In [None]:
!wget -P /models/fasttext https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

Check if the language detection works :

In [9]:
from fasttext import FastText

langmodel = FastText.load_model("/models/fasttext/lid.176.bin")
langcodes = {val:val[-2:] for val in langmodel.get_labels()}

def detect_lang(text):
    pred = langmodel.predict(text)
    return (langcodes[pred[0][0]],pred[1][0])



In [10]:
%time detect_lang("quel temps fait-il ?")

CPU times: user 39 µs, sys: 31 µs, total: 70 µs
Wall time: 72.2 µs


('fr', 0.9988894462585449)

In [11]:
detect_lang("what time is it ?")

('en', 0.9735504388809204)

### 1.4 Download publicly available french dictionaries

Create a local subdirectory to store the french dictionaries:

In [5]:
from pathlib import Path

dictdir = Path("_dictionaries")
dictdir.mkdir(exist_ok=True)

**Dictionary 1 : Dicollecte** - Open Source french dictionary for LibreOffice/OpenOffice

Website : https://grammalecte.net/.

Licence : MPL : Mozilla Public License version 2.0  -  http://www.mozilla.org/MPL/2.0/.

Download the file : http://www.lexique.org/databases/Dicollecte/lexique-dicollecte-fr-v6.4.1.txt

In [None]:
!wget -P {dictdir} http://www.lexique.org/databases/Dicollecte/lexique-dicollecte-fr-v6.4.1.txt

In [13]:
dicollectefile = dictdir / "lexique-dicollecte-fr-v6.4.1.txt"
dicollectefile.exists()

True

In [14]:
def buildDicollecteTags(dicollectefile):
    dictionarydf = pd.read_csv(dicollectefile, sep="\t")
    dictionarytags = {}
    for index, row in dictionarydf.iterrows():
        token = row["Flexion"]
        tag = _convertDicollecteTagsToUnivDepTags(row["Étiquettes"])
        if(not (token in dictionarytags)):
            dictionarytags[token] = tag
        elif(not (tag in dictionarytags[token])):
            dictionarytags[token] = dictionarytags[token] + "|" + tag
    return dictionarytags

def _convertDicollecteTagsToUnivDepTags(text):
    if(("adj" in text) or ("loc.adj" in text)):
        return "ADJ"
    elif("prep" in text):
        return "ADP"
    elif(("adv" in text) or ("loc.adv" in text)):
        return "ADV"
    elif(("v0a" in text) or ("v0e" in text) or ("ppas" in text)):
        return "AUX"
    elif("cjco" in text):
        return "CCONJ"
    elif("det" in text):
        return "DET"
    elif("interj" in text):
        return "INTJ"
    elif("nom" in text):
        return "NOUN"
    elif(("nb" in text) or ("ord" in text)):
        return "NUM"
    elif("pro" in text):
        return "PRON"
    elif(("prn" in text) or ("patr" in text) or ("npr" in text)):
        return "PROPN"
    elif("cjsub" in text):
        return "SCONJ"
    elif("symb" in text):
        return "SYM"
    elif(("v1" in text) or ("v2" in text) or ("v3" in text) or ("loc.verb" in text)):
        return "VERB"
    else:
        return text

In [15]:
dicollecteTags = buildDicollecteTags(dicollectefile)

**Dictionary 2 : UDLexicons Lefff** - Research resource from INRIA for the [Universal Dependencies](https://universaldependencies.org/) project

Citation : Benoît Sagot. A multilingual collection of CoNLL-U-compatible morphological lexicons. Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Miyazaki, Japan. hal-01798798v2

Paper : https://hal.inria.fr/hal-01798798v2/document

Download the latest "UDLexicons" on [Benoît Sagot's resources page](http://alpage.inria.fr/~sagot/)

In [None]:
!wget -P {dictdir} http://atoll.inria.fr/~sagot/UDLexicons.0.2.zip

In [None]:
!unzip {dictdir}/UDLexicons.0.2.zip 'UDLexicons.0.2/UDLex_French-Lefff.conllul' -d {dictdir}

In [24]:
!rm {dictdir}/UDLexicons.0.2.zip

In [16]:
leffffile = dictdir / "UDLexicons.0.2/UDLex_French-Lefff.conllul"
leffffile.exists()

True

In [17]:
def buildLefffTags(leffffile):
    lexicondf = pd.read_csv(leffffile, sep="\t", quoting=3, on_bad_lines='skip')
    lexicontags = {}
    for index, row in lexicondf.iterrows():
        token = row["!"]
        tag = row["PUNCT"]
        if(not (token in lexicontags)):
            lexicontags[token] = tag
        elif(not (tag in lexicontags[token])):
            lexicontags[token] = lexicontags[token] + "|" + tag
    return lexicontags

In [18]:
lefffTags = buildLefffTags(leffffile)

### 1.5 Define technical utility functions

In [6]:
import os

def _memory_size(obj, seen=None):
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([_memory_size(v, seen) for v in obj.values()])
        size += sum([_memory_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += _memory_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([_memory_size(i, seen) for i in obj])
    return size

# OTHER OPTION specific to pandas dataframes
# https://www.dataquest.io/blog/pandas-big-data/
# df.info(memory_usage="deep")

def _file_size(filepath):
    statinfo = os.stat(filepath)
    return statinfo.st_size

def _format_size_mb(size):
    return int(size / 1024.0 / 102.4) / 10.0

## 2. Extract nlp text documents from a list of websites in a local directory

### 2.1 Download the nlptextdoc web scraper

Test the **nlptextdoc** command line tool and learn its syntax:

```
nlptexdoc extractor v1.0

Crawls all the Html pages of a website and converts them to .nlp.txt structured text documents.
All the extracted text documents are stored under a single directory named like the website.The .nlp.txt file format is described here : https://www.cognitivefactory.org/nlptextdocument/

Features an advanced Html to text conversion algorithm :
- tries to recover the logical structure of the document from the Html layout
- interprets Css properties of the Html nodes to make this operation more reliable
- preserves document / section / list / table grouping and nesting information

Usage : nlptextdoc [scope] [rootUrl] [storageDir] [key=value optional params]
 - scope            : domain | subdomain | path
                      > decide what part of the rootUrl should be used to limit the extraction
 - rootUrl          : root Url of the website (or subfolder of a website) you want to crawl
 - storageDir       : path to the disk directory where the text documents will be extracted
Optional stopping conditions (the first to be met will stop the crawl, 0 means no limit) :
 - maxDuration=2     : maximum duration of the extraction in minutes
 - maxPageCount=500  : maximum number of pages extracted from the website
 - maxErrorsCount=100  : maximum number of errors during the extraction
 - minUniqueText=10  : minimum percentage of unique text blocks extracted
 - maxSizeOnDisk=0   : maximum size of the extracted text files on disk in Mb
Optional parameters :
 - minCrawlDelay=100 : delay in milliseconds between two requests sent to the website

Recommended process :
0. Navigate to the rootUrl in your browser and check the links on the page to select a scope for the extraction
1. Run the the tool once with the default params (maximum 2 minutes/500 pages, small crawl delay)
2. Open the log file "_nlptextdoc/httprequests.log.csv" created in the storageDirectory for the website
3. Check for Http "Forbidden" answers or connection errors, and test if the url was accessible when tested from your browser
4. Try again with a bigger minCrawlDelay, and continue to increase it until "Forbidden" errors disappear
5. Open the log file "_nlptextdoc/exceptions.log.txt" created in the storageDirectory for the website
6. Try to find the root cause and to fix any exception message you see there
7. Start the extraction again with bigger maxPageCount and maxDuration
8. Open the log file "_nlptextdoc /exceptions.log.txt" and find the Urls you want to exclude9. Add urlPatternsToExclude and continue or restart the crawl with a bigger maxPageCount and maxDuration

The extraction can take a while :
- your system can go to hibernation mode and resume without interrupting the crawl
- your can even stop the crawl (Ctrl-C or shutdown) and continue it later where you left it
- the continue command will use checkpoint and config files found in the "_nlptextdoc" subfolder
- the restart command will ignore any checkpoint, start again at the root url, and overwrite everything

Specific syntax to continue or restart after a first try :
nlptextdoc [continue|restart] [storageDirectory/rootUrlSubdir] [key=value optional params to override]
```

### 2.2 Identify popular websites to build your specific language model

List the public and open websites you would like to scrape to build a language model.

PLEASE MAKE SURE THIS IS LEGAL in your country.

For example in Europe : https://ec.europa.eu/digital-single-market/en/modernisation-eu-copyright-rules. 

> "The mandatory exceptions that the proposed directive announced are related to: ... Text and data mining ..."

### 2.3 Extract raw text from these websites in a local directory

Create a local directory to store the extracted nlp text documents : be careful, this directory may contain several gigabytes of data at the end of the process.

In [7]:
rootdir = Path("nlptextdoc-dataset-2023_05")
rootdir.mkdir(exist_ok=True)

In the local root directory, the extraction program creates one subdirectory per website.

Each website subdirectory contains:
- a **_nlptextdoc** subdirectory, with:
  - a config file called **config.txt**
  - a log file called **requests.log.csv**
- subdirectories reproducing the website tree structure
  - with one **nlp.txt text document** for each html page or pdf file extracted

See the following page for a **description of the nlptextdoc format** : https://github.com/laurentprudhon/nlptextdoc/blob/master/README.md

## 3. Consolidate the extracted nlp text documents in one dataframe for each website 

### 3.1 Build a list of the extracted websites

In [8]:
extractiondir = rootdir / "_extraction"
extractiondir.mkdir(exist_ok=True)

In [22]:
from pathlib import Path
import numpy as np
import pandas as pd

def list_websites(rootdir):
    scopes = []
    urls = []
    dirs = []
    SCOPE_KEY = "scope="
    URL_KEY="rootUrl="
    for entry in os.scandir(rootdir):
        if entry.is_dir():
            websitedir = Path(entry)
            configfile = websitedir / "_nlptextdoc" / "config.txt"
            scope = ""
            if configfile.exists():
                with configfile.open(mode="r", encoding="utf-8-sig") as f:   
                    for lineidx,line in enumerate(f):
                        line = line.strip()
                        if (line.startswith(SCOPE_KEY)):
                            scope = line[len(SCOPE_KEY):]
                        if (line.startswith(URL_KEY)):
                            url = line[len(URL_KEY):]
                            scopes.append(scope)
                            urls.append(url)
                            dirs.append(str(websitedir.name))                            
                            break
    websitesdf = pd.DataFrame({"Url": urls, "Scope" : scopes, "Directory" : dirs})
    websitesdf = websitesdf.astype({"Scope": "category"},copy=False)
    websitesdf.sort_values(by="Url", ignore_index=True, inplace=True)
    return websitesdf

In [25]:
websitesdf = list_websites(rootdir)

In [26]:
websitesdf.head()

Unnamed: 0,Url,Scope,Directory
0,https://www.acm.fr/,subdomain,www.acm.fr
1,https://www.afedim.fr/,subdomain,www.afedim.fr
2,https://www.banquedeluxembourg.com/,subdomain,www.banquedeluxembourg.com
3,https://www.banquetransatlantique.com/,subdomain,www.banquetransatlantique.com
4,https://www.becm.fr/,subdomain,www.becm.fr


Check if all the websites were correctly extracted:

In [27]:
import os
import fnmatch
import pandas as pd

def loadExtractionLogs(websitedir):
    return pd.read_csv(websitedir / "_nlptextdoc/requests.log.csv",delimiter=";")

def count_files_with_extension(directory, extension):
    count = 0
    for root, dirs, files in os.walk(directory):
        for file in files:
            if fnmatch.fnmatch(file, f'*.{extension}'):
                count += 1
    return count

def getExtractionStats(rootdir, websitesdf):
    websiteNames = []
    requestsCount = []
    extractedCount = []
    htmlCount = []
    pdfCount = []
    statusCounts = []    
    errorTypes = ["OK","NotFound","Forbidden"]
    for errorType in errorTypes:
        statusCounts.append([])
    for websiterow in websitesdf.itertuples():
        website = websiterow.Url        
        websitedir = rootdir / websiterow.Directory
        print(f"Checking extraction logs for website {website} ...")
        logsdf = loadExtractionLogs(websitedir)
        logsstatus = logsdf["Status code"].value_counts()
        websiteNames.append(website)
        requestsCount.append(len(logsdf))
        for idx,errorType in enumerate(errorTypes):
            statusCounts[idx].append(logsstatus[errorType] if errorType in logsstatus else 0)
        extractedtotal = count_files_with_extension(websitedir, "nlp.txt")
        extractedpdf = count_files_with_extension(websitedir, "pdf.nlp.txt")
        extractedCount.append(extractedtotal)
        htmlCount.append(extractedtotal-extractedpdf)
        pdfCount.append(extractedpdf)
    websitesdf["Requests"] = requestsCount
    for idx,errorType in enumerate(errorTypes):
        websitesdf[errorType] = statusCounts[idx]    
    websitesdf["Extracted"] = extractedCount
    websitesdf["(html)"] = htmlCount
    websitesdf["(pdf)"] = pdfCount
    websitesdf.sort_values(by="Extracted", ascending=False, ignore_index=True, inplace=True)

In [26]:
getExtractionStats(rootdir, websitesdf)

Checking extraction logs for website https://www.acm.fr/ ...
Checking extraction logs for website https://www.afedim.fr/ ...
Checking extraction logs for website https://www.banquedeluxembourg.com/ ...
Checking extraction logs for website https://www.banquetransatlantique.com/ ...
Checking extraction logs for website https://www.becm.fr/ ...
Checking extraction logs for website https://www.beobank.be/ ...
Checking extraction logs for website https://www.bfcm.creditmutuel.fr/ ...
Checking extraction logs for website https://www.cic-marketsolutions.eu/ ...
Checking extraction logs for website https://www.cic.ch/ ...
Checking extraction logs for website https://www.cic.fr/ ...
Checking extraction logs for website https://www.cic.fr/banqueprivee/ ...
Checking extraction logs for website https://www.cofidis-group.com/ ...
Checking extraction logs for website https://www.cofidis.be/ ...
Checking extraction logs for website https://www.cofidis.cz/ ...
Checking extraction logs for website http

In [27]:
websitesdf.to_csv(extractiondir / "websites.csv", sep=";", index=False)

In [23]:
websitesdf = pd.read_csv(extractiondir / "websites.csv", sep=";")

In [18]:
websitesdf

Unnamed: 0,Url,Scope,Directory,Requests,OK,NotFound,Forbidden,Extracted,(html),(pdf)
0,https://www.creditmutuel.fr/,subdomain,www.creditmutuel.fr,24047,23672,230,0,11857,10912,945
1,https://www.cic.fr/,subdomain,www.cic.fr,17534,17127,176,0,7838,6807,1031
2,https://www.becm.fr/,subdomain,www.becm.fr,9029,8925,26,0,7647,7597,50
3,https://www.la-francaise.com/,subdomain,www.la-francaise.com,14618,14607,3,0,6330,4600,1730
4,https://www.beobank.be/,subdomain,www.beobank.be,2651,2631,17,0,3710,2916,794
5,https://www.cic.fr/banqueprivee/,path,www.cic.fr_banqueprivee_,2434,2423,7,0,1991,1991,0
6,https://www.creditmutuel-am.eu/,subdomain,www.creditmutuel-am.eu,3711,3539,171,0,1651,1284,367
7,https://www.tomamosimpulso.com/,subdomain,www.tomamosimpulso.com,2534,2488,8,0,1536,1171,365
8,https://www.creditmutuel.com/,subdomain,www.creditmutuel.com,2681,2652,29,0,1481,1122,359
9,https://www.banquedeluxembourg.com/,subdomain,www.banquedeluxembourg.com,2710,2538,145,0,1394,1394,0


Prepare a list of websites directories for the next steps:

In [24]:
websitesdirs = list(websitesdf["Directory"])

### 3.2 Generate an efficient DataFrame for each website

The following class can be used to **parse the .nlp.txt text files extracted from a website into a DataFrame**.

See the following page for a **description of the nlptextdoc format** : https://github.com/laurentprudhon/nlptextdoc/blob/master/README.md

In [25]:
def get_website_dataframe_files(websitename, outputdir, i=0):
    nameprefix = websitename + "."
    idxprefix = "" if i == 0 else str(i)+"."    
    urlsdffile = outputdir / (nameprefix + idxprefix + "urls.feather")
    textdffile = outputdir / (nameprefix + idxprefix + "nlptextdocs.feather")
    return (urlsdffile, textdffile)

In [20]:
import pandas as pd
import pyarrow as pa
import re

class NLPTextDocumentReader:
    """Read output files of a website extraction in pandas DataFrames.
    
    Sample usage :
    
    textreader = NLPTextDocumentReader(websitedir)
    textdf = textreader.load_dataframe()
    """    
    def __init__(self, websitedir):
        self.websitedir = websitedir
        
        self.documentCount = 0 
        self.nestingLevel = 1
        
        self.listDocUrls = []
        
        self.listDocId = []
        self.listType = []
        self.listCmd = []
        self.listLevel = []
        self.listText = []
                
        self.DOCUMENT_ELEMENT_LINE_MARKER = "##"
        self.DOCUMENT_ELEMENT_START = "Start"
        self.DOCUMENT_ELEMENT_END = "End"
        self.DOCUMENT_ELEMENT_ITEMS = "Items"
        self.DOCUMENT_ELEMENT_ITEMS_START = ">>"
        self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR = "||"
        
        self.TEXT_DOCUMENT_PROPERTY_PREFIX = self.DOCUMENT_ELEMENT_LINE_MARKER + " NLPTextDocument "
        self.TEXT_DOCUMENT_TITLE = "Title"
        self.TEXT_DOCUMENT_URI = "Uri"
        
        self.DOCUMENT_ELEMENT_LINE_REGEX = re.compile(
            self.DOCUMENT_ELEMENT_LINE_MARKER + " "
            + "(?P<NestingLevel>[0-9]+)" + " "
            + "(?P<ElementName>[A-Za-z]+)" + " "
            + "(?P<Command>" + self.DOCUMENT_ELEMENT_START + "|" + self.DOCUMENT_ELEMENT_END + "|" + self.DOCUMENT_ELEMENT_ITEMS + ")" + " ?")
        
    def generate_dataframes(self, outputdir):        
        i = 0
        for textfile in self.websitedir.glob("**/*.nlp.txt"):
            with textfile.open(mode="r", encoding="utf-8-sig", newline='\n') as f:   
                self.textfile = textfile
                self.documentCount = self.documentCount+1
                self.onDocumentStart(str(self.documentCount))
                self.isreadingproperties = True
                for lineidx,line in enumerate(f):
                    line = line.strip()
                    if(not line): continue
                    self.lineidx = lineidx
                    self.readline(line)
                self.onDocumentEnd(str(self.documentCount))
            i = i+1
            if(i%100000 == 0):
                self.write_dataframes(outputdir, i)
        return self.write_dataframes(outputdir)
  
    def write_dataframes(self, outputdir, i=0):
        urlsdffile, textdffile = get_website_dataframe_files(self.websitedir.name, outputdir, i)
        urlsdf = pd.DataFrame({"DocId" : pd.Series(data = range(1,len(self.listDocUrls)+1), dtype="uint32[pyarrow]"), 
                               "DocUrl" : pd.Series(data = self.listDocUrls, dtype="string[pyarrow]")})        
        urlsdf.to_feather(urlsdffile)      
        textdf = pd.DataFrame({"DocId" : pd.Series(data = self.listDocId, dtype="uint32[pyarrow]"), 
                               "DocEltType" : pd.Series(data = self.listType, dtype="category"), 
                               "DocEltCmd" : pd.Series(data = self.listCmd, dtype="category"), 
                               "NestingLevel" : pd.Series(data = self.listLevel, dtype="uint8[pyarrow]"), 
                               "Text" : pd.Series(data = self.listText, dtype="string[pyarrow]")})
        textdf.to_feather(textdffile)
        self.__init__(self.websitedir)
        return (urlsdf,textdf)

    def readline(self,line):
        if (self.isreadingproperties):
            if (line.startswith(self.TEXT_DOCUMENT_PROPERTY_PREFIX)):
                self.readproperty(line[len(self.TEXT_DOCUMENT_PROPERTY_PREFIX):])
            else:
                self.isreadingproperties = False
        if (not self.isreadingproperties):
            self.readelement(line)
                
    def readproperty(self,propstr):
        firstspaceindex = propstr.find(" ");
        if (firstspaceindex > 0):
            propertyname = propstr[:firstspaceindex]            
            propertyvalue = propstr[firstspaceindex + 1:].strip()
            if(propertyname == self.TEXT_DOCUMENT_TITLE):
                self.onDocumentTitle(propertyvalue)
            elif(propertyname == self.TEXT_DOCUMENT_URI):
                self.onDocumentUri(propertyvalue)       
    
    def readelement(self,line):
        if (line.startswith(self.DOCUMENT_ELEMENT_LINE_MARKER)):
            self.readcommand(line)
        else:
            self.onTextBlock(line)
    
    def readcommand(self,line):
        match = self.DOCUMENT_ELEMENT_LINE_REGEX.match(line)
        if(match): 
            self.nestingLevel = int(match.group("NestingLevel"))
            elementName = match.group("ElementName")
            command = match.group("Command")
            if (command == self.DOCUMENT_ELEMENT_START):
                title = line[match.end():].strip()
                if (len(title) == 0): title = None
                if(elementName == "Section"):
                    self.onSectionStart(title)
                elif(elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif(elementName == "List"):
                    self.onListStart(title)
                elif(elementName == "ListItem"):
                    self.onListItemStart()
                elif(elementName == "Table"):
                    self.onTableStart(title)
                elif(elementName == "TableHeader"):
                    self.onTableHeaderStart()           
                elif(elementName == "TableCell"):
                    self.onTableCellStart()
            elif (command == self.DOCUMENT_ELEMENT_END):
                if(elementName == "Section"):
                    self.onSectionEnd()
                elif(elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif(elementName == "List"):
                    self.onListEnd()
                elif(elementName == "ListItem"):
                    self.onListItemEnd()
                elif(elementName == "Table"):
                    self.onTableEnd()
                elif(elementName == "TableHeader"):
                    self.onTableHeaderEnd()                 
                elif(elementName == "TableCell"):
                    self.onTableCellEnd()
            elif (command == self.DOCUMENT_ELEMENT_ITEMS):
                interestingLine = False #line == "## 5 List Items >> Découvrez nos solutions pour épargner || Nos conseils## 4 ListItem End"
                if interestingLine:
                    print("INTERESTING LINE")
                startOfItems = line.find(self.DOCUMENT_ELEMENT_ITEMS_START)
                itemsText = line[startOfItems+len(self.DOCUMENT_ELEMENT_ITEMS_START):].strip()
                # START Temporary bug fix: newline missing in many documents here
                match2 = self.DOCUMENT_ELEMENT_LINE_REGEX.search(itemsText)
                nextLine = None
                if(match2):
                    if interestingLine:
                        print(f"Before line: {line}")
                        print(f"Before itxt: {itemsText}")
                    nextLineStartIndex = match2.start()
                    nextLine = itemsText[nextLineStartIndex:]
                    itemsText = itemsText[:nextLineStartIndex]
                    line = line[:startOfItems+len(self.DOCUMENT_ELEMENT_ITEMS_START)+nextLineStartIndex+1]
                    if interestingLine:
                        print(f"After line: {line}")
                        print(f"After itxt: {itemsText}")
                        print(f"Next  line: {nextLine}")
                # END Temporary bug fix                  
                title = line[match.end():startOfItems].strip()
                if (len(title) == 0): title = None
                if (elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif (elementName == "List"):
                    self.onListStart(title)             
                self.nestingLevel = self.nestingLevel+1
                items = itemsText.split(self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR)
                for item in items:
                    item = item.strip()
                    if (len(item) > 0):
                        self.onInlineListItem(item)
                self.nestingLevel = self.nestingLevel-1
                if (elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif (elementName == "List"):
                    self.onListEnd()
                # START Temporary bug fix
                if nextLine is not None:
                    self.readcommand(nextLine)
                # END Temporary bug fix                
            else:
                raise Exception(f"File format error in file {self.textfile} on line {self.lineidx} : {line[:min(len(line), 50)]}");                     
        else:
            raise Exception(f"File format error in file {self.textfile} on line {self.lineidx} : {line[:min(len(line), 50)]}");
    
    def onDocumentStart(self,docId):
        self.appendrow("Document","Start",docId)
    
    def onDocumentTitle(self,title):
        self.appendrow("Document","Title",title)
            
    def onDocumentUri(self,uri):
        self.listDocUrls.append(uri)
        self.appendrow("Document","Uri",uri)
    
    def onDocumentEnd(self,docId):
        self.appendrow("Document","End",docId)
    
    def onTextBlock(self,text):
        self.appendrow("TextBlock","Text",text)
            
    def onSectionStart(self,title):
        self.appendrow("Section","Start",title)
        
    def onSectionEnd(self): 
        self.appendrow("Section","End")
        
    def onNavigationListStart(self,title):
        self.appendrow("NavigationList","Start",title)
        
    def onNavigationListEnd(self):
        self.appendrow("NavigationList","End")
        
    def onListStart(self,title):
        self.appendrow("List","Start",title)
        
    def onListEnd(self):
        self.appendrow("List","End")
        
    def onInlineListItem(self,item):
        self.appendrow("ListItem","Text",item)
            
    def onListItemStart(self):
        self.appendrow("ListItem","Start")
        
    def onListItemEnd(self):
        self.appendrow("ListItem","End")
        
    def onTableStart(self,title):
        self.appendrow("Table","Start",title)
    
    def onTableEnd(self):
        self.appendrow("Table","End")
        
    def onTableHeaderStart(self):
        self.appendrow("TableHeader","Start")
        
    def onTableHeaderEnd(self): 
        self.appendrow("TableHeader","End")
        
    def onTableCellStart(self):
        self.appendrow("TableCell","Start")
        
    def onTableCellEnd(self): 
        self.appendrow("TableCell","End")
            
    def appendrow(self,docEltType,docEltCmd,text=None):
        self.listDocId.append(self.documentCount)
        self.listType.append(docEltType)
        self.listCmd.append(docEltCmd)
        self.listLevel.append(self.nestingLevel)
        if(text != None):
            text = text.replace("\\n","\n")
        self.listText.append(text)

Test if the document reader works on the first website:

In [32]:
websitedir = websitesdirs[30]
docreader = NLPTextDocumentReader(rootdir / websitedir)
urlsdf, textdf = docreader.generate_dataframes(extractiondir)

In [33]:
len(urlsdf),len(textdf)

(88, 22150)

In [26]:
def load_website_dataframes(websitename, outputdir, i=0):
    urlsdffile, textdffile = get_website_dataframe_files(websitename, outputdir, i)  
    if textdffile.exists():
        urlsdf = pd.read_feather(urlsdffile, dtype_backend="pyarrow")
        textdf = pd.read_feather(textdffile, dtype_backend="pyarrow")
        return (urlsdf, textdf)
    else:
        return (None, None)

In [219]:
urlsdf, textdf = load_website_dataframes(websitedir, extractiondir)

In [220]:
urlsdf.head(), urlsdf.dtypes

(   DocId                                             DocUrl
 0      1                            https://www.cofidis.sk/
 1      2  https://www.cofidis.sk/dolezite-informacie-suv...
 2      3  https://www.cofidis.sk/poistenie-schopnosti-sp...
 3      4               https://www.cofidis.sk/uver-na-auto/
 4      5  https://www.cofidis.sk/autouver-vyhodnejsi-ako...,
 DocId     uint32[pyarrow]
 DocUrl    string[pyarrow]
 dtype: object)

In [221]:
textdf.head(), textdf.dtypes

(   DocId      DocEltType DocEltCmd  NestingLevel  \
 0      1        Document     Start             1   
 1      1        Document     Title             1   
 2      1        Document       Uri             1   
 3      1         Section     Start             1   
 4      1  NavigationList     Start             2   
 
                               Text  
 0                                1  
 1  Seriózne financovanie | Cofidis  
 2          https://www.cofidis.sk/  
 3                          Cofidis  
 4                             <NA>  ,
 DocId                                             uint32[pyarrow]
 DocEltType      dictionary<values=string, indices=int8, ordere...
 DocEltCmd       dictionary<values=string, indices=int8, ordere...
 NestingLevel                                       uint8[pyarrow]
 Text                                              string[pyarrow]
 dtype: object)

Use the function below to prepare DataFrames for all the extracted websites at once :

In [22]:
def prepareDataFramesForWebsites(rootdir, websitesdirs, outputdir):
    """Loads all individual text blocks extracted from the pages of each website in a dataframe, and save them efficiently on disk.

    Parameters:
    websitesdirs - List of paths to the websites extraction directories
    """
    for websitedir in websitesdirs:
        print(f"Preparing dataframe for website dir {websitedir} ...")   
        try:
            urlsdf, textdf = load_website_dataframes(websitedir, outputdir)
            if urlsdf is None:
                reader = NLPTextDocumentReader(rootdir / websitedir)
                urlsdf, textdf = reader.generate_dataframes(outputdir)
            print(f"- {len(urlsdf)} documents")
            print(f"- {len(textdf)} document elements")
            print(f"- dataframe size in memory : {_format_size_mb(textdf.memory_usage().sum())} MB")
            urlsdffile, textdffile = get_website_dataframe_files(websitedir, outputdir)            
            print(f"- dataframe size on disk : {_format_size_mb(_file_size(textdffile))} MB")
        except Exception as e:
            print(f"ERROR: {str(e)}")            

In [414]:
prepareDataFramesForWebsites(rootdir, websitesdirs, extractiondir)

Preparing dataframe for website dir www.creditmutuel.fr ...
- 11857 documents
- 6969671 document elements
- dataframe size in memory : 194.3 MB
- dataframe size on disk : 51.5 MB
Preparing dataframe for website dir www.cic.fr ...
- 7838 documents
- 4606990 document elements
- dataframe size in memory : 172.2 MB
- dataframe size on disk : 58.3 MB
Preparing dataframe for website dir www.becm.fr ...
- 7647 documents
- 7461285 document elements
- dataframe size in memory : 138.4 MB
- dataframe size on disk : 17.9 MB
Preparing dataframe for website dir www.la-francaise.com ...
- 6330 documents
- 1084800 document elements
- dataframe size in memory : 62.9 MB
- dataframe size on disk : 18.7 MB
Preparing dataframe for website dir www.beobank.be ...
- 3710 documents
- 558463 document elements
- dataframe size in memory : 60.0 MB
- dataframe size on disk : 21.5 MB
Preparing dataframe for website dir www.cic.fr_banqueprivee_ ...
- 1991 documents
- 578643 document elements
- dataframe size in memo

If you encounter a parsing error while reading any of the text files : just delete the corrupted file and relaunch the function above.

It will run very efficiently for all the websites already processed.

## 4. Create a dataset with the text gathered from all websites

In [9]:
datasetdir = rootdir / "_dataset"
datasetdir.mkdir(exist_ok = True)

### 4.1 Filter and aggregate all interesting text blocks in a single DataFrame

While we filter and aggregate all the interesting text blocks in a single DataFrame, we also generate the following summaries of the text data for later use :

1. Information about the character set used in the extracted dataset :

In [28]:
from unicodedata import name as unicodename

def charname(char):
    return unicodename(char,f"CHAR {ord(char)}")

def saveCharset(datasetdir, vocabdf):
    charcounts = defaultdict(lambda:0)
    for idx,row in vocabdf.iterrows():
        token = row["Word"]
        count = row["Count"]
        for char in token:
            charcode = ord(char)
            charcounts[charcode] = charcounts[charcode] + count
    charsetdf = pd.DataFrame({"Code" : [*charcounts.keys()], "Count" : [*charcounts.values()]})
    charsetdf.sort_values("Count", ascending=False, inplace=True)
    charsetdf.reset_index(inplace=True)
    charsetdf.drop('index', axis=1, inplace=True)
    charsetdf["Char"] = charsetdf["Code"].map(lambda x:chr(x))
    charsetdf["CharName"] = charsetdf["Char"].map(lambda c:charname(c))
    charsetdf["isAlpha"] = charsetdf["Char"].map(lambda x:x.isalpha())
    charsetdf["isDigit"] = charsetdf["Char"].map(lambda x:x.isdigit())
    charsetdf["isSpace"] = charsetdf["Char"].map(lambda x:x.isspace())
    charsetdf["Percent"] = 100*charsetdf["Count"].cumsum()/charsetdf["Count"].sum()
    charsetdf.to_feather(datasetdir / "charset.feather")
    charsetdf.to_csv(datasetdir / "charset.csv", sep=';', escapechar='\\')
    return charsetdf
               
def loadCharset(datasetdir):
    charsetfile = datasetdir / "charset.feather"
    return pd.read_feather(charsetfile, dtype_backend="pyarrow")

2. Information about the vocabulary (distinct words) used in the extracted dataset :

In [29]:
def saveVocabulary(datasetdir, vocabdict):
    vocabdf = pd.DataFrame({"Word" : [*vocabdict.keys()], "Count" : [*vocabdict.values()]})	
    vocabdf.sort_values("Count", ascending=False, inplace=True)
    vocabdf.reset_index(inplace=True)    
    vocabdf.drop('index', axis=1, inplace=True)
    vocabdf["LefffTags"] = vocabdf["Word"].apply(lambda word: _getTokenTags(str(word),lefffTags))
    vocabdf["DicollecteTags"] = vocabdf["Word"].apply(lambda word: _getTokenTags(str(word),dicollecteTags))
    vocabdf["CommonTags"] = vocabdf.apply(lambda row: _mergeTokenTags(str(row["LefffTags"]),str(row["DicollecteTags"])),axis=1)
    vocabdf["Percent"] = 100*vocabdf["Count"].cumsum()/vocabdf["Count"].sum()
    vocabdf.to_feather(datasetdir / "vocabulary.feather")
    vocabdf.to_csv(datasetdir / "vocabulary.csv", sep=';', escapechar='\\')
    return vocabdf

def _getTokenTags(token,tags):
    annot = tags.get(token)
    if(annot is None):
        annot = tags.get(token.lower())
    if(annot is None):
        try:
            float(token.replace(",","."))
        except ValueError:
            return None
        return "Number"
    return annot

def _mergeTokenTags(annot1,annot2):
    if(annot1 == annot2):
        return annot1
    elif((annot1 != "None") and (annot2 == "None")):
        return annot1
    elif((annot1 == "None") and (annot2 != "None")):
        return annot2
    else:
        tags1 = set(annot1.split("|"))
        tags2 = set(annot2.split("|"))
        mergedtags = tags1 | tags2
        return "|".join(mergedtags)
    
def loadVocabulary(rootdir):
    vocabfile = rootdir / "vocabulary.feather"
    return pd.read_feather(vocabfile, dtype_backend="pyarrow")

Combine the textblocks from all websites in a single dataframe, while adding metadata for each textblock:
- tokenize and count the number of words
- try to detect the language of the textblock 
- check if the textblock is new and unique in the scope of its website

In [26]:
class Stack:
    def __init__(self):
        self.stack = []

    def push(self, item):
        self.stack.append(item)

    def pop(self):
        if not self.is_empty():
            return self.stack.pop()
        else:
            raise Exception("Stack is empty")

    def peek(self):
        if not self.is_empty():
            return self.stack[-1]
        else:
            raise Exception("Stack is empty")

    def is_empty(self):
        return len(self.stack) == 0

    def size(self):
        return len(self.stack)

In [27]:
class DocEltStatus:
    def __init__(self, rowIndex, unique):
        self.rowIndex = rowIndex
        self.unique = unique

In [28]:
from collections import defaultdict
from hashlib import md5

def createDatasetFromWebsites(extractiondir, websitesdirs, outputdir):
    """Combine all textblocks from each website in a single dataframe, while adding metadata to enhance the dataset quality:
     - tokenize and count the number of words
     - try to detect the language of the textblock 
     - check if the textblock is new and unique in the scope of its website

    Create at the same time 3 additional dataframes:
    - a dictionary of all distinct words encountered in the dataset by decreasing frequency
    - a dictionary of all distinct characters encountered in the dataset by decreasing frequency
    - a table of the dataset statistics

    Parameters:
    extractiondir - Path to the directory where the websites were extracted
    websitesdirs - List of strings with the websites directories inside extractiondir
    outputdir - Path to the directory where the dataset will be created
    """
    urlsdfs = {}
    textdfs = {}
    print(f"Loading dataframes for all websites ...")
    for idx,websitedir in enumerate(websitesdirs):
        urlsdffile, textdffile = get_website_dataframe_files(websitedir, extractiondir)  
        urlsdf = pd.read_feather(urlsdffile)
        textdf = pd.read_feather(textdffile)
        print(f"- loaded {len(textdf)} document elements from {websitedir}")
        urlsdfs[idx] = urlsdf
        textdfs[idx] = textdf
    dataseturlsdf,datasettextdf = mergeDataframes(urlsdfs, textdfs)
    urlsdfs = None
    textdfs = None
    createDatasetFromDataframes(dataseturlsdf,datasettextdf, websitesdirs, outputdir)
        
def mergeDataframes(urlsdfs, textdfs):
    print("Merging all dataframes in one ...")
    dataseturlsdf=pd.concat(urlsdfs, names=["SiteIndex", "RowIndex"], axis=0, copy=False)    
    dataseturlsdf.reset_index(inplace=True)
    datasettextdf=pd.concat(textdfs, names=["SiteIndex", "RowIndex"], axis=0, copy=False)    
    datasettextdf.reset_index(inplace=True)
    print("- done")
    return dataseturlsdf,datasettextdf
    
def createDatasetFromDataframes(dataseturlsdf,datasettextdf, websitesdirs, outputdir):
    print(f"Analyzing {len(datasettextdf)} document elements ...")   
    # dataset stats
    datasetCharsCount = 0
    datasetWordsCount = 0
    datasetVocabDict = defaultdict(lambda:0)
    datasetLanguagesDict = defaultdict(lambda:0)
    listWordCounts = []    
    listLanguages = []
    listUniqueFlags = []
    # website stats
    websiteIndex = 0
    print(f"- tokenizing text from website {websitesdirs[websiteIndex]} ...")
    websiteHashes = set()
    websiteWordsCount = 0    
    websiteLanguagesDict = defaultdict(lambda:0)
    listWebsitesWordCounts = []
    listWebsitesLanguages = []
    # Document stats
    documentStartRowIndex = 0
    documentLanguagesDict = defaultdict(lambda:0)
    eltStack = Stack()
    eltStatus = None 
    rowIndex = 0
    for row in datasettextdf.itertuples():
        # reset website stats
        if(row.SiteIndex != websiteIndex):
            listWebsitesWordCounts.append(websiteWordsCount)
            listWebsitesLanguages.append(websiteLanguagesDict)
            print(f"- {websitesdirs[websiteIndex]} contributed {websiteWordsCount} words to the dataset")
            websiteIndex = row.SiteIndex
            websiteHashes = set()
            websiteWordsCount = 0    
            websiteLanguagesDict = defaultdict(lambda:0)            
            print(f"- tokenizing text from website {websitesdirs[websiteIndex]} ...")
        # analyze text row
        if (((row.DocEltType != "Document") or (row.DocEltCmd == "Title")) and (row.DocEltCmd != "End") and pd.notnull(row.Text)):
            # tokenizing
            text = row.Text
            doc = nlp(text)
            # words count            
            wordsCount = len(doc)
            listWordCounts.append(wordsCount)
            # language
            lang,prob = detect_lang(text.replace('\n',' '))
            language = lang
            if(prob < 0.6):
                language = "?"
            listLanguages.append(language)            
            # unique
            hval = md5(text.encode()).digest()
            if not (hval in websiteHashes):         
                websiteHashes.add(hval)
                listUniqueFlags.append(True)
                # datset stats
                datasetCharsCount = datasetCharsCount + len(text)
                datasetWordsCount = datasetWordsCount + wordsCount
                for token in doc:
                    datasetVocabDict[token.text] = datasetVocabDict[token.text] + 1
                datasetLanguagesDict[language] = datasetLanguagesDict[language] + wordsCount
                # website stats
                websiteWordsCount = websiteWordsCount + wordsCount
                websiteLanguagesDict[language] = websiteLanguagesDict[language] + wordsCount
                # document stats
                documentLanguagesDict[language] = documentLanguagesDict[language] + wordsCount
            else:
                listUniqueFlags.append(False)
        # row without text
        else:
            listWordCounts.append(0)
            listLanguages.append(None)         
            listUniqueFlags.append(None)
         
        # Propagate the Lang property to Document Start line
        if row.DocEltType == "Document" and row.DocEltCmd == "Start":
            eltStack = Stack()
            eltStatus = None 
            documentStartRowIndex = rowIndex
            documentLanguagesDict = defaultdict(lambda:0)
        elif row.DocEltType == "Document" and row.DocEltCmd == "End":
            documentLang = "?"
            if len(documentLanguagesDict) > 0:
                documentLang = sorted(documentLanguagesDict, key=datasetLanguagesDict.get, reverse=True)[0]
            listLanguages[documentStartRowIndex] = documentLang
            # Safety check for 'unique' status stack
            if eltStatus.rowIndex != documentStartRowIndex:
                print(f"DOCUMENT FORMAT ERROR for SiteIndex {row.SiteIndex} and DocId {row.DocId}: number of element Start and Stop command don't match !")
        
        # Propagate the Unique property to the DocElt Start line
        if row.DocEltCmd == "Start":
            eltStatus = DocEltStatus(rowIndex, False)
            eltStack.push(eltStatus)
        elif row.DocEltCmd == "Text" or row.DocEltCmd == "Title":
            eltStatus.unique = eltStatus.unique or listUniqueFlags[rowIndex]
        elif row.DocEltCmd == "End":
            listUniqueFlags[eltStatus.rowIndex] = eltStatus.unique
            try:
                eltStack.pop()
            except Exception as e:
                print(f"DOCUMENT FORMAT ERROR for SiteIndex {row.SiteIndex} and DocId {row.DocId}: number of element Start and Stop command don't match !")
                raise e
            if not eltStack.is_empty():
                nestedEltStatus = eltStatus.unique
                eltStatus = eltStack.peek()
                eltStatus.unique = eltStatus.unique or nestedEltStatus
            
        rowIndex += 1
        
    print(f"- {websitesdirs[websiteIndex]} contributed {websiteWordsCount} words to the dataset")
    # website stats
    listWebsitesWordCounts.append(websiteWordsCount)
    listWebsitesLanguages.append(websiteLanguagesDict)
    # dataset stats
    print("Adding statistics to the dataset ...")
    datasettextdf["Words"] = pd.Series(data = listWordCounts, dtype="uint32[pyarrow]")
    datasettextdf["Lang"] = pd.Series(data = listLanguages, dtype="category")
    datasettextdf["Unique"] = pd.Series(data = listUniqueFlags, dtype="boolean[pyarrow]")
        
    print("Saving the complete dataset ...")
    print(f"- {datasetCharsCount} characters, {datasetWordsCount} words, {len(datasettextdf)} text blocks")
    print(f"- dataset size in memory : {_format_size_mb(datasettextdf.memory_usage().sum())} MB")
    datasettextfile = outputdir / "dataset.nlptextdocs.feather"
    datasettextdf.to_feather(outputdir / "dataset.nlptextdocs.feather")
    print(f"- dataset size on disk : {_format_size_mb(_file_size(datasettextfile))} MB")
    dataseturlsdf.to_feather(outputdir / "dataset.urls.feather")
    print("Saving the dataset charset and vocabulary ...")
    vocabdf = saveVocabulary(outputdir, datasetVocabDict)
    print(f"- vocabulary size: {len(vocabdf)} distinct words")
    charsetdf = saveCharset(outputdir, vocabdf)
    print(f"- charset size; {len(charsetdf)} distinct chars")
    print("Saving the stats by website ...")
    websitesdf["Words"] = pd.Series(data = listWebsitesWordCounts, dtype="uint32[pyarrow]")
    sortedLangCodes = sorted(datasetLanguagesDict, key=datasetLanguagesDict.get, reverse=True)[:10]
    for langCode in sortedLangCodes:
        if datasetLanguagesDict[langCode] < 100: continue
        langCodeCounts = [langDict[langCode] for langDict in listWebsitesLanguages]
        websitesdf[langCode] = pd.Series(data = langCodeCounts, dtype="uint32[pyarrow]")
    websitesdf.to_csv(outputdir / "websites.csv", sep=";", index=False)    
    print("- done")
    
def loadDataset(rootdir):
    urlsfile = rootdir / "_dataset" / "dataset.urls.feather"
    datasetfile = rootdir / "_dataset" / "dataset.nlptextdocs.feather"
    return pd.read_feather(urlsfile), pd.read_feather(datasetfile)

Let's use all these functions to create our dataset : depending on the amount of data this could take SEVERAL HOURS.

In [41]:
createDatasetFromWebsites(extractiondir, websitesdirs, datasetdir)

Loading dataframes for all websites ...
- loaded 6969671 document elements from www.creditmutuel.fr
- loaded 4606990 document elements from www.cic.fr
- loaded 7461285 document elements from www.becm.fr
- loaded 1084800 document elements from www.la-francaise.com
- loaded 558463 document elements from www.beobank.be
- loaded 578643 document elements from www.cic.fr_banqueprivee_
- loaded 381961 document elements from www.creditmutuel-am.eu
- loaded 1070323 document elements from www.tomamosimpulso.com
- loaded 504415 document elements from www.creditmutuel.com
- loaded 315550 document elements from www.banquedeluxembourg.com
- loaded 145395 document elements from www.banquetransatlantique.com
- loaded 81403 document elements from www.creditmutuel-equity.eu
- loaded 90626 document elements from www.creditmutuel-im.eu
- loaded 225848 document elements from www.cofidis.hu
- loaded 164616 document elements from www.afedim.fr
- loaded 615422 document elements from www.bfcm.creditmutuel.fr
-

In [33]:
websitesdf = pd.read_csv(datasetdir / "websites.csv", sep=";")

In [42]:
websitesdf.head(10)

Unnamed: 0,Url,Scope,Directory,Requests,OK,NotFound,Forbidden,Extracted,(html),(pdf),...,fr,en,?,hu,es,de,nl,ca,it,sk
0,https://www.creditmutuel.fr/,subdomain,www.creditmutuel.fr,24047,23672,230,0,11857,10912,945,...,8146606,495643,1221965,13,79367,107732,14100,336,1738,26
1,https://www.cic.fr/,subdomain,www.cic.fr,17534,17127,176,0,7838,6807,1031,...,6221729,4149380,1445611,236,192637,172894,5351,164,9692,19
2,https://www.becm.fr/,subdomain,www.becm.fr,9029,8925,26,0,7647,7597,50,...,394984,42890,436427,0,115,22661,3,18,1332,5
3,https://www.la-francaise.com/,subdomain,www.la-francaise.com,14618,14607,3,0,6330,4600,1730,...,1848906,1272768,213783,38,249252,201369,91779,44,182358,0
4,https://www.beobank.be/,subdomain,www.beobank.be,2651,2631,17,0,3710,2916,794,...,1460884,3878339,380192,2,894,2161,932469,38,885,1
5,https://www.cic.fr/banqueprivee/,path,www.cic.fr_banqueprivee_,2434,2423,7,0,1991,1991,0,...,145032,8218,170270,0,27,85,0,6,8,0
6,https://www.creditmutuel-am.eu/,subdomain,www.creditmutuel-am.eu,3711,3539,171,0,1651,1284,367,...,1253891,347867,214360,4,124169,106997,133330,28,309,16
7,https://www.tomamosimpulso.com/,subdomain,www.tomamosimpulso.com,2534,2488,8,0,1536,1171,365,...,142545,9151,221945,16,1614730,455,18,562961,324,0
8,https://www.creditmutuel.com/,subdomain,www.creditmutuel.com,2681,2652,29,0,1481,1122,359,...,1974308,558885,364531,9,105826,95604,22,41,2034,5
9,https://www.banquedeluxembourg.com/,subdomain,www.banquedeluxembourg.com,2710,2538,145,0,1394,1394,0,...,487086,603275,160753,0,2983,238297,91949,58,276,0


In [47]:
urlsdf, textdf = loadDataset(rootdir)

In [48]:
textdf[(textdf["SiteIndex"]==0) & (textdf["DocEltCmd"]=="Start")].head(50)

Unnamed: 0,SiteIndex,RowIndex,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Words,Lang,Unique
0,0,0,1,Document,Start,1,1,0,fr,True
3,0,3,1,NavigationList,Start,1,,0,,True
12,0,12,1,Section,Start,1,Mentions légales,2,fr,False
14,0,14,1,Section,Start,1,Editeur,1,fr,True
16,0,16,1,List,Start,2,,0,,True
17,0,17,1,ListItem,Start,3,,0,,True
20,0,20,1,ListItem,Start,3,,0,,True
23,0,23,1,ListItem,Start,3,,0,,True
26,0,26,1,ListItem,Start,3,,0,,True
31,0,31,1,Section,Start,1,Hébergeur,1,fr,True


### 4.2 Create a huggingface dataset

In [1]:
beforeTitleMarker = "******"
afterTitleMarker = "======"
sectionMarker = "#"
listItemMarker = "- "
tableCellMarker = "| "
tableHeaderSeparatorMarker = "| ---"
lineSeparator = "\n"

def addTitle(row, document, documentLanguagesDict):
    document.append(beforeTitleMarker)
    document.append(lineSeparator)
    addText(row, document, documentLanguagesDict, addBlockSeparator=False, removeNewLines=True)
    document.append(afterTitleMarker)
    document.append(lineSeparator*2)

def startSection(row, document, documentLanguagesDict):
    document.append(sectionMarker*row.NestingLevel + " ")
    addText(row, document, documentLanguagesDict, addBlockSeparator=True)
    
def startListItem(row, document, nestingLevel):
    document.append(" "*((nestingLevel-1)*2) + listItemMarker)
    
def endNestedListItems(document):
    document.append(lineSeparator)

def startTableCellOrHeader(document):
    document.append(tableCellMarker)

def endTableLine(document):
    document.append(lineSeparator)
    
def writeTableHeaderSeparators(headersCount, document):
    document.append(tableHeaderSeparatorMarker * headersCount)
    document.append(lineSeparator)
    
def endTable(document):
    document.append(lineSeparator)
    
def addText(row, document, documentLanguagesDict, addLineSeparator=True, addBlockSeparator=True, removeNewLines=False):
    if not pd.isna(row.Text) and len(row.Text) > 0:
        text = row.Text
        if removeNewLines:
            text = text.replace('\n',' ').replace('\r',' ')
        document.append(text)
        documentLanguagesDict[row.Lang] = documentLanguagesDict[row.Lang] + row.Words
        if addLineSeparator:
            document.append(lineSeparator)
        if addBlockSeparator:
            document.append(lineSeparator)

In [2]:
def datasetDocumentsGenerator(nlptextdocdf, lang):
    currentContainer = Stack()
    document = []
    documentUri = ""
    documentLanguagesDict = defaultdict(lambda:0)
    listNestingLevel = 0
    elementToIgnoreNestingLevel = -1
    insidePdfDocument = False
    insideTableHeadersCount = 0
    insideTableCellsCount = 0
    for row in nlptextdocdf.itertuples():
        if elementToIgnoreNestingLevel >= 0:
            if row.DocEltCmd == "End" and ((elementToIgnoreNestingLevel == 0 and row.DocEltType == "Document") or row.NestingLevel == elementToIgnoreNestingLevel):                
                 #print(f"{row.RowIndex} -> stop ignoring {elementToIgnoreNestingLevel}")
                elementToIgnoreNestingLevel = -1
            continue
        elif row.DocEltCmd == "Start" and row.Unique == False:
            if row.DocEltType == "Document":
                elementToIgnoreNestingLevel = 0
            else:
                elementToIgnoreNestingLevel = row.NestingLevel
            #print(f"{row.RowIndex} -> ignore {elementToIgnoreNestingLevel}")
            continue
        elif row.DocEltCmd == "Start" and row.DocEltType == "Document" and row.Lang != lang:
            elementToIgnoreNestingLevel = 0
        elif row.NestingLevel == 1 and not pd.isna(row.Unique) and row.Unique == False:
            continue
        # Documents
        if row.DocEltType == "Document" and row.DocEltCmd == "Start":
            document = []      
            documentUri = ""
            documentLanguagesDict = defaultdict(lambda:0)
        elif row.DocEltType == "Document" and row.DocEltCmd == "Uri":     
            currentContainer.push(row)
            documentUri = row.Text
            if documentUri.endswith(".pdf"):
                insidePdfDocument = True
            else:
                insidePdfDocument = False
        elif row.DocEltType == "Document" and row.DocEltCmd == "Title":     
            addTitle(row, document, documentLanguagesDict)
        elif row.DocEltType == "Document" and row.DocEltCmd == "End":    
            currentContainer.pop()
            if len(documentLanguagesDict)>0:
                documentLang = sorted(documentLanguagesDict, key=documentLanguagesDict.get, reverse=True)[0]
                documentWordsCount = sum(documentLanguagesDict.values())
            else:
                documentLang = "?"
                documentWordsCount = 0
            datasetDocument = { "SiteIndex" : row.SiteIndex, "DocId" : row.DocId, "Url" : documentUri, "Text" : ''.join(document), "Words" : documentWordsCount, "Lang" : documentLang }
            yield datasetDocument
        # Sections
        elif row.DocEltType == "Section" and row.DocEltCmd == "Start":  
            currentContainer.push(row)
            if not insidePdfDocument:
                startSection(row, document, documentLanguagesDict)
        elif row.DocEltType == "Section" and row.DocEltCmd == "End":  
            currentContainer.pop()
        # Lists
        elif row.DocEltType == "List" and row.DocEltCmd == "Start":
            currentContainer.push(row)
            listNestingLevel += 1
        elif row.DocEltType == "ListItem" and row.DocEltCmd == "Start":  
            startListItem(row, document, listNestingLevel)
        elif row.DocEltType == "ListItem" and row.DocEltCmd == "Text": 
            addText(row, document, documentLanguagesDict, addBlockSeparator=False)
        elif row.DocEltType == "List"and row.DocEltCmd == "End":   
            currentContainer.pop()            
            listNestingLevel -= 1
            endNestedListItems(document)
        # NavigationLists => ignore
        elif row.DocEltType == "NavigationList" and row.DocEltCmd == "Start":
            elementToIgnoreNestingLevel = row.NestingLevel
            #print(f"{row.RowIndex} -> ignore {elementToIgnoreNestingLevel}")
        # Tables
        elif row.DocEltType == "Table" and row.DocEltCmd == "Start": 
            currentContainer.push(row)  
            insideTableHeadersCount = 0
            insideTableCellsCount = 0
        elif row.DocEltType == "TableHeader" and row.DocEltCmd == "Start":
            startTableCellOrHeader(document)
            insideTableHeadersCount += 1
        elif row.DocEltType == "TableCell" and row.DocEltCmd == "Start":  
            if insideTableHeadersCount > 0 and insideTableCellsCount == 0:
                writeTableHeaderSeparators(insideTableHeadersCount, document)
            insideTableCellsCount += 1
            if insideTableCellsCount > insideTableHeadersCount:
                endTableLine(document)
                insideTableCellsCount = 1
            startTableCellOrHeader(document)
        elif row.DocEltType == "Table" and row.DocEltCmd == "End":         
            currentContainer.pop()
            endTable(document)
        # Text blocks
        elif row.DocEltType == "TextBlock" and row.DocEltCmd == "Text":
            container = currentContainer.peek()
            if container.DocEltType == "List" or container.DocEltType == "NavigationList":
                addText(row, document, documentLanguagesDict, addBlockSeparator=False)
            elif container.DocEltType == "Table":
                addText(row, document, documentLanguagesDict, addLineSeparator=False, addBlockSeparator=False, removeNewLines=True)
            elif row.NestingLevel > 1 or row.Words >= 5:
                addText(row, document, documentLanguagesDict)

In [3]:
from IPython.display import Markdown

In [52]:
docgen = datasetDocumentsGenerator(nlptextdocdf=textdf, lang="fr")
docgen

<generator object datasetDocumentsGenerator at 0x7f809db099a0>

In [53]:
for i in range(24):
    next(docgen)

In [57]:
datasetDoc = next(docgen)
print(datasetDoc["DocId"])
print(datasetDoc["Url"])
Markdown(datasetDoc["Text"])

29
https://www.creditmutuel.fr/cmmabn/fr/vitrine/medias/docs/association-ce/lpce_0323.pdf


E
R
P A R T E N A I
É C O N O M I Q U E S
S O C I A U X
C O M I T É S
L A L E T T R E D U S E R V I C E P A R T E N A I R E C O M I T É S S O C I A U X E T É C O N O M I Q U E S D U C R É D I T M U T U E L

2023 :
L’ANNÉE DU RENOUVELLEMENT

La mise en place des CSE en 2018 et 2019 provoque, depuis
l’année dernière, un raz de marée d’élections professionnelles.
Pour les élus, ces élections sont synonymes de bilans, de
campagne électorale, de passation et enfin de prise de fonctions.

Les élections professionnelles permettent aux salariés d’exercer leur
droit de vote pour élire leurs représentants au sein du Comité social
et économique (CSE). Le CSE a pour rôle de représenter les intérêts
des salariés dans les entreprises et de débattre avec les employeurs
sur des sujets tels que les conditions de travail, les salaires et les
avantages sociaux.
Les élections sont organisées tous les 4 ans et sont obligatoires dès
11 salariés.

Quelles sont les informations à donner pour une bonne passation ?
En plus de mettre de l’ordre dans les archives, de récupérer et
ranger tout le matériel du CSE, les élus sortants vont compiler les
informations essentielles aux nouveaux élus.

Par exemple :
le bilan des actions menées,
les actions en cours : juridiques, consultations, expertises,
les activités et règles d’attribution,
les contrats existants,
les biens et stocks,
les logiciels utilisés et leurs codes d’accès.

La fin de mandat
Un CSE existe et fonctionne, quels que soient les élus qui le
composent. Une passation doit être réalisée entre les élus de
1
l’ancien et du nouveau mandat (CT R2315-39). Cette passation peut
prendre la forme d’un rapport de fin de mandat.

Ils fourniront également l’accès aux documents essentiels comme :
le règlement intérieur,
les procès-verbaux,
les rapports d’expertises,
les documents comptables,
le rapport de gestion du CSE.

Essentiel pour les nouveaux élus, il leur permet de prendre
connaissance de l’existant et de disposer de toutes les informations
pour démarrer la gestion du nouveau mandat. Pour les anciens élus,
c’est le moment de dresser leur bilan et de tout mettre en ordre avant
les élections.

Les ASC : la vitrine du CSE
Qu’on le veuille ou non, les salariés ne voient bien souvent qu’une
partie du travail des élus : les cadeaux, aides financières, bons
d’achat qui forment les activités sociales et culturelles (ASC). Mais
restez prudent : elles sont encadrées ! (CT L2312-78 et suivants)
Si votre CSE a décidé de se faire accompagner par un ou des
professionnels, alors en expliquer les raisons pour chacun et surtout
les bénéfices pour les salariés (nouvelles activités, etc…).

Ne pas rendre compte aux nouveaux élus pourrait être qualifié
d’entrave au bon fonctionnement de l’instance.

1 CT : Code du travail

PAGE 1 La fin de mandat
PAGE 2 La mise en place du CSE
PAGE 3 Les droits du CSE en matière d’expertise
PAGE 4 Actualité juridique et sociale

P A R T E N A I R E
C O M I T É S S O C I A U X E T É C O N O M I Q U E S

LA MISE EN PLACE DU CSE

De plus, il a des obligations vis-à-vis de l’Urssaf qui vérifie les règles
d’attributions des activités proposées et les conditions d’attributions
des activités. Un guide rédigé par l’Urssaf et téléchargeable sur leur
site permet aux élus de pouvoir se mettre en conformité.

Le CSE, ce n’est pas que les activités sociales, c’est aussi une mini
PME avec des responsabilités parfois jusqu’au pénal surtout pour
les postes exposés comme le secrétaire et le trésorier.

Le CSE a deux principales missions
Une mission économique : le comité est obligatoirement informé et
consulté sur les questions intéressant l’organisation, la gestion et la
marche générales de l’entreprise.
Une mission sociale : le comité est obligatoirement informé
et consulté sur les mesures de nature à affecter le volume ou la
structure des effectifs, la durée du travail, l’égalité professionnelle,
les conditions d’emploi, de travail et de formation professionnelle
du personnel.

Se poser les bonnes questions pour une mise en
place efficace

Quels sont le budget ou les ressources dont dispose le CSE ?
Identifier le montant prévisionnel des subventions accordées et
toute autre ressource
Établir un budget prévisionnel qui tient compte de ces ressources

Comment s’organiser et se répartir les tâches ?
Se former
Identifier les compétences de chacun
Identifier des axes d’actions prioritaires

Le CSE a des obligations comptables
Depuis la loi du 5 mars 2014 et les ordonnances de 2017, le CSE
respecte diverses obligations visant à la transparence de ses
comptes qui varient selon la taille de celui-ci.
Établir un rapport de gestion des activités contenant des
informations qualitatives et quantitatives sur le fonctionnement
et les activités du CSE,
Établir des comptes annuels avec des obligations en fonction
du seuil auquel il appartient défini selon le montant des
subventions :
- plus de 153 k€ subventions ASC* et AEP** : obligations
d’avoir un Expert-comptable pour attester les comptes,
- ressources supérieures à 3,1 M€ : obligation d’avoir un
commissaire aux comptes,
Établir un règlement intérieur précisant les règles d’arrêté des
comptes et d’établissement du rapport de gestion (CT L2315-
68 et L2315-69).

Comment organiser les activités ?
Établir un calendrier annuel des prestations dans le prévisionnel
Répartir les tâches entre les élus
Communiquer avec les salariés

Quelles sont les formalités prioritaires au démarrage du nouveau
mandat ?
Étudier le rapport de fin de mandat et la documentation initiale
remise par l’employeur (CT L2312-57)
Contacter la banque afin d’accéder aux comptes bancaires

Pour aller plus loin, consultez nos lettres Partenaire sur
l’arrêté des comptes des CSE et les missions du trésorier.

LA FORMATION DES ÉLUS DE CSE

La formation est définie comme le processus d’acquisition des
connaissances et des compétences requises pour un métier
spécifique.

Le nouvel élu endosse de nouvelles et réelles responsabilités
professionnelles nécessitant de se former et c’est pourquoi des
dispositions légales vous permettent de vous former. Entre moment
de cohésion et apport de compétences indispensables, se former
doit être la priorité de tout nouvel élu.

Les membres titulaires bénéficient d’un stage de formation
économique d’une durée maximale de 5 jours. Ces formations
sont renouvelées lorsque les représentants ont exercé leur mandat
pendant quatre ans, consécutifs ou non (CT L2315-63 et L2315-17).

l’entreprise (CT L2315-63). En cas de renouvellement de ce mandat,
la formation est d’une durée minimale de :
2
5 jours pour les membres d’une CSSCT dans les entreprises
d’au moins 300 salariés ;
3 jours pour les autres élus, quelle que soit la taille de l’entreprise.
Ces dispositions ne sont pas limitatives et toute formation pertinente
peut être engagée sur le budget de fonctionnement.

De plus, l’ensemble des membres bénéficie d’une formation en
santé, sécurité et conditions de travail dont la durée minimale est de
5 jours, lors du premier mandat, sans distinction selon l’effectif de

* Activités sociales et culturelles
**Attributions économiques et professionnelles

Commissions santé, sécurité et conditions de travail

QUELS SONT LES DROITS DU CSE EN MATIÈRE D’EXPERTISE ?

Les cas de recours à un expert
Le CSE peut avoir recours à un expert afin de l’aider à préparer ses
travaux ; on parle alors souvent « d’expertise légale ».
Ces expertises sont rémunérées en tout ou partie par l’employeur
dans les situations suivantes :

Expertises prises en charge à 100 % par l’employeur :
En vue de la consultation sur la situation économique et
financière de l’entreprise (expert-comptable ; CT L2315-88) ;
Dans le cadre de la consultation récurrente sur la politique
sociale de l’entreprise, les conditions de travail et l’emploi
mentionnée au 3° de l’article L. 2312-17 (expert-comptable ;
CT L2315-91) ;
En cas de licenciements collectifs pour motif économique
(expert-comptable ; CT L1233-34) ;
Lorsqu’un risque grave, identifié et actuel, révélé ou non par
un accident du travail, une maladie professionnelle ou à
caractère professionnel est constaté dans l’établissement
(expert agréé ; CT L2315-96 1°).

préparer les négociations prévues aux articles L2254-2 (accord de
compétitivité) et L1233-24-1 (accord en cas de licenciement collectif
avec PSE 3 ).

Expertises prises en charge à 80 % par l’employeur et à 20 %
par le CSE sur son budget de fonctionnement (principales
situations) :
En vue de l’examen des orientations stratégiques de
l’entreprise (CT L2315-87) ;
Lorsqu’une entreprise est partie à une opération de
concentration (expert-comptable ; CT L2312-41) ;
Lorsque l’entreprise est l’objet d’une offre publique
d’acquisition (expert-comptable ; CT L2312-42) ;
Lorsque le CSE déclenche un droit d’alerte économique
(expert-comptable ; CT L2312-64) ;
En cas de projet important modifiant les conditions de santé
et de sécurité ou les conditions de travail (expert agréé ; CT
L2315-96 2°).

L’article D3323-14 du Code du travail accorde au CSE la possibilité
de se faire assister de l’expert-comptable prévu à l’article L2325-35,
rémunéré en totalité par l’employeur, pour analyser le rapport relatif
à l’application de l’accord de participation.

La désignation de l’expert
La désignation doit être faite lors d’une réunion plénière. Soit la
question de la désignation est inscrite à l’ordre du jour, soit la
désignation a un lien suffisant avec un sujet de l’ordre du jour.

En général le CSE va délibérer d’abord sur le principe du recours à un
expert, puis sur le choix de cet expert.

Lorsque le budget du CSE n’est pas suffisant pour prendre en charge
les 20 % du coût de l’expertise, l’employeur prend intégralement en
charge le coût de l’expertise à la condition que sur les 3 dernières
années le CSE n’ait pas opéré un transfert de l’excédent annuel du
budget de fonctionnement (AEP) au budget des activités sociales
(ASC). Cette prise en charge de 100 % de l’expertise par l’employeur
interdira au CSE d’opérer ce transfert pendant les 3 années suivantes.

Parfois, le Code du travail impose que la désignation nominative ait
lieu à un moment précis. C’est le cas en matière de licenciement
collectif avec obligation d’établir un PSE. La désignation doit
intervenir lors de la première réunion. C’est aussi le cas en matière
d’opération de concentration : la désignation a lieu au cours de
la réunion qui doit être convoquée dans les 3 jours qui suivent la
publication du communiqué relatif à la notification du projet de
concentration, émanant soit de l’autorité administrative française,
soit de la Commission européenne. C’est également le cas en
matière d’offre publique d’acquisition : la désignation intervient lors
de la réunion du CSE de l’entreprise qui est l’objet de l’offre et qui doit
suivre le dépôt de l’offre.

Par ailleurs, le CSE peut faire appel à tout type d’expertise rémunérée
par ses soins pour la préparation de ses travaux dite « d’expertise
libre ».

Dans les entreprises d’au moins 300 salariés, le CSE peut décider
de recourir à un expert technique de son choix en vue de préparer la
« négociation » sur l’égalité professionnelle. Dès lors que l’absence
de tout indicateur relatif à l’égalité professionnelle tel que prévu
à l’article L2312-18 est constatée, l’employeur prend en charge la
totalité du coût de cette expertise.

À compter de la désignation de l’expert par le CSE, les membres du
comité établissent si besoin et notifient à l’employeur un cahier des
charges. L’expert notifie à l’employeur le coût prévisionnel, l’étendue
et la durée d’expertise, selon le délai fixé par décret en Conseil d’État.

Enfin, le CSE peut également mandater un expert-comptable afin
qu’il apporte toute analyse utile aux organisations syndicales pour

Pour aller plus loin, consultez notre lettre Partenaire sur
le rôle de l’expert-comptable dans le CSE.

Un dossier rédigé par Ecodia Marquant pour le Crédit Mutuel.

Plan de sauvegarde de l’emploi

POUR VOUS AIDER À SUIVRE CHAQUE TRIMESTRE L’ ACTUALITÉ JURIDIQUE ET
SOCIALE, NOUS AVONS RELEVÉ POUR VOUS LES INFORMATIONS SUIVANTES

Bons d’achat et cadeaux en nature
Le montant conditionnant l’octroi des bons d’achat et des
cadeaux en nature du CSE est fixé à 183 € pour 2023 (5 %
du PMSS).
Mesure exceptionnelle « Coupe du monde de rugby 2023
et Jeux olympiques 2024 » : le ministère de l’Économie,
dans un communiqué de presse du 11 janvier 2023 élargit
les possibilités d’attributions pour ces occasions et, en
conséquence, de faire bénéficier les salariés de l’exclusion
de l’assiette des cotisations et contributions sociales.
Les CSE pourront attribuer des places pour assister aux
épreuves sportives de 2023 et de 2024 sous la forme
de billets ou de bons d’achat et cadeaux en nature ainsi
que les prestations associées, transport, hébergement,
cadeaux divers jusqu’à 917 € en 2023, soit 5 fois plus
qu’habituellement.
CP – Ministère éco, finances JO 2024 et Coupe du monde de Rugby

Élection du CSE
On ne ferme pas à clé la salle de vote pour procéder au
dépouillement du scrutin.
Fermer à clé la salle de vote pour dépouiller porte atteinte
à la sincérité du scrutin même si une baie vitrée permet
d’assister au dépouillement des élections. Le lieu de
dépouillement des votes doit rester accessible jusqu’à
la proclamation des résultats de l’élection sous peine de
nullité. Dans cette affaire, la cour a retenu « qu’il n’était pas
possible pour les parties prenantes de circuler entre les
tables de dépouillement pour s’assurer de la sincérité du
scrutin ».
Cass. soc., 21 sept. 2022, n° 21-14.123

Bons d’achat
limite d’exonération
de cotisations sociales

Titres restaurant
limite d’exonération
de cotisations sociales

Primes de crèche, nourrice,
garde d’enfants
limite d’exonération
de cotisations sociales

Élections partielles
Elles n’échappent pas à la règle de représentation
proportionnée des femmes et des hommes.
Les listes de candidats présentés par une organisation
syndicale à l’occasion d’élections partielles du comité
social et économique (CSE) doivent respecter la proportion
de femmes et d’hommes du collège électoral.
Pour rappel, les listes qui comportent plusieurs candidats
à l’élection du CSE doivent être composées d’un nombre de
femmes et d’hommes correspondant à la part de femmes
et d’hommes inscrits sur la liste électorale (CT L2314-30).
Cass. soc., 9 nov. 2022, n°21-60.183

Plafond mensuel
de la sécurité sociale

Nouvelle définition du corps électoral
Les salariés assimilés à l’employeur sont inclus dans
les listes électorales (CT L2314-18) mais toujours non
éligibles.
Ceci fait suite à la décision n° 2021-947 QPC du 19
novembre 2021. Le Conseil constitutionnel s’est fondé sur
le principe de participation pour invalider le texte précédent
et le déclarer contraire à la Constitution.
Décision n° 2021-947 QPC du 19 novembre 2021

Valeur du point Agirc-Arrco
1,3498 €

Actualités fournies par Ecodia Marquant,
expert au côté des CSE.

G U I D E C S E

Urssaf, les cotisations applicables sur les

prestations fournies par le CSE.

Retrouvez toutes les informations et

les principes applicables en matière de

cotisations dans ce guide.

Retrouvez 3 fois par an un dossier d’informations pratiques sur le
fonctionnement, les rôles et missions du CSE... rédigé par un expert du sujet.
Un service exclusif pour les CSE !

Consultez toutes les lettres sur www.creditmutuel.com - Nos actions -
Associations et CSE

Nous sommes présents pour vos formations,

votre comptabilité, vos expertises et assurons

un service de conseil au quotidien.

La lettre du Service Partenaire Comités Sociaux et Économiques est éditée par la Confédération Nationale
du Crédit Mutuel - 46 rue du Bastion - 75017 PARIS - Tél. 01 53 48 88 03
• Directeur de la publication : Martine Gendre (martine.gendre@creditmutuel.fr)
• Rédactrice en chef : Laurence Arnaud (laurence.arnaud@creditmutuel.fr)
• Comité de rédaction : Chantal Béato, Nathalie Boudet-Tionck, Christel Clargé, Christophe Cornet,
Hervé Frioud-Chatrieux, Stéphanie Guimard, Yves Gourtay, Marie-Anne Lafaye, Benjamin Le Clec’h,
Sandrine Letertre Chardin, Delphine Spanhove, Carine Vanbecelaere, Amaury Vienne, Lucie You.
• Réalisation : Zest en plus - Tél. 01 60 45 94 07
• Imprimeur : Technicom Paris - 32 av. Pierre Grenier 92100 Boulogne-Billancourt
• ISSN : 1637 - 6110
•Dépôt légal : Mars 2023

Notre équipe pluridisciplinaire, à taille humaine,

est indépendante et dédiée aux CSE.

Faites décoller votre CSE avec Ecodia Marquant !



In [133]:
textdf[(textdf["SiteIndex"]==0) & (textdf["DocId"]==25)]

Unnamed: 0,SiteIndex,RowIndex,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Words,Lang,Unique
3901,0,3901,25,Document,Start,1,25,0,?,False
3902,0,3902,25,Document,Title,1,Commander votre carte Crédit Mutuel,5,fr,False
3903,0,3903,25,Document,Uri,1,https://www.creditmutuel.fr/cmmabn/fr/souscrip...,0,,
3904,0,3904,25,NavigationList,Start,1,,0,,False
3905,0,3905,25,ListItem,Text,2,Contenu principal,2,?,False
...,...,...,...,...,...,...,...,...,...,...
4092,0,4092,25,NavigationList,Start,1,,0,,False
4093,0,4093,25,ListItem,Text,2,EN,1,de,False
4094,0,4094,25,NavigationList,End,1,,0,,
4095,0,4095,25,TextBlock,Text,1,Haut de page,3,?,False


In [58]:
from datasets import Dataset

# config_name: Optional[str] = None,
# hash: Optional[str] = None,
# base_path: Optional[str] = None,
# info: Optional[DatasetInfo] = None,
# features: Optional[Features] = None,
# use_auth_token: Optional[Union[bool, str]] = None,
# repo_id: Optional[str] = None,
# writer_batch_size: Optional[int] = None,

dataset = Dataset.from_generator(datasetDocumentsGenerator, gen_kwargs={"nlptextdocdf":textdf, "lang":"fr"}, config_name="fr", version="1.0.0", description="Text extracted from all Crédit Mutuel Alliance Fédérale websites with nlptextdoc v3 in may 2023")

Downloading and preparing dataset generator/fr to /models/huggingface/datasets/generator/fr-8d8a97c72b3ae17a/1.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset generator downloaded and prepared to /models/huggingface/datasets/generator/fr-8d8a97c72b3ae17a/1.0.0. Subsequent calls will reuse this data.


In [63]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [64]:
dataset.push_to_hub("frenchtext/creditmutuel-052023")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/35 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

### 4.3 Test the huggingface dataset

In [66]:
from datasets import get_dataset_split_names

get_dataset_split_names("frenchtext/creditmutuel-052023")

['train']

In [68]:
from datasets import get_dataset_config_names

get_dataset_config_names("frenchtext/creditmutuel-052023")

['frenchtext--creditmutuel-052023']

In [71]:
from datasets import load_dataset

datasetdict = load_dataset("frenchtext/creditmutuel-052023")
datasetdict

Found cached dataset parquet (/models/huggingface/datasets/frenchtext___parquet/frenchtext--creditmutuel-052023-6d7000711d056f50/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['SiteIndex', 'DocId', 'Url', 'Text', 'Words', 'Lang'],
        num_rows: 34364
    })
})

In [72]:
dataset = datasetdict["train"]
dataset

Dataset({
    features: ['SiteIndex', 'DocId', 'Url', 'Text', 'Words', 'Lang'],
    num_rows: 34364
})

In [75]:
for idx,example in enumerate(dataset):
    if idx>3:
        break
    else:
        print(example)



### 4.4 Modernize the 2019 dataset

In [4]:
from pathlib import Path

oldrootdir = Path("nlptextdoc-dataset-2019_09")
oldrootdir

PosixPath('nlptextdoc-dataset-2019_09')

In [5]:
oldextractiondir = oldrootdir / "_extraction"
oldextractiondir

PosixPath('nlptextdoc-dataset-2019_09/_extraction')

In [6]:
olddatasetsdir = oldrootdir / "_datasets"
olddatasetsdir

PosixPath('nlptextdoc-dataset-2019_09/_datasets')

In [7]:
import pandas as pd

oldwebsitesdf = pd.read_csv(oldextractiondir / "websites.csv", sep=';')
oldwebsitesdf.head()

Unnamed: 0,Website,Url,Scope,ExtractionDir,ExtractionFile,Dataset
0,1,https://www.10meilleuresbanques.fr/,domain,10meilleuresbanques.fr,10meilleuresbanques,Comparateur
1,2,https://www.abcbourse.com/,domain,abcbourse.com,abcbourse,Bourse
2,3,https://acpr.banque-france.fr/,subdomain,acpr.banque-france.fr,acpr-banque-france,Institution
3,4,https://www.afer.fr/,domain,afer.fr,afer,Assurance
4,5,https://www.ag2rlamondiale.fr/,domain,ag2rlamondiale.fr,ag2rlamondiale,Assurance


In [17]:
oldtextdf = pd.read_feather(oldextractiondir / "afer.nlptextdocs.dataframe.feather")
oldtextdf.head()

Unnamed: 0,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Lang,Words,Unique
0,1,Document,Start,1,1,,0,
1,1,Document,Title,1,Association Française d’épargne & de retraite ...,fr,9,True
2,1,Document,Uri,1,https://www.afer.fr/,,0,
3,1,NavigationList,Start,1,,,0,
4,1,ListItem,Text,2,Acces direct au contenu,fr,4,True


In [18]:
oldurlsdf = pd.read_feather(oldextractiondir / "afer.urls.dataframe.feather")
oldurlsdf.head()

Unnamed: 0,DocId,DocUrl,fr,?,en,de,Words,%fr,%de,%en,%?
0,1,https://www.afer.fr/,524.0,46.0,3.0,0.0,573.0,0.914485,0.0,0.005236,0.080279
1,2,https://www.afer.fr/afer/adhesion/,74.0,0.0,0.0,0.0,74.0,1.0,0.0,0.0,0.0
2,3,https://www.afer.fr/afer/adhesion/adherent-ass...,457.0,13.0,5.0,0.0,475.0,0.962105,0.0,0.010526,0.027368
3,4,https://www.afer.fr/afer/adhesion/adherer-assu...,519.0,0.0,0.0,0.0,519.0,1.0,0.0,0.0,0.0
4,5,https://www.afer.fr/afer/adhesion/parrainage-a...,345.0,10.0,0.0,0.0,355.0,0.971831,0.0,0.0,0.028169


In [8]:
def get_old_datasets():
    oldwebsitesdf = pd.read_csv(oldextractiondir / "websites.csv", sep=';')
    return oldwebsitesdf["Dataset"].sort_values().unique()
    
def get_old_extraction_files(dataset):
    oldwebsitesdf = pd.read_csv(oldextractiondir / "websites.csv", sep=';')
    datasetwebsites = oldwebsitesdf[oldwebsitesdf["Dataset"]==dataset].sort_values(by="ExtractionFile") 
    for idx,row in datasetwebsites.iterrows():
        url = row["Url"]
        file = row["ExtractionFile"]  
        urlsfilepath = oldextractiondir / (file + ".urls.dataframe.feather")
        textfilepath = oldextractiondir / (file + ".nlptextdocs.dataframe.feather")
        urlsdffile = None
        textdffile = None
        try:
            urlsdffile = pd.read_feather(urlsfilepath)        
        except Exception as e:
            print(f"{urlsfilepath} --> {e}")
        try:
            textdffile = pd.read_feather(textfilepath)
        except Exception as e:
            print(f"{textfilepath} --> {e}") 
        yield (url, urlsfilepath, urlsdffile, textfilepath, textdffile)

In [9]:
olddatasets = get_old_datasets()
olddatasets

array(['Assurance', 'Banque', 'Bourse', 'Comparateur', 'Crédit', 'Forum',
       'Institution', 'Presse1', 'Presse2', 'SiteInfo', 'Wikipedia'],
      dtype=object)

In [44]:
oldextractionfiles = get_old_extraction_files(olddatasets[0])

In [42]:
next(oldextractionfiles)

('https://www.afer.fr/',
 PosixPath('nlptextdoc-dataset-2019_09/_extraction/afer.urls.dataframe.feather'),
      DocId                                             DocUrl      fr     ?  \
 0        1                               https://www.afer.fr/   524.0  46.0   
 1        2                 https://www.afer.fr/afer/adhesion/    74.0   0.0   
 2        3  https://www.afer.fr/afer/adhesion/adherent-ass...   457.0  13.0   
 3        4  https://www.afer.fr/afer/adhesion/adherer-assu...   519.0   0.0   
 4        5  https://www.afer.fr/afer/adhesion/parrainage-a...   345.0  10.0   
 ..     ...                                                ...     ...   ...   
 151    152          https://www.afer.fr/support/afer-premium/  1101.0  24.0   
 152    153             https://www.afer.fr/support/afer-sfer/   429.0  33.0   
 153    154             https://www.afer.fr/support/fond-euro/   404.0  29.0   
 154    155  https://www.afer.fr/type-presse/communique-de-...     6.0   6.0   
 155    156  

In [43]:
testfilepath = oldextractiondir / "cbanque-forum.nlptextdocs.dataframe.feather"
oldtextdf = pd.read_feather(testfilepath, dtype_backend="pyarrow")
oldtextdf.head()

Unnamed: 0,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Lang,Words,Unique
0,1,Document,Start,1,1,,0,
1,1,Document,Title,1,Forum banque et argent,fr,4,True
2,1,Document,Uri,1,https://www.cbanque.com/forums/,,0,
3,1,Section,Start,1,Rechercher,de,1,True
4,1,List,Start,2,,,0,


In [207]:
oldtextdf[(oldtextdf["DocEltType"]=="ListItem") & (oldtextdf["DocEltCmd"]=="Text") & (oldtextdf["Text"].str.contains("## "))]

Unnamed: 0,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Lang,Words,Unique
2698436,10018,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,True
2698590,10019,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
2698733,10020,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
2698876,10021,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
2699019,10022,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
...,...,...,...,...,...,...,...,...
2933318,11578,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
2933461,11579,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
2933604,11580,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False
2933747,11581,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,False


In [195]:
oldtextdf.iloc[2698434:2698442]

Unnamed: 0,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Lang,Words,Unique
2698434,10018,Section,Start,1,Tout (1) J'aime J'aime (1),fr,11,True
2698435,10018,List,Start,2,,,0,
2698436,10018,ListItem,Text,3,## 1 Section End <<Tout (1) J'aime J'aime (1)>>,fr,20,True
2698437,10018,List,End,2,,,0,
2698438,10018,List,Start,1,,,0,
2698439,10018,ListItem,Start,2,,,0,
2698440,10018,TextBlock,Text,2,Forums,en,1,False
2698441,10018,ListItem,End,2,,,0,


In [53]:
import pandas as pd
import pyarrow as pa
import re

class OldTextDataFrameBugFixer:
 
    def __init__(self):              
        self.DOCUMENT_ELEMENT_LINE_MARKER = "##"
        self.DOCUMENT_ELEMENT_START = "Start"
        self.DOCUMENT_ELEMENT_END = "End"
        self.DOCUMENT_ELEMENT_ITEMS = "Items"
        self.DOCUMENT_ELEMENT_ITEMS_START = ">>"
        self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR = "||"
        
        self.TEXT_DOCUMENT_PROPERTY_PREFIX = self.DOCUMENT_ELEMENT_LINE_MARKER + " NLPTextDocument "
        self.TEXT_DOCUMENT_TITLE = "Title"
        self.TEXT_DOCUMENT_URI = "Uri"
        
        self.DOCUMENT_ELEMENT_LINE_REGEX = re.compile(
            self.DOCUMENT_ELEMENT_LINE_MARKER + " "
            + "(?P<NestingLevel>[0-9]+)" + " "
            + "(?P<ElementName>[A-Za-z]+)" + " "
            + "(?P<Command>" + self.DOCUMENT_ELEMENT_START + "|" + self.DOCUMENT_ELEMENT_END + "|" + self.DOCUMENT_ELEMENT_ITEMS + ")" + " ?")
        
    def fix_dataframe(self, oldtextdf):        
        rowstofix = oldtextdf[(oldtextdf["DocEltType"]=="ListItem") & (oldtextdf["DocEltCmd"]=="Text") & (oldtextdf["Text"].str.contains("## "))]
        bugscount = rowstofix["Text"].count()
        if bugscount > 0:                           
            print(f" -> fixing {bugscount} errors")
            self.oldtextdf = oldtextdf
            for idx,row in rowstofix.iterrows():
                self.fix_row(idx, row)
            oldtextdf.sort_index(inplace=True)
            oldtextdf.reset_index(inplace=True, drop=True)
        else:
            print(" -> nothing to fix")
            
    def fix_row(self, idx, row):
        itemsText = row["Text"]
        match2 = self.DOCUMENT_ELEMENT_LINE_REGEX.search(itemsText)
        nextLine = None
        if(match2):
            nextLineStartIndex = match2.start()
            nextLine = itemsText[nextLineStartIndex:]
            
            itemsText = itemsText[:nextLineStartIndex]
            self.oldtextdf.loc[idx, "Text"] = itemsText
            
            self.docId = row["DocId"]
            self.insertIndex = idx + 1.5
            self.readcommand(nextLine)
    
    def readcommand(self,line):
        match = self.DOCUMENT_ELEMENT_LINE_REGEX.match(line)
        if(match): 
            self.nestingLevel = int(match.group("NestingLevel"))
            elementName = match.group("ElementName")
            command = match.group("Command")
            if (command == self.DOCUMENT_ELEMENT_START):
                title = line[match.end():].strip()
                if (len(title) == 0): title = None
                if(elementName == "Section"):
                    self.onSectionStart(title)
                elif(elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif(elementName == "List"):
                    self.onListStart(title)
                elif(elementName == "ListItem"):
                    self.onListItemStart()
                elif(elementName == "Table"):
                    self.onTableStart(title)
                elif(elementName == "TableHeader"):
                    self.onTableHeaderStart()           
                elif(elementName == "TableCell"):
                    self.onTableCellStart()
            elif (command == self.DOCUMENT_ELEMENT_END):
                if(elementName == "Section"):
                    self.onSectionEnd()
                elif(elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif(elementName == "List"):
                    self.onListEnd()
                elif(elementName == "ListItem"):
                    self.onListItemEnd()
                elif(elementName == "Table"):
                    self.onTableEnd()
                elif(elementName == "TableHeader"):
                    self.onTableHeaderEnd()                 
                elif(elementName == "TableCell"):
                    self.onTableCellEnd()
            elif (command == self.DOCUMENT_ELEMENT_ITEMS):
                startOfItems = line.find(self.DOCUMENT_ELEMENT_ITEMS_START)
                itemsText = line[startOfItems+len(self.DOCUMENT_ELEMENT_ITEMS_START):].strip()
                
                title = line[match.end():startOfItems].strip()
                if (len(title) == 0): title = None
                if (elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif (elementName == "List"):
                    self.onListStart(title)             
                self.nestingLevel = self.nestingLevel+1
                items = itemsText.split(self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR)
                for item in items:
                    item = item.strip()
                    if (len(item) > 0):
                        self.onInlineListItem(item)
                self.nestingLevel = self.nestingLevel-1
                if (elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif (elementName == "List"):
                    self.onListEnd()        
            else:
                raise Exception(f"File format error on line: {line}");                     
        else:
            raise Exception(f"File format error on line: {line}");
    
    def onDocumentStart(self,docId):
        self.appendrow("Document","Start",docId)
    
    def onDocumentTitle(self,title):
        self.appendrow("Document","Title",title)
            
    def onDocumentUri(self,uri):
        self.listDocUrls.append(uri)
        self.appendrow("Document","Uri",uri)
    
    def onDocumentEnd(self,docId):
        self.appendrow("Document","End",docId)
    
    def onTextBlock(self,text):
        self.appendrow("TextBlock","Text",text)
            
    def onSectionStart(self,title):
        self.appendrow("Section","Start",title)
        
    def onSectionEnd(self): 
        self.appendrow("Section","End")
        
    def onNavigationListStart(self,title):
        self.appendrow("NavigationList","Start",title)
        
    def onNavigationListEnd(self):
        self.appendrow("NavigationList","End")
        
    def onListStart(self,title):
        self.appendrow("List","Start",title)
        
    def onListEnd(self):
        self.appendrow("List","End")
        
    def onInlineListItem(self,item):
        self.appendrow("ListItem","Text",item)
            
    def onListItemStart(self):
        self.appendrow("ListItem","Start")
        
    def onListItemEnd(self):
        self.appendrow("ListItem","End")
        
    def onTableStart(self,title):
        self.appendrow("Table","Start",title)
    
    def onTableEnd(self):
        self.appendrow("Table","End")
        
    def onTableHeaderStart(self):
        self.appendrow("TableHeader","Start")
        
    def onTableHeaderEnd(self): 
        self.appendrow("TableHeader","End")
        
    def onTableCellStart(self):
        self.appendrow("TableCell","Start")
        
    def onTableCellEnd(self): 
        self.appendrow("TableCell","End")
            
    def appendrow(self,docEltType,docEltCmd,text=None):
        if(text != None):
            text = text.replace("\\n","\n")
        self.oldtextdf.loc[self.insertIndex] = (self.docId, docEltType, docEltCmd, self.nestingLevel, text, None, 0,  None)

In [197]:
bugfixer = OldTextDataFrameBugFixer()
bugfixer.fix_dataframe(oldtextdf)

 -> fixing 949 errors


In [198]:
oldtextdf.iloc[2698434:2698442]

Unnamed: 0,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Lang,Words,Unique
2698434,10018,Section,Start,1,Tout (1) J'aime J'aime (1),fr,11,True
2698435,10018,List,Start,2,,,0,
2698436,10018,ListItem,Text,3,,fr,20,True
2698437,10018,List,End,2,,,0,
2698438,10018,Section,End,1,,,0,
2698439,10018,List,Start,1,,,0,
2698440,10018,ListItem,Start,2,,,0,
2698441,10018,TextBlock,Text,2,Forums,en,1,False


In [217]:
for dataset in get_old_datasets():
    print(f"--- {dataset} ---")    
    for elements in get_old_extraction_files(dataset):
        print(elements[0])
        oldtextfilepath = elements[3]
        oldtextdf = elements[4]
        if oldtextdf is not None:
            linestofix = oldtextdf[(oldtextdf["DocEltType"]=="ListItem") & (oldtextdf["DocEltCmd"]=="Text") & (oldtextdf["Text"].str.contains("## "))]["Text"].count()
            if linestofix > 0:
                bugfixer = OldTextDataFrameBugFixer()
                bugfixer.fix_dataframe(oldtextdf)
                oldtextdf.to_feather(oldtextfilepath)                

--- Assurance ---
https://www.afer.fr/
https://www.ag2rlamondiale.fr/
https://www.agpm.fr/
https://www.allianz.fr/
https://www.amaguiz.com/
https://www.ameli.fr/
https://www.aviva.fr/
https://www.axa.fr/
https://www.cnp.fr/
https://www.direct-assurance.fr/
https://www.eurofil.com/
https://www.gan.fr/
http://www.generali.fr/
https://www.groupama.fr/
https://www.lolivier.fr/
https://www.mae.fr/
https://www.maif.fr/
http://www.malakoffmederic.com/
https://www.matmut.fr/
https://www.mma.fr/
https://www.probtp.com/
--- Banque ---
https://www.allianzbanque.fr/a/z/b/jm_6268/fr/navigations
https://www.arkea.com/
https://www.banque-edel.fr/
https://www.banquepopulaire.fr/
https://www.bforbank.com/
https://www.boursorama-banque.com/
https://www.bred.fr/
https://www.ca-alsace-vosges.fr/
https://www.caisse-epargne.fr/
https://www.cic.fr/
 -> fixing 1 errors
https://compte-nickel.fr/
https://www.credit-agricole.fr/
https://www.credit-cooperatif.coop/
https://www.credit-du-nord.fr/
 -> fixing 5 erro

In [None]:
import gc

def createDatasetFromOldWebsites(dataset, oldextractiondir, olddatasetsdir):
    print(f"--- DATASET: {dataset} ---")
    websites = []
    urlsdfs = {}
    textdfs = {}
    print(f"Loading dataframes for all websites ...")
    for idx,elements in enumerate(get_old_extraction_files(dataset)):
        website = elements[0]
        websites.append(website)
        if idx<0:
            continue
        urlsdf = elements[2][["DocId","DocUrl"]]
        textdf = elements[4][["DocId","DocEltType","DocEltCmd","NestingLevel","Text"]]
        if dataset=="Assurance":
            if idx==8:
                textdf = textdf[textdf["DocId"]!=220]
            elif idx==9:
                textdf = textdf[(textdf["DocId"]!=211) & (textdf["DocId"]!=230)]
            elif idx==17:
                textdf = textdf[textdf["DocId"]!=120]
            elif idx==18:
                textdf = textdf[(textdf["DocId"]!=1262) & (textdf["DocId"]!=1500) & (textdf["DocId"]!=1650)]
            elif idx==20:
                textdf = textdf[(textdf["DocId"]!=815) & (textdf["DocId"]!=1166) & (textdf["DocId"]!=1244)]
        elif dataset=="Banque":
            if idx==3:
                textdf = textdf[textdf["DocId"]!=5829]
            elif idx==6:
                textdf = textdf[textdf["DocId"]!=1160]
            elif idx==20:
                textdf = textdf[(textdf["DocId"]<347) & (textdf["DocId"]>400)]
            elif idx==23:
                textdf = textdf[(textdf["DocId"]<189) & (textdf["DocId"]>191)]
        elif dataset=="Bourse":
            if idx==9:
                textdf = textdf[(textdf["DocId"]<25469) & (textdf["DocId"]>25470)]
        elif dataset=="Comparateur":
            if idx==10:
                textdf = textdf[(textdf["DocId"]!=1097) & (textdf["DocId"]!=1207)]
            elif idx==12:
                textdf = textdf[textdf["DocId"]!=2124]
        elif dataset=="Crédit":
            if idx==4:
                textdf = textdf[textdf["DocId"]!=1068]
        elif dataset=="Presse1":
            if idx==7:
                textdf = textdf[textdf["DocId"]!=60025]
        elif dataset=="SiteInfo":
            if idx==8:
                textdf = textdf[textdf["DocId"]!=19749]
        print(f"- loaded {len(textdf)} document elements from {website}")
        urlsdfs[idx] = urlsdf
        textdfs[idx] = textdf
    outputdir = olddatasetsdir / dataset
    outputdir.mkdir(exist_ok=True)
    dataseturlsdf,datasettextdf = mergeDataframes(urlsdfs, textdfs)
    urlsdfs = None
    textdfs = None
    gc.collect()
    createDatasetFromDataframes(dataseturlsdf, datasettextdf, websites, outputdir)

In [None]:
createDatasetFromOldWebsites(olddatasets[10], oldextractiondir, olddatasetsdir)

In [10]:
class Stack:
    def __init__(self):
        self.stack = []

    def push(self, item):
        self.stack.append(item)

    def pop(self):
        if not self.is_empty():
            return self.stack.pop()
        else:
            raise Exception("Stack is empty")

    def peek(self):
        if not self.is_empty():
            return self.stack[-1]
        else:
            raise Exception("Stack is empty")

    def is_empty(self):
        return len(self.stack) == 0

    def size(self):
        return len(self.stack)

In [11]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [12]:
datasetname = olddatasets[10]
datasetname

'Wikipedia'

In [15]:
olddatasetdf = pd.read_feather(olddatasetsdir / datasetname / "dataset.nlptextdocs.feather")
olddatasetdf

Unnamed: 0,SiteIndex,RowIndex,DocId,DocEltType,DocEltCmd,NestingLevel,Text,Words,Lang,Unique
0,0,0,1,Document,Start,1,1,0,fr,True
1,0,1,1,Document,Uri,1,007_Legends.txt,0,,
2,0,2,1,Document,Title,1,007 Legends,2,?,True
3,0,3,1,TextBlock,Text,1,007 Legends est un jeu vidéo de tir à la premi...,41,fr,True
4,0,4,1,TextBlock,Text,1,"James Bond s'est fait tirer dessus. Peu à peu,...",78,fr,True
...,...,...,...,...,...,...,...,...,...,...
41517122,6,30717730,55453,NavigationList,Start,1,,0,,False
41517123,6,30717731,55453,ListItem,Text,2,Wikimedia Foundation,2,?,False
41517124,6,30717732,55453,ListItem,Text,2,Powered by MediaWiki,3,en,False
41517125,6,30717733,55453,NavigationList,End,1,,0,,


In [13]:
from collections import defaultdict

In [19]:
olddocgen = datasetDocumentsGenerator(nlptextdocdf=olddatasetdf, lang="fr")

In [198]:
for i in range(1000):
    next(olddocgen)

In [None]:
datasetDoc = next(olddocgen)
print(datasetDoc["DocId"])
print(datasetDoc["Url"])
Markdown(datasetDoc["Text"])

In [16]:
from datasets import Dataset

dataset = Dataset.from_generator(datasetDocumentsGenerator, gen_kwargs={"nlptextdocdf":olddatasetdf, "lang":"fr"}, config_name="fr", version="1.0.0", description="Text extracted from french wikipedia with nlptextdoc in september 2019")

Downloading and preparing dataset generator/fr to /models/huggingface/datasets/generator/fr-913abdfddddc8f60/1.0.0...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset generator downloaded and prepared to /models/huggingface/datasets/generator/fr-913abdfddddc8f60/1.0.0. Subsequent calls will reuse this data.


In [17]:
dataset.push_to_hub("frenchtext/nlptextdoc-2019_09-wikipedia", private=True)

Pushing dataset shards to the dataset hub:   0%|          | 0/8 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/72 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

--- TEST datasets ---

In [38]:
datasetname = olddatasets[10]
datasetname

'Wikipedia'

In [39]:
from datasets import load_dataset

olddatasetdict = load_dataset("frenchtext/nlptextdoc-2019_09-wikipedia")
olddataset = olddatasetdict["train"]

Downloading readme:   0%|          | 0.00/549 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /models/huggingface/datasets/frenchtext___parquet/frenchtext--nlptextdoc-2019_09-wikipedia-562b378b41da449a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/276M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/243M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/254M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/254M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/470M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/568393 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /models/huggingface/datasets/frenchtext___parquet/frenchtext--nlptextdoc-2019_09-wikipedia-562b378b41da449a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [158]:
for idx,example in enumerate(olddataset):
    if idx<198:
        continue
    elif idx>200:
        break
    else:
        print(example)



## 5. Study vocabulary and tokenizer perf

https://aclanthology.org/2020.findings-emnlp.352.pdf

Our recommendation for Transformer NMT is to use the largest possible BPE vocabulary such that at least 95% of classes have 100 or more examples in training.

Frequency at 95th% Class Rank (F95%), defined as the least frequency in the 95th percentile of most frequent classes.

More generally, FP % is a simple way of quantifying the minimum number of training examples for at least the Pth percentile of classes.

https://arxiv.org/pdf/2204.08832.pdf

The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers.

WordPiece statistically significantly outperforms other tokenizers in most of the tasks in our experiments.



### 5.1 Unknown words and proper nouns

In [329]:
def listUnknownWordsAndProperNouns(vocabdf):
    uwords = vocabdf[(vocabdf["CommonTags"] == "None") | (vocabdf["CommonTags"].str.contains("PROPN"))].copy()
    uwords["Length"] = uwords["Word"].apply(lambda w: len(w))
    uwords["isSpace"] = uwords["Word"].apply(lambda w: w.isspace())
    uwords["CharName"] = uwords["Word"].apply(lambda w: charname(w) if len(w)==1 else "")
    uwords.to_csv(rootdir / "specificwords.csv",sep=";")
    return uwords

In [330]:
vocabdf = loadVocabulary(rootdir)
specificwords = listUnknownWordsAndProperNouns(vocabdf)
specificwords.head(30)

  labels, = index.labels


Unnamed: 0,Word,Count,LefffTags,DicollecteTags,CommonTags,Percent,Length,isSpace,CharName
13,,52198,,,,26.22981,1,True,NO-BREAK SPACE
51,€,11834,,,,43.56146,1,False,EURO SIGN
69,…,7794,,,,46.390409,1,False,HORIZONTAL ELLIPSIS
70,!,7773,,,,46.517784,1,False,EXCLAMATION MARK
81,France,6961,PROPN,NOUN,NOUN|PROPN,47.844507,6,False,
155,–,3940,,,,54.040798,1,False,EN DASH
173,,3675,,,,55.15367,1,True,SPACE
178,Paris,3578,PROPN,PROPN,PROPN,55.451192,5,False,
239,\n,2580,,,,58.419299,1,True,CHAR 10
251,Epargne,2443,,,,58.9114,7,False,


In [338]:
def getContextAroundWord(text,word,ctxsize=20):
    idx = text.index(word)
    start = max(idx-ctxsize,0)
    end = min(idx+ctxsize,len(text))
    return text[start:end+1]

def sampleTextBlocksWithChar(textdf,char,count=100,ctxsize=20):
    textsWithWord = textdf[textdf["Text"].str.contains(char,regex=False)]
    textsWithWord = textsWithWord.sample(count)
    textsWithWord["Context"] = textsWithWord["Text"].apply(lambda t: getContextAroundWord(t,char,ctxsize))
    return textsWithWord.copy()

def sampleTextBlocksWithWord(textdf,word,count=100,ctxsize=20):
    textsWithWord = textdf[textdf["Text"].str.contains("\\b"+word+"\\b")]
    textsWithWord = textsWithWord.sample(count)
    textsWithWord["Context"] = textsWithWord["Text"].apply(lambda t: getContextAroundWord(t,word,ctxsize))
    return textsWithWord.copy()

In [253]:
dataset = loadDataset(rootdir)

  labels, = index.labels


In [259]:
textsWithWord = sampleTextBlocksWithWord(dataset,"Mutuel")
textsWithWord.head(20)

Unnamed: 0,SiteIndex,RowIndex,Text,WordsCount,Context
78777,51,23427,Crédit Mutuel Arkéa - Nos métiers - ABEI - Pag...,12,Crédit Mutuel Arkéa - Nos mé
205845,115,6655,Livret Bienvenue de Crédit Mutuel,5,Bienvenue de Crédit Mutuel
133173,79,5677,Mettre de l'argent de côté permet de faire fac...,47,"ng terme. Au Crédit Mutuel, plusieurs sol"
77897,51,4291,"Jean-Pierre Denis, Président du Crédit Mutuel ...",98,Président du Crédit Mutuel Arkéa et du Cr
132852,79,2811,1 er réseau bancaire né de la volonté des Prof...,51,"ur elles, le Crédit Mutuel des Profession"
42475,26,3033,"En 1984, suite à une demande déposée par les a...",47,ricole et le Crédit Mutuel le Groupement
58770,38,7560,2 nouveaux contrats complètent la gamme ! mes-...,51,(filiale du Crédit Mutuel Arkéa) mes-pla
130608,77,9913,Frais bancaires Crédit Mutuel Normandie,5,is bancaires Crédit Mutuel Normandie
78064,51,7639,Crédit Mutuel Arkéa - Présentation de la direc...,14,Crédit Mutuel Arkéa - Présen
78593,51,17742,Assurances - Crédit Mutuel Arkéa 1,6,Assurances - Crédit Mutuel Arkéa 1


### 5.2 Most common nouns

In [262]:
def listMostCommonNouns(vocabdf,count):
    cwords = vocabdf[vocabdf["CommonTags"].str.contains("NOUN")]
    cwords = cwords[:count].copy()
    cwords.to_csv(rootdir / "commonwords.csv",sep=";")
    return cwords

In [263]:
nouns = listMostCommonNouns(vocabdf,5000)
nouns.head(20)

Unnamed: 0,index,Word,Count,LefffTags,DicollecteTags,CommonTags,Percent
4,2,la,109538,PRON|NOUN|DET,DET|PRON|NOUN,NOUN|DET|PRON,15.034524
10,183,un,66670,DET|NOUN|PRON|NUM,DET|NOUN,NOUN|DET|PRON|NUM,23.454135
15,42,pour,49910,NOUN|ADP,ADP,NOUN|ADP,27.900134
17,40,une,48932,DET|PRON|NOUN|NUM,DET|NOUN,NOUN|DET|PRON|NUM,29.518757
18,64,est,45859,ADJ|NOUN|AUX|VERB,NOUN|AUX,NOUN|ADJ|AUX|VERB,30.270247
27,61,par,35796,NOUN|ADP,ADP,NOUN|ADP,36.230518
30,256,dans,28412,ADP,ADP|NOUN,NOUN|ADP,37.749787
32,91,plus,21789,VERB|ADV|CCONJ|NOUN,ADV|NOUN|VERB,NOUN|ADV|CCONJ|VERB,38.51047
37,2681,assurance,18147,NOUN,NOUN,NOUN,40.125341
41,176,%,16426,NOUN,,NOUN,41.250995


In [265]:
textsWithWord = sampleTextBlocksWithWord(dataset,"compte")
textsWithWord.head(20)

Unnamed: 0,SiteIndex,RowIndex,Text,WordsCount,Context
93995,60,9623,La gamme d’épargne de Boursorama Banque n’est ...,50,"t A, LDD, PEL, CEL, compte sur livret et"
35560,22,15286,"Si vous le souhaitez, l ' agence bancaire qui ...",68,e de vous ouvrir un compte vous fait remp
43995,26,6892,Une autre menace dont il faut tenir compte con...,51,dont il faut tenir compte concerne cette
85478,54,9201,L’heure des résultats et de la délivrance est ...,101,emière ouverture de compte incluant une c
108570,67,4777,"La fidélité compte, nous la reconnaissons.",8,"La fidélité compte, nous la recon"
144570,88,69272,Comparatif des rendements des OPCI. Ces placem...,101,"u en direct, via un compte-titres, ou via"
130391,77,9111,Si la banque du Crédit Mutuel ( où le compte d...,104,édit Mutuel ( où le compte de cette dame
112199,70,12258,Pour ouvrir un compte bancaire en ligne chez M...,90,Pour ouvrir un compte bancaire en li
149630,91,24,« Quelles sont les conditions pour ouvrir un c...,14,ions pour ouvrir un compte bancaire ? »
111879,70,8552,« Tout va bien quand il n'y a pas de problème ...,24,de problème sur le compte. J'ai un compt


### 5.3 Separator chars and tokenizer rules

Candidate separator chars :

In [336]:
def listSeparatorChars(charsetdf):
    sepchars = charsetdf[(charsetdf["isAlpha"] == False) & (charsetdf["isDigit"] == False)].copy()
    sepchars["Name"] = sepchars["Char"].apply(lambda c:c.isspace()) 
    return sepchars

In [337]:
charset = loadCharset(rootdir)
separatorChars = listSeparatorChars(charset)
separatorChars[:30]

Unnamed: 0,Code,Count,Char,CharName,isAlpha,isDigit,isSpace,Percent,Name
18,44,227594,",",COMMA,False,False,False,87.11154,False
20,46,209020,.,FULL STOP,False,False,False,88.665295,False
23,8217,151057,’,RIGHT SINGLE QUOTATION MARK,False,False,False,90.437958,False
27,39,99110,',APOSTROPHE,False,False,False,92.088362,False
35,45,70047,-,HYPHEN-MINUS,False,False,False,94.560573,False
40,32,59143,,SPACE,False,False,True,95.668293,True
41,160,55700,,NO-BREAK SPACE,False,False,True,95.869548,True
42,58,55099,:,COLON,False,False,False,96.068632,False
48,41,42072,),RIGHT PARENTHESIS,False,False,False,97.058604,False
49,40,42034,(,LEFT PARENTHESIS,False,False,False,97.210481,False


Test the tokenizer behavior with each separator char :

In [350]:
nlp = spacy_InitWithTokenizer()

In [436]:
def searchCharInTokens(dataset,nlp,char,count):
    dataset4c = sampleTextBlocksWithChar(dataset,char,count)    
    listSplits = []
    listBefore = []
    listAfter = []
    for rowidx,text in dataset4c["Text"].iteritems():
        doc = nlp(text)
        splits = True
        before = ""
        after = ""
        for idx,token in enumerate(doc):
            if token.text == char:
                before = "" if idx==0 else doc[idx-1].text
                after = "" if idx==(len(doc)-1) else doc[idx+1].text
                break
            elif char in token.text:
                parts = token.text.split(char)
                before = parts[0]
                if before == "":
                    before = "" if idx==0 else doc[idx-1].text + "<<"
                after = parts[1]
                if after == "":
                    after = "" if idx==(len(doc)-1) else ">>" + doc[idx+1].text
                splits = False
                break        
        listSplits.append(splits)
        listBefore.append(before)
        listAfter.append(after)        
    dataset4c["Splits"] = listSplits
    dataset4c["Before"] = listBefore
    dataset4c["After"] = listAfter
    return dataset4c

In [437]:
char = "-"
dataset4c = searchCharInTokens(dataset,nlp,char,10000)
dataset4c.head(20)

Unnamed: 0,SiteIndex,RowIndex,Text,WordsCount,Context,Splits,Before,After
146797,89,16624,Documents d'informations : Document d'Informat...,63,rmations Clés (DIC) - Gan Performance Ret,True,),Gan
200382,113,6006,"Le tremblement de terre de magnitude 5,2, surv...",41,"ire de Chinon (Indre-et-Loire), a déclaré",False,Indre,et
206538,115,10611,Pour trouver l'assurance-vie qui correspond à ...,182,trouver l'assurance-vie qui correspond à,True,assurance,vie
196314,109,8575,Monte Paschi Banque - Livret de développement ...,10,Monte Paschi Banque - Livret de développe,True,Banque,Livret
207225,115,26098,- Label d’Excellence 2014 pour le compte coura...,20,- Label d’Excellence,True,,Label
44948,26,9943,Vous souhaitez effectuer une demande de micro-...,32,une demande de micro-crédit en ligne aupr,False,micro,crédit
209761,117,6680,L’assurance-vie (fonds en euros ou unités de c...,14,L’assurance-vie (fonds en euros,True,assurance,vie
68992,44,24865,"*les dispositions s'appliquent également, comp...",48,territoires d'outre-mer ou à l'étranger*,True,outre,mer
26291,19,4356,"Cofondateur du Palais de Tokyo, ancien directe...",63,"directeur des Beaux-Arts de Paris, ce th",False,Beaux,Arts
158501,95,10283,ING peut percevoir des rétrocessions de la par...,112,ment à l’article 314-76 du règlement géné,True,314,76


Most frequent chars before, after and around a separator when the tokenizer splits in three tokens, or doesn't split at all :

In [438]:
def exploreBeforeAfterSeparator(dataset4c,splits,columns):
    return dataset4c[dataset4c["Splits"] == splits].groupby(columns).agg({'Text':['count']})["Text"].sort_values("count",ascending=False)

In [439]:
beforeSplits = exploreBeforeAfterSeparator(dataset4c,True,["Before"])
beforeSplits.head(30)

Unnamed: 0_level_0,count
Before,Unnamed: 1_level_1
,300
assurance,219
Etats,121
Jean,97
),91
Saint,89
:,63
faut,53
Assurance,47
peut,46


In [440]:
afterSplits = exploreBeforeAfterSeparator(dataset4c,True,["After"])
afterSplits.head(30)

Unnamed: 0_level_0,count
After,Unnamed: 1_level_1
Mis,425
vous,341
vie,210
Unis,161
il,147
nous,117
on,73
t,67
je,65
ils,62


In [441]:
aroundSplits = exploreBeforeAfterSeparator(dataset4c,True,["Before","After"])
aroundSplits.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Before,After,Unnamed: 2_level_1
assurance,vie,205
Etats,Unis,121
faut,il,53
États,Unis,39
Peut,on,39
Assurance,Vie,36
PEA,PME,33
mes,placements,33
Faut,il,29
dois,je,24


In [442]:
beforeNoSplit = exploreBeforeAfterSeparator(dataset4c,False,["Before"])
beforeNoSplit.head(30)

Unnamed: 0_level_0,count
Before,Unnamed: 1_level_1
est<<,197
ci,160
non,124
e,122
au,118
rendez,114
start,107
sous,107
plus,100
celui,93


In [443]:
afterNoSplit = exploreBeforeAfterSeparator(dataset4c,False,["After"])
afterNoSplit.head(30)

Unnamed: 0_level_0,count
After,Unnamed: 1_level_1
ci,260
ce,213
vous,154
vie,151
delà,133
up,114
à,112
même,100
dessous,95
values,82


In [449]:
aroundNoSplit = exploreBeforeAfterSeparator(dataset4c,False,["Before","After"])
aroundNoSplit.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Before,After,Unnamed: 2_level_1
est<<,ce,192
rendez,vous,114
start,up,107
celui,ci,93
ci,dessous,81
au,delà,81
assurance,vie,79
plus,values,74
Assurance,vie,61
celle,ci,60


Find examples in context :

In [452]:
dataset4c[(dataset4c["Before"]=="est<<") & (dataset4c["After"]=="ce")]["Context"].sample(30)

123583          onsommation : qu’est-ce que c’est ?
185271                  Qu’est-ce qu’un contrat d'a
164688                  Qu'est-ce que la défiscalis
177974    ébut d’année. Qu’est-ce que le fichier de
8761                      Qu'est-ce qu'un artisan ?
151572            Pourquoi est-ce important : lors 
17323                   Qu'est-ce qu'un crédit reno
188674                  Qu’est-ce que le tiers paya
43241     r. Mais alors qu’est-ce qu’un CFD et comm
132609                  Qu'est-ce que la banque à d
1327                    Qu’est-ce qu’une résidence 
186802                  Qu'est-ce qu'un conducteur 
132018     » PTZ 2019 | Qu’est-ce que le Prêt à Tau
178253               7. Qu’est-ce que le Transfert 
72771                   Qu’est-ce que la prévoyance
5504                       iant Cthulhu, qu’est-ce…
138234                  Qu’est-ce que le bonus malu
159029                  Qu’est-ce qui change avec m
17453                   Qu'est-ce que l'assurance p
123736      