# nlptextdoc library source code

## 1. Prepare the Python environment

### 1.1 Install Anaconda and create a virtual environment

Download and install Anaconda for your platform : [Anaconda - Python 3.7](https://www.anaconda.com/distribution/#download-section).

Launch Anaconda Prompt.

> conda create --name nlptextenv

> conda activate nlptextenv

### 1.2 Install pandas with pyarrow.feather file format support

> conda install pandas

> conda install pyarrow

Make sure your version of pandas is > 0.24 and pyarrow is installed :

In [165]:
import pandas as pd
pd.show_versions()


INSTALLED VERSIONS
------------------
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.4.1
pip: 19.0.3
setuptools: 41.0.0
Cython: 0.29.7
numpy: 1.16.2
scipy: None
pyarrow: 0.11.1
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None


### 1.3 Install spaCy with french language support

> conda install -c conda-forge spacy

> python -m spacy download fr

Make sure your version of spacy is > 2.1 and fr model is installed :

In [166]:
!python -m spacy info



spaCy version    2.1.3                         
Location         C:\Users\laure\Anaconda3\envs\spacy\lib\site-packages\spacy
Platform         Windows-10-10.0.18362-SP0     
Python version   3.7.3                         
Models           fr                            



Install a spaCy language detector extension :

> pip install spacy-langdetect

Check if the language detection works :

In [344]:
import spacy
from spacy_langdetect import LanguageDetector

def spacy_InitWithTokenizer():
    nlp = spacy.load("fr_core_news_sm",disable=["tagger","ner","parser"])
    return nlp

def spacy_InitWithTokenizerAndLanguageDetector():
    nlp = spacy_InitWithTokenizer()
    nlp.add_pipe(nlp.create_pipe('sentencizer'), name="sentencizer", last=True)
    nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
    return nlp

def spacy_DetectLanguage(doc):
    return doc._.language["language"]

In [347]:
nlp = spacy_InitWithTokenizerAndLanguageDetector()
doc = nlp("Est-ce que le détecteur fonctionne ?")
%time spacy_DetectLanguage(doc)

Wall time: 3.99 ms


'fr'

### 1.4 How to run a Jupyter notebook in the context of a conda environment

> conda activate nlptextenv

> conda install ipykernel

> python -m ipykernel install --user --name nlptextenv --display-name "Python (nlptextenv)"

> jupyter notebook

=> menu Kernel / Change kernel / Python (nlptextenv)

Check : locate the Jupyter config directories, kernels are configured in the 'kernels' subdirectory, in 'kernel.json' files

In [169]:
from jupyter_core.paths import jupyter_data_dir
print(jupyter_data_dir())

C:\Users\laure\AppData\Roaming\jupyter


Check : locate the python environment in use

In [170]:
import sys
print(sys.executable)

C:\Users\laure\Anaconda3\envs\spacy\python.exe


### 1.5 Define technical utility functions

In [171]:
import os

def _memory_size(obj, seen=None):
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([_memory_size(v, seen) for v in obj.values()])
        size += sum([_memory_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += _memory_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([_memory_size(i, seen) for i in obj])
    return size

# OTHER OPTION specific to pandas dataframes
# https://www.dataquest.io/blog/pandas-big-data/
# df.info(memory_usage="deep")

def _file_size(filepath):
    statinfo = os.stat(filepath)
    return statinfo.st_size

def _format_size_mb(size):
    return int(size / 1024.0 / 102.4) / 10.0

## 2. Prepare the .NET environment

### 2.1 Install Visual Studio 2019 community

Download and install [Microsoft Visual Studio 2019](https://visualstudio.microsoft.com/fr/downloads/) community edition.

Install only the following workload : .NET Core multiplatform development.

### 2.2 Clone and compile nlptextdoc

Launch Visual Studio 2019.

Clone code
- Repository URL : https://github.com/laurentprudhon/nlptextdoc.git
- Choose a local directory for the solution

Double click on the solution file : nlptextdoc.sln

Select the "Release" configuration in the top toolbar.

In the Solution Explorer :
- right-click on the solution root => Generate solution
- right-click on the projet "nlptextdoc.cli" => Open directory in File Explorer

Navigate to the bin\Release\netcoreapp2.1 subdirectory :
- this directory should contain 7 .dll files, including : nlptextdoc.cli.dll
- copy the full path of this directory in the variable below

In [172]:
nlptextdocExecPath = r"C:\Users\laure\source\repos\nlptextdoctemp\nlptextdoc.cli\bin\Release\netcoreapp2.1"

Test the command line client and learn its syntax :

In [173]:
!dotnet "{nlptextdocExecPath}/nlptextdoc.cli.dll"

nlptexdoc extractor v1.0

Crawls all the Html pages of a website and converts them to .nlp.txt structured text documents.
All the extracted text documents are stored under a single directory named like the website.
The .nlp.txt file format is described here : https://www.cognitivefactory.org/nlptextdocument/

Features an advanced Html to text conversion algorithm :
- tries to recover the logical structure of the document from the Html layout
- interprets Css properties of the Html nodes to make this operation more reliable
- preserves document / section / list / table grouping and nesting information

Usage : nlptextdoc [rootUrl] [storageDirectory] [maxPagesCount=0] [minCrawlDelay=0]
 - rootUrl          : root Url of the website (or subfolder of a website) you want to crawl
 - storageDirectory : path to the disk directory where the website folder
 - maxPagesCount    : maximum number of pages extracted from the website (optional, default:100 000)
 - minCrawlDelay    : delay in milliseco

## 3. Extract nlp text documents from websites

### 3.1 Identify popular websites to build your specific language model

List the public and open websites you would like to read to build a language model.

PLEASE MAKE SURE THIS IS LEGAL in your country.

For example in Europe : https://ec.europa.eu/digital-single-market/en/modernisation-eu-copyright-rules. 

> "The mandatory exceptions that the proposed directive announced are related to: ... Text and data mining ..."

In [174]:
websites = ["http://bourse.latribune.fr/",
            "http://cercledelepargne.com/",
            "http://finance.lelynx.fr/banques/",
            "http://labourseauquotidien.fr/",
            "http://lafourmiz.fr/",
            "http://www.assurances.com/",
            "http://www.banque.org/",
            "http://www.banque-info.com/",
            "http://www.bourse.fr/",
            "http://www.boursedirect.fr/",
            "http://www.capitaine-epargne.com/",
            "http://www.cnp.fr/",
            "http://www.cofinoga.fr/",
            "http://www.comparabanques.fr/",
            "http://www.comparalivrets.fr/",
            "http://www.fbf.fr/",
            "http://www.financo.fr/",
            "http://www.generali.fr/",
            "http://www.guide-epargne.com/",
            "http://www.lemonde.fr/epargne/",
            "http://www.leparisien.fr/economie/votre-argent/",
            "http://www.lesaffaires.com/bourse",
            "http://www.lesclesdelabanque.com",
            "http://www.msn.com/fr-fr/finance",
            "http://www.retraiteepargne.fr/",
            "http://www.revue-banque.fr/",
            "http://www.strategie-bourse.com/",
            "http://www.zonebourse.com/",
            "https://acpr.banque-france.fr/",
            "https://banque.meilleurtaux.com/",
            "https://bourse.lefigaro.fr/",
            "https://compte-nickel.fr/",
            "https://eko-by-ca.fr/",
            "https://epargne.ooreka.fr/",
            "https://ffa-assurance.fr/",
            "https://fr.finance.yahoo.com/",
            "https://humanis.com/",
            "https://mabanque.bnpparibas/",
            "https://mes-placements.fr/",
            "https://n26.com/fr-fr/",
            "https://particulier.apicil.com/",
            "https://www.10meilleuresbanques.fr/",
            "https://www.abcbourse.com/",
            "https://www.afer.fr/",
            "https://www.ag2rlamondiale.fr/",
            "https://www.agpm.fr/",
            "https://www.allianz.fr/",
            "https://www.allianzbanque.fr/",
            "https://www.amaguiz.com/",
            "https://www.ameli.fr/",
            "https://www.amundi.fr/fr_part",
            "https://www.arkea.com/",
            "https://www.assurland.com/",
            "https://www.aviva.fr/",
            "https://www.axa.fr/",
            "https://www.banque.fr/",
            "https://www.banque-casino.fr/",
            "https://www.banque-edel.fr/",
            "https://www.banque-france.fr/",
            "https://www.banquepopulaire.fr/",
            "https://www.banquesenligne.org/",
            "https://www.bforbank.com/",
            "https://www.boursedeparis.fr/",
            "https://www.boursier.com/",
            "https://www.boursorama.com/",
            "https://www.boursorama-banque.com/",
            "https://www.bred.fr/",
            "https://www.ca-alsace-vosges.fr/",
            "https://www.caisse-epargne.fr/",
            "https://www.carrefour-banque.fr/",
            "https://www.cbanque.com/",
            "https://www.cetelem.fr/",
            "https://www.challenges.fr/tag_theme/banque_876/",
            "https://www.cic.fr/",
            "https://www.cofidis.fr/",
            "https://www.credit-cooperatif.coop/",
            "https://www.credit-du-nord.fr/",
            "https://www.credit-et-banque.com/",
            "https://www.creditfoncier.fr/",
            "https://www.creditmutuel.fr/",
            "https://www.culturebanque.com/",
            "https://www.diac.fr/",
            "https://www.direct-assurance.fr/",
            "https://www.economie.gouv.fr/",
            "https://www.empruntis.com/epargne/",
            "https://www.en-bourse.fr/",
            "https://www.eurofil.com/",
            "https://www.fortuneo.fr/",
            "https://www.francetransactions.com/",
            "https://www.gan.fr/",
            "https://www.groupama.fr/",
            "https://www.hellobank.fr/",
            "https://www.home.saxo/fr-fr/",
            "https://www.hsbc.fr/",
            "https://www.impots.gouv.fr/portail/",
            "https://www.ing.fr/banque-en-ligne/",
            "https://www.labanquepostale.fr/",
            "https://www.lcl.fr/",
            "https://www.lerevenu.com/",
            "https://www.lesechos.fr/finance-marches/",
            "https://www.lesfurets.com/",
            "https://www.lolivier.fr/",
            "https://www.mae.fr/",
            "https://www.maif.fr/",
            "https://www.matmut.fr/",
            "https://www.mma.fr/",
            "https://www.monabanq.com/fr/index.html",
            "https://www.mon-epargne.com/",
            "https://www.montepaschi-banque.fr/fr/",
            "https://www.natixis.com/",
            "https://www.oney.fr/",
            "https://www.orangebank.fr/",
            "https://www.ouest-france.fr/economie/banques-finance/",
            "https://www.palatine.fr/",
            "https://www.panorabanques.com/",
            "https://www.probtp.com/",
            "https://www.psabanque.fr/",
            "https://www.quechoisir.org/thematique-banque-credit-t111/",
            "https://www.revolut.com/fr-FR/",
            "https://www.service-public.fr/particuliers/vosdroits/N19803",
            "https://www.smc.fr/",
            "https://www.societegenerale.fr/",
            "https://www.sofinco.fr/",
            "https://www.toutsurmesfinances.com/",
            "https://www.tradingsat.com/",
            "https://www.usine-digitale.fr/banque/",
            "https://www.younited-credit.com/"]

len(websites)

128

### 3.2 Extract raw text from these websites in a local directory

Create a local directory to store the extracted nlp text documents : be careful, this directory may contain several gigabytes of data at the end of the process.

IMPORTANT : **the "magic" \\\\?\ prefix in the root path is mandatory on Windows** to enable long file names support in Python.

In [175]:
from pathlib import Path

rootdir = Path(r"\\?\C:\Users\laure\Desktop\nlptextdoc-data-201907")
rootdir.mkdir(exist_ok=True)

Start by extracting only a few documents (for example 100) from each webiste, to test if they are accessible and if everything works as expected :

In [None]:
maxPagesCount = 100

for websiteUrl in websites:
    !dotnet "{nlptextdocExecPath}/nlptextdoc.cli.dll" {websiteUrl} {str(rootdir)} {maxPagesCount}

In the local root directory, the extraction program creates one subdirectory per website.

Each website subdirectory contains :
- one log file called **httprequests.log.csv**
- subdirectories reproducing the website tree structure
- one **nlp.txt text document** for each extracted html page in this tree structure

See the following page for a **description of the nlptextdoc format** : https://github.com/laurentprudhon/nlptextdoc/blob/master/README.md

Check if all the websites were correctly extracted :

In [176]:
import pandas as pd
from urllib.parse import urlparse

def getWebsiteName(websiteurl):
    url = urlparse(websiteurl)
    websitename = url.netloc
    return websitename

def getWebsiteDir(rootdir, websitename):
    websitedir = rootdir / websitename
    return websitedir

def loadExtractionLogs(websitedir):
    return pd.read_csv(websitedir / "httprequests.log.csv",delimiter=";")

def getExtractionStats(websites):
    websiteNames = []
    requestsCount = []
    statusCounts = []    
    errorTypes = ["OK","NotFound","Redirect","NoContent","Forbidden","BadRequest","Moved"]
    for errorType in errorTypes:
        statusCounts.append([])
    for websiteurl in websites:
        website = getWebsiteName(websiteurl)
        print(f"Checking extraction logs for website {website} ...")
        websitedir = getWebsiteDir(rootdir, website)
        logsdf = loadExtractionLogs(websitedir)
        logsstatus = logsdf["Status code"].value_counts()
        websiteNames.append(website)
        requestsCount.append(len(logsdf))
        for idx,errorType in enumerate(errorTypes):
            statusCounts[idx].append(logsstatus[errorType] if errorType in logsstatus else 0)
    dictResult = {}
    dictResult["Website"] = websiteNames
    dictResult["Requests"] = requestsCount
    for idx,errorType in enumerate(errorTypes):
        dictResult[errorType] = statusCounts[idx]
    return pd.DataFrame(dictResult)    

In [None]:
extractionStats = getExtractionStats(websites)

In [None]:
extractionStats[extractionStats["Requests"] != extractionStats["OK"]]

For each website with http error codes, open the log file **httprequests.log.csv** and see if something needs to be fixed.

Use the code below to test if the errors :
- were temporary, a consequence of a the high request rate the extraction program : then relaunch the extraction of the website with a bigger minCrawlDelay
- are a real problem in the source website : just ignore them and continue

In [177]:
from urllib.request import urlopen
from urllib.error import HTTPError

def checkExtractionLogsByErrorType(logsdf):
    errorTypes = ["NotFound","Redirect","NoContent","Forbidden","BadRequest","Moved"]
    urls = []
    extractionStatus = []
    checkedStatus = []
    for errorType in errorTypes:
        urlsWithError = logsdf[logsdf["Status code"] == errorType]["Url"]
        print(f"Testing {len(urlsWithError)} URLs with error type {errorType} ...")
        for url in urlsWithError:
            urls.append(url)
            extractionStatus.append(errorType)
            try:
                resp = urlopen(url)
                checkedStatus.append(resp.getcode())
            except HTTPError as he:
                checkedStatus.append(he.code)
    checksdf = pd.DataFrame({"Urls" : urls, "ExtractionStatus" : extractionStatus, "CheckedStatus" : checkedStatus})
    return checksdf

In [53]:
websiteIndex = 9
websitename = getWebsiteName(websites[websiteIndex])
print(websitename)

websitedir = getWebsiteDir(rootdir, websitename)
logsdf = loadExtractionLogs(websitedir)
checkExtractionLogsByErrorType(logsdf)

www.boursedirect.fr
Testing 0 URLs with error type NotFound ...
Testing 1 URLs with error type Redirect ...
Testing 0 URLs with error type NoContent ...
Testing 2 URLs with error type Forbidden ...
Testing 0 URLs with error type BadRequest ...
Testing 0 URLs with error type Moved ...


Unnamed: 0,Urls,ExtractionStatus,CheckedStatus
0,http://www.boursedirect.fr/priv/logoutPriv.php,Redirect,200
1,http://www.boursedirect.fr/fr/profil,Forbidden,403
2,http://www.boursedirect.fr/fr/messagerie,Forbidden,403


When everything seems OK, relaunch the extraction code above with a much bigger maxPagesCount (for example 100 000).

### 3.3 Download publicly available french dictionaries

Create a local subdirectory to store the french dictionaries :

In [178]:
dictdir = rootdir / "_dictionaries"
dictdir.mkdir(exist_ok=True)

**Dictionary 1 : Dicollecte** - Open Source french dictionary for LibreOffice/OpenOffice

Website : https://grammalecte.net/home.php?prj=fr.

Licence : MPL : Mozilla Public License version 2.0  -  http://www.mozilla.org/MPL/2.0/.

Download the latest "Lexique" on the [Grammalecte downloads page](https://grammalecte.net/download.php?prj=fr) :
- open the zip file
- copy only the "lexique-dicollecte-fr-v*.txt file (for example : lexique-dicollecte-fr-v6.4.1.txt) in the local subdirectory

Open the file in a text editor to see its self-descriptive format and contents.

Store the file name in the variable below :

In [179]:
dicollectefile = dictdir / "lexique-dicollecte-fr-v6.4.1.txt"
dicollectefile.exists()

True

In [180]:
def buildDicollecteTags(dicollectefile):
    dictionarydf = pd.read_csv(dicollectefile, sep="\t", skiprows=15)
    dictionarydf.head()
    dictionarytags = {}
    for index, row in dictionarydf.iterrows():
        token = row["Flexion"]
        tag = _convertDicollecteTagsToUnivDepTags(row["Étiquettes"])
        if(not (token in dictionarytags)):
            dictionarytags[token] = tag
        elif(not (tag in dictionarytags[token])):
            dictionarytags[token] = dictionarytags[token] + "|" + tag
    return dictionarytags

def _convertDicollecteTagsToUnivDepTags(text):
    if(("adj" in text) or ("loc.adj" in text)):
        return "ADJ"
    elif("prep" in text):
        return "ADP"
    elif(("adv" in text) or ("loc.adv" in text)):
        return "ADV"
    elif(("v0a" in text) or ("v0e" in text) or ("ppas" in text)):
        return "AUX"
    elif("cjco" in text):
        return "CCONJ"
    elif("det" in text):
        return "DET"
    elif("interj" in text):
        return "INTJ"
    elif("nom" in text):
        return "NOUN"
    elif(("nb" in text) or ("ord" in text)):
        return "NUM"
    elif("pro" in text):
        return "PRON"
    elif(("prn" in text) or ("patr" in text) or ("npr" in text)):
        return "PROPN"
    elif("cjsub" in text):
        return "SCONJ"
    elif("symb" in text):
        return "SYM"
    elif(("v1" in text) or ("v2" in text) or ("v3" in text) or ("loc.verb" in text)):
        return "VERB"
    else:
        return text

In [None]:
dicollecteTags = buildDicollecteTags(dicollectefile)
dicollecteTags

**Dictionary 2 : UDLexicons Lefff** - Research resource from INRIA for the [Universal Dependencies](https://universaldependencies.org/) project

Citation : Benoît Sagot. A multilingual collection of CoNLL-U-compatible morphological lexicons. Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 2018, Miyazaki, Japan. hal-01798798v2

Paper : https://hal.inria.fr/hal-01798798v2/document

Download the latest "UDLexicons" on [Benoît Sagot's resources page](http://alpage.inria.fr/~sagot/) :
- open the zip file
- copy only the "UDLex_French-Lefff.conllul" in the local directory
- add a .txt extension to the file name

Open the file in a text editor to see its self-descriptive format and contents.

Store the file name in the variable below :

In [182]:
leffffile = dictdir / "UDLex_French-Lefff.conllul.txt"
leffffile.exists()

True

In [183]:
def buildLefffTags(leffffile):
    lexicondf = pd.read_csv(leffffile, sep="\t", quoting=3, error_bad_lines=False)
    lexicontags = {}
    for index, row in lexicondf.iterrows():
        token = row["!"]
        tag = row["PUNCT"]
        if(not (token in lexicontags)):
            lexicontags[token] = tag
        elif(not (tag in lexicontags[token])):
            lexicontags[token] = lexicontags[token] + "|" + tag
    return lexicontags

In [None]:
lefffTags = buildLefffTags(leffffile)
lefffTags

## 4. Generate text dataset, statistics, dictionaries from websites extraction directory

### 4.1 Load the extracted text files in an efficient DataFrame for each website

The following class can be used to **parse and load the .nlp.txt text files extracted from a website in a DataFrame**.

See the following page for a **description of the nlptextdoc format** : https://github.com/laurentprudhon/nlptextdoc/blob/master/README.md

In [185]:
import numpy as np
import re

class NLPTextDocumentReader:
    """Read output files of a website extraction in pandas DataFrames.
    
    Sample usage :
    
    textreader = NLPTextDocumentReader(websitedir)
    textdf = textreader.load_dataframe()
    """    
    def __init__(self, websitedir):
        self.websitedir = websitedir
        
        self.documentCount = 0 
        self.nestingLevel = 1
        self.listType = []
        self.listCmd = []
        self.listLevel = []
        self.listText = []
                
        self.DOCUMENT_ELEMENT_LINE_MARKER = "##"
        self.DOCUMENT_ELEMENT_START = "Start"
        self.DOCUMENT_ELEMENT_END = "End"
        self.DOCUMENT_ELEMENT_ITEMS = "Items"
        self.DOCUMENT_ELEMENT_ITEMS_START = ">>"
        self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR = "||"
        
        self.TEXT_DOCUMENT_PROPERTY_PREFIX = self.DOCUMENT_ELEMENT_LINE_MARKER + " NLPTextDocument "
        self.TEXT_DOCUMENT_TITLE = "Title"
        self.TEXT_DOCUMENT_URI = "Uri"
        
        self.DOCUMENT_ELEMENT_LINE_REGEX = re.compile(
            self.DOCUMENT_ELEMENT_LINE_MARKER + " "
            + "(?P<NestingLevel>[0-9]+)" + " "
            + "(?P<ElementName>[A-Za-z]+)" + " "
            + "(?P<Command>" + self.DOCUMENT_ELEMENT_START + "|" + self.DOCUMENT_ELEMENT_END + "|" + self.DOCUMENT_ELEMENT_ITEMS + ")" + " ?")
        
    def load_dataframe(self):
        textdffile = self.websitedir / "nlptextdocs.dataframe.feather"
        if(textdffile.exists()):
            return pd.read_feather(textdffile)
        else:
            for textfile in self.websitedir.glob("**/*.nlp.txt"):
                with textfile.open(mode="r", encoding="utf-8-sig") as f:   
                    self.textfile = textfile
                    self.documentCount = self.documentCount+1
                    self.onDocumentStart(str(self.documentCount))
                    self.isreadingproperties = True
                    for lineidx,line in enumerate(f):
                        line = line.strip()
                        if(not line): continue
                        self.lineidx = lineidx
                        self.readline(line)
                    self.onDocumentEnd(str(self.documentCount))
            textdf = pd.DataFrame({"DocEltType": self.listType, "DocEltCmd" : self.listCmd, "NestingLevel": self.listLevel, "Text":self.listText})
            textdf = textdf.astype({"DocEltType": "category", "DocEltCmd": "category", "NestingLevel": np.uint8},copy=False)
            self.__init__(self.websitedir)
            textdf.to_feather(textdffile)
            return textdf

    def readline(self,line):
        if (self.isreadingproperties):
            if (line.startswith(self.TEXT_DOCUMENT_PROPERTY_PREFIX)):
                self.readproperty(line[len(self.TEXT_DOCUMENT_PROPERTY_PREFIX):])
            else:
                self.isreadingproperties = False
        if (not self.isreadingproperties):
            self.readelement(line)
                
    def readproperty(self,propstr):
        firstspaceindex = propstr.find(" ");
        if (firstspaceindex > 0):
            propertyname = propstr[:firstspaceindex]            
            propertyvalue = propstr[firstspaceindex + 1:].strip()
            if(propertyname == self.TEXT_DOCUMENT_TITLE):
                self.onDocumentTitle(propertyvalue)
            elif(propertyname == self.TEXT_DOCUMENT_URI):
                self.onDocumentUri(propertyvalue)       
    
    def readelement(self,line):
        if (line.startswith(self.DOCUMENT_ELEMENT_LINE_MARKER)):
            self.readcommand(line)
        else:
            self.onTextBlock(line)
    
    def readcommand(self,line):
        match = self.DOCUMENT_ELEMENT_LINE_REGEX.match(line)
        if(match): 
            self.nestingLevel = int(match.group("NestingLevel"))
            elementName = match.group("ElementName")
            command = match.group("Command")
            if (command == self.DOCUMENT_ELEMENT_START):
                title = line[match.end():].strip()
                if (len(title) == 0): title = None
                if(elementName == "Section"):
                    self.onSectionStart(title)
                elif(elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif(elementName == "List"):
                    self.onListStart(title)
                elif(elementName == "ListItem"):
                    self.onListItemStart()
                elif(elementName == "Table"):
                    self.onTableStart(title)
                elif(elementName == "TableHeader"):
                    self.onTableHeaderStart()           
                elif(elementName == "TableCell"):
                    self.onTableCellStart()
            elif (command == self.DOCUMENT_ELEMENT_END):
                if(elementName == "Section"):
                    self.onSectionEnd()
                elif(elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif(elementName == "List"):
                    self.onListEnd()
                elif(elementName == "ListItem"):
                    self.onListItemEnd()
                elif(elementName == "Table"):
                    self.onTableEnd()
                elif(elementName == "TableHeader"):
                    self.onTableHeaderEnd()                 
                elif(elementName == "TableCell"):
                    self.onTableCellEnd()
            elif (command == self.DOCUMENT_ELEMENT_ITEMS):
                startOfItems = line.find(self.DOCUMENT_ELEMENT_ITEMS_START)
                title = line[match.end():startOfItems].strip()
                if (len(title) == 0): title = None
                if (elementName == "NavigationList"):
                    self.onNavigationListStart(title)
                elif (elementName == "List"):
                    self.onListStart(title)             
                self.nestingLevel = self.nestingLevel+1
                items = line[startOfItems+len(self.DOCUMENT_ELEMENT_ITEMS_START):].split(self.DOCUMENT_ELEMENT_ITEMS_SEPARATOR)
                for item in items:
                    item = item.strip()
                    if (len(item) > 0):
                        self.onInlineListItem(item)
                self.nestingLevel = self.nestingLevel-1
                if (elementName == "NavigationList"):
                    self.onNavigationListEnd()
                elif (elementName == "List"):
                    self.onListEnd()
            else:
                raise Exception(f"File format error in file {self.textfile} on line {self.lineidx} : {line[:min(len(line), 50)]}");                     
        else:
            raise Exception(f"File format error in file {self.textfile} on line {self.lineidx} : {line[:min(len(line), 50)]}");
    
    def onDocumentStart(self,docId):
        self.appendrow("Document","Start",docId)
    
    def onDocumentTitle(self,title):
        self.appendrow("Document","Title",title)
            
    def onDocumentUri(self,uri):
        self.appendrow("Document","Uri",uri)
    
    def onDocumentEnd(self,docId):
        self.appendrow("Document","End",docId)
    
    def onTextBlock(self,text):
        self.appendrow("TextBlock","Text",text)
            
    def onSectionStart(self,title):
        self.appendrow("Section","Start",title)
        
    def onSectionEnd(self): 
        self.appendrow("Section","End")
        
    def onNavigationListStart(self,title):
        self.appendrow("NavigationList","Start",title)
        
    def onNavigationListEnd(self):
        self.appendrow("NavigationList","End")
        
    def onListStart(self,title):
        self.appendrow("List","Start",title)
        
    def onListEnd(self):
        self.appendrow("List","End")
        
    def onInlineListItem(self,item):
        self.appendrow("ListItem","Text",item)
            
    def onListItemStart(self):
        self.appendrow("ListItem","Start")
        
    def onListItemEnd(self):
        self.appendrow("ListItem","End")
        
    def onTableStart(self,title):
        self.appendrow("Table","Start",title)
    
    def onTableEnd(self):
        self.appendrow("Table","End")
        
    def onTableHeaderStart(self):
        self.appendrow("TableHeader","Start")
        
    def onTableHeaderEnd(self): 
        self.appendrow("TableHeader","End")
        
    def onTableCellStart(self):
        self.appendrow("TableCell","Start")
        
    def onTableCellEnd(self): 
        self.appendrow("TableCell","End")
            
    def appendrow(self,docEltType,docEltCmd,text=None):
        self.listType.append(docEltType)
        self.listCmd.append(docEltCmd)
        self.listLevel.append(self.nestingLevel)
        if(text != None):
            text = text.replace("\\n","\n")
        self.listText.append(text)

Use the function below to prepare DataFrames for all the extracted websites at once :

In [186]:
def prepareDataFramesForWebsites(rootdir, websites):
    """Loads all individual text blocks extracted from the pages of each website in a dataframe, and save them efficiently on disk.

    Parameters:
    rootdir - Path to the directory where the websites were extracted
    websites - List of strings with the websites root URLs
    """
    for websiteurl in websites:
        website = getWebsiteName(websiteurl)
        print(f"Preparing dataframe for website {website} ...")        
        websitedir = getWebsiteDir(rootdir,website)
        reader = NLPTextDocumentReader(websitedir)
        textdf = reader.load_dataframe()
        docsCount = len(textdf[(textdf["DocEltType"]=="Document") & (textdf["DocEltCmd"]=="Start")])
        logsdf = loadExtractionLogs(websitedir)
        print(f"- {len(logsdf)} website extraction logs")
        print(f"- {docsCount} documents")
        print(f"- {len(textdf)} document elements")
        print(f"- dataframe size in memory : {_format_size_mb(_memory_size(textdf))} MB")
        websitefile = websitedir / "nlptextdocs.dataframe.feather"
        print(f"- dataframe size on disk : {_format_size_mb(_file_size(websitefile))} MB")

In [None]:
prepareDataFramesForWebsites(rootdir, websites)

If you encounter a parsing error in any of the text files : just delete the corrupted file and relaunch the function above.

It will run very efficiently for all the websites already processed.

### 4.2 Filter and aggregate all interesting text blocks in a single DataFrame

While we filter and aggregate all the interesting text blocks in a single DataFrame, we also generate the following summaries of the text data for later use :

1. Information about the character set used in the extracted dataset :

In [324]:
from unicodedata import name as unicodename

def charname(char):
    return unicodename(char,f"CHAR {ord(char)}")

def saveCharset(rootdir, vocabdf):
    print("Saving the character set ...")
    charcounts = defaultdict(lambda:0)
    for idx,row in vocabdf.iterrows():
        token = row["Word"]
        count = row["Count"]
        for char in token:
            charcode = ord(char)
            charcounts[charcode] = charcounts[charcode] + count
    charsetdf = pd.DataFrame({"Code" : [*charcounts.keys()], "Count" : [*charcounts.values()]})
    charsetdf.sort_values("Count", ascending=False, inplace=True)
    charsetdf.reset_index(inplace=True)
    charsetdf.drop('index', axis=1, inplace=True)
    charsetdf["Char"] = charsetdf["Code"].map(lambda x:chr(x))
    charsetdf["CharName"] = charsetdf["Char"].map(lambda c:charname(c))
    charsetdf["isAlpha"] = charsetdf["Char"].map(lambda x:x.isalpha())
    charsetdf["isDigit"] = charsetdf["Char"].map(lambda x:x.isdigit())
    charsetdf["isSpace"] = charsetdf["Char"].map(lambda x:x.isspace())
    charsetdf["Percent"] = 100*charsetdf["Count"].cumsum()/charsetdf["Count"].sum()
    charsetfile = rootdir / "charset.dataframe.feather"
    charsetdf.to_feather(charsetfile)
    charsetdf.to_csv(rootdir / "charset.csv",sep=";")
    print(f"- {len(charsetdf)} distinct characters")
    return charsetdf
               
def loadCharset(rootdir):
    charsetfile = rootdir / "charset.dataframe.feather"
    return pd.read_feather(charsetfile)

2. Information about the vocabulary (distinct words) used in the extracted dataset :

In [327]:
def saveVocabulary(rootdir, vocabdict):
    print("Saving the vocabulary ...")
    vocabdf = pd.DataFrame({"Word" : [*vocabdict.keys()], "Count" : [*vocabdict.values()]})	
    vocabdf.sort_values("Count", ascending=False, inplace=True)
    vocabdf.reset_index(inplace=True)    
    vocabdf.drop('index', axis=1, inplace=True)
    vocabdf["LefffTags"] = vocabdf["Word"].apply(lambda word: _getTokenTags(str(word),lefffTags))
    vocabdf["DicollecteTags"] = vocabdf["Word"].apply(lambda word: _getTokenTags(str(word),dicollecteTags))
    vocabdf["CommonTags"] = vocabdf.apply(lambda row: _mergeTokenTags(str(row["LefffTags"]),str(row["DicollecteTags"])),axis=1)
    vocabdf["Percent"] = 100*vocabdf["Count"].cumsum()/vocabdf["Count"].sum()
    vocabfile = rootdir / "vocabulary.dataframe.feather"
    vocabdf.to_feather(vocabfile)
    vocabdf.to_csv(rootdir / "vocabulary.csv",sep=";")
    print(f"- {len(vocabdf)} distinct words")
    return vocabdf

def _getTokenTags(token,tags):
    annot = tags.get(token)
    if(annot is None):
        annot = tags.get(token.lower())
    if(annot is None):
        try:
            float(token.replace(",","."))
        except ValueError:
            return None
        return "Number"
    return annot

def _mergeTokenTags(annot1,annot2):
    if(annot1 == annot2):
        return annot1
    elif((annot1 != "None") and (annot2 == "None")):
        return annot1
    elif((annot1 == "None") and (annot2 != "None")):
        return annot2
    else:
        tags1 = set(annot1.split("|"))
        tags2 = set(annot2.split("|"))
        mergedtags = tags1 | tags2
        return "|".join(mergedtags)
    
def loadVocabulary(rootdir):
    vocabfile = rootdir / "vocabulary.dataframe.feather"
    return pd.read_feather(vocabfile)

Combine all textblocks from each website in a single dataframe, while applying several filters to enhance the dataset quality:
- keep only distinct text blocks for each website
- keep only text blocks with more than 5 words
- keep only text blocks in french 

In [189]:
from collections import defaultdict
from hashlib import md5

def createDatasetFromWebsites(rootdir, websites, minWordsCount=5, filterLanguage="fr"):
    """Combine all textblocks from each website in a single dataframe, while applying several filters to enhance the dataset quality:
    - keep only distinct text blocks for each website
    - keep only text blocks with more than 5 words
    - keep only text blocks in french

    Create at the same time 4 additional dataframes:
    - a dictionary of all distinct words encountered in the dataset by decreasing frequency
    - a dictionary of all distinct characters encountered in the dataset by decreasing frequency
    - a table of the dataset statistics

    Parameters:
    rootdir - Path to the directory where the websites were extracted
    websites - List of strings with the websites root URLs
    """
    charsCount = 0
    wordsCount = 0
    vocabdict = defaultdict(lambda:0)
    listSiteIndex = []
    listRowIndex = []
    listText = []
    listWordCounts = []    
    listWebsites = []
    listWebsitesWordCounts = []
    nlp = spacy_InitWithTokenizerAndLanguageDetector()
    for idx,websiteUrl in enumerate(websites):
        website = getWebsiteName(websiteUrl)
        websitedir = getWebsiteDir(rootdir,website)
        hashes = set()
        print(f"Loading dataframe for website {website} ...")
        reader = NLPTextDocumentReader(websitedir)
        textdf = reader.load_dataframe()
        print(f"- filtering and tokenizing {len(textdf)} text blocks ...")
        websitetexts = textdf[((textdf["DocEltType"] != "Document") | (textdf["DocEltCmd"] == "Title")) & (textdf["DocEltCmd"] != "End") & ~textdf["Text"].isnull()]["Text"]
        localWordsCount = 0
        for rowidx,text in websitetexts.iteritems():
            hval = md5(text.encode()).digest()
            if not (hval in hashes):         
                hashes.add(hval)
                doc = nlp(text)
                rowWordsCount = len(doc)
                rowLanguage = spacy_DetectLanguage(doc)
                if (rowWordsCount >= minWordsCount) and (rowLanguage == filterLanguage):
                    charsCount = charsCount + len(text)
                    localWordsCount = localWordsCount + rowWordsCount
                    for token in doc:
                        vocabdict[token.text] = vocabdict[token.text] + 1
                    listSiteIndex.append(idx)
                    listRowIndex.append(rowidx)
                    listText.append(text)
                    listWordCounts.append(rowWordsCount)
        listWebsites.append(website)
        listWebsitesWordCounts.append(localWordsCount)
        print(f"- this website contributed {localWordsCount} words to the dataset")
        wordsCount = wordsCount + localWordsCount

    print("Saving the complete dataset ...")
    datasetdf = pd.DataFrame({"SiteIndex": listSiteIndex, "RowIndex" : listRowIndex, "Text":listText, "WordsCount":listWordCounts})
    print(f"- {charsCount} characters, {wordsCount} words, {len(datasetdf)} text blocks")
    print(f"- dataset size in memory : {_format_size_mb(_memory_size(datasetdf))} MB")
    datasetfile = rootdir / "dataset.dataframe.feather"
    datasetdf.to_feather(datasetfile)
    print(f"- dataset size on disk : {_format_size_mb(_file_size(datasetfile))} MB")
    
    vocabdf = saveVocabulary(rootdir, vocabdict)
    charsetdf = saveCharset(rootdir, vocabdf)

    statsdf = pd.DataFrame({"Website": listWebsites,"WordsCount":listWebsitesWordCounts})    
    statsdf["Percent"] = statsdf["WordsCount"].apply(lambda w:w/wordsCount*100)
    statsdf.sort_values("Percent",ascending=False,inplace=True)
    statsfile = rootdir / "stats.csv"
    statsdf.to_csv(statsfile)    
    
def loadDataset(rootdir):
    datasetfile = rootdir / "dataset.dataframe.feather"
    return pd.read_feather(datasetfile)

Let's use all these functions to create our dataset : depending on the amount of data this could take SEVERAL HOURS.

In [None]:
createDatasetFromWebsites(rootdir,websites)

## 5. Study vocabulary and tokenizer perf

### 5.1 Unknown words and proper nouns

In [329]:
def listUnknownWordsAndProperNouns(vocabdf):
    uwords = vocabdf[(vocabdf["CommonTags"] == "None") | (vocabdf["CommonTags"].str.contains("PROPN"))].copy()
    uwords["Length"] = uwords["Word"].apply(lambda w: len(w))
    uwords["isSpace"] = uwords["Word"].apply(lambda w: w.isspace())
    uwords["CharName"] = uwords["Word"].apply(lambda w: charname(w) if len(w)==1 else "")
    uwords.to_csv(rootdir / "specificwords.csv",sep=";")
    return uwords

In [330]:
vocabdf = loadVocabulary(rootdir)
specificwords = listUnknownWordsAndProperNouns(vocabdf)
specificwords.head(30)

  labels, = index.labels


Unnamed: 0,Word,Count,LefffTags,DicollecteTags,CommonTags,Percent,Length,isSpace,CharName
13,,52198,,,,26.22981,1,True,NO-BREAK SPACE
51,€,11834,,,,43.56146,1,False,EURO SIGN
69,…,7794,,,,46.390409,1,False,HORIZONTAL ELLIPSIS
70,!,7773,,,,46.517784,1,False,EXCLAMATION MARK
81,France,6961,PROPN,NOUN,NOUN|PROPN,47.844507,6,False,
155,–,3940,,,,54.040798,1,False,EN DASH
173,,3675,,,,55.15367,1,True,SPACE
178,Paris,3578,PROPN,PROPN,PROPN,55.451192,5,False,
239,\n,2580,,,,58.419299,1,True,CHAR 10
251,Epargne,2443,,,,58.9114,7,False,


In [338]:
def getContextAroundWord(text,word,ctxsize=20):
    idx = text.index(word)
    start = max(idx-ctxsize,0)
    end = min(idx+ctxsize,len(text))
    return text[start:end+1]

def sampleTextBlocksWithChar(textdf,char,count=100,ctxsize=20):
    textsWithWord = textdf[textdf["Text"].str.contains(char,regex=False)]
    textsWithWord = textsWithWord.sample(count)
    textsWithWord["Context"] = textsWithWord["Text"].apply(lambda t: getContextAroundWord(t,char,ctxsize))
    return textsWithWord.copy()

def sampleTextBlocksWithWord(textdf,word,count=100,ctxsize=20):
    textsWithWord = textdf[textdf["Text"].str.contains("\\b"+word+"\\b")]
    textsWithWord = textsWithWord.sample(count)
    textsWithWord["Context"] = textsWithWord["Text"].apply(lambda t: getContextAroundWord(t,word,ctxsize))
    return textsWithWord.copy()

In [253]:
dataset = loadDataset(rootdir)

  labels, = index.labels


In [259]:
textsWithWord = sampleTextBlocksWithWord(dataset,"Mutuel")
textsWithWord.head(20)

Unnamed: 0,SiteIndex,RowIndex,Text,WordsCount,Context
78777,51,23427,Crédit Mutuel Arkéa - Nos métiers - ABEI - Pag...,12,Crédit Mutuel Arkéa - Nos mé
205845,115,6655,Livret Bienvenue de Crédit Mutuel,5,Bienvenue de Crédit Mutuel
133173,79,5677,Mettre de l'argent de côté permet de faire fac...,47,"ng terme. Au Crédit Mutuel, plusieurs sol"
77897,51,4291,"Jean-Pierre Denis, Président du Crédit Mutuel ...",98,Président du Crédit Mutuel Arkéa et du Cr
132852,79,2811,1 er réseau bancaire né de la volonté des Prof...,51,"ur elles, le Crédit Mutuel des Profession"
42475,26,3033,"En 1984, suite à une demande déposée par les a...",47,ricole et le Crédit Mutuel le Groupement
58770,38,7560,2 nouveaux contrats complètent la gamme ! mes-...,51,(filiale du Crédit Mutuel Arkéa) mes-pla
130608,77,9913,Frais bancaires Crédit Mutuel Normandie,5,is bancaires Crédit Mutuel Normandie
78064,51,7639,Crédit Mutuel Arkéa - Présentation de la direc...,14,Crédit Mutuel Arkéa - Présen
78593,51,17742,Assurances - Crédit Mutuel Arkéa 1,6,Assurances - Crédit Mutuel Arkéa 1


### 5.2 Most common nouns

In [262]:
def listMostCommonNouns(vocabdf,count):
    cwords = vocabdf[vocabdf["CommonTags"].str.contains("NOUN")]
    cwords = cwords[:count].copy()
    cwords.to_csv(rootdir / "commonwords.csv",sep=";")
    return cwords

In [263]:
nouns = listMostCommonNouns(vocabdf,5000)
nouns.head(20)

Unnamed: 0,index,Word,Count,LefffTags,DicollecteTags,CommonTags,Percent
4,2,la,109538,PRON|NOUN|DET,DET|PRON|NOUN,NOUN|DET|PRON,15.034524
10,183,un,66670,DET|NOUN|PRON|NUM,DET|NOUN,NOUN|DET|PRON|NUM,23.454135
15,42,pour,49910,NOUN|ADP,ADP,NOUN|ADP,27.900134
17,40,une,48932,DET|PRON|NOUN|NUM,DET|NOUN,NOUN|DET|PRON|NUM,29.518757
18,64,est,45859,ADJ|NOUN|AUX|VERB,NOUN|AUX,NOUN|ADJ|AUX|VERB,30.270247
27,61,par,35796,NOUN|ADP,ADP,NOUN|ADP,36.230518
30,256,dans,28412,ADP,ADP|NOUN,NOUN|ADP,37.749787
32,91,plus,21789,VERB|ADV|CCONJ|NOUN,ADV|NOUN|VERB,NOUN|ADV|CCONJ|VERB,38.51047
37,2681,assurance,18147,NOUN,NOUN,NOUN,40.125341
41,176,%,16426,NOUN,,NOUN,41.250995


In [265]:
textsWithWord = sampleTextBlocksWithWord(dataset,"compte")
textsWithWord.head(20)

Unnamed: 0,SiteIndex,RowIndex,Text,WordsCount,Context
93995,60,9623,La gamme d’épargne de Boursorama Banque n’est ...,50,"t A, LDD, PEL, CEL, compte sur livret et"
35560,22,15286,"Si vous le souhaitez, l ' agence bancaire qui ...",68,e de vous ouvrir un compte vous fait remp
43995,26,6892,Une autre menace dont il faut tenir compte con...,51,dont il faut tenir compte concerne cette
85478,54,9201,L’heure des résultats et de la délivrance est ...,101,emière ouverture de compte incluant une c
108570,67,4777,"La fidélité compte, nous la reconnaissons.",8,"La fidélité compte, nous la recon"
144570,88,69272,Comparatif des rendements des OPCI. Ces placem...,101,"u en direct, via un compte-titres, ou via"
130391,77,9111,Si la banque du Crédit Mutuel ( où le compte d...,104,édit Mutuel ( où le compte de cette dame
112199,70,12258,Pour ouvrir un compte bancaire en ligne chez M...,90,Pour ouvrir un compte bancaire en li
149630,91,24,« Quelles sont les conditions pour ouvrir un c...,14,ions pour ouvrir un compte bancaire ? »
111879,70,8552,« Tout va bien quand il n'y a pas de problème ...,24,de problème sur le compte. J'ai un compt


### 5.3 Separator chars and tokenizer rules

Candidate separator chars :

In [336]:
def listSeparatorChars(charsetdf):
    sepchars = charsetdf[(charsetdf["isAlpha"] == False) & (charsetdf["isDigit"] == False)].copy()
    sepchars["Name"] = sepchars["Char"].apply(lambda c:c.isspace()) 
    return sepchars

In [337]:
charset = loadCharset(rootdir)
separatorChars = listSeparatorChars(charset)
separatorChars[:30]

Unnamed: 0,Code,Count,Char,CharName,isAlpha,isDigit,isSpace,Percent,Name
18,44,227594,",",COMMA,False,False,False,87.11154,False
20,46,209020,.,FULL STOP,False,False,False,88.665295,False
23,8217,151057,’,RIGHT SINGLE QUOTATION MARK,False,False,False,90.437958,False
27,39,99110,',APOSTROPHE,False,False,False,92.088362,False
35,45,70047,-,HYPHEN-MINUS,False,False,False,94.560573,False
40,32,59143,,SPACE,False,False,True,95.668293,True
41,160,55700,,NO-BREAK SPACE,False,False,True,95.869548,True
42,58,55099,:,COLON,False,False,False,96.068632,False
48,41,42072,),RIGHT PARENTHESIS,False,False,False,97.058604,False
49,40,42034,(,LEFT PARENTHESIS,False,False,False,97.210481,False


Test the tokenizer behavior with each separator char :

In [350]:
nlp = spacy_InitWithTokenizer()

In [436]:
def searchCharInTokens(dataset,nlp,char,count):
    dataset4c = sampleTextBlocksWithChar(dataset,char,count)    
    listSplits = []
    listBefore = []
    listAfter = []
    for rowidx,text in dataset4c["Text"].iteritems():
        doc = nlp(text)
        splits = True
        before = ""
        after = ""
        for idx,token in enumerate(doc):
            if token.text == char:
                before = "" if idx==0 else doc[idx-1].text
                after = "" if idx==(len(doc)-1) else doc[idx+1].text
                break
            elif char in token.text:
                parts = token.text.split(char)
                before = parts[0]
                if before == "":
                    before = "" if idx==0 else doc[idx-1].text + "<<"
                after = parts[1]
                if after == "":
                    after = "" if idx==(len(doc)-1) else ">>" + doc[idx+1].text
                splits = False
                break        
        listSplits.append(splits)
        listBefore.append(before)
        listAfter.append(after)        
    dataset4c["Splits"] = listSplits
    dataset4c["Before"] = listBefore
    dataset4c["After"] = listAfter
    return dataset4c

In [437]:
char = "-"
dataset4c = searchCharInTokens(dataset,nlp,char,10000)
dataset4c.head(20)

Unnamed: 0,SiteIndex,RowIndex,Text,WordsCount,Context,Splits,Before,After
146797,89,16624,Documents d'informations : Document d'Informat...,63,rmations Clés (DIC) - Gan Performance Ret,True,),Gan
200382,113,6006,"Le tremblement de terre de magnitude 5,2, surv...",41,"ire de Chinon (Indre-et-Loire), a déclaré",False,Indre,et
206538,115,10611,Pour trouver l'assurance-vie qui correspond à ...,182,trouver l'assurance-vie qui correspond à,True,assurance,vie
196314,109,8575,Monte Paschi Banque - Livret de développement ...,10,Monte Paschi Banque - Livret de développe,True,Banque,Livret
207225,115,26098,- Label d’Excellence 2014 pour le compte coura...,20,- Label d’Excellence,True,,Label
44948,26,9943,Vous souhaitez effectuer une demande de micro-...,32,une demande de micro-crédit en ligne aupr,False,micro,crédit
209761,117,6680,L’assurance-vie (fonds en euros ou unités de c...,14,L’assurance-vie (fonds en euros,True,assurance,vie
68992,44,24865,"*les dispositions s'appliquent également, comp...",48,territoires d'outre-mer ou à l'étranger*,True,outre,mer
26291,19,4356,"Cofondateur du Palais de Tokyo, ancien directe...",63,"directeur des Beaux-Arts de Paris, ce th",False,Beaux,Arts
158501,95,10283,ING peut percevoir des rétrocessions de la par...,112,ment à l’article 314-76 du règlement géné,True,314,76


Most frequent chars before, after and around a separator when the tokenizer splits in three tokens, or doesn't split at all :

In [438]:
def exploreBeforeAfterSeparator(dataset4c,splits,columns):
    return dataset4c[dataset4c["Splits"] == splits].groupby(columns).agg({'Text':['count']})["Text"].sort_values("count",ascending=False)

In [439]:
beforeSplits = exploreBeforeAfterSeparator(dataset4c,True,["Before"])
beforeSplits.head(30)

Unnamed: 0_level_0,count
Before,Unnamed: 1_level_1
,300
assurance,219
Etats,121
Jean,97
),91
Saint,89
:,63
faut,53
Assurance,47
peut,46


In [440]:
afterSplits = exploreBeforeAfterSeparator(dataset4c,True,["After"])
afterSplits.head(30)

Unnamed: 0_level_0,count
After,Unnamed: 1_level_1
Mis,425
vous,341
vie,210
Unis,161
il,147
nous,117
on,73
t,67
je,65
ils,62


In [441]:
aroundSplits = exploreBeforeAfterSeparator(dataset4c,True,["Before","After"])
aroundSplits.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Before,After,Unnamed: 2_level_1
assurance,vie,205
Etats,Unis,121
faut,il,53
États,Unis,39
Peut,on,39
Assurance,Vie,36
PEA,PME,33
mes,placements,33
Faut,il,29
dois,je,24


In [442]:
beforeNoSplit = exploreBeforeAfterSeparator(dataset4c,False,["Before"])
beforeNoSplit.head(30)

Unnamed: 0_level_0,count
Before,Unnamed: 1_level_1
est<<,197
ci,160
non,124
e,122
au,118
rendez,114
start,107
sous,107
plus,100
celui,93


In [443]:
afterNoSplit = exploreBeforeAfterSeparator(dataset4c,False,["After"])
afterNoSplit.head(30)

Unnamed: 0_level_0,count
After,Unnamed: 1_level_1
ci,260
ce,213
vous,154
vie,151
delà,133
up,114
à,112
même,100
dessous,95
values,82


In [449]:
aroundNoSplit = exploreBeforeAfterSeparator(dataset4c,False,["Before","After"])
aroundNoSplit.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Before,After,Unnamed: 2_level_1
est<<,ce,192
rendez,vous,114
start,up,107
celui,ci,93
ci,dessous,81
au,delà,81
assurance,vie,79
plus,values,74
Assurance,vie,61
celle,ci,60


Find examples in context :

In [452]:
dataset4c[(dataset4c["Before"]=="est<<") & (dataset4c["After"]=="ce")]["Context"].sample(30)

123583          onsommation : qu’est-ce que c’est ?
185271                  Qu’est-ce qu’un contrat d'a
164688                  Qu'est-ce que la défiscalis
177974    ébut d’année. Qu’est-ce que le fichier de
8761                      Qu'est-ce qu'un artisan ?
151572            Pourquoi est-ce important : lors 
17323                   Qu'est-ce qu'un crédit reno
188674                  Qu’est-ce que le tiers paya
43241     r. Mais alors qu’est-ce qu’un CFD et comm
132609                  Qu'est-ce que la banque à d
1327                    Qu’est-ce qu’une résidence 
186802                  Qu'est-ce qu'un conducteur 
132018     » PTZ 2019 | Qu’est-ce que le Prêt à Tau
178253               7. Qu’est-ce que le Transfert 
72771                   Qu’est-ce que la prévoyance
5504                       iant Cthulhu, qu’est-ce…
138234                  Qu’est-ce que le bonus malu
159029                  Qu’est-ce qui change avec m
17453                   Qu'est-ce que l'assurance p
123736      