Ce script automatise le processus de traitement des fichiers texte :

Nettoyage des fichiers texte : Supprime les espaces multiples, les numéros de page, et les sauts de ligne.

Segmentation des phrases : Utilise des outils NLP (stanza et spacy) pour segmenter le texte en phrases.

Création d'un fichier CSV : Regroupe les phrases segmentées par langue dans un fichier CSV structuré.

### Importations

os pour les opérations système (par exemple, parcourir les dossiers et créer des répertoires).

re pour les opérations avec des expressions régulières (par exemple, nettoyage du texte).

pandas pour la manipulation et l'analyse des données, notamment la création du fichier CSV.

stanza et spacy pour le traitement du langage naturel (NLP), notamment la segmentation des phrases.

In [3]:
import os
import re
import pandas as pd
from random import sample
import stanza
import spacy



### Nettoyage de Texte

In [4]:
# Fonction pour nettoyer le texte
def clean_text(text):
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"Page \d+", "", text)
    text = re.sub(r"\n", "", text)
    return text.strip()

Cette fonction normalise le texte en remplaçant les espaces multiples par un seul espace, supprime les numéros de page et les sauts de ligne, puis supprime les espaces en début et fin de texte.

### Nettoyage des Fichiers Texte

In [5]:
# Fonction pour nettoyer les fichiers texte
def clean_files(input_folder, output_folder):
    os.makedirs(os.path.join(output_folder, "fichiers_clean"), exist_ok=True)
    for root, _, files in os.walk(input_folder):
        for file in files:
            if file.endswith(".txt"):
                input_file_path = os.path.join(root, file)
                output_file_path = os.path.join(output_folder, "fichiers_clean", file)
                with open(input_file_path, "r", encoding="utf-8") as f:
                    text = f.read()
                clean_content = clean_text(text)
                with open(output_file_path, "w", encoding="utf-8") as f:
                    f.write(clean_content)
    print("Tous les fichiers ont été nettoyés et enregistrés dans le répertoire : '{}'.".format(os.path.join(output_folder, "fichiers_clean")))

Parcourt tous les fichiers texte dans le dossier d'entrée (input_folder).

Nettoie leur contenu en utilisant la fonction clean_text.

Enregistre les fichiers nettoyés dans un sous-dossier "fichiers_clean" du dossier de sortie (output_folder).

Crée le dossier "fichiers_clean" s'il n'existe pas déjà.

In [6]:
# Fonction pour segmenter les phrases
def segment_sentences(text, language):
    if language == "ar":
        nlp = stanza.Pipeline(lang="ar", processors="tokenize", tokenize_no_ssplit=True)
        doc = nlp(text)
        sentences = [" ".join([token.text for token in sentence.tokens]) for sentence in doc.sentences]
    elif language == "ja":
        nlp = spacy.load("ja_core_news_sm")
        doc = nlp(text)
        sentences = [sent.text for sent in doc.sents]
    elif language == "zh":
        nlp = spacy.load("zh_core_web_sm")
        doc = nlp(text)
        sentences = [sent.text for sent in doc.sents]
    else:
        nlp = spacy.load("en_core_web_sm")
        doc = nlp(text)
        sentences = [sent.text for sent in doc.sents]
    return sentences

Les modèles spécifiques sont utilisés pour l'arabe, le japonais et le chinois en raison de leurs structures linguistiques uniques, tandis que le modèle anglais est employé par défaut pour d'autres langues en raison de sa flexibilité et de sa robustesse.

Cette fonction segmente le texte en phrases en utilisant stanza ou spacy selon la langue :

Utilise stanza pour l'arabe (ar).

Utilise spacy pour le japonais (ja), le chinois (zh), et les autres langues (par défaut, l'anglais).

### Création d'un Fichier CSV

In [7]:
# Fonction pour créer un fichier CSV
def create_csv(input_folder, output_csv):
    data = []
    for file_name in os.listdir(input_folder):
        if file_name.endswith(".txt"):
            parts = file_name.split("_")
            if len(parts) >= 2:
                language = parts[1].split(".")[0]
                with open(os.path.join(input_folder, file_name), "r", encoding="utf-8") as file:
                    text = file.read()
                sentences = segment_sentences(text, language)
                data.extend([(language, sentence) for sentence in sentences])
    df = pd.DataFrame(data, columns=["labels", "text"])
    df = df.sample(frac=1).reset_index(drop=True)
    os.makedirs(os.path.dirname(output_csv), exist_ok=True)
    df.to_csv(output_csv, index=False, encoding="utf-8")
    print(f"Le fichier de sortie CSV est bien généré : {output_csv}")

Cette fonction :
Lit tous les fichiers texte nettoyés dans le dossier d'entrée (input_folder).

Segmente le texte en phrases en utilisant segment_sentences.
Crée un DataFrame avec deux colonnes : "labels" (la langue) et "text" (les phrases segmentées).

Mélange les lignes du DataFrame.

Enregistre le DataFrame en tant que fichier CSV dans le chemin spécifié (output_csv).

Crée les répertoires nécessaires s'ils n'existent pas déjà.

In [8]:
def main():
    input_folder = "../raw/results/"
    output_folder = "./results"
    clean_files(input_folder, output_folder)
    create_csv(os.path.join(output_folder, "fichiers_clean"), os.path.join(output_folder, "CSV", "result.csv"))

if __name__ == "__main__":
    main()

Tous les fichiers ont été nettoyés et enregistrés dans le répertoire : './results/fichiers_clean'.


2024-05-19 00:05:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:05:33 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:05:33 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:05:33 INFO: Using device: cpu
2024-05-19 00:05:33 INFO: Loading: tokenize
2024-05-19 00:05:33 INFO: Loading: mwt
2024-05-19 00:05:33 INFO: Done loading processors!
2024-05-19 00:05:41 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:05:42 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:05:42 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:05:42 INFO: Using device: cpu
2024-05-19 00:05:42 INFO: Loading: tokenize
2024-05-19 00:05:42 INFO: Loading: mwt
2024-05-19 00:05:42 INFO: Done loading processors!
2024-05-19 00:05:49 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:05:50 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:05:50 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:05:50 INFO: Using device: cpu
2024-05-19 00:05:50 INFO: Loading: tokenize
2024-05-19 00:05:50 INFO: Loading: mwt
2024-05-19 00:05:50 INFO: Done loading processors!
2024-05-19 00:05:55 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:05:55 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:05:55 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:05:55 INFO: Using device: cpu
2024-05-19 00:05:55 INFO: Loading: tokenize
2024-05-19 00:05:55 INFO: Loading: mwt
2024-05-19 00:05:55 INFO: Done loading processors!
2024-05-19 00:06:04 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:06:04 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:06:04 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:06:04 INFO: Using device: cpu
2024-05-19 00:06:04 INFO: Loading: tokenize
2024-05-19 00:06:04 INFO: Loading: mwt
2024-05-19 00:06:04 INFO: Done loading processors!
2024-05-19 00:06:16 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:06:16 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:06:16 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:06:16 INFO: Using device: cpu
2024-05-19 00:06:16 INFO: Loading: tokenize
2024-05-19 00:06:16 INFO: Loading: mwt
2024-05-19 00:06:16 INFO: Done loading processors!
2024-05-19 00:06:39 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:06:39 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:06:39 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:06:39 INFO: Using device: cpu
2024-05-19 00:06:39 INFO: Loading: tokenize
2024-05-19 00:06:39 INFO: Loading: mwt
2024-05-19 00:06:39 INFO: Done loading processors!
2024-05-19 00:06:39 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:06:40 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:06:40 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:06:40 INFO: Using device: cpu
2024-05-19 00:06:40 INFO: Loading: tokenize
2024-05-19 00:06:40 INFO: Loading: mwt
2024-05-19 00:06:40 INFO: Done loading processors!
2024-05-19 00:06:49 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:06:49 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:06:49 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:06:49 INFO: Using device: cpu
2024-05-19 00:06:49 INFO: Loading: tokenize
2024-05-19 00:06:49 INFO: Loading: mwt
2024-05-19 00:06:49 INFO: Done loading processors!
2024-05-19 00:06:53 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:06:53 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:06:53 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:06:53 INFO: Using device: cpu
2024-05-19 00:06:53 INFO: Loading: tokenize
2024-05-19 00:06:53 INFO: Loading: mwt
2024-05-19 00:06:53 INFO: Done loading processors!
2024-05-19 00:07:03 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:03 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:03 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:03 INFO: Using device: cpu
2024-05-19 00:07:03 INFO: Loading: tokenize
2024-05-19 00:07:03 INFO: Loading: mwt
2024-05-19 00:07:03 INFO: Done loading processors!
2024-05-19 00:07:05 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:05 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:05 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:05 INFO: Using device: cpu
2024-05-19 00:07:05 INFO: Loading: tokenize
2024-05-19 00:07:05 INFO: Loading: mwt
2024-05-19 00:07:05 INFO: Done loading processors!
2024-05-19 00:07:07 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:08 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:08 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:08 INFO: Using device: cpu
2024-05-19 00:07:08 INFO: Loading: tokenize
2024-05-19 00:07:08 INFO: Loading: mwt
2024-05-19 00:07:08 INFO: Done loading processors!
2024-05-19 00:07:20 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:21 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:21 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:21 INFO: Using device: cpu
2024-05-19 00:07:21 INFO: Loading: tokenize
2024-05-19 00:07:21 INFO: Loading: mwt
2024-05-19 00:07:21 INFO: Done loading processors!
2024-05-19 00:07:21 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:22 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:22 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:22 INFO: Using device: cpu
2024-05-19 00:07:22 INFO: Loading: tokenize
2024-05-19 00:07:22 INFO: Loading: mwt
2024-05-19 00:07:22 INFO: Done loading processors!
2024-05-19 00:07:23 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:23 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:23 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:23 INFO: Using device: cpu
2024-05-19 00:07:23 INFO: Loading: tokenize
2024-05-19 00:07:23 INFO: Loading: mwt
2024-05-19 00:07:23 INFO: Done loading processors!
2024-05-19 00:07:24 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:24 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:24 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:24 INFO: Using device: cpu
2024-05-19 00:07:24 INFO: Loading: tokenize
2024-05-19 00:07:24 INFO: Loading: mwt
2024-05-19 00:07:24 INFO: Done loading processors!
2024-05-19 00:07:42 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:43 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:43 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:43 INFO: Using device: cpu
2024-05-19 00:07:43 INFO: Loading: tokenize
2024-05-19 00:07:43 INFO: Loading: mwt
2024-05-19 00:07:43 INFO: Done loading processors!
2024-05-19 00:07:57 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:58 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:58 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:58 INFO: Using device: cpu
2024-05-19 00:07:58 INFO: Loading: tokenize
2024-05-19 00:07:58 INFO: Loading: mwt
2024-05-19 00:07:58 INFO: Done loading processors!
2024-05-19 00:07:59 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-05-19 00:07:59 INFO: Downloaded file to /home/zia/stanza_resources/resources.json
2024-05-19 00:07:59 INFO: Loading these models for language: ar (Arabic):
| Processor | Package |
-----------------------
| tokenize  | padt    |
| mwt       | padt    |

2024-05-19 00:07:59 INFO: Using device: cpu
2024-05-19 00:07:59 INFO: Loading: tokenize
2024-05-19 00:07:59 INFO: Loading: mwt
2024-05-19 00:07:59 INFO: Done loading processors!


Le fichier de sortie CSV est bien généré : ./results/CSV/result.csv


La fonction principale du script :

Spécifie les chemins des dossiers d'entrée (input_folder) et de sortie (output_folder).

Appelle clean_files pour nettoyer les fichiers texte.

Appelle create_csv pour segmenter les phrases et créer le fichier CSV.

Exécution Conditionnelle : Si le script est exécuté directement, la fonction main est appelée, ce qui lance le processus complet de nettoyage, segmentation et création du fichier CSV.