Wiklassify - Detection of Textual Incoherences project
This repository concerns the pieces of work related to detection of textual incoherences based on text samples retrieved from French Wikipedia dumps.
- Jupyter notebooks (with Python 3)
XX-meta-historyX.xml-pXXXXpXXXX.tsvfiles obtained from
.xmlfrom French Wikipedia dumps
XX_XX_sample.tsvannotated datasets with labels
- Description of the models and of the data
Wikipedia dump xml files are downloaded from https://dumps.wikimedia.org/frwiki/20170801/ (obsolete link, similar content in https://dumps.wikimedia.org/frwiki/20171201/, all history versions are available at https://dumps.wikimedia.org/frwiki/). Each compressed part of the edit history can be obtained and extracted with the following lines of code:
wget https://dumps.wikimedia.org/frwiki/20171001/frwiki-20171001-pages-meta-history1.xml-p3p3581.7z 7z x frwiki-20171001-pages-meta-history1.xml-p3p3581.7z
A file with the current version of all Wikipedia articles can be obtained with the commands below. It serves both as the training text for the skip-gram model and as ground truth.
wget https://dumps.wikimedia.org/frwiki/20171001/frwiki-20171001-pages-meta-current.xml.bz2 bzip2 -dk frwiki-20171001-pages-meta-current.xml.bz2
Extracts modifications between two consecutive edits from
.xmlfiles and turn all samples into a pandas dataframe saved as a
XX-meta-historyX.xml-pXXXXpXXXX.tsvfile stored in folder
/tsv_output. Each output file contains around 1.5 million pairs of text versions. An input
.xmlfile has an average compressed volume of 250 MB that rises to around 28 GB when decompressed. The average time for extracting version pairs is 5 hours (median is 3 hours).
.tsvfile, filter irrelevant edits and randomly draw samples from a few files in order to generate small datasets of 100 text version pairs that are stored in two ways:
XX_XX_pre_annot.csvfor human annotation tasks and
XX_XX_sample.tsvfor experiments on jupyter notebooks. They are saved in
Visualizes the content of each set of 100 text version pairs and assists the annotator in labelling each sample with InterfaceAnnotation.exe. For using that software, one needs to load
/path/classes.csvas Classification plan and
/path/XX_XX_pre_annot.csvas Corpus CSV file. Then it suffices to tick the boxes corresponding to relevant labels for each observation on the left panel. The right panel displays the sample_id for ensuring that the annotator is labelling the right observation. After each session, the labelling software returns a new file renamed
XX_XX_post_annot.csvand stored in
annotationsfolder. One observation can be tagged with one or more of 14 labels.
Attempts to identify edits related to semantic modifications. The goal is to increase the number of semantics-related edits through a semi-supervised approach.
Within hand-labeled semantics-related text version pairs, each text fragment is set apart as a single observation. The goal consists in building a model that can identify an incoherence from a single text sample. The word embeddings are obtained from a skip-gram model that is trained on current version of French Wikipedia articles. This training text can be obtained with this Wikipedia extractor.
- Gensim - Python library for language modelling
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
The master thesis based on this repository is available upon request.
This project is licensed under the MIT License - see the LICENSE.md file for details