No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
notebooks
.gitignore
CONTRIBUTING.md
LICENSE.md
README.md

README.md

Wiklassify - Detection of Textual Incoherences project

This repository concerns the pieces of work related to detection of textual incoherences based on text samples retrieved from French Wikipedia dumps.

Repository content

  • Jupyter notebooks (with Python 3)
  • Datasets
    XX-meta-historyX.xml-pXXXXpXXXX.tsv files obtained from .xml from French Wikipedia dumps
    XX_XX_sample.tsv annotated datasets with labels XX_XX_post_annot.csv
  • Description of the models and of the data

Data

Wikipedia dump xml files are downloaded from https://dumps.wikimedia.org/frwiki/20170801/ (obsolete link, similar content in https://dumps.wikimedia.org/frwiki/20171201/, all history versions are available at https://dumps.wikimedia.org/frwiki/). Each compressed part of the edit history can be obtained and extracted with the following lines of code:

wget https://dumps.wikimedia.org/frwiki/20171001/frwiki-20171001-pages-meta-history1.xml-p3p3581.7z
7z x frwiki-20171001-pages-meta-history1.xml-p3p3581.7z

A file with the current version of all Wikipedia articles can be obtained with the commands below. It serves both as the training text for the skip-gram model and as ground truth.

wget https://dumps.wikimedia.org/frwiki/20171001/frwiki-20171001-pages-meta-current.xml.bz2
bzip2 -dk frwiki-20171001-pages-meta-current.xml.bz2

Notebooks

  • nb1_xml_extractor.ipynb

Extracts modifications between two consecutive edits from .xml files and turn all samples into a pandas dataframe saved as a XX-meta-historyX.xml-pXXXXpXXXX.tsv file stored in folder /tsv_output. Each output file contains around 1.5 million pairs of text versions. An input .xml file has an average compressed volume of 250 MB that rises to around 28 GB when decompressed. The average time for extracting version pairs is 5 hours (median is 3 hours).

  • nb2_tsv_sampling.ipynb

For each .tsv file, filter irrelevant edits and randomly draw samples from a few files in order to generate small datasets of 100 text version pairs that are stored in two ways: XX_XX_pre_annot.csv for human annotation tasks and XX_XX_sample.tsv for experiments on jupyter notebooks. They are saved in annotations folder.

  • nb3_visualizer_annotator.ipynb

Visualizes the content of each set of 100 text version pairs and assists the annotator in labelling each sample with InterfaceAnnotation.exe. For using that software, one needs to load /path/classes.csv as Classification plan and /path/XX_XX_pre_annot.csv as Corpus CSV file. Then it suffices to tick the boxes corresponding to relevant labels for each observation on the left panel. The right panel displays the sample_id for ensuring that the annotator is labelling the right observation. After each session, the labelling software returns a new file renamed XX_XX_post_annot.csv and stored in annotations folder. One observation can be tagged with one or more of 14 labels.

alt text

  • nb4_multilabelling_classification.ipynb

Attempts to identify edits related to semantic modifications. The goal is to increase the number of semantics-related edits through a semi-supervised approach.

  • nb5_detection_semantic_incoherences.ipynb

Within hand-labeled semantics-related text version pairs, each text fragment is set apart as a single observation. The goal consists in building a model that can identify an incoherence from a single text sample. The word embeddings are obtained from a skip-gram model that is trained on current version of French Wikipedia articles. This training text can be obtained with this Wikipedia extractor.

Built With

  • Gensim - Python library for language modelling

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Author

Olivier Salaün

The master thesis based on this repository is available upon request.

License

This project is licensed under the MIT License - see the LICENSE.md file for details