Skip to content
No description, website, or topics provided.
Python Makefile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
text_reuse_pipline
.Rhistory
.gitignore
Makefile
README.md

README.md

Text Reuse Pipeline

To build the Text reuse pipeline python module, run the following command:

make build

This will generate a dist folder. All the tasks could be run under the dist folder.

Wikipedia dump extraction

Using the open source tool, we run the following command to extract the text from the Wiki text:

python wiki_extractor.py -b 30G -s -ns 0 --filter_disambig_pages -o wiki_no_lists  enwiki-20160501-pages-articles.xml.bz2 &

The output should be copied to hdfs files using the following command: hdfs dfs -put ./path_of_the_extracted_dump ./text-reuse/wiki_00

Wikipedia Text preprocessing

To perform text preprocessing on Wikipedia, Run the following two commands inside webis docker image . Assuming that the extracted Wikipedia dump is located under the following hdfs path: text-reuse/wiki_00

## To perform paragraph re-balancing and cleaning up text
PYSPARK_DRIVER_PYTHON=python3 spark-submit --master yarn --deploy-mode cluster --num-executors 100 --executor-cores 10 --executor-memory 25g --driver-memory 25g --conf spark.driver.maxResultSize=15g --conf spark.yarn.executor.memoryOverhead=25000 --conf spark.yarn.driver.memoryOverhead=25000 --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.10:1.5.0 --py-files ./text_reuse_pipeline.zip main.py --job wiki_preprocess
## The output is written to the following hdfs path: `text-reuse/pipeline/wiki_preprocessed

## To extract TFIDF vectors for each article
PYSPARK_DRIVER_PYTHON=python3 spark-submit --master yarn --deploy-mode cluster --num-executors 100 --executor-cores 10 --executor-memory 25g --driver-memory 25g --conf spark.driver.maxResultSize=15g --conf spark.yarn.executor.memoryOverhead=25000 --conf spark.yarn.driver.memoryOverhead=25000 --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.10:1.5.0 --py-files ./text_reuse_pipeline.zip main.py --job wiki_represent --job_args tfidf
## The output is written to the following hdfs path: text-reuse/pipeline/wiki_rep_tfidf

Wikipedia candidate elimination

To extract candidate articles from Wikipedia to be examined in the last subtask, we run the following command:

PYSPARK_DRIVER_PYTHON=python3 spark-submit --master yarn --deploy-mode cluster --num-executors 200 --executor-cores 10 --executor-memory 25g --driver-memory 25g --conf spark.driver.maxResultSize=15g --conf spark.yarn.executor.memoryOverhead=25000 --conf spark.yarn.driver.memoryOverhead=25000 --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.10:1.5.0 --py-files ./text_reuse_pipeline.zip main.py --job wiki_candidate_elemination --job_args hdfs://betaweb020:8020/user/sile2804/cython_utils.so 0-10 0.025

This candidate elimination is divided into 100 batches. In the command 0-10 means perform candidate elimination on the batches from 0 to 10. The last argument is 0.025 is the threshold that we consider as the minimum similarity between two documents to be considered similar.

Wikipedia text alignments

To run the detailed alignments on the candidate elimination output. Run the following command:

PYSPARK_DRIVER_PYTHON=python3 spark-submit --master yarn --deploy-mode cluster --num-executors 200 --executor-cores 2 --executor-memory 45g --driver-memory 45g --conf spark.driver.maxResultSize=15g --packages com.databricks:spark-xml_2.11:0.4.1,com.databricks:spark-csv_2.10:1.5.0 --jars /home/sile2804/picapica.jar --py-files ./text_reuse_pipeline.zip main.py --job wiki_text_alignment --job_args text-reuse/pipeline/candidates/[0-2] text-reuse/pipeline/alignments/output_name threshold k &

Arguments for this command:

  • text-reuse/pipeline/candidates/[0-2] : hdfs input path (here it means run on parts 0-2 of the whole dataset)
  • text-reuse/pipeline/alignments/output_name: hdfs output path
  • threshold is the minimum similarity threshold
  • k number of misses threshold
You can’t perform that action at this time.