R package to get a duplicate content score between 2 web pages
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
man
.Rbuildignore
.gitignore
DESCRIPTION
NAMESPACE
README.md
duplicateContentR.R
duplicateContentR.Rd
duplicateContentR.Rproj

README.md

duplicateContentR

Get duplicate content score between 2 web pages. duplicateContentR takes 2 urls as input and computes a duplicate content score to detect plagiarism.

Install

devtools::install_github("remibacha/duplicateContentR")

Getting started

Run the folowing lines to load needed packages

packages <- c("XML", "httr", "textrank", "duplicateContentR")
lapply(packages, library, character.only = TRUE)

Call the duplicate_content_score function by specified 3 arguments (url a, url b, your user agent*) and enjoy!

duplicate_content_score("https://fr.wikipedia.org/wiki/Physique_quantique", "https://www.futura-sciences.com/sciences/definitions/physique-physique-quantique-13197/","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0")

NB: You can get your user agent by asking Google "What is my user agent?"

Feedbacks

Questions and feedbacks welcome!

You want to contribute ? Open a pull request ;-) If you encounter a bug or want to suggest an enhancement, please open an issue.