Skip to content
`dupree` helps identify code blocks that have a high level of similarity in a set of R files
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Replace one vs all with one vs one; show max either direction. (#30) Jul 19, 2019
man add `noRd` annotations to non-exported functions Jan 28, 2019
presentations Edinbr talk (#28) Jul 19, 2019
tests Replace one vs all with one vs one; show max either direction. (#30) Jul 19, 2019
.Rbuildignore Drop `presentations` folder from use in R build Jul 19, 2019
.gitignore drop html from version control Jul 7, 2019
.lintr use lintr bot - should pass CI Sep 19, 2018
.travis.yml fix #14 - add travis/codecov integration Nov 29, 2018
DESCRIPTION
LICENSE New package: dupree Sep 3, 2018
NAMESPACE Replace one vs all with one vs one; show max either direction. (#30) Jul 19, 2019
NEWS.md
README.Rmd
README.md
TODO.md Edinbr talk (#28) Jul 19, 2019
codecov.yml
dupree.Rproj remove `dupr` and dependency on `Biostrings` Sep 19, 2018

README.md

Travis-CI Build Status

Coverage Status

dupree

The goal of dupree is to identify chunks / blocks of highly duplicated code within a set of R scripts.

A very lightweight approach is used:

  • The user provides a set of *.R and/or *.Rmd files;

  • All R-code in the user-provided files is read and code-blocks are identified;

  • The non-trivial symbols from each code-block are retained (for instance, really common symbols like <-, ,, +, ( are dropped);

  • Similarity between different blocks is calculated using stringdist::seq_sim by longest-common-subsequence (symbol-identity is at whole-word level - so “my_data”, “my_Data”, “my.data” and “myData” are not considered to be identical in the calculation - and all non-trivial symbols have equal weight in the similarity calculation);

  • Code-blocks pairs (both between and within the files) are returned in order of highest similarity

To prevent the results being dominated by high-identity blocks containing very few symbols (eg, library(dplyr)) the user can specify a min_block_size. Any code-block containing at least this many non-trivial symbols will be kept.

Installation

You can install dupree from github with:

# install.packages("devtools")
devtools::install_github("russHyde/dupree")

Example

This is a basic example which shows you how to solve a common problem:

## basic example code
library(dupree)
files <- dir(pattern = "*.R(md)*$", recursive = TRUE)
dupree(files, min_block_size = 20)
#> # A tibble: 15 x 7
#>    file_a             file_b            block_a block_b line_a line_b score
#>    <chr>              <chr>               <int>   <int>  <int>  <int> <dbl>
#>  1 tests/testthat/te… tests/testthat/t…       2       4      7     95 0.36 
#>  2 R/dupree_classes.R tests/testthat/t…       4       3     50     22 0.327
#>  3 R/dupree_code_enu… tests/testthat/t…       1       5     14    119 0.283
#>  4 R/dupree_number_o… R/dupree_number_…       2       3     24     42 0.265
#>  5 R/dupree_classes.R R/dupree_classes…       4       8     50    169 0.219
#>  6 R/dupree_classes.R R/dupree_classes…       4       6     50    107 0.218
#>  7 tests/testthat/te… tests/testthat/t…       3       5     22     70 0.216
#>  8 R/dupree_code_enu… R/dupree_code_en…       6      12    124    218 0.213
#>  9 tests/testthat/te… tests/testthat/t…       3       4     25     95 0.212
#> 10 R/dupree_code_enu… R/dupree.R             12       2    218     69 0.200
#> 11 R/dupree_code_enu… R/dupree_code_en…       4       6     89    124 0.174
#> 12 R/dupree_classes.R R/dupree_classes…       7       8    141    169 0.173
#> 13 R/dupree_classes.R R/dupree_code_en…       8       3    169     62 0.172
#> 14 R/dupree_classes.R R/dupree_data_va…       2       5     19     45 0.163
#> 15 R/dupree_classes.R tests/testthat/t…       4       2     50      7 0.110

Note that you can do something similar using the functions dupree_dir and (if you are analysing a package) dupree_package.

# Analyse all R files except those in the tests directory:
dupree_dir(".", min_block_size = 20, filter = "tests", invert = TRUE)
#> # A tibble: 10 x 7
#>    file_a             file_b            block_a block_b line_a line_b score
#>    <chr>              <chr>               <int>   <int>  <int>  <int> <dbl>
#>  1 ./R/dupree_number… ./R/dupree_numbe…       2       3     24     42 0.265
#>  2 ./R/dupree_classe… ./R/dupree_class…       4       8     50    169 0.219
#>  3 ./R/dupree_classe… ./R/dupree_class…       4       6     50    107 0.218
#>  4 ./R/dupree_code_e… ./R/dupree_code_…       6      12    124    218 0.213
#>  5 ./R/dupree_code_e… ./R/dupree.R           12       2    218     69 0.200
#>  6 ./R/dupree_code_e… ./R/dupree_code_…       4       6     89    124 0.174
#>  7 ./R/dupree_classe… ./R/dupree_class…       7       8    141    169 0.173
#>  8 ./R/dupree_classe… ./R/dupree_code_…       8       3    169     62 0.172
#>  9 ./R/dupree_classe… ./R/dupree_data_…       2       5     19     45 0.163
#> 10 ./R/dupree_classe… ./R/dupree_code_…       4       1     50     14 0.141
# Analyse all R source code in the package (ignoring the tests directory)
dupree_package(".", min_block_size = 20)
#> # A tibble: 10 x 7
#>    file_a             file_b            block_a block_b line_a line_b score
#>    <chr>              <chr>               <int>   <int>  <int>  <int> <dbl>
#>  1 ./R/dupree_number… ./R/dupree_numbe…       2       3     24     42 0.265
#>  2 ./R/dupree_classe… ./R/dupree_class…       4       8     50    169 0.219
#>  3 ./R/dupree_classe… ./R/dupree_class…       4       6     50    107 0.218
#>  4 ./R/dupree_code_e… ./R/dupree_code_…       6      12    124    218 0.213
#>  5 ./R/dupree_code_e… ./R/dupree.R           12       2    218     69 0.200
#>  6 ./R/dupree_code_e… ./R/dupree_code_…       4       6     89    124 0.174
#>  7 ./R/dupree_classe… ./R/dupree_class…       7       8    141    169 0.173
#>  8 ./R/dupree_classe… ./R/dupree_code_…       8       3    169     62 0.172
#>  9 ./R/dupree_classe… ./R/dupree_data_…       2       5     19     45 0.163
#> 10 ./R/dupree_classe… ./R/dupree_code_…       4       1     50     14 0.141
You can’t perform that action at this time.