Skip to content
Multilingual Stopword Lists in R
R CSS
Branch: master
Clone or download
Latest commit 72e7d6b Jul 24, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R Fix length condition error Jul 17, 2019
data-raw Fix હોવા as final stopword Jan 2, 2019
data Fix remaining duplicated or misspelled tr words Jul 17, 2019
docs Update version and pkgdown site Jul 23, 2019
images Build website Jun 10, 2018
inst Fix length condition error Jul 17, 2019
man Update version and pkgdown site Jul 23, 2019
pkgdown Adjust font size Jun 10, 2018
tests Fix length condition error Jul 17, 2019
.Rbuildignore
.gitignore Update cran comments and ignores Jul 24, 2019
.gitmodules initial commit Nov 9, 2017
.lintr Change camel to snake_case Jan 2, 2019
.travis.yml Turn off lintr-bot on Travis Jan 1, 2019
DESCRIPTION Update for 1.0 Jul 24, 2019
LICENSE cran feedback Nov 10, 2017
NAMESPACE error handling and make cran happy Dec 13, 2017
NEWS.md Update version and pkgdown site Jul 23, 2019
README.Rmd Merge pull request #12 from quanteda/Add-lint-spell Jul 17, 2019
README.md Merge pull request #12 from quanteda/Add-lint-spell Jul 17, 2019
_pkgdown.yml Apply custom CSS Jun 9, 2018
codecov.yml add coverage reporting with codecov Dec 14, 2017
cran-comments.md Update cran comments and ignores Jul 24, 2019
stopwords.Rproj Update .gitignore and .Rbuildignore Dec 11, 2017

README.md

stopwords: the R package

CRAN Version Travis-CI Build Status Coverage status Downloads Total Downloads

R package providing “one-stop shopping” (or should that be “one-shop stopping”?) for stopword lists in R, for multiple languages and sources. No longer should text analysis or NLP packages bake in their own stopword lists or functions, since this package can accommodate them all, and is easily extended.

Created by David Muhr, and extended in cooperation with Kenneth Benoit and Kohei Watanabe.

Installation

# from CRAN
install.packages("stopwords")

# Or get the development version from GitHub:
# install.packages("devtools")
devtools::install_github("quanteda/stopwords")

Usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("de", source = "stopwords-iso"), 20)
##  [1] "a"           "ab"          "aber"        "ach"         "acht"       
##  [6] "achte"       "achten"      "achter"      "achtes"      "ag"         
## [11] "alle"        "allein"      "allem"       "allen"       "aller"      
## [16] "allerdings"  "alles"       "allgemeinen" "als"         "also"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru"
## [15] "sv"

Languages available

The following coverage of languages is currently available, by source. Note that the inclusiveness of the stopword lists will vary by source, and the number of languages covered by a stopword list does not necessarily mean that the source is better than one with more limited coverage. (There may be many reasons to prefer the default “snowball” source over the “stopwords-iso” source, for instance.)

The following languages are currently available:

Language ISO-639-1 Code stopwords-iso snowball SMART misc
Afrikaans af
Arabic ar
Armenian hy
Basque eu
Bengali bn
Breton br
Bulgarian bg
Catalan ca
Chinese zh
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Esperanto eo
Estonian et
Finnish fi
French fr
Galician gl
German de
Greek el
Gujarati gu
Hausa ha
Hebrew he
Hindi hi
Hungarian hu
Indonesian id
Irish ga
Italian it
Japanese ja
Korean ko
Kurdish ku
Latin la
Lithuanian lt
Latvian lv
Malay ms
Marathi mr
Norwegian no
Persian fa
Polish pl
Portuguese pt
Romanian ro
Russian ru
Slovak sk
Slovenian sl
Somali so
Southern Sotho st
Spanish es
Swahili sw
Swedish sv
Thai th
Tagalog tl
Turkish tr
Ukrainian uk
Urdu ur
Vietnamese vi
Yoruba yo
Zulu zu

Contributing

Additional sources can be defined and contributed by adding new data objects, as follows:

  1. Data object. Create a named list of characters, in UTF-8 format, consisting of the stopwords for each language. The ISO-639-1 language code will form the name of the list element, and the values of each element will be the character vector of stopwords for literal matches. The data object should follow the package naming convention, and be called data_stopwords_newsource, where newsource is replaced by the name of the new source.

  2. Documentation. The new source should be clearly documented, especially the source from which was taken.

License

This package as well as the source repositories are licensed under MIT.

You can’t perform that action at this time.