A general purpose R interface to Solr
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github
R
inst
man-roxygen
man
revdep
tests
vignettes
.Rbuildignore
.gitignore
.travis.yml
CODE_OF_CONDUCT.md
DESCRIPTION
LICENSE
Makefile
NAMESPACE
NEWS.md
README.Rmd
README.md
codemeta.json
cran-comments.md
solrium.Rproj

README.md

solrium

Project Status: Active – The project has reached a stable, usable state and is being actively developed. cran checks rstudio mirror downloads cran version

A general purpose R interface to Solr

Development is now following Solr v7 and greater - which introduced many changes, which means many functions here may not work with your Solr installation older than v7.

Be aware that currently some functions will only work in certain Solr modes, e.g, collection_create() won't work when you are not in Solrcloud mode. But, you should get an error message stating that you aren't.

Currently developing against Solr v7.4.0

Solr info

Package API and ways of using the package

The first thing to look at is SolrClient to instantiate a client connection to your Solr instance. ping and schema are helpful functions to look at after instantiating your client.

There are two ways to use solrium:

  1. Call functions on the SolrClient object
  2. Pass the SolrClient object to functions

For example, if we instantiate a client like conn <- SolrClient$new(), then to use the first way we can do conn$search(...), and the second way by doing solr_search(conn, ...). These two ways of using the package hopefully make the package more user friendly for more people, those that prefer a more object oriented approach, and those that prefer more of a functional approach.

Collections

Functions that start with collection work with Solr collections when in cloud mode. Note that these functions won't work when in Solr standard mode

Cores

Functions that start with core work with Solr cores when in standard Solr mode. Note that these functions won't work when in Solr cloud mode

Documents

The following functions work with documents in Solr

#>  - add
#>  - delete_by_id
#>  - delete_by_query
#>  - update_atomic_json
#>  - update_atomic_xml
#>  - update_csv
#>  - update_json
#>  - update_xml

Search

Search functions, including solr_parse for parsing results from different functions appropriately

#>  - solr_all
#>  - solr_facet
#>  - solr_get
#>  - solr_group
#>  - solr_highlight
#>  - solr_mlt
#>  - solr_parse
#>  - solr_search
#>  - solr_stats

Install

Stable version from CRAN

install.packages("solrium")

Or development version from GitHub

devtools::install_github("ropensci/solrium")
library("solrium")

Setup

Use SolrClient$new() to initialize your connection. These examples use a remote Solr server, but work on any local Solr server.

(cli <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL))
#> <Solr Client>
#>   host: api.plos.org
#>   path: search
#>   port: 
#>   scheme: http
#>   errors: simple
#>   proxy:

You can also set whether you want simple or detailed error messages (via errors), and whether you want URLs used in each function call or not (via verbose), and your proxy settings (via proxy) if needed. For example:

SolrClient$new(errors = "complete")

Your settings are printed in the print method for the connection object

cli
#> <Solr Client>
#>   host: api.plos.org
#>   path: search
#>   port: 
#>   scheme: http
#>   errors: simple
#>   proxy:

For local Solr server setup:

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted example/exampledocs/*.xml

Search

cli$search(params = list(q='*:*', rows=2, fl='id'))
#> # A tibble: 2 x 1
#>   id                                                
#>   <chr>                                             
#> 1 10.1371/journal.pone.0058099/materials_and_methods
#> 2 10.1371/journal.pone.0030394/introduction

Search grouped data

Most recent publication by journal

cli$group(params = list(q='*:*', group.field='journal', rows=5, group.limit=1,
                        group.sort='publication_date desc',
                        fl='publication_date, score'))
#>       groupValue numFound start     publication_date score
#> 1       plos one  1754116     0 2018-12-12T00:00:00Z     1
#> 2           none    65774     0 2012-10-23T00:00:00Z     1
#> 3  plos genetics    65482     0 2018-12-12T00:00:00Z     1
#> 4   plos biology    36181     0 2018-12-12T00:00:00Z     1
#> 5 plos pathogens    58879     0 2018-12-10T00:00:00Z     1

First publication by journal

cli$group(params = list(q = '*:*', group.field = 'journal', group.limit = 1,
                        group.sort = 'publication_date asc',
                        fl = c('publication_date', 'score'),
                        fq = "publication_date:[1900-01-01T00:00:00Z TO *]"))
#>                          groupValue numFound start     publication_date
#> 1                          plos one  1754116     0 2006-12-20T00:00:00Z
#> 2                              none    57532     0 2005-08-23T00:00:00Z
#> 3                     plos genetics    65482     0 2005-06-17T00:00:00Z
#> 4                      plos biology    36181     0 2003-08-18T00:00:00Z
#> 5                    plos pathogens    58879     0 2005-07-22T00:00:00Z
#> 6        plos computational biology    51118     0 2005-06-24T00:00:00Z
#> 7                     plos medicine    25913     0 2004-09-07T00:00:00Z
#> 8  plos neglected tropical diseases    55287     0 2007-08-30T00:00:00Z
#> 9              plos clinical trials      521     0 2006-04-21T00:00:00Z
#> 10                     plos medicin        9     0 2012-04-17T00:00:00Z
#>    score
#> 1      1
#> 2      1
#> 3      1
#> 4      1
#> 5      1
#> 6      1
#> 7      1
#> 8      1
#> 9      1
#> 10     1

Search group query : Last 3 publications of 2013.

gq <- 'publication_date:[2013-01-01T00:00:00Z TO 2013-12-31T00:00:00Z]'
cli$group(
  params = list(q='*:*', group.query = gq,
                group.limit = 3, group.sort = 'publication_date desc',
                fl = 'publication_date'))
#>   numFound start     publication_date
#> 1   307076     0 2013-12-31T00:00:00Z
#> 2   307076     0 2013-12-31T00:00:00Z
#> 3   307076     0 2013-12-31T00:00:00Z

Search group with format simple

cli$group(params = list(q='*:*', group.field='journal', rows=5,
                        group.limit=3, group.sort='publication_date desc',
                        group.format='simple', fl='journal, publication_date'))
#>   numFound start  journal     publication_date
#> 1  2113280     0 PLOS ONE 2018-12-12T00:00:00Z
#> 2  2113280     0 PLOS ONE 2018-12-12T00:00:00Z
#> 3  2113280     0 PLOS ONE 2018-12-12T00:00:00Z
#> 4  2113280     0     <NA> 2012-10-23T00:00:00Z
#> 5  2113280     0     <NA> 2012-10-23T00:00:00Z

Facet

cli$facet(params = list(q='*:*', facet.field='journal', facet.query=c('cell', 'bird')))
#> $facet_queries
#> # A tibble: 2 x 2
#>   term   value
#>   <chr>  <int>
#> 1 cell  171895
#> 2 bird   18193
#> 
#> $facet_fields
#> $facet_fields$journal
#> # A tibble: 9 x 2
#>   term                             value  
#>   <fct>                            <fct>  
#> 1 plos one                         1754116
#> 2 plos genetics                    65482  
#> 3 plos pathogens                   58879  
#> 4 plos neglected tropical diseases 55287  
#> 5 plos computational biology       51118  
#> 6 plos biology                     36181  
#> 7 plos medicine                    25913  
#> 8 plos clinical trials             521    
#> 9 plos medicin                     9      
#> 
#> 
#> $facet_pivot
#> NULL
#> 
#> $facet_dates
#> NULL
#> 
#> $facet_ranges
#> NULL

Highlight

cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2))
#> # A tibble: 2 x 2
#>   names                 abstract                                           
#>   <chr>                 <chr>                                              
#> 1 10.1371/journal.pone… "\nAcute <em>alcohol</em> administration can lead …
#> 2 10.1371/journal.pone… Objectives: <em>Alcohol</em>-related morbidity and…

Stats

out <- cli$stats(params = list(q='ecology', stats.field=c('counter_total_all','alm_twitterCount'), stats.facet='journal'))
out$data
#>                   min     max count missing       sum sumOfSquares
#> counter_total_all   0 1113107 45873       0 244364194  9.57388e+12
#> alm_twitterCount    0    3437 45873       0    297821  7.89253e+07
#>                          mean      stddev
#> counter_total_all 5326.972162 13428.74994
#> alm_twitterCount     6.492294    40.96833

More like this

solr_mlt is a function to return similar documents to the one

out <- cli$mlt(params = list(q='title:"ecology" AND body:"cell"', mlt.fl='title', mlt.mindf=1, mlt.mintf=1, fl='counter_total_all', rows=5))
out$docs
#> # A tibble: 5 x 2
#>   id                           counter_total_all
#>   <chr>                                    <int>
#> 1 10.1371/journal.pbio.1001805             22378
#> 2 10.1371/journal.pbio.0020440             25560
#> 3 10.1371/journal.pbio.1002559             11150
#> 4 10.1371/journal.pone.0087217             11502
#> 5 10.1371/journal.pbio.1002191             24483
out$mlt
#> $`10.1371/journal.pbio.1001805`
#> # A tibble: 5 x 4
#>   numFound start id                           counter_total_all
#>      <int> <int> <chr>                                    <int>
#> 1     4319     0 10.1371/journal.pone.0098876              3857
#> 2     4319     0 10.1371/journal.pone.0082578              3140
#> 3     4319     0 10.1371/journal.pone.0102159              2329
#> 4     4319     0 10.1371/journal.pcbi.1002915             11767
#> 5     4319     0 10.1371/journal.pcbi.1003408             10700
#> 
#> $`10.1371/journal.pbio.0020440`
#> # A tibble: 5 x 4
#>   numFound start id                           counter_total_all
#>      <int> <int> <chr>                                    <int>
#> 1     1254     0 10.1371/journal.pone.0162651              3311
#> 2     1254     0 10.1371/journal.pone.0003259              3310
#> 3     1254     0 10.1371/journal.pntd.0003377              4406
#> 4     1254     0 10.1371/journal.pone.0068814              9416
#> 5     1254     0 10.1371/journal.pone.0101568              5600
#> 
#> $`10.1371/journal.pbio.1002559`
#> # A tibble: 5 x 4
#>   numFound start id                           counter_total_all
#>      <int> <int> <chr>                                    <int>
#> 1     5962     0 10.1371/journal.pone.0023086              8442
#> 2     5962     0 10.1371/journal.pone.0041684             24475
#> 3     5962     0 10.1371/journal.pone.0155028              2662
#> 4     5962     0 10.1371/journal.pone.0155989              2519
#> 5     5962     0 10.1371/journal.pone.0129394              2111
#> 
#> $`10.1371/journal.pone.0087217`
#> # A tibble: 5 x 4
#>   numFound start id                           counter_total_all
#>      <int> <int> <chr>                                    <int>
#> 1     5111     0 10.1371/journal.pone.0204743                 0
#> 2     5111     0 10.1371/journal.pone.0175497              1088
#> 3     5111     0 10.1371/journal.pone.0159131              4937
#> 4     5111     0 10.1371/journal.pcbi.0020092             25551
#> 5     5111     0 10.1371/journal.pone.0133941              1336
#> 
#> $`10.1371/journal.pbio.1002191`
#> # A tibble: 5 x 4
#>   numFound start id                           counter_total_all
#>      <int> <int> <chr>                                    <int>
#> 1    13747     0 10.1371/journal.pbio.1002232              3055
#> 2    13747     0 10.1371/journal.pone.0070448              2203
#> 3    13747     0 10.1371/journal.pone.0191705               800
#> 4    13747     0 10.1371/journal.pone.0131700              3051
#> 5    13747     0 10.1371/journal.pone.0121680              4980

Parsing

solr_parse is a general purpose parser function with extension methods solr_parse.sr_search, solr_parse.sr_facet, and solr_parse.sr_high, for parsing solr_search, solr_facet, and solr_highlight function output, respectively. solr_parse is used internally within those three functions (solr_search, solr_facet, solr_highlight) to do parsing. You can optionally get back raw json or xml from solr_search, solr_facet, and solr_highlight setting parameter raw=TRUE, and then parsing after the fact with solr_parse. All you need to know is solr_parse can parse

For example:

(out <- cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2),
                      raw=TRUE))
#> [1] "{\n  \"response\":{\"numFound\":29215,\"start\":0,\"maxScore\":4.6769786,\"docs\":[\n      {\n        \"id\":\"10.1371/journal.pone.0201042\",\n        \"journal\":\"PLOS ONE\",\n        \"eissn\":\"1932-6203\",\n        \"publication_date\":\"2018-07-26T00:00:00Z\",\n        \"article_type\":\"Research Article\",\n        \"author_display\":[\"Graeme Knibb\",\n          \"Carl. A. Roberts\",\n          \"Eric Robinson\",\n          \"Abi Rose\",\n          \"Paul Christiansen\"],\n        \"abstract\":[\"\\nAcute alcohol administration can lead to a loss of control over drinking. Several models argue that this ‘alcohol priming effect’ is mediated by the effect of alcohol on inhibitory control. Alternatively, beliefs about how alcohol affects behavioural regulation may also underlie alcohol priming and alcohol-induced inhibitory impairments. Here two studies examine the extent to which the alcohol priming effect and inhibitory impairments are moderated by beliefs regarding the effects of alcohol on the ability to control behaviour. In study 1, following a priming drink (placebo or .5g/kg of alcohol), participants were provided with bogus feedback regarding their performance on a measure of inhibitory control (stop-signal task; SST) suggesting that they had high or average self-control. However, the bogus feedback manipulation was not successful. In study 2, before a SST, participants were exposed to a neutral or experimental message suggesting acute doses of alcohol reduce the urge to drink and consumed a priming drink and this manipulation was successful. In both studies craving was assessed throughout and a bogus taste test which measured ad libitum drinking was completed. Results suggest no effect of beliefs on craving or ad lib consumption within either study. However, within study 2, participants exposed to the experimental message displayed evidence of alcohol-induced impairments of inhibitory control, while those exposed to the neutral message did not. These findings do not suggest beliefs about the effects of alcohol moderate the alcohol priming effect but do suggest beliefs may, in part, underlie the effect of alcohol on inhibitory control.\\n\"],\n        \"title_display\":\"The effect of beliefs about alcohol’s acute effects on alcohol priming and alcohol-induced impairments of inhibitory control\",\n        \"score\":4.6769786},\n      {\n        \"id\":\"10.1371/journal.pone.0185457\",\n        \"journal\":\"PLOS ONE\",\n        \"eissn\":\"1932-6203\",\n        \"publication_date\":\"2017-09-28T00:00:00Z\",\n        \"article_type\":\"Research Article\",\n        \"author_display\":[\"Jacqueline Willmore\",\n          \"Terry-Lynne Marko\",\n          \"Darcie Taing\",\n          \"Hugues Sampasa-Kanyinga\"],\n        \"abstract\":[\"Objectives: Alcohol-related morbidity and mortality are significant public health issues. The purpose of this study was to describe the prevalence and trends over time of alcohol consumption and alcohol-related morbidity and mortality; and public attitudes of alcohol use impacts on families and the community in Ottawa, Canada. Methods: Prevalence (2013–2014) and trends (2000–2001 to 2013–2014) of alcohol use were obtained from the Canadian Community Health Survey. Data on paramedic responses (2015), emergency department (ED) visits (2013–2015), hospitalizations (2013–2015) and deaths (2007–2011) were used to quantify the acute and chronic health effects of alcohol in Ottawa. Qualitative data were obtained from the “Have Your Say” alcohol survey, an online survey of public attitudes on alcohol conducted in 2016. Results: In 2013–2014, an estimated 595,300 (83%) Ottawa adults 19 years and older drank alcohol, 42% reported binge drinking in the past year. Heavy drinking increased from 15% in 2000–2001 to 20% in 2013–2014. In 2015, the Ottawa Paramedic Service responded to 2,060 calls directly attributable to alcohol. Between 2013 and 2015, there were an average of 6,100 ED visits and 1,270 hospitalizations per year due to alcohol. Annually, alcohol use results in at least 140 deaths in Ottawa. Men have higher rates of alcohol-attributable paramedic responses, ED visits, hospitalizations and deaths than women, and young adults have higher rates of alcohol-attributable paramedic responses. Qualitative data of public attitudes indicate that alcohol misuse has greater repercussions not only on those who drink, but also on the family and community. Conclusions: Results highlight the need for healthy public policy intended to encourage a culture of drinking in moderation in Ottawa to support lower risk alcohol use, particularly among men and young adults. \"],\n        \"title_display\":\"The burden of alcohol-related morbidity and mortality in Ottawa, Canada\",\n        \"score\":4.676675}]\n  },\n  \"highlighting\":{\n    \"10.1371/journal.pone.0201042\":{\n      \"abstract\":[\"\\nAcute <em>alcohol</em> administration can lead to a loss of control over drinking. Several models argue\"]},\n    \"10.1371/journal.pone.0185457\":{\n      \"abstract\":[\"Objectives: <em>Alcohol</em>-related morbidity and mortality are significant public health issues\"]}}}\n"
#> attr(,"class")
#> [1] "sr_high"
#> attr(,"wt")
#> [1] "json"

Then parse

solr_parse(out, 'df')
#> # A tibble: 2 x 2
#>   names                 abstract                                           
#>   <chr>                 <chr>                                              
#> 1 10.1371/journal.pone… "\nAcute <em>alcohol</em> administration can lead …
#> 2 10.1371/journal.pone… Objectives: <em>Alcohol</em>-related morbidity and…

Progress bars

only supported in the core search methods: search, facet, group, mlt, stats, high, all

library(httr)
invisible(cli$search(params = list(q='*:*', rows=100, fl='id'), progress = httr::progress()))
|==============================================| 100%

Advanced: Function Queries

Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field "val"

cli$search(params = list(q='_val_:"product(counter_total_all,alm_twitterCount)"',
  rows=5, fl='id,title', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>   id                    title                                              
#>   <chr>                 <chr>                                              
#> 1 10.1371/journal.pmed… Why Most Published Research Findings Are False     
#> 2 10.1371/journal.pone… A Multi-Level Bayesian Analysis of Racial Bias in …
#> 3 10.1371/journal.pcbi… Ten simple rules for structuring papers            
#> 4 10.1371/journal.pone… More than 75 percent decline over 27 years in tota…
#> 5 10.1371/journal.pone… Long-Term Follow-Up of Transsexual Persons Undergo…

Here, we search for the papers with the most citations

cli$search(params = list(q='_val_:"max(counter_total_all)"',
    rows=5, fl='id,counter_total_all', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>   id                                                      counter_total_all
#>   <chr>                                                               <int>
#> 1 10.1371/journal.pmed.0020124                                      2597075
#> 2 10.1371/annotation/80bd7285-9d2d-403a-8e6f-9c375bf977ca           1235195
#> 3 10.1371/journal.pcbi.1003149                                      1113107
#> 4 10.1371/journal.pone.0141854                                       878333
#> 5 10.1371/journal.pcbi.0030102                                       776907

Or with the most tweets

cli$search(params = list(q='_val_:"max(alm_twitterCount)"',
    rows=5, fl='id,alm_twitterCount', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>   id                           alm_twitterCount
#>   <chr>                                   <int>
#> 1 10.1371/journal.pmed.0020124             3468
#> 2 10.1371/journal.pone.0141854             3437
#> 3 10.1371/journal.pcbi.1005619             3096
#> 4 10.1371/journal.pone.0115069             3027
#> 5 10.1371/journal.pmed.1001953             2825

Using specific data sources

USGS BISON service

The occurrences service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/occurrences/select", port = NULL)
conn$search(params = list(q = '*:*', fl = c('decimalLatitude','decimalLongitude','scientificName'), rows = 2))
#> # A tibble: 2 x 3
#>   decimalLongitude scientificName              decimalLatitude
#>              <dbl> <chr>                                 <dbl>
#> 1            -75.4 Setophaga coronata coronata            37.9
#> 2            -75.4 Setophaga coronata coronata            37.9

The species names service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/scientificName/select", port = NULL)
conn$search(params = list(q = '*:*'))
#> # A tibble: 10 x 2
#>    scientificName             `_version_`
#>    <chr>                            <dbl>
#>  1 Dictyopteris polypodioides     1.57e18
#>  2 Lonicera iberica               1.57e18
#>  3 Epuraea ambigua                1.57e18
#>  4 Pseudopomala brachyptera       1.57e18
#>  5 Didymosphaeria populina        1.57e18
#>  6 Sanoarca                       1.57e18
#>  7 Celleporina ventricosa         1.57e18
#>  8 Trigonurus crotchi             1.57e18
#>  9 Ceraticelus laticeps           1.57e18
#> 10 Micraster acutus               1.57e18

PLOS Search API

Most of the examples above use the PLOS search API... :)

Solr server management

This isn't as complete as searching functions show above, but we're getting there.

Cores

conn <- SolrClient$new()

Many functions, e.g.:

  • core_create()
  • core_rename()
  • core_status()
  • ...

Create a core

conn$core_create(name = "foo_bar")

Collections

Many functions, e.g.:

  • collection_create()
  • collection_list()
  • collection_addrole()
  • ...

Create a collection

conn$collection_create(name = "hello_world")

Add documents

Add documents, supports adding from files (json, xml, or csv format), and from R objects (including data.frame and list types so far)

df <- data.frame(id = c(67, 68), price = c(1000, 500000000))
conn$add(df, name = "books")

Delete documents, by id

conn$delete_by_id(name = "books", ids = c(3, 4))

Or by query

conn$delete_by_query(name = "books", query = "manu:bank")

Meta

  • Please report any issues or bugs
  • License: MIT
  • Get citation information for solrium in R doing citation(package = 'solrium')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

ropensci_footer