Fast item-to-item recommendations on the command line.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src Removed unnecessary conversion Oct 26, 2018
.gitignore More polishing Sep 20, 2018
.travis.yml
Cargo.lock
Cargo.toml
LICENSE
README.md

README.md

recoreco

Fast item-to-item recommendations on the command line.

GitHub license GitHub issues Build Status

Installation

Currently, the only convenient way to install recoreco is via Rust's package manager cargo:

$ cargo install recoreco

Quickstart

Recoreco computes highly associated pairs of items (in the sense of 'people who are interested in X are also interested in Y') from interactions between users and items.

It is a command line tool that expects a CSV file as input, where each line denotes an interaction between a user and an item and consists of a user identifier and an item identifier separated by a tab character. Recoreco by default outputs 10 associated items per item (with no particular ranking) in JSON format.

If you would like to learn a bit more about the math behind the approach that recoreco is built on, checkout the book on practical machine learning: innovations in recommendation and the talk on real-time puppies and ponies from my friend Ted Dunning.

Example: Finding related music artists with recoreco

As an example, we will compute related artists from a music dataset crawled from last.fm. The data contains 17,535,655 interactions between 358,868 users and 292,365 bands.

As a first step, we download the data, uncompress it and have a look at the format:
$ wget http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz
$ tar xvfz lastfm-dataset-360K.tar.gz

$ head lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv
00000c289a1829a808ac09c00daf10bc3c4e223b	3bd73256-3905-4f3a-97e2-8b341527f805	betty blowtorch	2137
00000c289a1829a808ac09c00daf10bc3c4e223b	f2fb0ff0-5679-42ec-a55c-15109ce6e320	die Ärzte	1099
00000c289a1829a808ac09c00daf10bc3c4e223b	b3ae82c2-e60b-4551-a76d-6620f1b456aa	melissa etheridge	897
00000c289a1829a808ac09c00daf10bc3c4e223b	3d6bbeb7-f90e-4d10-b440-e153c0d10b53	elvenking	717
00000c289a1829a808ac09c00daf10bc3c4e223b	bbd2ffd7-17f4-4506-8572-c1ea58c3f9a8	juliette & the licks	706

We need our inputs to only consist of user and item interactions, so we create a new CSV file which only contains the first column (the hashed userid) and the third column (the artist name) from the original data:

$ cat lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv|cut -f1,3 > plays.csv

Now the CSV file is in the correct format:

$ head plays.csv 
00000c289a1829a808ac09c00daf10bc3c4e223b	betty blowtorch
00000c289a1829a808ac09c00daf10bc3c4e223b	die Ärzte
00000c289a1829a808ac09c00daf10bc3c4e223b	melissa etheridge
00000c289a1829a808ac09c00daf10bc3c4e223b	elvenking
00000c289a1829a808ac09c00daf10bc3c4e223b	juliette & the licks

Next, we invoke recoreco, point it to the CSV file as input and ask it to write the output to a file called artists.json. It will read the CSV file twice, once for computing some statistics of the data, and a second time for computing the actual item-to-item recommendations. Note that recoreco is pretty fast, the computation takes less than a minute on my machine.

$ recoreco --inputfile=plays.csv --outputfile=artists.json

Reading plays.csv to compute data statistics (pass 1/2)
Found 17535655 interactions between 358868 users and 292365 items.
Reading plays.csv to compute 10 item indicators per item (pass 2/2)
194996130 cooccurrences observed, 34015ms training time, 292365 items rescored
Writing indicators...

The file artists.json now contains the results of the computation. Let's have a look at some artist recommendations using the JSON processor jq.

Who is strongly associated with Michael Jackson?

$ jq 'select(.for_item=="michael jackson")' artists.json

{
  "for_item": "michael jackson",
  "indicated_items": [
    "justin timberlake",
    "queen",
    "kanye west",
    "amy winehouse",
    "britney spears",
    "madonna",
    "rihanna",
    "beyoncé",
    "daft punk",
    "u2"
  ]
}

One of my favorite bands is Hot Water Music, lets see bands that people associate with them:

$ jq 'select(.for_item=="hot water music")' artists.json

{
  "for_item": "hot water music",
  "indicated_items": [
    "lifetime",
    "the get up kids",
    "the lawrence arms",
    "the gaslight anthem",
    "dillinger four",
    "propagandhi",
    "the bouncing souls",
    "strike anywhere",
    "jawbreaker",
    "chuck ragan"
  ]
}

And finally, we look for artists similar to Paco de Lucia in homage to Ted's days of building search engines for Veoh :)

$ jq 'select(.for_item=="paco de lucia")' artists.json

{
  "for_item": "paco de lucia",
  "indicated_items": [
    "miguel poveda",
    "cserhati zsuzsa",
    "ramón veloz",
    "szarka tamás",
    "camaron de la isla",
    "cseh tamás - másik jános",
    "duquende",
    "amr diab",
    "chuck brown & eva cassidy",
    "keympa"
  ]
}

Programmatic Usage

recoreco can also be included as a library in your rust program. We provide a basic example on how to do this. Be sure to checkout the documentation for further details.