Skip to content

Commit

Permalink
Formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
wdenton committed Dec 7, 2011
1 parent bd96f24 commit d7a8c07
Show file tree
Hide file tree
Showing 2 changed files with 193 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
@@ -0,0 +1,2 @@
*~

191 changes: 191 additions & 0 deletions splurge.markdown
@@ -0,0 +1,191 @@
# Goal

Collect usage data from OCUL members and build a recommendation engine
that can be integrated into any members catalogue. Make the anonymized
data available under an open license so members and others can better
assess and understand collection usage in Ontario, and make the software
available under the GNU Public License so anyone can use it.

# Background

This project is based on a British project at JISC called [MOSAIC (Making Our Shared Activity Information Count)](http://sero.co.uk/jisc-mosaic-documents.html). The documents there include:

* [MOSAIC Data Collection: A Guide](http://sero.co.uk/assets/090514%20MOSAIC%20data%20collection%20-%20A%20guide%20v01.pdf)
* [MOSAIC Final Report](http://sero.co.uk/mosaic/100322_MOSAIC_Final_Report_v7_FINAL.pdf) (and [Appendices](http://sero.co.uk/mosaic/100212%20MOSAIC%20Final%20Report%20Appendices%20FINAL.pdf))
* Also [MOSAIC Demonstration Links](http://sero.co.uk/mosaic/091012-MOSAIC-Demonstration-Links.doc), from a software contest they ran to find new, interesting uses for their data. The examples here go beyond
the Recommendation Engine idea, but are worth looking at to see other
possible future directions.)

The JISC project grew out of work done by Dave Pattern and others at the
University of Huddersfield. They made usage data available under an Open
Data Commons License.

* [Data](http://library.hud.ac.uk/data/usagedata/)
* [README](http://library.hud.ac.uk/data/usagedata/_readme.html)
* Dave Pattern, Library Systems Manager at Huddersfield, explains things in [Free book usage data from the University of Huddersfield](http://www.daveyp.com/blog/archives/528)

# Data gathering

## Data levels

MOSAIC set out three levels of usage data in the [Final Report](http://sero.co.uk/mosaic/100322_MOSAIC_Final_Report_v7_FINAL.pdf) (p 40):

> We refer to library circulation (loan & renewal) information as use data. Use
> data contains one use record per item borrowed. Sets of use records may
> have different amounts of information in each record, according to the
> data level that applies to all the records in the set.
<table>
<thead>
<tr>
<th>Level</th>
<th>Description</th>
<th>Use</th>
</tr>
</thead>
<tbody>
</tbody>
</table>

Level 0 Level 0 use records contain where and when the loan was made and
the item borrowed. Level 0 use data can be used to indicate popular
loan items in the participating library.
Level 1 Level 1 records are as for level 0, but also with borrower context
information, indicating borrower type (staff or student), and course and
progression level (for students). Level 1 use data can be used to
see, via facets, for a given search, what was borrowed in one or more of:
a particular institution, a particular course, a particular progression
level (or by staff), and in a particular academic year.
Level 2 Level 2 records are as for level 0, but also with an anonymised
user ID Level 2 use data enables recommendations like borrowers of this
item also borrowed, and borrowers of this item previously borrowed /went
on to borrow.

This project would collect use data at Level 0.

## Data extraction

Scholars Portal will give template XML files, with instructions, to member
libraries, who will pull the necessary data from their systems. Because
there are several different ILSes involved, the necessary database or
report commands will vary, but once done for one ILS they can be shared
with other users of the same system.

!!! TODO Expand with actual examples

## Data formats

Following the MOSAIC lead (as described in the README from their script
repository), we will collect item file and yearly transaction files from
libraries.

Item file: items.txt:

FIELDS:

* item ID
* ISBN(s)
* title
author(s)
publisher
publication year
persistent URL

SAMPLE:

123 0415972531 Music & copyright L. Marshall Wiley 2004
http://libcat.hud.ac.uk/123
234 0415969298 Songwriting tips N. Skilbeck Phaidon 1997
http://libcat.hud.ac.uk/234
The item ID is whatever ID you want to use to identify a library book. It
must match the item ID contained in the item file.
The ISBN(s) are one (or more) ISBNs, separated by a | pipe character where
more than one ISBN is linked to the item (e.g. 0415966744|0415966752).
The title is the title of the book.
The author(s) are one (or more) names, separated by a | pipe character
where more than one name is present (e.g. John Smith|Julie Johnson).
The publisher and publication year are the name of the publishing company
and the year of publication.
The persistent URL is the web address the item can be found at (e.g. on
your library catalogue).

Transaction files: transaction.YYYY.txt

FIELDS:

* timestamp
* item ID
* user ID

SAMPLE:

1222646400 114784 67890
1225756800 103828 67890
1225756800 62580 76543
The timestamp is in Unix time format (i.e. the number of seconds since 1st
Jan 1970 UTC). It is used to calculate the day the transaction occurred
on.
The user ID is whatever ID you want to use to identify an individual
library user. It will be converted to a MD5 hash value before the data is
submitted to MOSAIC. It must match the user ID contained in the user file.
The item ID is whatever ID you want to use to identify a library book. It
must match the item ID contained in the item file.

The basic usage data to be gathered is:

Item title
ISBN
Number of copies
URL of item in catalogue
Loan history, giving number of initial circulations per year over
the last 10 years (or fewer, if 10 years of data is not available)

The basic also-borrowed data to be gathered for each item (A) is a list of
other items (B) that shows:

how many times A was borrowed before B
how many times A and B were borrowed together
how many times A was borrowed after B
how many times B was borrowed in total

Scholars Portal will aggregate the data from the different libraries, and
make the data openly available.

# Privacy

No identifying information will be connected to the usage data. It is
completely anonymous.

## Data storage

The data will be stored using the same format as Huddersfield used in
their data release (see
http://library.hud.ac.uk/data/usagedata/_readme.html):

circulation_data.xml contains aggregate usage information for
individual titles
suggestion_data.xml contains people who borrowed X also borrowed Y
relations
schools.xml is a lookup file listing OCUL members and ID numbers
courses.xml is a lookup file listing course codes and ID numbers

!!!! TODO Expand

## Recommendation Engine

!!! TODO Write up what is known about how this can work, from MOSAIC and
what Tim Spalding said

When the Recommendation Engine is given an ISBN or other ID number it will
suggest a list of related items, using an algorithm based on the
also-borrowed data and the usage data.

[suggest algorithm? Also use LibraryThing data? We can get it from Tim
Spalding.]

# Implementation as a web service

The Recommendation Engine will have web-based API available at Scholars
Portal. Ideally a library will be able to insert one line of Javascript
into its HTML template to make the recommendations appear.

0 comments on commit d7a8c07

Please sign in to comment.