Formatting

wdenton · Dec 7, 2011 · d7a8c07 · d7a8c07
1 parent bd96f24
commit d7a8c07
Show file tree

Hide file tree

Showing 2 changed files with 193 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+*~
+
diff --git a/splurge.markdown b/splurge.markdown
@@ -0,0 +1,191 @@
+# Goal
+
+Collect usage data from OCUL members and build a recommendation engine 
+that can be integrated into any members catalogue. Make the anonymized 
+data available under an open license so members and others can better 
+assess and understand collection usage in Ontario, and make the software 
+available under the GNU Public License so anyone can use it.
+
+# Background
+
+This project is based on a British project at JISC called [MOSAIC (Making Our Shared Activity Information Count)](http://sero.co.uk/jisc-mosaic-documents.html). The documents there include:
+
+* [MOSAIC Data Collection: A Guide](http://sero.co.uk/assets/090514%20MOSAIC%20data%20collection%20-%20A%20guide%20v01.pdf)
+* [MOSAIC Final Report](http://sero.co.uk/mosaic/100322_MOSAIC_Final_Report_v7_FINAL.pdf) (and [Appendices](http://sero.co.uk/mosaic/100212%20MOSAIC%20Final%20Report%20Appendices%20FINAL.pdf))
+* Also [MOSAIC Demonstration Links](http://sero.co.uk/mosaic/091012-MOSAIC-Demonstration-Links.doc), from a software contest they ran to find new, interesting uses for their data. The examples here go beyond 
+the Recommendation Engine idea, but are worth looking at to see other 
+possible future directions.)
+
+The JISC project grew out of work done by Dave Pattern and others at the 
+University of Huddersfield. They made usage data available under an Open 
+Data Commons License.
+
+* [Data](http://library.hud.ac.uk/data/usagedata/)
+* [README](http://library.hud.ac.uk/data/usagedata/_readme.html)
+* Dave Pattern, Library Systems Manager at Huddersfield, explains things in [Free book usage data from the University of Huddersfield](http://www.daveyp.com/blog/archives/528)
+
+# Data gathering
+
+## Data levels
+
+MOSAIC set out three levels of usage data in the [Final Report](http://sero.co.uk/mosaic/100322_MOSAIC_Final_Report_v7_FINAL.pdf) (p 40):
+
+> We refer to library circulation (loan & renewal) information as use data. Use 
+> data contains one use record per item borrowed. Sets of use records may 
+> have different amounts of information in each record, according to the 
+> data level that applies to all the records in the set.
+
+<table>
+<thead>
+<tr>
+<th>Level</th>
+<th>Description</th>
+<th>Use</th>
+</tr>
+</thead>
+<tbody>
+</tbody>
+</table>
+
+Level 0	Level 0 use records contain where and when the loan was made and 
+the item borrowed.	Level 0 use data can be used to indicate popular 
+loan items in the participating library.
+Level 1	Level 1 records are as for level 0, but also with borrower context 
+information, indicating borrower type (staff or student), and course and 
+progression level (for students).	Level 1 use data can be used to 
+see, via facets, for a given search, what was borrowed in one or more of: 
+a particular institution, a particular course, a particular progression 
+level (or by staff), and in a particular academic year.
+Level 2	Level 2 records are as for level 0, but also with an anonymised 
+user ID	Level 2 use data enables recommendations like borrowers of this 
+item also borrowed, and borrowers of this item previously borrowed /went 
+on to borrow.
+
+This project would collect use data at Level 0.
+
+## Data extraction
+
+Scholars Portal will give template XML files, with instructions, to member 
+libraries, who will pull the necessary data from their systems. Because 
+there are several different ILSes involved, the necessary database or 
+report commands will vary, but once done for one ILS they can be shared 
+with other users of the same system.
+
+!!! TODO Expand with actual examples
+
+## Data formats
+
+Following the MOSAIC lead (as described in the README from their script 
+repository), we will collect item file and yearly transaction files from 
+libraries.
+
+Item file: items.txt:
+
+FIELDS:
+
+   * item ID
+   * ISBN(s)
+   * title
+     author(s)
+     publisher
+     publication year
+     persistent URL
+
+SAMPLE:
+
+     123  0415972531  Music & copyright  L. Marshall  Wiley    2004 
+http://libcat.hud.ac.uk/123
+     234  0415969298  Songwriting tips   N. Skilbeck  Phaidon  1997 
+http://libcat.hud.ac.uk/234
+The item ID is whatever ID you want to use to identify a library book. It 
+must match the item ID contained in the item file.
+The ISBN(s) are one (or more) ISBNs, separated by a | pipe character where 
+more than one ISBN is linked to the item (e.g. 0415966744|0415966752).
+The title is the title of the book.
+The author(s) are one (or more) names, separated by a | pipe character 
+where more than one name is present (e.g. John Smith|Julie Johnson).
+The publisher and publication year are the name of the publishing company 
+and the year of publication.
+The persistent URL is the web address the item can be found at (e.g. on 
+your library catalogue).
+
+Transaction files: transaction.YYYY.txt
+
+FIELDS:
+
+   * timestamp
+   * item ID
+   * user ID
+
+SAMPLE:
+
+     1222646400    114784    67890
+     1225756800    103828    67890
+     1225756800    62580     76543
+The timestamp is in Unix time format (i.e. the number of seconds since 1st 
+Jan 1970 UTC). It is used to calculate the day the transaction occurred 
+on.
+The user ID is whatever ID you want to use to identify an individual 
+library user. It will be converted to a MD5 hash value before the data is 
+submitted to MOSAIC. It must match the user ID contained in the user file.
+The item ID is whatever ID you want to use to identify a library book. It 
+must match the item ID contained in the item file.
+
+The basic usage data to be gathered is:
+
+ 	Item title
+ 	ISBN
+ 	Number of copies
+ 	URL of item in catalogue
+ 	Loan history, giving number of initial circulations per year over 
+the last 10 years (or fewer, if 10 years of data is not available)
+
+The basic also-borrowed data to be gathered for each item (A) is a list of 
+other items (B) that shows:
+
+ 	how many times A was borrowed before B
+ 	how many times A and B were borrowed together
+ 	how many times A was borrowed after B
+ 	how many times B was borrowed in total
+
+Scholars Portal will aggregate the data from the different libraries, and 
+make the data openly available.
+
+# Privacy
+
+No identifying information will be connected to the usage data. It is 
+completely anonymous.
+
+## Data storage
+
+The data will be stored using the same format as Huddersfield used in 
+their data release (see 
+http://library.hud.ac.uk/data/usagedata/_readme.html):
+
+ 	circulation_data.xml contains aggregate usage information for 
+individual titles
+ 	suggestion_data.xml contains people who borrowed X also borrowed Y 
+relations
+ 	schools.xml is a lookup file listing OCUL members and ID numbers
+ 	courses.xml is a lookup file listing course codes and ID numbers
+
+!!!! TODO Expand
+
+## Recommendation Engine
+
+!!! TODO Write up what is known about how this can work, from MOSAIC and 
+what Tim Spalding said
+
+When the Recommendation Engine is given an ISBN or other ID number it will 
+suggest a list of related items, using an algorithm based on the 
+also-borrowed data and the usage data.
+
+[suggest algorithm? Also use LibraryThing data? We can get it from Tim 
+Spalding.]
+
+# Implementation as a web service
+
+The Recommendation Engine will have web-based API available at Scholars 
+Portal. Ideally a library will be able to insert one line of Javascript 
+into its HTML template to make the recommendations appear.
+