Skip to content
Switch branches/tags
Go to file
Cannot retrieve contributors at this time
by T. Bertin-Mahieux (2011) Columbia University
This folder contains code to deal with the musiXmatch dataset,
the official collection of lyrics for the Million Song Dataset.
The mXm dataset comes in 2 text files, train and test.
To simplify its usage, we also provide a SQLite database with the
same data. The code to recreate this database is:
More details on the database:
- it contains two tables, 'words' and 'lyrics'
- table 'words' has one column: 'word'. Words are entered according
to popularity, check their ROWID if you want to check their position.
ROWID is an implicit column in SQLite, it starts at 1.
- table 'lyrics' contains 5 columns, see below
- column 'track_id' -> as usual, track id from the MSD
- column 'mxm_tid' -> track ID from musiXmatch
- column 'word' -> a word that is also in the 'words' table
- column 'count' -> word count for the word
- column 'is_test' -> 0 if this example is from the train set, 1 if test
If you want to know exactly how we created the bag-of-wirds, look at:
Note that it requires the following Python package:
Please enjoy, and don't hesitate to give us feedback!