Skip to content
Predicting a song's genre from its lyrical content
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Lyrics, Pt. 1: Genre Classification

In today’s day and age, we’re seeing more crossover than ever between musical artists of different genres. This project builds a model which predicts a song's genre based solely on its lyrical content.

A full description of the project can be found at

Getting started

Prerequisite software

  • Python (suggested install through Anaconda)

  • R

Prerequisite libraries

  • Python:

    • bs4, numpy, pandas, re, requests, sklearn, string, warnings (all installed with Anaconda)
    • json (!pip install json)
    • nltk (!pip install nltk)
    • xgboost (!pip install xgboost)
  • R:

lib <- c('dplyr', 'geniusR', 'jsonlite', 'lubridate', 'stringr')

Instructions for use

1. Run the code contained in /python/artist_collection.ipynb

This code scrapes Billboard, Ranker, and TheTopTens for artists of different genres. Any duplicate artists are removed as appropriate.

The output of /python/artist_collection.ipynb can also be found at /data/json_genres.json.

2. Run /r/genius_scraper.R

This program scrapes and cleans lyrics from Genius, categorizing results by genre. Visit Genius to view or obtain a Genius client access token.

The output of /r/genius_scraper.R can also be found at /data/lyrics.csv.

3. Run the code contained in /python/lyrics_classifier.ipynb

This code preprocesses all lyrics for modeling, and runs Naïve Bayes, support vector machine, and gradient boosting models to predict a song's genre from its lyrics.



This project is licensed under the MIT License - see the file for details.


You can’t perform that action at this time.