# An analysis of tracks played by the Legacy Music Hour

The Legacy Music Hour is a podcast devoted to tracks from the 8-bit and 16-bit consoles.  It is hosted by two comedians, Rob F. Switch and Brent Weinbach who show real enthusiasm for the music and I have a lot of affection for the show.  The podcast has over 150 episodes, showcasing well over 2000 tracks.  

Before every episode, the hosts upload its full track listing, with game, system, and composer information.  I decided to write some code to pull all this information and see what this data says.

First, we will pull all of the track data from the LMH Blogger website.

In [5]:
import re
def lmh_standardize_whitespace(text):
    """ 
    There is some nonuniformity in the episode descriptions of line breaks.
    some this routine standardizes the line breaks to simply be newlines
    while this breaks the html for display, it helps with parsing.
    """
    # first replace <br />\n -> <br />
    text = re.sub('<br\s*?/>\n', '<br />', text)
    # then replace <br /> -> \n
    return re.sub('<br\s*?/>', '\n', text)


In [8]:
import datetime
def blogger_url_gen(baseurl,start_date,end_date):
    """
    Given a start and end datetime, generate a blogger page query.
    """
    if not end_date:
        end_date = datetime.datetime.now()
    start_str = start_date.strftime("%Y-%M-%d")
    end_str = end_date.strftime("%Y-%M-%d")
    return baseurl+'/search?updated-min='+start_str+'T00:00:00-08:00&updated-max='+end_str+'T00:00:00-08:00'

def lmh_url_gen(start_date,end_date=None):
    return blogger_url_gen('http://legacymusichour.blogspot.ca/',start_date,end_date)

In [21]:
from bs4 import BeautifulSoup
import urllib2
import json
import contextlib

def normalize_track(episode_number,track_fields):
    """
    Given a track which has (potentially) many composers, 
    generate a data point per composer.
    """
    for composer in track_fields[1].split(','):
        yield {'episode_number':episode_number,
               'game':track_fields[0],
               'composer':composer,
               'title':track_fields[2],
               'producer':track_fields[3],
               'console':track_fields[4],
               'year':track_fields[5]}

def parse_tracks(soup):
    """
    Parse the soup of an LMH blogger page, pulling out the tracks
    """
    for outer_p in soup.find_all("div",class_="post-outer"):
        title_text = outer_p.find("h3",class_="post-title entry-title")
        if title_text != None and "Episode" in title_text.text:
            episode_number = ''.join([c 
                                      for c in title_text.text.strip().split(':')[0] 
                                      if c.isdigit()])
            title_text.text.strip()
            rawlist = [
                i.strip().split(' - ') 
                for i in outer_p.text.strip().split('\n\n') 
                if ' - ' in i and len(i.split(' - '))>=6]
            if len(rawlist) > 0:
                for j in rawlist:
                    if j[1] != 'Composer':
                        yield episode_number+';'+(';'.join(j))+'\n'

def compoyors():
    """
    Generator for all the tracks on LMH
    """    
    data_url = lmh_url_gen(datetime.datetime(year=2011,month=1,day=1))
    has_older_posts = True
    while has_older_posts:
        with contextlib.closing(urllib2.urlopen(data_url)) as response:        
            html = response.read().decode('utf-8')
            soup = BeautifulSoup(lmh_standardize_whitespace(html))
            for track in parse_tracks(soup):
                yield track                
            older_posts = soup.find("a",text="Older Posts")
            if older_posts:
                has_older_posts = True
                data_url = soup.find("a",text="Older Posts").get('href')
            else:
                has_older_posts = False

In [23]:
tracks = [track for track in compoyors()]

### Cleaning and Normalizing the data

The above script pulls the data with some errors. We need first clean the entries.

In [None]:
# first let's look at the episode numbers



