# A crash course on IPython (jupyter) notebooks

This file contains a tutorial on analyzing the music data using Spark.
But first, a primer on how to use the notebook.

## A notebook is divided into cells

The boxes labeled with `In [ ]:` are called cells.  They contain Python code.
To run the code in the cell, click on the cell and press **Shift + Enter**

Try it now on the cell below:

In [1]:
1 + 1

2

You should see an output like `Out[1]: 2`.

Next, edit the above cell to say `1 + 2` and run it.

## Check that the SparkContext is loaded

Since you launched the notebook with PySpark, an object called `sc` is automatically loaded.
This object is the connection to the Spark cluster.
Run the cell below.
You *should* see an output like

`<pyspark.context.SparkContext at  `  *some numbers*  `  >`

Otherwise, it won't be possible to proceed.

In [2]:
sc

<pyspark.context.SparkContext at 0x7f210455f950>

# Music recommendations with PySpark

## 1. Load some modules (libraries)

The following cell imports the modules we need

In [3]:
import os
import numpy as np

## 2. Load the artist info

We will load the data in `/root/data/artist_data.txt` into Python itself as a *dictionary*
(a hash table of key-value pairs).
This allows us to make sense of the numeric IDs we will get in the analysis.

In [4]:
# Load the lines of the text into a list of strings
f = open('/root/data/artist_data.txt', 'r')
txt = f.read().split('\n')
f.close
txt[0:5]

['1134999\t06Crazy Life',
 '6821360\tPang Nakarin',
 '10113088\tTerfel, Bartoli- Mozart: Don',
 '10151459\tThe Flaming Sidebur',
 '6826647\tBodenstandig 3000']

In [6]:
# Process each line to a key-value pair
artist_ids = dict()
for line in txt:
    split = line.split('\t')
    if len(split) > 1:
        artist_ids[int(split[0])] = split[1]

In [12]:
# Check some IDs
artist_ids[1291], artist_ids[1989]

('Staple Singers', 'Hi-Tek feat. Buckshot')

Chances are that some of your favorite artists are in the list.
To find them, make another dict() which enables reverse search

In [14]:
# Construct the reverse search dict
artist_to_id = dict((v,k) for k,v in artist_ids.iteritems())

In [16]:
# Enter your favorite artist below!
artist_to_id['The Beatles']

1000113