# A crash course on IPython (jupyter) notebooks

This file contains a tutorial on analyzing the music data using Spark.
But first, a primer on how to use the notebook.

## A notebook is divided into cells

The boxes labeled with `In [ ]:` are called cells.  They contain Python code.
To run the code in the cell, click on the cell and press **Shift + Enter**

Try it now on the cell below:

In [1]:
1 + 1

2

You should see an output like `Out[1]: 2`.

Next, edit the above cell to say `1 + 2` and run it.

## Check that the SparkContext is loaded

Since you launched the notebook with PySpark, an object called `sc` is automatically loaded.
This object is the connection to the Spark cluster.
Run the cell below.
You *should* see an output like

`<pyspark.context.SparkContext at  `  *some numbers*  `  >`

Otherwise, it won't be possible to proceed.

In [2]:
sc

<pyspark.context.SparkContext at 0x7f210455f950>

### Note: How to reset the Spark cluster

If you run into strange errors, you might have to reset the cluster.
 1. In the ssh session go to `screen -r nb` and press Control + C, then Y, Enter to stop the notebook
 2. Follow the instructions in the setup to restart the Spark cluster and launch the notebook
 3. Refresh this page and run everything from the beginning

# Music recommendations with PySpark

## 1. Load some modules (libraries)

The following cell imports the modules we need

In [23]:
import os
import numpy as np
import time

## 2. Load the artist info

We will load the data in `/root/data/artist_data.txt` into Python itself as a *dictionary*
(a hash table of key-value pairs).
This allows us to make sense of the numeric IDs we will get in the analysis.

In [4]:
# Load the lines of the text into a list of strings
f = open('/root/data/artist_data.txt', 'r')
txt = f.read().split('\n')
f.close
txt[0:5]

['1134999\t06Crazy Life',
 '6821360\tPang Nakarin',
 '10113088\tTerfel, Bartoli- Mozart: Don',
 '10151459\tThe Flaming Sidebur',
 '6826647\tBodenstandig 3000']

In [6]:
# Process each line to a key-value pair
artist_ids = dict()
for line in txt:
    split = line.split('\t')
    if len(split) > 1:
        artist_ids[int(split[0])] = split[1]

In [12]:
# Check some IDs
artist_ids[1291], artist_ids[1989]

('Staple Singers', 'Hi-Tek feat. Buckshot')

Chances are that some of your favorite artists are in the list.
To find them, make another dict() which enables reverse search

In [14]:
# Construct the reverse search dict
artist_to_id = dict((v,k) for k,v in artist_ids.iteritems())

In [16]:
# Enter your favorite artist below!
artist_to_id['The Beatles']

1000113

## 3. Load the data into hadoop (in ssh)

Go back to your ssh session.  We need to copy the `user_artist_data.txt` into the HDFS (Hadoop file system).

Make sure you are detached from any screen with Control + A, D.  Then create a new screen and go to the hadoop folder
```
screen -S hdfs
cd /root/ephemeral-hdfs/bin
```

Copy the local file into hadoop
```
./hadoop fs -put /root/data/user-artist-data.txt data.txt
```

Check that you copied the file:
```
./hadoop fs -ls
```
You should see the file listed as `/user/root/data.txt`

## 4. Load the file into Spark

The following code creates a RDD (resilient distributed dataset) with the raw strings from `data.txt`.
The number `4` indicates the number of partitions (chunks).
You can change the number of partitions and see if you get a performance improvement.

In [17]:
rawdata = sc.textFile('data.txt', 4)

Let's take a look at what was loaded.  The `takeSample` command allows you to peek at random contents of an RDD object.

This might take a while!  To occupy the time, switch to the tab `XX.XX.XX.XXX:8080` which shows the cluster status.  There should be one running application: PySparkShell.  Click on the name to see what is going on inside Spark.

In [27]:
rawlines = rawdata.takeSample(True, 5)

Why does it take so long?  Actually, the text data was not loaded until you used the command `takeSample`.  Spark uses lazy evaluation.  Let's run it again and note the time:

In [24]:
t1 = time.time()
rawlines = rawdata.takeSample(True, 5)
time.time() - t1

247.87838888168335

Spark includes a `cache()` command which caches the result of a computation.  Run the below code, then get timing information.  Does the speed improve?

In [21]:
# Caches the result of the textFile command (?)
rawdata.cache()

data.txt MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2

Let's get the number of records in the RDD.  The one() function will be used to map every record to the number 1.  Then the add() function will be used to sum up all of the ones.

In [25]:
def one(x):
    return 1

def add(x, y):
    return x + y

rawdata.map(one).reduce(add)

24296858

#### Exercise:

Write and run a command in the empty line below to find the total number of characters in the raw data.

*Hint:* `len(x)` returns the number of characters in `x`.

In [None]:
# Your code below:


Scroll down for the answer
<br/><br/><br/><br/><br/><br/>
...
<br/><br/><br/><br/><br/><br/>
...
<br/><br/><br/><br/><br/><br/>
...
<br/><br/><br/><br/><br/><br/>
...
<br/><br/><br/><br/><br/><br/>
...
<br/><br/><br/><br/><br/><br/>
...
<br/><br/><br/><br/><br/><br/>
...

In [None]:
# ANSWER TO EXERCISE:
rawdata.map(len).reduce(add)

Next 

## 5. Process the raw data

The raw text data is not very useful to us.
We need to process the data into relevant (key, value) pairs.
What should we set as the key, and what should we set as the value?
This depends on the application at hand.

For now, let's narrow our focus to the artist and the counts while ignoring the users.
The following function will convert a raw line into a (key, value) pair with the key being an integer artist ID and the value being the integer count.

In [28]:
def raw_to_artist_count(line):
    line = str(line)
    parts = line.split(' ')
    if len(parts) != 3:
        return (-1, 0) # a (k, v) pair indicating error
    key = int(parts[1]) # the artist ID
    value = int(parts[2]) # the count
    return (key, value) # return the (k,v) pair

Check that it works correctly using the sampled text lines.

In [31]:
print(rawlines[0])
raw_to_artist_count(rawlines[0])

2258857 1404 4


(1404, 4)

Using this function, we will make a new rdd with the (key, value) pairs

In [32]:
count_rdd = rawdata.map(raw_to_artist_count).cache()

Let's check the results (and time it)

In [33]:
t1 = time.time()
sample_counts = count_rdd.takeSample(True, 5)
time.time() - t1

228.10321593284607

In [34]:
sample_counts

[(2281323, 12), (1273010, 1), (10698564, 1), (3114, 1), (1158516, 1)]

Now, can we find the *most popular* artist?
First we need to combine the counts for each artist.
The `combineByKey` function is perfect for this task.