## Installing pyspark

The following cell install the latest pyspark package

In [None]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 34 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 48.2 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=08f06391d209e91ba51f7fb529e1384c5f7c108e16b8e95bafca410e02334ef8
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


## Mounting Google Drive

The following cell mounts your google drive in the virtual machine runing the notebook. You will be asked to authenticate your account to access Google drive. Once authenticated, your google drive is mounted at `/content/drive`. Anything in your google drive can be accessed from `/content/drive/MyDrive`.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The following cell lists the content of your google drive. We assume you have created a folder called `comp5349` in your google drive and have uploaded the data file there.

In [None]:
!ls /content/drive/MyDrive

'Colab Notebooks'   COMP5216   comp5349   ELEC5517   INFO5990   INFO6007


### Initializing Spark


In [None]:
from pyspark import SparkConf, SparkContext

spark_conf = SparkConf()\
        .setAppName("Week 5 Lecture Sample Code")
sc=SparkContext.getOrCreate(spark_conf) 


### Word Count Program ###

This is the word count program used in week 5 lecture to illustrate basic spark program structure. It reads a text file from local disk and count the occurance of words in the text. For simplicity, words are considered as separaetd by white space only.

**Each run of this cell will create an output directory called 1984_wordcount. To re-run the cell, you need to remove that directory from your google drive**


In [None]:
input_file = 'file:///content/drive/MyDrive/comp5349/1984_processed.txt'
output_path = 'file:///content/drive/MyDrive/comp5349/1984_wordcount'

text_file = sc.textFile(input_file)

counts = text_file.flatMap(lambda line: line.strip().split(" ")) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(output_path)

## map style transformations

 `map` vs. `mapValues`

In [None]:
d = [('a',1),('b',2),('c',3),('d',4),('e',5)]
distRDD = sc.parallelize(d)

#convert to kms
kvmap= distRDD.map(lambda rec: (rec[0],rec[1] * 1.6)).collect()
kvmapvalues = distRDD.mapValues(lambda dist: dist * 1.6).collect()

In [None]:
print(kvmap)
print(kvmapvalues)

[('a', 1.6), ('b', 3.2), ('c', 4.800000000000001), ('d', 6.4), ('e', 8.0)]
[('a', 1.6), ('b', 3.2), ('c', 4.800000000000001), ('d', 6.4), ('e', 8.0)]


## map style transformation 

`filter`

In [None]:
longDist = distRDD.filter(lambda rec: rec[1] > 2)
longDist.collect()

[('c', 3), ('d', 4), ('e', 5)]

### Movie Rating Computing ###

This is a sample notebook showing basic spark RDD operations. The program has two input data sources: *ratings.csv* and *movies.csv*.

The *movies.csv* file contains movie information. Each row represents one movie, and has the following format:

`movieId,title,genres`

The *ratings.csv* file contains rating information. Each row represents one rating of one movie by one user, and has the following format:

`userId,movieId,rating,timestamp`


#### The following cell defines a number of functions to be used in the computation ####

In [None]:
import csv
"""
This module includes a few functions used in computing average rating per genre
"""
def pairMovieToGenre(record):
    """This function converts entries of movies.csv into key,value pair of the following format
    (movieID, genre)
    since there may be multiple genre per movie, this function returns a list of tuples
    Args:
        record (str): A row of CSV file, with three columns separated by comma
    Returns:
        The return value is a list of tuples, each tuple contains (movieID, genre)
    """
    for row in csv.reader([record]):
        if len(row) != 3:
            continue
        movieID, genreList = row[0],row[2]
        return [(movieID, genre) for genre in genreList.split("|")]

def extractRating(record):
    """ This function converts entries of ratings.csv into key,value pair of the following format
    (movieID, rating)
    Args:
        record (str): A row of CSV file, with four columns separated by comma
    Returns:
        The return value is a tuple (movieID, genre)
    """
    try:
        userID, movieID, rating, timestamp = record.split(",")
        rating = float(rating)
        return (movieID, rating)
    except:
        return ()

def mapToPair(line):
    """ This function converts tuples of (genre, rating) into key,value pair of the following format
    (genre,rating)
    
    Args:
        line (str): A tuple of  (genre, rating) 
    Returns:
        The return value is a tuple  (genre, rating) 
    """
    genre, rating = line
    return (genre, rating)

def avg(values):
    #convert the iterable into a list
    vlist = list(values) 
    # the average is the sum of the list divided by the count of the the list
    return sum(vlist)/len(vlist)

#### This cell defines the spark function  skeleton (e.g. the computation graph ####

To facilitate inspection of each intermediate RDD, we write each transformation in a separate statement. This is not necessary in production code. 

In [None]:
input_path = 'file:///content/drive/MyDrive/comp5349/'

#read the input as line and convert into RDD of String
ratingData = sc.textFile(input_path + "ratings.csv")
movieData = sc.textFile(input_path + "movies.csv")

movieRatings = ratingData.map(extractRating)
# we use flatMap as there are multiple genre per movie
movieGenre = movieData.flatMap(pairMovieToGenre)

# join  the two RDDs
joined = movieGenre.join(movieRatings)
# throw away the movieID which is useless for subsequent computation
joined_gk = joined.values()
# group ratings by genre
grouped = joined_gk.groupByKey()
genreRatingsAvg = grouped.mapValues(avg).collect()

''' The short hand version
genreRatingsAvg = movieGenre \
    .join(movieRatings) \
    .values() \
    .groupByKey() \
    .mapValues(avg) \
    .collect()
'''
genreRatingsAvg

#### Check RDD element ####

In [None]:
#What does movieData look like
#Each row is a string
movieData.take(2)

['1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy',
 '2,Jumanji (1995),Adventure|Children|Fantasy']

In [None]:
# What does movieRatings RDD look like
# Each row is a tuple of String, float
movieRatings.take(2)

[('16', 4.0), ('24', 1.5)]

In [None]:
#How many element are there in movieRatings
movieRatings.count()

105339

In [None]:
#what does moveGenre RDD look like
#Each row is a tuple of string, string
movieGenre.take(2)

[('1', 'Adventure'), ('1', 'Animation')]

In [None]:
#How many element are there in movieGenre
movieGenre.count()

23114

In [None]:
#what does joined look like 
# we are joinning (mid, genre) with (mid, rating)
# the result is (mid, (genre, rating))
joined.take(2)

[('4', ('Comedy', 3.5)), ('4', ('Comedy', 3.0))]

In [None]:
# What does joined_gk look like
# a tuple of (string, float) representing (genre, rating)
joined_gk.take(2)

[('Comedy', 3.5), ('Comedy', 3.0)]

In [None]:
# When we run groupByKey on joined_gk, all rating values 
# for the same genre will be grouped into a single sequence as an iterable object
grouped.take(2)

[('Drama', <pyspark.resultiterable.ResultIterable at 0x7f067a9fe990>),
 ('Romance', <pyspark.resultiterable.ResultIterable at 0x7f067a9fedd0>)]

# Lab6 code


In [68]:
from pyspark import SparkConf, SparkContext

spark_conf = SparkConf()\
        .setAppName("Week 6 Lab Code")
sc = SparkContext.getOrCreate(spark_conf)

input_file = 'file:///content/drive/MyDrive/comp5349/1984_processed.txt'
output_path = 'file:///content/drive/MyDrive/comp5349/1984_wordcount'

text_file = sc.textFile(input_file)

bigrams = text_file.map(lambda line: line.strip().split(" "))\
                 .flatMap(lambda xs: (tuple(x) for x in zip(xs, xs[1:])))

result = bigrams.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).sortBy(lambda r: r[1],ascending=False)

result.take(5)

# bigrams = zip(words[:-1], words[1:])

# result = bigrams.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).sortBy(lambda r: r[1],ascending=False) \

[(('big', 'brother'), 67),
 (('said', 'winston'), 43),
 (('old', 'man'), 38),
 (('thought', 'police'), 38),
 (('said', "o'brien"), 37)]

In [110]:
from pyspark import SparkConf, SparkContext
import csv

spark_conf = SparkConf()\
        .setAppName("Week 6 Lab Code")
sc = SparkContext.getOrCreate(spark_conf)

input_path = 'file:///content/drive/MyDrive/comp5349/'

movieData = sc.textFile(input_path + "movies.csv")

genre = 'Sci-Fi'

def filterMovieInGenre(record):
    for row in csv.reader([record]):
        if len(row) != 3:
            continue
        year = row[1][-5:-1]
        genreList = row[2]
        genres = genreList.split('|')
        if genre in genres:
            return [year]
        else:
            return []

genres = movieData.flatMap(filterMovieInGenre)\
                  .map(lambda x: (x, 1))\
                  .reduceByKey(lambda x, y : x + y)\
                  .sortBy(lambda r: r[1],ascending=False)

genres.take(5)

[('2009', 49), ('2011', 33), ('2008', 32), ('2013', 30), ('2007', 27)]