# COMP7095 - Big Data Management

## Spark Lab 1: Introduction to Spark

### Introduction
Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It supports higher-level tools for SQL and structured data processing, machine learning, graph processing, and stream processing. It exposes APIs for Java, Python, and Scala. In our labs, we mainly use it with Python.

## Hand on
We are going to use PySpark to load the data from a data file in TSV format and adopt PySpark to do some simply analysis. 

The file named as "moview_reviews.tsv" can be downloaded from the course moodle, too. 

Using the following code segments to understand how PySpark works.

The size of the data file is around 65MB. and the following is the partial layout of the data file:
```
review\tsentiment
One of the other reviewer...\tpositive
A wonderful little produc...\tpositive
I thought this was a wond...\tpositive
Basically there's a famil...\tnegative
...
```

Note that values are seperated by a tab ('\t').

Import the required packages and get the instance of the Spark context:

In [None]:
# Install pyspark
!pip install pyspark

from pyspark import *
from operator import add

sc = SparkContext.getOrCreate()

Define a function to split the values (review and sentiment) of each line.

In [None]:
def preprocess(line):
    review, sentiment = line.split('\t')
    return sentiment, review

Upload your local file to Colab.

This step requires you to first import the files module from the google.colab library:

In [None]:
from google.colab import files

Uploading files from local file system using Python code.

You use the upload method of the files object:

In [None]:
uploaded = files.upload()

Once the upload is complete, you can either read it as a file in Colab.

#### Task 1: Load the data file and create a resilient distributed data (RDD) object. Please complete your code as follows.

Use the `filter` function to ignore the header row and flip the data to the `preprocess` function. Then, a new RDD object will be created.
The `count` function returns the number of rows stored in the RDD object.

In [None]:
reviews = rdd.filter(lambda x: x != 'review\tsentiment').map(preprocess)
reviews.count()

We can also check what are stored in the RDD object by using the take function. Here we use the `take` function with 1 to get the first item. The parameter represents how many items you want to get from the RDD object.

Next, we use `filter` function to retrieve all rows with the positive sentiment and create a new RDD object. And, it stores the reviews without the sentiments.
Let's also check how many positive reviews!

In [None]:
posReviews = reviews.filter(lambda x: x[0] == 'positive').map(lambda x: x[1])
posReviews.count()

#### Task 2: Please take the first row of positive reviews. Please complete your code as follows.

#### Task 3: Please a new RDD object for negative reviews. Please complete your code as follows.

Besides, we can find the word frequency using the following simple steps:
1. Define a function for splitting each line. It returns a list of words.

In [None]:
def splitWords(line):
    values = line.replace(',', ' ').replace('.', ' ').replace('"', '').split(' ')
    data = []
    for v in values:
        if len(v) > 0:
            data.append(v)
    return data

2. Create a new RDD object from the original RDD object by using the `filter` function and the `splitWords` function.

In [None]:
wordcounts = rdd.filter(lambda x: x != 'review\tsentiment').flatMap(splitWords).map(lambda w: (w, 1)).reduceByKey(add)

3. We sort data by the frequency (x[1], the column with index 1) in descending order and retrieve 10 items.

In [None]:
wordcounts.takeOrdered(10, key=lambda x: -x[1])

## After using Spark
In the end, we should stop the Spark by using the `stop` function.

In [None]:
sc.stop()