## Transforming_Hamlet_into_a_Dataset

In this project, we used the techniques of PySpark, the MapReduce paradigm, transformations and actions, and data cleanup to transform the text of _Hamlet_ into a format that's more useful for data analysis.

The file `hamlet.txt` contains the entire text of Shakespeare's play Hamlet and it is one of his most popular plays.

In [1]:
# finding path to PySpark.
import findspark
findspark.init()

# importing PySpark and initializing SparkContext object.
import pyspark
sc = pyspark.SparkContext()

- Read the text file into an RDD named `raw_hamlet` using the `textFile()` method from SparkContext (this object instantiates to `sc` on our end).
- Display the first five elements of the RDD.

In [2]:
raw_hamlet = sc.textFile('hamlet.txt')

raw_hamlet.take(5)

['hamlet@0\t\tHAMLET',
 'hamlet@8',
 'hamlet@9',
 'hamlet@10\t\tDRAMATIS PERSONAE',
 'hamlet@29']

- The text file uses the tab character (`\t`) as a delimiter. We'll need to split the file on the tab delimiter and convert the results into an RDD that's more manageable.
- Name the resulting RDD `split_hamlet`.

In [3]:
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))
split_hamlet.take(5)

[['hamlet@0', '', 'HAMLET'],
 ['hamlet@8'],
 ['hamlet@9'],
 ['hamlet@10', '', 'DRAMATIS PERSONAE'],
 ['hamlet@29']]

Transform the RDD `split_hamlet` into a new RDD `hamlet_with_ids` that contains the clean version of the line ID for each element.

- For example, we want to transform `hamlet@0` to `0`, and leave the rest of the values in that element untouched.
    - Recall that the `map()` function will run on each element in the RDD, where each element is a list that we can access using regular Python mechanics.

In [4]:
def format_id(x):
    id = x[0].split('hamlet@')[1]
    results = list()
    results.append(id)
    if len(x) > 1:
        for y in x[1:]:
            results.append(y)
    return results

hamlet_with_ids = split_hamlet.map(lambda line: format_id(line))
hamlet_with_ids.take(10)

[['0', '', 'HAMLET'],
 ['8'],
 ['9'],
 ['10', '', 'DRAMATIS PERSONAE'],
 ['29'],
 ['30'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['74'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['131']]

- Next, we want to get rid of elements that don't contain any actual words (and just have an ID as the first value). These typically represent blank lines between paragraphs or sections in the play. We also want to remove any blank values ('') within elements, which don't contain any useful information for our analysis.
- Clean up the RDD and store the result as a new RDD `hamlet_text_only`.

In [5]:
def filter_text_only(line):
    results = list()
    if len(line) > 1:
        results.append(line[2:])
    return results

real_text = hamlet_with_ids.filter(lambda line: filter_text_only(line))

hamlet_text_only = real_text.map(lambda line: [l for l in line if l != ''])

hamlet_text_only.take(5)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)'],
 ['75', 'HAMLET', 'son to the late, and nephew to the present king.'],
 ['132', 'POLONIUS', 'lord chamberlain. (LORD POLONIUS:)']]

- Using `take()` to preview the RDD after each task, we noticed there are some pipe characters (`|`) in odd places that add no value for us. The pipe character may appear as a standalone value in an element, or as part of an otherwise useful string value.
- Remove any list items that only contain the pipe character (`|`), and replace any pipe characters that appear within strings with an empty character.
    - Assign the resulting RDD to `clean_hamlet`.

In [6]:
def fix_pipe(line):
    results = list()
    for l in line:
        if l == '|':
            pass
        elif '|' in l:
            formatted = l.replace('|', '')
            results.append(formatted)
        else:
            results.append(l)
    return results
            
clean_hamlet = hamlet_text_only.map(lambda line: fix_pipe(line))
clean_hamlet.take(3)

[['0', 'HAMLET'],
 ['10', 'DRAMATIS PERSONAE'],
 ['31', 'CLAUDIUS', 'king of Denmark. (KING CLAUDIUS:)']]