# CS246 - Colab 1
## Wordcount in Spark

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 65kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 40.9MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=aa1550d0031ba8e968af869ad534fcca413389fb6c85fde4cd68d3eb470494b0
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1
The 

Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
id='1SE6k_0YukzGd5wK-E4i6mG83nydlfvSa'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('pg100.txt')

If you executed the cells above, you should be able to see the file *pg100.txt* under the "Files" tab on the left panel.

### Your task

If you run successfully the setup stage, you are ready to work on the *pg100.txt* file which contains a copy of the complete works of Shakespeare.

Write a Spark application which outputs the number of words that start with each letter. This means that for every letter we want to count the total number of (non-unique) words that start with a specific letter. In your implementation **ignore the letter case**, i.e., consider all words as lower case. Also, you can ignore all the words **starting** with a non-alphabetic character.

In [175]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
import pandas as pd


# Initialize spark configuration and context
conf = SparkConf()
sc = SparkContext(conf=conf)

# create the Spark Session
spark = SparkSession.builder.getOrCreate()

In [176]:
# YOUR CODE HERE
txt = spark.read.text("pg100.txt")

In [177]:
lines = sc.textFile("pg100.txt")
lines.take(5)

['The Project Gutenberg EBook of The Complete Works of William Shakespeare, by',
 'William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever.  You may copy it, give it away or']

In [178]:
words = lines.flatMap(lambda line: line.split())
words.take(15)

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Complete',
 'Works',
 'of',
 'William',
 'Shakespeare,',
 'by',
 'William',
 'Shakespeare',
 'This']

In [179]:
# Filter out words where first letter is not in alphabet
# From every word, take only first letter: word[0]
letters = words.filter(lambda word: word[0].isalpha()) \
               .map(lambda word: word[0].lower())
letters.take(15)

['t', 'p', 'g', 'e', 'o', 't', 'c', 'w', 'o', 'w', 's', 'b', 'w', 's', 't']

In [180]:
# MAP : for each first letter of a word, we make pair (letter, 1)
pairs = letters.map(lambda w: (w, 1))

In [181]:
# REDUCE : count all the same letters
counts = pairs.reduceByKey(lambda n1, n2: n1 + n2)

In [188]:
counts = counts.sortByKey()

In [189]:
counts.collect()

[('a', 84836),
 ('b', 45455),
 ('c', 34567),
 ('d', 29713),
 ('e', 18697),
 ('f', 36814),
 ('g', 20782),
 ('h', 60563),
 ('i', 62167),
 ('j', 3339),
 ('k', 9418),
 ('l', 29569),
 ('m', 55676),
 ('n', 26759),
 ('o', 43494),
 ('p', 27759),
 ('q', 2377),
 ('r', 14265),
 ('s', 65705),
 ('t', 123602),
 ('u', 9170),
 ('v', 5728),
 ('w', 59597),
 ('x', 14),
 ('y', 25855),
 ('z', 71)]

In [187]:
df = pd.DataFrame(counts.collect(), columns = ['Letter', 'Occurrence'])
print(df)

   Letter  Occurrence
0       a       84836
1       b       45455
2       c       34567
3       d       29713
4       e       18697
5       f       36814
6       g       20782
7       h       60563
8       i       62167
9       j        3339
10      k        9418
11      l       29569
12      m       55676
13      n       26759
14      o       43494
15      p       27759
16      q        2377
17      r       14265
18      s       65705
19      t      123602
20      u        9170
21      v        5728
22      w       59597
23      x          14
24      y       25855
25      z          71


In [190]:
sc.stop()

Once you obtained the desired results, **head over to Gradescope and submit your solution for this Colab**!