## Preparation for the moodle exercise in Spark

In this jupyter notebook we are going to preprocess the dataset that we will use in this week's graded exercise.
1. Change directory to `exercise08` 

2. Start docker <br>
`docker-compose up -d`

3. Getting the data:
    1. Download the data:<br> ```wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2```
    1. Extract the data:<br> ```tar -jxvf confusion-2014-03-02.tbz2```

4. Change directory to `confusion-2014-03-02`

5. Extract the part of the dataset that we will work with in this exercise: ```head -n 3000000 confusion-2014-03-02.json > confusion-part.json```

1. Run: `docker cp confusion-2014-03-02/ jupyter:/home/jovyan/`

For more information about the dataset, you can refer to https://lars.yencken.org/datasets/great-language-game

## Preprocessing commands
In your newly created notebook run these commands in order to have the dataset into an RDD:

In [7]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "confusion-2014-03-02/confusion-part.json"
raw_data = sc.textFile(path)
dataset = raw_data.map(json.loads).cache()

After that you will be able to run the queries of the moodle question of this week. The RDD that you have to perform your queries on is the ```dataset``` one. For example, the following command returns one element of the dataset:

In [8]:
dataset.take(1)

[{'guess': 'Norwegian',
  'target': 'Norwegian',
  'country': 'AU',
  'choices': ['Maori', 'Mandarin', 'Norwegian', 'Tongan'],
  'sample': '48f9c924e0d98c959d8a6f1862b3ce9a',
  'date': '2013-08-19'}]

In [9]:
# Example query 1: count the number of games played on the latest day
latest_day = dataset \
    .map(lambda o: (o['date'], 0)) \
    .sortByKey(False) \
    .take(1)[0][0]
print("Latest day:", latest_day)
dataset \
    .filter(lambda o: o['date'] == latest_day) \
    .count()

Latest day: 2013-09-09


87672

## Question 1
Return the number of countries in the dataset where at least one game has a guessed language of "Somali".

In [None]:
num_countries = dataset \
    .filter(lambda o: o['guess'] == 'Somali') \
    .map(lambda o: o['country']) \
    .distinct() \
    .count()
num_countries

143

## Question 2
Return the sample ID of the earliest game in Switzerland (CH) where the guessed language is correct. 
If multiple games have correct guesses on that date, return the one with the alphabetically earliest guessed language.

In [24]:
swiss_games = dataset \
    .filter(lambda o: o['country'] == 'CH') \
    .filter(lambda o: o['guess'] == o['target'])
earliest_swiss_date = swiss_games \
    .sortBy(lambda o: o['date']) \
    .take(1)[0]['date']
print(earliest_swiss_date)
earliest_swiss_sample = swiss_games \
    .filter(lambda o: o['date'] == earliest_swiss_date) \
    .sortBy(lambda o: o['guess']) \
    .take(1)[0]['sample']
earliest_swiss_sample

2013-09-01


'9b1340b8343bb267783e1bfb2dc55bf1'

## Question 3
Find the average number of language choices per game (rounded to three decimal places).

In [19]:
num_games = dataset.count()
num_lang_choices = dataset \
    .map(lambda o: len(o['choices'])) \
    .reduce(lambda x, y: x + y)
avg_num_lang_choices = (num_lang_choices / num_games)
avg_num_lang_choices


3.2654316666666667

## Question 4
Find the three languages easiest to guess (i.e., with the highest overall percentage of correct guesses). 
Please write the answers separated by commas and without any spaces between them: language1,language2,language3

In [21]:
correct_guesses = dataset \
    .map(lambda o: (o['target'], (1, 1 if o['target'] == o['guess'] else 0))) \
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \
    .mapValues(lambda v : v[1] / v[0]) \
    .sortBy(lambda kv: kv[1], ascending=False) \
    .map(lambda kv: kv[0]) \
    .take(3)
correct_guesses

['French', 'German', 'Italian']

## Question 5
Return the country (as a two-character code) with the highest percentage of games where the guessed language is "Korean".

In [27]:
# This is prolly wrong... :(
korean_country_bad = dataset \
    .filter(lambda o: o['guess'] == 'Korean') \
    .map(lambda o: (o['country'], 1)) \
    .reduceByKey(lambda x, y: x + y) \
    .sortBy(lambda kv: kv[1], ascending=False) \
    .take(1)[0][0]
korean_country_bad

'US'

In [30]:
korean_country = dataset \
    .map(lambda o: (o['country'], (1, 1 if o['guess'] == 'Korean' else 0))) \
    .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) \
    .mapValues(lambda v : v[1] / v[0]) \
    .sortBy(lambda kv: kv[1], ascending=False) \
    .take(1)[0][0]
korean_country

'OM'

## Question 6
Find the number of games played on the earliest date present in the dataset.

In [25]:
earliest_date = dataset \
    .map(lambda o: o['date']) \
    .distinct() \
    .sortBy(lambda x: x) \
    .take(1)[0]
num_early_games = dataset \
    .filter(lambda o: o['date'] == earliest_date) \
    .count()
num_early_games

163