# Exam 2:  CWL Analysis using Spark DataFrames

First let us import the DataFrames that we prepared in the week 9 HW.  Note that if you aren't confident that you did the right thing then please use the solutions that I provided to import them correctly.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F  # will be used a LOT
from pyspark import Row  # Row will be used in some of the assertions

ss = SparkSession.builder.\
     master('spark://spark-master:7077').\
     appName('cwlanalysis').getOrCreate()

In [None]:
matches_df = ss.read.parquet("hdfs://namenode/Users/vagrant/matches_df.parquet")
teammatches_df = ss.read.parquet("hdfs://namenode/Users/vagrant/teammatches_df.parquet")
playermatches_df = ss.read.parquet("hdfs://namenode/Users/vagrant/playermatches_df.parquet")
matchevents_df = ss.read.parquet("hdfs://namenode/Users/vagrant/matchevents_df.parquet")

In [None]:
matches_df.printSchema()

In [None]:
teammatches_df.printSchema()

In [None]:
playermatches_df.printSchema()

In [None]:
matchevents_df.printSchema()

## Problem 1

Players tend to be very interested in the performance characteristics of various weapons.  One easy question to answer is:

Which weapon was responsible for the most kills over the entire tournament?

Store your result in a `str` variable named `weapon_most_kills`, i.e.
```
weapon_most_kills = 'KICKBOOTY_3000'
```

WE WILL RUN LOW ON MEMORY, so clean up after yourself.  Here is how to delete any DataFrames or RDDs you are done with:
```
del my_df
ss.catalog.clearCache()
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(weapon_most_kills, str)


## Problem 2:  Winning-est Team

Please rank the teams in order of who won the most *matches* (winning-est team is first).  If there are any ties then secondarily sort alphabetically.

Note that because each team-vs-team competition is the best out of 3 matches, the "winning-est" team might not be the same team that won the whole tournament.

Store your result in a variable named `winningest_teams` that is a `list` of `tuple`s where each tuple contains the team name and the total number of matches won, e.g.
```
[('AWESOME_TEAM', 55), ('OK_TEAM', 43), ...]
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(winningest_teams, list)
assert isinstance(winningest_teams[0], tuple)
assert isinstance(winningest_teams[0][0], str)
assert isinstance(winningest_teams[0][1], int)


## Problem 3:  Deadliest map

It is interesting to know which maps are "deadliest" (i.e. have the most kills over the entire tournament) because this is where the exciting action was happening.

Similar to Problem 2, provide a rank-order of maps (deadliest first), i.e. your `list` of `tuples` should look like:
```
[('Super Deadly Map', 1053), ('Second deadliest map', 997), ...]
```
where each pair contains the team name and total number of deaths that occurred in that map for the entire tournament.

If there are any ties then secondarily sort alphabetically.  Name your variable `deadliest_maps`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(deadliest_maps, list)
assert isinstance(deadliest_maps[0], tuple)
assert isinstance(deadliest_maps[0][0], str)
assert isinstance(deadliest_maps[0][1], int)


## Problem 4:  Time spent per map

Let's figure out how much time was spent (for the entire tournament) on each map.

Produce a DataFrame named `map_durations_df` that contains two columns:  `map` (the name of the map) and `tot_duration_s` (total *seconds* played on the map for the entire tournament over all matches).

It should be sorted in descending order with the longest played map first.

Hint: I used a UDF to solve this

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
from pyspark.sql import DataFrame
assert isinstance(map_durations_df, DataFrame)
assert map_durations_df.columns == ['map', 'tot_duration_s']


## Problem 5: Deadliest map per unit time

The analysis in Problem 3 is not really fair.  Since some maps were played longer than others, we should really produce a DataFrame that details the deadliest maps PER SECOND.

Your resulting DataFrame should be named `deadliest_maps_per_second_df` and have the columns `map` (the map name) and `deaths_per_second` (which will be doubles).

Your DataFrame should be sorted according to `deaths_per_second` (deadliest first) and, in case of tie, alphabetically.

HINT: I used a UDF to solve this

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
import numpy as np

assert isinstance(deadliest_maps_per_second_df, DataFrame)
assert deadliest_maps_per_second_df.columns == ['map', 'deaths_per_second']
assert deadliest_maps_per_second_df.select('deaths_per_second').dtypes[0][1] == 'double'


## Problem 6: Cumulative time a team played

Create a DataFrame that contains the matches played by the team `EVIL GENIUSES` (one row per match).  This DataFrame, named `genius_matches`, will contain 4 columns:

`mode`, `start_time_s`, `end_time_s`, and `cumulative_time_s` where `cumulative_time_s` is the cumulative seconds played UP TO AND INCLUDING THAT MATCH.  The cumulative match time should be separate for each game mode.

The matches should be sorted first by `mode`, then in order of time played (earliest matches first).

HINT:  You need some UDFs and Windowing

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(genius_matches, DataFrame)
assert genius_matches.columns == \
    ['mode', 'start_time_s', 'end_time_s', 'cumulative_time_s']
