# Lab Exercise Solution:
## Washingtons and Marthas

-sandbox
##![Spark Logo Tiny](https://files.training.databricks.com/images/wiki-book/general/logo_spark_tiny.png) Instructions

This data was captured in the August before the 2016 US presidential election.

As a result, articles about the candidates were very popular.

For this exercise, you will...
0. Filter the result to the **en** Wikipedia project.
0. Find all the articles where the name of the article **ends** with **_Washington** (presumably "George Washington", "Martha Washington", etc)
0. Return all records as an array to the Driver.
0. Assign your array of Washingtons (the return value of your action) to the variable `washingtons`.
0. Calculate the sum of requests for the Washingtons and assign it to the variable `totalWashingtons`. <br/>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** We've not yet covered `DataFrame` aggregation techniques, so for this exercise use the array of records you have just obtained.

** Bonus **

Repeat the exercise for the Marthas
0. Filter the result to the **en** Wikipedia project.
0. Find all the articles where the name of the article **starts** with **Martha_** (presumably "Martha Washington", "Martha Graham", etc)
0. Return all records as an array to the Driver.
0. Assign your array of Marthas (the return value of your action) to the variable `marthas`.
0. Calculate the sum of requests for the Marthas and assign it to the variable `totalMarthas`.<br/>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** We've not yet covered `DataFrame` aggregation techniques, so for this exercise use the array of records you have just obtained.
0. But you cannot do it the same way twice:
   * In the filter, don't use the same conditional method as the one used for the Washingtons.
   * Don't use the same action as used for the Washingtons.

**Testing**

Run the last cell to verify that your results are correct.

**Hints**
* <img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Make sure to include the underscore in the condition.
* The actions we've explored for extracting data include:
  * `first()`
  * `collect()`
  * `head()`
  * `take(n)`
* The conditional methods used with a `filter(..)` include:
  * equals
  * not-equals
  * starts-with
  * and there are others - remember, the `DataFrames` API is built upon an SQL engine.
* There shouldn't be more than 1000 records for either the Washingtons or the Marthas

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "../Includes/Classroom-Setup"

In [0]:
# Mount "/mnt/training" again using "%run "./Includes/Dataset-Mounts-New"" if it is failed in "./Includes/Classroom-Setup"
try:
    files = dbutils.fs.ls("/mnt/training")
except:
    dbutils.fs.unmount('/mnt/training/')


/mnt/training/ has been unmounted.


In [0]:
%run "../Includes/Dataset-Mounts-New"

##![Spark Logo Tiny](https://files.training.databricks.com/images/wiki-book/general/logo_spark_tiny.png) Show Your Work

In [0]:
(source, sasEntity, sasToken) = getAzureDataSource()
spark.conf.set(sasEntity, sasToken)

source = '/mnt/training'
parquetDir = source + "/wikipedia/pagecounts/staging_parquet_en_only_clean/"

In [0]:
# ANSWER

from pyspark.sql.functions import *

parquetDir = "/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/"

washingtons = (spark.read
  .parquet(parquetDir)
  .filter( col("project") == "en")
  .filter( col("article").endswith("_Washington") )
  #.filter( col("article").like("%\\_Washington") )
  .collect()
  #.take(1000)
)
totalWashingtons = 0

for washington in washingtons:
  totalWashingtons += washington["requests"]

print("Total Washingtons: {0:,}".format( len(washingtons) ))
print("Total Washington Requests: {0:,}".format( totalWashingtons ))

Total Washingtons: 466
Total Washington Requests: 3,266


In [0]:
# ANSWER
# BEST ANSWER - this is how you would do it in production

from pyspark.sql.functions import *  # sum(), count()

parquetDir = "/mnt/training/wikipedia/pagecounts/staging_parquet_en_only_clean/"

stats = (spark.read
  .parquet(parquetDir)
  .filter((col("project") == "en") & col("article").endswith("_Washington"))
  .select(sum("requests"), count("*"))
  .first())

totalWashingtons = stats[0]
washingtonCount = stats[1]

print("Total Washingtons: {}".format(washingtonCount) )
print("Total Washingtons Requests: {}".format(totalWashingtons))

Total Washingtons: 466
Total Washingtons Requests: 3266


In [0]:
# ANSWER

from pyspark.sql.functions import *

marthas = (spark.read
  .parquet(parquetDir)
  .filter( col("project") == "en")
  #.filter( col("article").startswith("Martha_") )
  .filter( col("article").like("Martha\\_%") )
  #.collect()
  .take(1000)
)
totalMarthas = 0

for martha in marthas:
  totalMarthas += martha["requests"]

print("Total Marthas: {0:,}".format( len(marthas) ))
print("Total Marthas Requests: {0:,}".format( totalMarthas ))

Total Marthas: 146
Total Marthas Requests: 708


##![Spark Logo Tiny](https://files.training.databricks.com/images/wiki-book/general/logo_spark_tiny.png) Verify Your Work
Run the following cell to verify that your `DataFrame` was created properly.

In [0]:
print("Total Washingtons: {0:,}".format( len(washingtons) ))
print("Total Washington Requests: {0:,}".format( totalWashingtons ))

expectedCount = 466
assert len(washingtons) == expectedCount, "Expected " + str(expectedCount) + " articles but found " + str( len(washingtons) )

expectedTotal = 3266
assert totalWashingtons == expectedTotal, "Expected " + str(expectedTotal) + " requests but found " + str(totalWashingtons)

Total Washingtons: 466
Total Washington Requests: 3,266


In [0]:
print("Total Marthas: {0:,}".format( len(marthas) ))
print("Total Marthas Requests: {0:,}".format( totalMarthas ))

expectedCount = 146
assert len(marthas) == expectedCount, "Expected " + str(expectedCount) + " articles but found " + str( len(marthas) )

expectedTotal = 708
assert totalMarthas == expectedTotal, "Expected " + str(expectedTotal) + " requests but found " + str(totalMarthas)


Total Marthas: 146
Total Marthas Requests: 708
