# Social characteristics of the Marvel Universe

In [None]:
sc

In [None]:
spark

## Read in the data.

Run the right cell depending on the platform being used.

For AWS EMR use:

In [None]:
file = sc.textFile("s3://bigdatateaching/marvel/porgat.txt")

For Databricks on Azure use:

In [None]:
file = sc.textFile("wasbs://marvel@bigdatateaching.blob.core.windows.net/porgat.txt")

## Clean The Data

The data file is in three parts, with the single file:

* Marvel Characters
* Publications
* Relationships between the two

We need to pre-process the file before we can use it. The following operations are all RDD operations.

Let's look at the file.

Count the number of records in the file.

Define a new RDD that removes the headers from the file. The headers are lines that begin with a star.

In [None]:
noHeaders = file.

Look at the first 5 records of noHeaders

In [None]:
noHeaders.

Extract a pair from each line:  the leading integer and a string for the rest of the line

In [None]:
paired = noHeaders.

Filter relationships as they do not start with quotes, then split the integer list

In [None]:
scatteredRelationships = paired.

Relationships for the same character id sometime spans more than a line in the file, so let's group them together

In [None]:
relationships = scatteredRelationships.

Filter non-relationships as they start with quotes ; remove the quotes

In [None]:
nonRelationships = paired.

Characters stop at a certain line (part of the initial header ; we hardcode it here)

In [None]:
characters = nonRelationships.

Publications starts after the characters

In [None]:
publications = nonRelationships.

The following cells begin to use SparkSQL. 

Spark SQL works with Data Frames which are a kind of “structured” RDD or an “RDD with schema”.

The integration between the two works by creating a RDD of Row (a type from pyspark.sql) and then creating a Data Frame from it.

The Data Frames can then be registered as views.  It is those views we’ll query using Spark SQL.

In [None]:
from pyspark.sql import Row

Let's create dataframes out of the RDDs and register them as temporary views for SQL to use


In [None]:
#  Relationships has a list as a component, let's flat that
flatRelationships = relationships.flatMap(lambda (charId, pubList):  [(charId, pubId) for pubId in pubList])

In [None]:
#  Let's map the relationships to an RDD of rows in order to create a data frame out of it
relationshipsDf = spark.createDataFrame(flatRelationships.map(lambda t: Row(charId=t[0], pubId=t[1])))


In [None]:
#  Register relationships as a temporary view
relationshipsDf.createOrReplaceTempView("relationships")


In [None]:
#  Let's do the same for characters
charactersDf = spark.
charactersDf.

In [None]:
#  and for publications
publicationsDf = spark.
publicationsDf.

In [None]:
relationshipsDf.show(10)

The following cell is the standard way of running a SQL query on Spark. This query ranks Marvel characters in duo in order of join-appearances in publications. 

In [None]:
df1 = spark.sql("""SELECT c1.name AS name1, c2.name AS name2, sub.charId1, sub.charId2, sub.pubCount
FROM
(
  SELECT r1.charId AS charId1, r2.charId AS charId2, COUNT(r1.pubId, r2.pubId) AS pubCount
  FROM relationships AS r1
  CROSS JOIN relationships AS r2
  WHERE r1.charId < r2.charId
  AND r1.pubId=r2.pubId
  GROUP BY r1.charId, r2.charId
) AS sub
INNER JOIN characters c1 ON c1.charId=sub.charId1
INNER JOIN characters c2 ON c2.charId=sub.charId2
ORDER BY sub.pubCount DESC
LIMIT 10""").cache()


In [None]:
df1.take(10)

In [None]:
df2 = spark.sql("""
SELECT c1.name AS name1, c2.name AS name2, c3.name AS name3, sub.charId1, sub.charId2, sub.charId3, sub.pubCount
FROM
(
  SELECT r1.charId AS charId1, r2.charId AS charId2, r3.charId AS charId3, COUNT(r1.pubId, r2.pubId, r3.pubId) AS pubCount
  FROM relationships AS r1
  CROSS JOIN relationships AS r2
  CROSS JOIN relationships AS r3
  WHERE r1.charId < r2.charId
  AND r2.charId < r3.charId
  AND r1.pubId=r2.pubId
  AND r2.pubId=r3.pubId
  GROUP BY r1.charId, r2.charId, r3.charId
) AS sub
INNER JOIN characters c1 ON c1.charId=sub.charId1
INNER JOIN characters c2 ON c2.charId=sub.charId2
INNER JOIN characters c3 ON c3.charId=sub.charId3
ORDER BY sub.pubCount DESC
LIMIT 10
""").cache()


In [None]:
df2.show(10)

In [None]:
sc.stop()

This lab was adapted from [https://vincentlauzon.com/2018/01/24/azure-databricks-spark-sql-data-frames/](https://vincentlauzon.com/2018/01/24/azure-databricks-spark-sql-data-frames/)

Saving a DataFrame to a csv
```
publicationsDf.write\
    .format("com.databricks.spark.csv")\
    .option("header", "true")\
    .save("s3://bigdatateaching/marvel/publication")
```