# 5.SQL and Dataframes

References:

* Spark-SQL, <https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes>


# 5.1  Example Walkthrough
Follow the Spark SQL and Dataframes Examples below!

### Initialize PySpark

First, we use the findspark package to initialize PySpark.

In [1]:
# Initialize PySpark
APP_NAME = "PySpark Lecture"
SPARK_MASTER="spark://mpp3r03c04s04.cos.lrz.de:7077"

# If there is no SparkSession, create the environment
try:
    sc and spark
except NameError as e:
  #import findspark
  #findspark.init()
    import pyspark
    import pyspark.sql
    from pyspark.sql import Row
    conf=pyspark.SparkConf().set("spark.cores.max", "8")
    sc = pyspark.SparkContext(master=SPARK_MASTER, conf=conf)
    spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()

print("PySpark initiated...")

PySpark initiated...


### Hello, World!

Loading data, mapping it and collecting the records into RAM...

In [2]:
# Load the text file using the SparkContext
csv_lines = sc.textFile("../data/example.csv")

# Map the data to split the lines into a list
data = csv_lines.map(lambda line: line.split(","))

# Collect the dataset into local RAM
data.collect()

[[u'Russell Jurney', u'Relato', u'CEO'],
 [u'Florian Liebert', u'Mesosphere', u'CEO'],
 [u'Don Brown', u'Rocana', u'CIO'],
 [u'Steve Jobs', u'Apple', u'CEO'],
 [u'Donald Trump', u'The Trump Organization', u'CEO'],
 [u'Russell Jurney', u'Data Syndrome', u'Principal Consultant']]

### Creating Rows

Creating `pyspark.sql.Rows` out of your data so you can create DataFrames...

In [3]:
# Convert the CSV into a pyspark.sql.Row
def csv_to_row(line):
    parts = line.split(",")
    row = Row(
      name=parts[0],
      company=parts[1],
      title=parts[2]
    )
    return row

# Apply the function to get rows in an RDD
rows = csv_lines.map(csv_to_row)

### Creating DataFrames from RDDs

Using the `RDD.toDF()` method to create a dataframe, registering the `DataFrame` as a temporary table with Spark SQL, and counting the jobs per person using Spark SQL.

In [4]:
# Convert to a pyspark.sql.DataFrame
rows_df = rows.toDF()

# Register the DataFrame for Spark SQL
rows_df.registerTempTable("executives")

# Generate a new DataFrame with SQL using the SparkSession
job_counts = spark.sql("""
SELECT
  name,
  COUNT(*) AS total
  FROM executives
  GROUP BY name
""")
job_counts.show()

# Go back to an RDD
job_counts.rdd.collect()

+---------------+-----+
|           name|total|
+---------------+-----+
|   Donald Trump|    1|
|Florian Liebert|    1|
|      Don Brown|    1|
| Russell Jurney|    2|
|     Steve Jobs|    1|
+---------------+-----+



[Row(name=u'Donald Trump', total=1),
 Row(name=u'Florian Liebert', total=1),
 Row(name=u'Don Brown', total=1),
 Row(name=u'Russell Jurney', total=2),
 Row(name=u'Steve Jobs', total=1)]

# 5.2-5.4 NASA DataSet

5.2 Create a Spark-SQL table with fields for IP/Host and Response Code from the NASA Log file! 

In [3]:
%%time
from pyspark.sql import Row
nasa_lines = sc.textFile("../data/nasa/NASA_access_log_Jul95")
spark_dataframe=nasa_lines.map(lambda a: Row(host=a.split()[0], response_code=a.split()[-2] if len(a.split())>2 else "No Value")).toDF()
spark_dataframe.registerTempTable("nasa")

CPU times: user 40.2 ms, sys: 22.1 ms, total: 62.3 ms
Wall time: 1.09 s


5.3 Run an SQL query that outputs the number of occurrences of each HTTP response code!

In [4]:
%%time
results = spark.sql("""select response_code, count(*) as count from nasa group by response_code""").toPandas()

CPU times: user 82.4 ms, sys: 38 ms, total: 120 ms
Wall time: 40.4 s


5.4 Cachen Sie den Dataframe und führen Sie dieselbe Query nochmals aus! Messen Sie die Laufzeit für das Cachen und für die Ausführungszeit der Query!

In [5]:
%%time
spark_dataframe.cache()
spark_dataframe.count()

CPU times: user 16.8 ms, sys: 2.96 ms, total: 19.7 ms
Wall time: 29.8 s


In [8]:
%%time
results = spark.sql("""select response_code, count(*) as count from nasa group by response_code""").toPandas()

CPU times: user 98.5 ms, sys: 29.6 ms, total: 128 ms
Wall time: 1.48 s


5.5 Führen Sie diesselbe Query mit/ohne Cache und 8, 16 Cores aus! Dokumentieren und erklären Sie das Ergebnis!

5.6 Convert the output to a Pandas dataframe and calculate the percentage of total for each response code!

In [17]:
results["counts_pct"]=(results["count"]/results["count"].sum()*100) 
results

Unnamed: 0,response_code,count,counts_pct
0,200,1701534,89.946636
1,302,46573,2.461946
2,501,14,0.00074
3,404,10845,0.573289
4,403,54,0.002855
5,500,62,0.003277
6,304,132627,7.01094
7,No Value,1,5.3e-05
8,400,5,0.000264
