# Apache Spark for complex queries
**Data management, homework 7**

In this assignment we will use [Apache Spark](https://spark.apache.org/): a popular framework for optimal distributed processing on large amount of data. 
The objective of is to use Apache Spark to translate and execute some queries of the TPCx-BB bigdata benchmark.  
TPCx-BB or simply "Big Bench" is a common benchmark suite to evaluate the system performance on big data analytics and machine learning algorithms. We will focus on big data analytical queries, which are expressed in SQL. 

Spark is a framework available in multiple languages: Scala, Java, Python, R. In this excerice, we will use Python.

## Get started
### Jupyter Lab
If you are not familiar with the Jupyter Lab environment, check out these resources from the official website: [example notebook](https://jupyter.org/try), [docs](https://jupyterlab.readthedocs.io/en/stable/getting_started/overview.html).  
Quick reference:
- This is a cell. A cell can contain either Markdown text (such as this one) or code. Everything in jupyter notebook is a cell.
    - You can double click on a text cell to edit iy using Markdown
    - You can run a cell by either using the button "play" at the top bar or by using the "shift + enter" key combination
    - Running a code cell executes it
    - Running a text cell formats the text
- Once you run a cell it stays in memory! So code will be run based on which order you execute cells, even if you execute a cell that is below another one before
- General rule #1: try to arrange cell step-by-stop from top to bottom. If anything breaks, try to execute fevery cell from the top
- General rule #2: if you are stuck or a cell is blocked during execution re-run the kernel from the topbar menu
### Contents
You can navigate through this exercise contents with the file explorer on the left.  
The contents are "extracted" from the [TPCx-BB](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp) benchmark source folder. Please refer to the link if you want to have a broader overview and/or additional information TPCx-BB. Since this exercise differs from the actual benchmark, only a subset of its content are reported here:
- `queries/` contains 30 SQL/Spark queries, some of which are to be ported to Spark in this exercise. every query `qxx/` folder (`xx` = number) contains
    - `engineLocalSettings.conf`: TPC related, disregard
    - `engineLocalSettings.sql`: TPC related, disregard
    - `explain_qxx.sql`: *query content* in "explanatory" format
    - `explain_qxx.sql`: *query content* in TPC exec format
    - `run.sh`: TPC related, disregard
    - `results/qxx-result`: contains the expect result in plain-text. You should compare this with your query output (example provided later)
- `spark_table_schemas`: contains schema information for every table in the dataset. Not relevant for the imlpementation
- `TPCx-BB-dataset`: contains all the tables in separate folder. Refer to it for table names

**Do not modify** `spark_table_schemas` or `TPCx-BB-dataset` contents as it may compromise your solution.



!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Things to mention:

- using spark sql module https://spark.apache.org/docs/latest/sql-programming-guide.html
- refer to dataframe API reference: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html
- install JupyterLab
- 

In [102]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import os

spark = SparkSession.builder \
        .master("local") \
        .appName("Homework 07") \
        .getOrCreate()

In [104]:
def get_table(name):
    df = spark.read.parquet(f"TPCx-BB-dataset/{name}.ptxt")
    
    f = open(f"spark_table_schemas/{name}.schema","r")
    lines = f.readlines()
    for line in lines:
        l = line.split()
        if len(l) > 2:
            df.schema[l[0]].nullable = False

    return df

In [105]:
customer = get_table("customer")

In [108]:
customer.show(1) # show the 1st row of the customer table

+-------------+----------------+------------------+------------------+-----------------+----------------------+---------------------+------------+------------+-----------+---------------------+-----------+-------------+------------+--------------------+------------+--------------------+------------------+
|c_customer_sk|   c_customer_id|c_current_cdemo_sk|c_current_hdemo_sk|c_current_addr_sk|c_first_shipto_date_sk|c_first_sales_date_sk|c_salutation|c_first_name|c_last_name|c_preferred_cust_flag|c_birth_day|c_birth_month|c_birth_year|     c_birth_country|     c_login|     c_email_address|c_last_review_date|
+-------------+----------------+------------------+------------------+-----------------+----------------------+---------------------+------------+------------+-----------+---------------------+-----------+-------------+------------+--------------------+------------+--------------------+------------------+
|            0|AAAAAAAAAAAAAAAA|           1824793|              3203|         

In [101]:
## query1

s = get_table("store_sales")
i = get_table("item")

itemArray = s.join(i, s.ss_item_sk == i.i_item_sk) \
                .filter(i.i_category_id < 3) \
                .filter(s.ss_store_sk.isin([10, 20, 33, 40, 50])) \
                .groupBy("i_category") \
                .count()


itemArray.show()      

+--------------+-----+
|    i_category|count|
+--------------+-----+
|Home & Kitchen| 1975|
|         Music|25060|
+--------------+-----+



In [107]:
## check the result
!cat queries/q00/results/q00-result

Home & Kitchen, 1975
Music, 25060