### Name : < Put your name in this section >

# Data Access using SparkSQL and Dataframe

## Activity : Aggregation and Sorting Queries

In this module, you will practice how to write codes to retrieve data using Spark SQL and Dataframes API.

The complete list of Dataframe functions can be accessed from [here](https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html), [here](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) and [here](https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.functions$)


In this activity, we will use HR schema as shown below
![hr](HR.gif)

### INITIALIZATION
The first section of this scipt is the initialization section. 
In this section, we are preparing Spark environment to recognize and process SQL statements.

In [1]:
from pyspark import SparkContext, SparkConf # Spark
from pyspark.sql import SparkSession # Spark SQL
from pyspark.sql.types import *

#additional 
from pyspark.sql.functions import *

sc = SparkContext.getOrCreate()

# local[*]: run Spark locally with as many working processors as logical cores on your machine.
# In the field of `master`, we use a local server with as many working processors (or threads) as possible (i.e. `local[*]`). 
# If we want Spark to run locally with 'k' worker threads, we can specify as `local[k]`.
# The `appName` field is a name to be shown on the Sparking cluster UI. 

# If there is no existing spark context, we now create a new context
if (sc is None):
    sc = SparkContext(master="local[3]", appName="Week 2 - Join Query")
spark = SparkSession(sparkContext=sc)


## DATA STRUCTURE DEFINITION

In this section, we are preparing the data structure to match the datafiles provided as the datasources

In [2]:
#COUNTRIES TABLE
scCountries = StructType([StructField("country_id",StringType()),StructField("country_name",StringType()),StructField("region_id",IntegerType())])

#DEPARTMENTS TABLE
scDepartments = StructType([StructField("department_id",IntegerType()),
StructField("department_name",StringType()),
StructField("manager_id",IntegerType()),
StructField("location_id",IntegerType())
])

#EMPLOYEES TABLE
scEmployees = StructType([
StructField("employee_id",IntegerType()),
StructField("first_name",StringType()),
StructField("last_name",StringType()),
StructField("email",StringType()),
StructField("phone_number",StringType()),
StructField("hire_date",StringType()),
StructField("job_id",StringType()),
StructField("salary",IntegerType()),
StructField("commission_pct",FloatType()),
StructField("manager_id",IntegerType()),
StructField("department_id",IntegerType())
])

#JOBS TABLE
scJobs = StructType([
StructField("job_id",StringType()),
StructField("job_title",StringType()),
StructField("min_salary",IntegerType()),
StructField("max_salary",IntegerType())
])

#JOB_HISTORY TABLE
scJob_history = StructType([
StructField("employee_id",IntegerType()),
StructField("start_date",StringType()),
StructField("end_date",StringType()),
StructField("job_id",StringType()),
StructField("department_id",IntegerType())
])

#LOCATIONS TABLE
scLocations = StructType([
StructField("location_id",IntegerType()),
StructField("street_address",StringType()),
StructField("postal_code",StringType()),
StructField("city",StringType()),
StructField("state_province",StringType()),
StructField("country_id",StringType())
])

#REGIONS TABLE
scRegions = StructType([
StructField("region_id",IntegerType()),
StructField("region_name",StringType())
])

### DATA LOADING

In [3]:
#COUNTRIES DATA
dataCountries = sc.textFile('COUNTRIES.csv')
dataCountries = dataCountries.map(lambda x: x.split(','))
dataCountries = dataCountries.map(lambda x: [x[0],x[1], int(x[2])])

#DEPARTMENTS DATA
dataDepartments = sc.textFile('DEPARTMENTS.csv')
dataDepartments = dataDepartments.map(lambda x: x.split(','))
dataDepartments = dataDepartments.map(lambda x: [int(x[0]),x[1], int(x[2]), int(x[3])])

#EMPLOYEES DATA
dataEmployees = sc.textFile('EMPLOYEES.csv')
dataEmployees = dataEmployees.map(lambda x: x.split(','))
dataEmployees = dataEmployees.map(lambda x: [int(x[0]),x[1], x[2], \
                                             x[3],x[4], x[5], x[6], \
                                             int(x[7]),float(x[8]), int(x[9]), int(x[10])\
                                            ])

#JOBS_DATA
dataJobs = sc.textFile('JOBS.csv')
dataJobs = dataJobs.map(lambda x: x.split(','))
dataJobs = dataJobs.map(lambda x: [x[0],x[1], \
                                   int(x[2]),int(x[3])\
                                   ])

#JOB_HISTORY_DATA
dataJob_history = sc.textFile('JOB_HISTORY.csv')
dataJob_history = dataJob_history.map(lambda x: x.split(','))
dataJob_history = dataJob_history.map(lambda x: [int(x[0]),x[1], \
                                   x[2],x[3],int(x[4])\
                                   ])

#LOCATION_DATA
dataLocations = sc.textFile('LOCATIONS.csv')
dataLocations = dataLocations.map(lambda x: x.split(','))
dataLocations = dataLocations.map(lambda x: [int(x[0]),x[1], \
                                   x[2],x[3],x[4],x[5]\
                                   ])
#REGIONS DATA
dataRegions = sc.textFile('REGIONS.csv')
dataRegions = dataRegions.map(lambda x: x.split(','))
dataRegions = dataRegions.map(lambda x: [int(x[0]),x[1] ])


### PREPARING DATAFRAMES

In [4]:
dfCountries = spark.createDataFrame(dataCountries,schema=scCountries) 
dfCountries.createOrReplaceTempView("dataCountries")

dfDepartments = spark.createDataFrame(dataDepartments,schema=scDepartments) 
dfDepartments.createOrReplaceTempView("dataDepartments")

dfEmployees = spark.createDataFrame(dataEmployees,schema=scEmployees) 
dfEmployees.createOrReplaceTempView("dataEmployees")

dfJobs = spark.createDataFrame(dataJobs,schema=scJobs) 
dfJobs.createOrReplaceTempView("dataJobs")

dfJob_history = spark.createDataFrame(dataJob_history,schema=scJob_history) 
dfJob_history.createOrReplaceTempView("dataJob_history")

dfLocations = spark.createDataFrame(dataLocations,schema=scLocations) 
dfLocations.createOrReplaceTempView("dataLocations")

dfRegions = spark.createDataFrame(dataRegions,schema=scRegions) 
dfRegions.createOrReplaceTempView("dataRegions")


#### Question 1
Display all department name and the number of employees. Sort the result according to department name in descending order.

Ensure you have the same format as expected output below 

![picture](lab4_q1.png)

In [5]:
#spark.sql()
sqlQry=spark.sql("SELECT department_name, COUNT(dataEmployees.department_id=dataDepartments.department_id) as total "+
          "FROM dataDepartments LEFT JOIN dataEmployees " +
          "ON dataEmployees.department_id=dataDepartments.department_id " +
          "GROUP BY department_name")
sqlQry.show()

+--------------------+-----+
|     department_name|total|
+--------------------+-----+
|       Corporate Tax|    0|
|               Sales|   34|
|          Accounting|    2|
|    Government Sales|    0|
|             Payroll|    0|
|             Finance|    6|
|    Public Relations|    1|
|           Executive|    3|
|          Recruiting|    0|
|          Purchasing|    6|
|        Construction|    0|
|                 NOC|    0|
|            Treasury|    0|
|Shareholder Services|    0|
|        Retail Sales|    0|
|           Marketing|    2|
|                  IT|    5|
|      Administration|    1|
|         Contracting|    0|
|          IT Support|    0|
+--------------------+-----+
only showing top 20 rows



In [6]:
#DataFrame functions
q1_1 = dfDepartments.join(dfEmployees, dfDepartments.department_id==dfEmployees.department_id, 'left')
q1_2 = q1_1.groupBy('department_name').count()
q1_3 = q1_2.select("department_name", "count").withColumnRenamed('count', 'total').show()

+--------------------+-----+
|     department_name|total|
+--------------------+-----+
|       Corporate Tax|    1|
|               Sales|   34|
|          Accounting|    2|
|    Government Sales|    1|
|             Payroll|    1|
|             Finance|    6|
|    Public Relations|    1|
|           Executive|    3|
|          Recruiting|    1|
|          Purchasing|    6|
|        Construction|    1|
|                 NOC|    1|
|            Treasury|    1|
|Shareholder Services|    1|
|        Retail Sales|    1|
|           Marketing|    2|
|                  IT|    5|
|      Administration|    1|
|         Contracting|    1|
|          IT Support|    1|
+--------------------+-----+
only showing top 20 rows



#### Question 2
Using the result from Question 1, display the departments that have more than 20 employees. 
Order the result according to the number of employees in descending order.

![figure](lab4_q2.png)

In [9]:
#spark.sql
q2=spark.sql("select department_name, count(employee_id) as total" +
          " from dataEmployees as de INNER JOIN dataDepartments as dd on de.department_id=dd.department_id" +
          " group by department_name having count(*) > 20 order by department_name DESC")
q2.show()

+---------------+-----+
|department_name|total|
+---------------+-----+
|       Shipping|   45|
|          Sales|   34|
+---------------+-----+



In [8]:
#dataframe function
dfEmployees.join(dfDepartments, dfEmployees.department_id == dfDepartments.department_id,\
                ).groupBy("department_name").agg(count("employee_id").alias("total")).filter(count("employee_id") > 20).orderBy("department_name", ascending=False).show()

+---------------+-----+
|department_name|total|
+---------------+-----+
|       Shipping|   45|
|          Sales|   34|
+---------------+-----+



#### Question 3
Find the employee name and salary that gets the lowest salary in the company.

![figure](lab4_q3.png)

In [11]:
#spark.sql()
q3 = spark.sql("select first_name, salary from dataEmployees order by salary LIMIT 1")
q3.show(1)#select lowest salary

+----------+------+
|first_name|salary|
+----------+------+
|        TJ|  2100|
+----------+------+



In [12]:
#dataframe functions
dfEmployees.select("first_name", "salary").orderBy("salary").limit(1).show()

+----------+------+
|first_name|salary|
+----------+------+
|        TJ|  2100|
+----------+------+



#### Question 4

Find the first employee that join in the company. Display his name, job and his hire date.

![figure](lab4_q4.png)


In [15]:
#spark.sql
hiredate = spark.sql("select first_name, last_name, job_id, hire_date as recruit " +
                 "from dataEmployees order by reverse(hire_date)").limit(1)
hiredate.show()

+----------+---------+------+---------+
|first_name|last_name|job_id|  recruit|
+----------+---------+------+---------+
|       Lex|  De Haan| AD_VP|13-Jan-01|
+----------+---------+------+---------+



In [17]:
#dataframe
dfEmployees.select("first_name", "last_name", "job_id", "hire_Date",\
                  ).withColumnRenamed("hire_date", "recruit").orderBy(reverse("hire_date")).limit(1).show()

+----------+---------+------+---------+
|first_name|last_name|job_id|  recruit|
+----------+---------+------+---------+
|       Lex|  De Haan| AD_VP|13-Jan-01|
+----------+---------+------+---------+



#### Question 5

Display department name that has the highest total salary in the company.

![figure](lab4_q5.png)

In [18]:
#spark.sql()
highSalary=spark.sql("select department_name, sum(salary) as total from dataDepartments LEFT JOIN dataEmployees "+
                     "ON dataEmployees.department_id=dataDepartments.department_id "+
                     "group by department_name order by sum(salary) DESC LIMIT 1")
highSalary.show()

+---------------+------+
|department_name| total|
+---------------+------+
|          Sales|304500|
+---------------+------+



In [19]:
#dataframe function
q5_1 = dfDepartments.join(dfEmployees, dfDepartments.department_id==dfEmployees.department_id, 'left')
q5_2 = q5_1.groupBy('department_name').sum('salary')
q5_2.select("department_name", "sum(salary)").withColumnRenamed('sum(salary)', 'total').orderBy(desc("sum(salary)")).limit(1).show()

+---------------+------+
|department_name| total|
+---------------+------+
|          Sales|304500|
+---------------+------+

