## Question

We have a table with employees tables in which we have employee details with salary and department id of the employees. We have one more table in which we have department id and department name.
Provide below queries

1. Use this both tables and list all the employees woking in marketing department with highest to lowest salary order.
2. Provide count of employees in each departnent with department name.

## PySpark

### Setup

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("challenge").getOrCreate()
sqlContext = SparkSession(spark)
spark.sparkContext.setLogLevel("ERROR")

### Solution

In [21]:
employee_df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/employee_salary.csv")
employee_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- department_id: integer (nullable = true)



In [3]:
employee_df.limit(5).show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 45|     Kevin|   Duncan| 45210|         1003|
| 25|    Pamela| Matthews| 57944|         1005|
| 48|    Robert|    Lynch|117960|         1004|
| 34|    Justin|     Dunn| 67992|         1003|
| 62|      Dale|    Hayes| 97662|         1005|
+---+----------+---------+------+-------------+



In [4]:
employee_df.createOrReplaceTempView("tmpEmployee")
sqlContext.sql("SELECT * FROM tmpEmployee LIMIT 5").show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 45|     Kevin|   Duncan| 45210|         1003|
| 25|    Pamela| Matthews| 57944|         1005|
| 48|    Robert|    Lynch|117960|         1004|
| 34|    Justin|     Dunn| 67992|         1003|
| 62|      Dale|    Hayes| 97662|         1005|
+---+----------+---------+------+-------------+



In [22]:
department_df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/department.csv")
department_df.printSchema()

root
 |-- department_id: integer (nullable = true)
 |-- department_name: string (nullable = true)



In [6]:
department_df.limit(5).show()

+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
|         1005|          Sales|
|         1002|       Finanace|
|         1004|       Purchase|
|         1001|     Operations|
|         1006|      Marketing|
+-------------+---------------+



In [7]:
department_df.createOrReplaceTempView("tmpDepartment")
sqlContext.sql("SELECT * FROM tmpDepartment LIMIT 5").show()

+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
|         1005|          Sales|
|         1002|       Finanace|
|         1004|       Purchase|
|         1001|     Operations|
|         1006|      Marketing|
+-------------+---------------+



Q - Use this both tables and list all the employees woking in marketing department with highest to lowest salary order

In [10]:
from pyspark.sql.functions import desc

result = employee_df.join(department_df, employee_df.department_id == department_df.department_id,"left")

result = result.select("first_name","last_name","salary").where("department_name='Marketing'").orderBy(desc("salary"))

result.show(n=5)

+----------+---------+------+
|first_name|last_name|salary|
+----------+---------+------+
|      Sean| Crawford|190000|
|  Danielle| Williams|120000|
|      Todd|   Wilson|110000|
|     Julia|    Ramos|105000|
|      Eric|Zimmerman| 83093|
+----------+---------+------+
only showing top 5 rows



In [11]:
sqlContext.sql("SELECT first_name, last_name, salary \
                       FROM tmpEmployee as emp \
                       LEFT OUTER JOIN tmpDepartment as department \
                       ON emp.department_id = department.department_id \
                       WHERE department.department_name = 'Marketing' \
                       ORDER BY salary DESC").show(n=5)

+----------+---------+------+
|first_name|last_name|salary|
+----------+---------+------+
|      Sean| Crawford|190000|
|  Danielle| Williams|120000|
|      Todd|   Wilson|110000|
|     Julia|    Ramos|105000|
|      Eric|Zimmerman| 83093|
+----------+---------+------+
only showing top 5 rows



Q - Provide count of employees in each departnent with department name

In [23]:
result = department_df.join(employee_df, employee_df.department_id == department_df.department_id,"left")

result = result.groupBy("department_name").count()

result.show(n=5)

+---------------+-----+
|department_name|count|
+---------------+-----+
|       Purchase|   12|
|          Sales|   15|
|       Finanace|   15|
|      Technoogy|   14|
|      Marketing|    8|
+---------------+-----+
only showing top 5 rows



In [24]:
sqlContext.sql("SELECT department.department_name, count(*) as count_of_employee \
                    FROM tmpDepartment as department \
                    LEFT OUTER JOIN tmpEmployee as emp \
                    ON emp.department_id = department.department_id \
                    GROUP BY department.department_name").show(n=5)

+---------------+-----------------+
|department_name|count_of_employee|
+---------------+-----------------+
|       Purchase|               12|
|          Sales|               15|
|       Finanace|               15|
|      Technoogy|               14|
|      Marketing|                8|
+---------------+-----------------+
only showing top 5 rows



## Postgres

### Setup

In [12]:
import os
import pandas as pd

In [13]:
nb_path = os.path.join(os.getcwd(), 'utils/connect-postgres.ipynb')
%run {nb_path}

In [14]:
employee_salary_df = pd.read_csv("data/employee_salary.csv")
employee_salary_df.head()

Unnamed: 0,id,first_name,last_name,salary,department_id
0,45,Kevin,Duncan,45210,1003
1,25,Pamela,Matthews,57944,1005
2,48,Robert,Lynch,117960,1004
3,34,Justin,Dunn,67992,1003
4,62,Dale,Hayes,97662,1005


In [15]:
employee_salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             75 non-null     int64 
 1   first_name     75 non-null     object
 2   last_name      75 non-null     object
 3   salary         75 non-null     int64 
 4   department_id  75 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 3.1+ KB


In [35]:
employee_salary_df.to_sql('employee_salary', con=conn_str, index=False, schema='public')

In [17]:
department_df = pd.read_csv("data/department.csv")
department_df.head()

Unnamed: 0,department_id,department_name
0,1005,Sales
1,1002,Finanace
2,1004,Purchase
3,1001,Operations
4,1006,Marketing


In [18]:
department_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   department_id    6 non-null      int64 
 1   department_name  6 non-null      object
dtypes: int64(1), object(1)
memory usage: 224.0+ bytes


In [19]:
department_df.to_sql('department', con=conn_str, index=False, schema='public')

### Solution

Use this both tables and list all the employees woking in marketing department with highest to lowest salary order

In [25]:
%%sql
SELECT first_name,
    last_name,
    salary
FROM public.employee_salary as emp
    LEFT OUTER JOIN public.department as department ON emp.department_id = department.department_id
WHERE department.department_name = 'Marketing'
ORDER BY salary DESC
LIMIT 5

Unnamed: 0,first_name,last_name,salary
0,Sean,Crawford,190000
1,Danielle,Williams,120000
2,Todd,Wilson,110000
3,Julia,Ramos,105000
4,Eric,Zimmerman,83093


Provide count of employees in each departnent with department name

In [26]:
%%sql	
SELECT department.department_name,
    count(*) as count_of_employee
FROM public.department as department
    LEFT OUTER JOIN public.employee_salary as emp ON emp.department_id = department.department_id
GROUP BY department.department_name
LIMIT 5

Unnamed: 0,department_name,count_of_employee
0,Purchase,12
1,Marketing,8
2,Operations,11
3,Technoogy,14
4,Sales,15
