## Question - Salaries

We have a table with employees and their salaries. Write Queries to solve below problems
1. List all the employees whose salary is more than 100K
2. Provide distinct department id 
3. Provide first and last name of employees 
4. Provide all the details with the employees whose last name is 'Johnson'

## PySpark

### Setup

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("challenge").getOrCreate()
sqlContext = SparkSession(spark)
spark.sparkContext.setLogLevel("ERROR")

### Solution

In [4]:
employee_df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/employee_salary.csv")
employee_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- department_id: integer (nullable = true)



In [7]:
employee_df.limit(5).show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 45|     Kevin|   Duncan| 45210|         1003|
| 25|    Pamela| Matthews| 57944|         1005|
| 48|    Robert|    Lynch|117960|         1004|
| 34|    Justin|     Dunn| 67992|         1003|
| 62|      Dale|    Hayes| 97662|         1005|
+---+----------+---------+------+-------------+



In [13]:
employee_df.createOrReplaceTempView("tmpEmployee")
sqlContext.sql("SELECT * FROM tmpEmployee LIMIT 5").show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 45|     Kevin|   Duncan| 45210|         1003|
| 25|    Pamela| Matthews| 57944|         1005|
| 48|    Robert|    Lynch|117960|         1004|
| 34|    Justin|     Dunn| 67992|         1003|
| 62|      Dale|    Hayes| 97662|         1005|
+---+----------+---------+------+-------------+



Q - List all the meployees whoes salary is more than 100K

In [9]:
result = employee_df.filter("salary > 100000")
print("Total records: {}".format(result.count()))
result.show(n=5)

Total records: 39
+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 48|    Robert|    Lynch|117960|         1004|
|  1|      Todd|   Wilson|110000|         1006|
| 61|      Ryan|    Brown|120000|         1003|
| 21|   Stephen|    Berry|123617|         1002|
| 13|     Julie|  Sanchez|210000|         1001|
+---+----------+---------+------+-------------+
only showing top 5 rows



In [14]:
sqlContext.sql("SELECT * FROM tmpEmployee where salary > 100000 LIMIT 5").show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 48|    Robert|    Lynch|117960|         1004|
|  1|      Todd|   Wilson|110000|         1006|
| 61|      Ryan|    Brown|120000|         1003|
| 21|   Stephen|    Berry|123617|         1002|
| 13|     Julie|  Sanchez|210000|         1001|
+---+----------+---------+------+-------------+



Q - Provide distinct department id 

In [10]:
employee_df.select("department_id").distinct().show()

+-------------+
|department_id|
+-------------+
|         1005|
|         1002|
|         1001|
|         1006|
|         1003|
|         1004|
+-------------+



In [16]:
sqlContext.sql("SELECT DISTINCT department_id FROM tmpEmployee").show()

+-------------+
|department_id|
+-------------+
|         1005|
|         1002|
|         1001|
|         1006|
|         1003|
|         1004|
+-------------+



Q - Provide first and last name of employees 

In [11]:
employee_df.select("first_name", "last_name").show(n=5)

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Kevin|   Duncan|
|    Pamela| Matthews|
|    Robert|    Lynch|
|    Justin|     Dunn|
|      Dale|    Hayes|
+----------+---------+
only showing top 5 rows



In [17]:
sqlContext.sql("SELECT first_name, last_name FROM tmpEmployee").show(n=5)

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Kevin|   Duncan|
|    Pamela| Matthews|
|    Robert|    Lynch|
|    Justin|     Dunn|
|      Dale|    Hayes|
+----------+---------+
only showing top 5 rows



Q - Provide all the details with the employees whose last name is 'Johnson'

In [12]:
employee_df.filter("last_name == 'Johnson'").show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 26|   Allison|  Johnson|128782|         1001|
| 12|    Joshua|  Johnson|123082|         1004|
+---+----------+---------+------+-------------+



In [18]:
sqlContext.sql("SELECT * FROM tmpEmployee where last_name = 'Johnson'").show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
| 26|   Allison|  Johnson|128782|         1001|
| 12|    Joshua|  Johnson|123082|         1004|
+---+----------+---------+------+-------------+



## Postgres

### Setup

In [30]:
import os
import pandas as pd

In [28]:
nb_path = os.path.join(os.getcwd(), 'utils/connect-postgres.ipynb')
%run {nb_path}

In [31]:
employee_salary_df = pd.read_csv("data/employee_salary.csv")
employee_salary_df.head()

Unnamed: 0,id,first_name,last_name,salary,department_id
0,45,Kevin,Duncan,45210,1003
1,25,Pamela,Matthews,57944,1005
2,48,Robert,Lynch,117960,1004
3,34,Justin,Dunn,67992,1003
4,62,Dale,Hayes,97662,1005


In [32]:
employee_salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             75 non-null     int64 
 1   first_name     75 non-null     object
 2   last_name      75 non-null     object
 3   salary         75 non-null     int64 
 4   department_id  75 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 3.1+ KB


In [35]:
employee_salary_df.to_sql('employee_salary', con=conn_str, index=False, schema='public')

In [37]:
%%sql
SELECT * FROM public.employee_salary LIMIT 5

Unnamed: 0,id,first_name,last_name,salary,department_id
0,45,Kevin,Duncan,45210,1003
1,25,Pamela,Matthews,57944,1005
2,48,Robert,Lynch,117960,1004
3,34,Justin,Dunn,67992,1003
4,62,Dale,Hayes,97662,1005


In [38]:
%%sql
SELECT COUNT(*) FROM public.employee_salary

Unnamed: 0,count
0,75


### Solution

List all the meployees whoes salary is more than 100K

In [39]:
%%sql
SELECT id,
    first_name,
    last_name,
    salary,
    department_id
FROM public.employee_salary
WHERE salary > 100000
LIMIT 5

Unnamed: 0,id,first_name,last_name,salary,department_id
0,48,Robert,Lynch,117960,1004
1,1,Todd,Wilson,110000,1006
2,61,Ryan,Brown,120000,1003
3,21,Stephen,Berry,123617,1002
4,13,Julie,Sanchez,210000,1001


Provide distinct department id 

In [40]:
%%sql
SELECT DISTINCT department_id
FROM public.employee_salary

Unnamed: 0,department_id
0,1005
1,1002
2,1004
3,1001
4,1006
5,1003


Provide first and last name of employees 

In [42]:
%%sql
SELECT first_name,
    last_name
FROM public.employee_salary
LIMIT 5

Unnamed: 0,first_name,last_name
0,Kevin,Duncan
1,Pamela,Matthews
2,Robert,Lynch
3,Justin,Dunn
4,Dale,Hayes


Provide all the details with the employees whose last name is 'Johnson'

In [43]:
%%sql
SELECT id,
    first_name,
    last_name,
    salary,
    department_id
FROM public.employee_salary
WHERE last_name = 'Johnson'

Unnamed: 0,id,first_name,last_name,salary,department_id
0,26,Allison,Johnson,128782,1001
1,12,Joshua,Johnson,123082,1004
