## Question - Salaries

We have a table with employees and their salaries, however, some of the records are old and contain outdated salary information. Find the current salary of each employee assuming that salaries increase each year. Output their id, first name, last name, department ID, and current salary. Order your list by employee ID in ascending order.

1. We need to print latest salary of each employee
2. We also need their id, first name, lastname, department id and latest salary 
3. We also want to order by it by id 

## PySpark

### Setup

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("challenge").getOrCreate()
sqlContext = SparkSession(spark)
spark.sparkContext.setLogLevel("ERROR")

### Solution

In [2]:
employee_df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/employee.csv")
employee_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: integer (nullable = true)
 |-- department_id: integer (nullable = true)



In [3]:
employee_df.limit(5).show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
|  1|      Todd|   Wilson|110000|         1006|
|  1|      Todd|   Wilson|106119|         1006|
|  2|    Justin|    Simon|128922|         1005|
|  2|    Justin|    Simon|130000|         1005|
|  3|     Kelly|  Rosario| 42689|         1002|
+---+----------+---------+------+-------------+



In [4]:
employee_df.createOrReplaceTempView("tmpEmployee")
sqlContext.sql("SELECT * FROM tmpEmployee LIMIT 5").show()

+---+----------+---------+------+-------------+
| id|first_name|last_name|salary|department_id|
+---+----------+---------+------+-------------+
|  1|      Todd|   Wilson|110000|         1006|
|  1|      Todd|   Wilson|106119|         1006|
|  2|    Justin|    Simon|128922|         1005|
|  2|    Justin|    Simon|130000|         1005|
|  3|     Kelly|  Rosario| 42689|         1002|
+---+----------+---------+------+-------------+



In [8]:
from pyspark.sql.functions import max

employee_df.groupBy("id","first_name","last_name","department_id").agg(max("salary").alias('latest_salary')).orderBy("id").show(n=5)

+---+----------+---------+-------------+-------------+
| id|first_name|last_name|department_id|latest_salary|
+---+----------+---------+-------------+-------------+
|  1|      Todd|   Wilson|         1006|       110000|
|  2|    Justin|    Simon|         1005|       130000|
|  3|     Kelly|  Rosario|         1002|        42689|
|  4|  Patricia|   Powell|         1004|       170000|
|  5|    Sherry|   Golden|         1002|        44101|
+---+----------+---------+-------------+-------------+
only showing top 5 rows



In [10]:
sqlContext.sql("SELECT id,first_name,last_name,MAX(salary) AS latest_salary,department_id \
                FROM tmpEmployee \
                GROUP BY id,first_name,last_name,department_id \
                ORDER BY id").show(n=5)

+---+----------+---------+-------------+-------------+
| id|first_name|last_name|latest_salary|department_id|
+---+----------+---------+-------------+-------------+
|  1|      Todd|   Wilson|       110000|         1006|
|  2|    Justin|    Simon|       130000|         1005|
|  3|     Kelly|  Rosario|        42689|         1002|
|  4|  Patricia|   Powell|       170000|         1004|
|  5|    Sherry|   Golden|        44101|         1002|
+---+----------+---------+-------------+-------------+
only showing top 5 rows



## Postgres

### Setup

In [11]:
import os
import pandas as pd

In [12]:
nb_path = os.path.join(os.getcwd(), 'utils/connect-postgres.ipynb')
%run {nb_path}

In [13]:
employee_salary_df = pd.read_csv("data/employee.csv")
employee_salary_df.head()

Unnamed: 0,id,first_name,last_name,salary,department_id
0,1,Todd,Wilson,110000,1006
1,1,Todd,Wilson,106119,1006
2,2,Justin,Simon,128922,1005
3,2,Justin,Simon,130000,1005
4,3,Kelly,Rosario,42689,1002


In [14]:
employee_salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95 entries, 0 to 94
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             95 non-null     int64 
 1   first_name     95 non-null     object
 2   last_name      95 non-null     object
 3   salary         95 non-null     int64 
 4   department_id  95 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 3.8+ KB


In [15]:
employee_salary_df.to_sql('employee', con=conn_str, index=False, schema='public')

In [16]:
%%sql
SELECT * FROM public.employee LIMIT 5

Unnamed: 0,id,first_name,last_name,salary,department_id
0,1,Todd,Wilson,110000,1006
1,1,Todd,Wilson,106119,1006
2,2,Justin,Simon,128922,1005
3,2,Justin,Simon,130000,1005
4,3,Kelly,Rosario,42689,1002


In [18]:
%%sql
SELECT COUNT(*) FROM public.employee

Unnamed: 0,count
0,95


### Solution

In [20]:
%%sql
SELECT id,
    first_name,
    last_name,
    MAX(salary) AS max_salary,
    department_id
FROM public.employee
GROUP BY id,
    first_name,
    last_name,
    department_id
ORDER BY id
LIMIT 10

Unnamed: 0,id,first_name,last_name,max_salary,department_id
0,1,Todd,Wilson,110000,1006
1,2,Justin,Simon,130000,1005
2,3,Kelly,Rosario,42689,1002
3,4,Patricia,Powell,170000,1004
4,5,Sherry,Golden,44101,1002
5,6,Natasha,Swanson,90000,1005
6,7,Diane,Gordon,74591,1002
7,8,Mercedes,Rodriguez,61048,1005
8,9,Christy,Mitchell,150000,1001
9,10,Sean,Crawford,190000,1006
