# Second Highest Value

**QUESTION**

Get the second highest salary from the Employe table.

**EXAMPLE**

Table: Employee

    +----+--------+
    | Id | Salary |
    +----+--------+
    | 1  | 100    |
    | 2  | 200    |
    | 3  | 300    |
    +----+--------+

Expected result: 

    +---------------------+
    | SecondHighestSalary |
    +---------------------+
    | 200                 |
    +---------------------+
    
**TECHNIQUES**

- **sqlalchemy**: for creating database connection.
- Pandas Functions
  - DataFrame: create a dataframe
  - **to_sql**: insert data to a database table.
  - **read_sql**: read from database
  - **to_csv**: write to a CSV file.
  - **read_csv**: read from a CSV file
  
**VARIATIONS**

- Get nth highest salary.


## Prepare the test data

- Need to install `sqlalchemy` and appropriate drivers (e.g. `mysqlclient` for MySQL).
- Also need to dfine the connection URL.  Here, we read it from the environment variable.

In [1]:
import pandas as pd
import numpy as np
import os
import sqlalchemy as db

CONN_URL = os.environ['TEST_URL']
engine = db.create_engine(CONN_URL)

In [2]:
# Populate the test data
def populate_data(engine, table, data, path):
    df_tmp = pd.DataFrame(data)
    # Save it to a table
    if engine:
        df_tmp.to_sql(table, con=engine, index=False, if_exists='replace')
    # Save it to a file
    if path:
        df_tmp.to_csv(path, index=False, sep="\t")

employee_data = {"Id": [1, 2, 3], "Salary": [100, 200, 300]}
employee_path = "/tmp/employee.tsv"
populate_data(engine, 'Employee', employee_data, employee_path)

## Pandas Solutions

### Read Data

In [3]:
# Read the data from the database tab.e
df = pd.read_sql("SELECT * FROM Employee", engine)
df.head()

Unnamed: 0,Id,Salary
0,1,100
1,2,200
2,3,300


In [4]:
# Or, read it from a TSV file
df = pd.read_csv(employee_path, sep="\t")
df.head()

Unnamed: 0,Id,Salary
0,1,100
1,2,200
2,3,300


### Pandas Series &  Numpy Array

- Extract the salary column as a Panda Series.
- The **unique** command will convert the Series into a Numpy array.
- Sort the array with the built in **sort()** command and retrieves the 2nd hightest from the tail.
- Finally, wrap the result in a DataFrame for diplay.

In [5]:
# Get unique saries as a numpy array and sort it
salaries = df.Salary.unique()
salaries.sort()
print("Unique salaries =", salaries)

# Get the 2nd highest value, which can be None
second_highest = salaries[-2] if len(salaries) >= 2 else None

# Express it as a DataFrame
pd.DataFrame([second_highest], columns=['Second Highest Salary'])

Unique salaries = [100 200 300]


Unnamed: 0,Second Highest Salary
0,200


### The N-th Highest

In [6]:
for n in range(2, len(salaries) + 2):
    salary = salaries[-n] if len(salaries) >= n else None
    display(pd.DataFrame([salary], columns=['N-th Highest Salary ({})'.format(n)]))

Unnamed: 0,N-th Highest Salary (2)
0,200


Unnamed: 0,N-th Highest Salary (3)
0,100


Unnamed: 0,N-th Highest Salary (4)
0,


## SQL Solutions

### Use MAX Twice

In [7]:
SQL = """
SELECT MAX(Salary) AS SecondHighestSalary
    FROM Employee
    WHERE Salary < (SELECT MAX(Salary) FROM Employee)
"""
pd.read_sql(SQL, engine)

Unnamed: 0,SecondHighestSalary
0,200


### MAX & LIMIT

- Need to deal with empty table with **IFNULL**.


In [8]:
SQL = """
SELECT IFNULL ( (
    SELECT e1.Salary
        FROM Employee e1
        WHERE e1.Salary < (SELECT MAX(e2.Salary) FROM Employee e2)
        ORDER BY e1.Salary DESC
        LIMIT 1
    ), NULL)  AS 'SecondHighestSalary'
"""
pd.read_sql(SQL, engine)

Unnamed: 0,SecondHighestSalary
0,200


### The N-th Highest

Use COUNT DISTINCT.  Can be generalized to handle n-th highest.

In [9]:
SQL_TEMPLATE = """
SELECT IFNULL ((
    SELECT e1.Salary
        FROM Employee e1
        WHERE (SELECT COUNT(DISTINCT e2.Salary)
                   FROM Employee e2
                   WHERE e2.Salary > e1.Salary) = {n} - 1
        LIMIT 1
    ), NULL)
    AS `NthHighestSalary ({n})`
"""
for n in range(2,5):
    SQL = SQL_TEMPLATE.format(n=n)
    display(pd.read_sql(SQL, engine))

Unnamed: 0,NthHighestSalary (2)
0,200


Unnamed: 0,NthHighestSalary (3)
0,100


Unnamed: 0,NthHighestSalary (4)
0,
