# Transformations in Spark

## Setup
### Configure Spark Environment
Configure environment variables, Make sure you provide the correct Spark installation path/location.

In [1]:
## Set Python - Spark environment. Resolve necessary dependencies specific to Spark HBase Connector.
import os
import sys

os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

### Starting the Spark Session
#### Create and Initialize Spark Driver
Creating a spark app that will run locally and will use as many threads as there are cores using local[*] :

In [2]:
## Create SparkContext, SparkSession
from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'hdfs:///apps/hive/warehouse/'

spark = SparkSession \
    .builder \
    .appName("Machine Learning Example using Spark ML") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

#### Verify Spark Driver -  Spark Session

In [3]:
## Verify Spark Session
spark

#### Verify Python version

In [4]:
print(sys.version)
print(sys.version_info)

2.7.5 (default, Nov 20 2015, 02:00:19) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)]
sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0)


### Existing Database Schema
#### Columns

| **employees**     | **departments**   | **dept_emp**      | **dept_manager**  | **salaries**      | **titles**        |
|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
| **emp_no**        | **dept_no**       | **seq_no**        | **seq_no**        | **seq_no**        | **seq_no**        |
| birth_date        | **dept_name**     | **emp_no**        | **dept_no**       | **emp_no**        | **emp_no**        |
| first_name        | last_modified     | **dept_no**       | **emp_no**        | salary            | **title**         |
| last_name         |                   | from_date         | from_date         | **from_date**     | **from_date**     |
| gender            |                   | to_date           | to_date           | to_date           | to_date           |
| hire_date         |                   | last_modified     | last_modified     | last_modified     | last_modified     |
| last_modified     |                   |                   |                   |                   |                   |

### HIVE Metastore
#### Verify Hive Metastore - Database and Table(s)

#### Verify previously created hive database - insofe_empdb_10064

In [5]:
# insofe_empdb_10064
spark.sql("USE insofe_empdb_10064")

DataFrame[]

#### Verify/list all the table(s) in the above database 

In [6]:
spark.sql("SHOW TABLES").show()

+------------------+------------+-----------+
|          database|   tableName|isTemporary|
+------------------+------------+-----------+
|insofe_empdb_10064| departments|      false|
|insofe_empdb_10064|    dept_emp|      false|
|insofe_empdb_10064|dept_manager|      false|
|insofe_empdb_10064|   employees|      false|
|insofe_empdb_10064|    salaries|      false|
|insofe_empdb_10064|      titles|      false|
+------------------+------------+-----------+



Now there are 6 tables available in HIVE, in the database *insofe_empdb_10064**, Access these tables and process the data by creating DataFrames.

## Transformations
#### 1. Create DataFrame for each of the underlying table.
#### 2. Remove unnecessary rows (retain only active rows from the tables where necessary) and columns.
#### 3. Cache DataFrames as necessary.

### Load and Filter DataFrames.

#### Departments Table
Create DataFrame for departments data from departments table in HIVE

In [7]:
deptDF = spark.sql("SELECT * FROM departments")

Verify the above DataFrame

In [8]:
deptDF.show()

+-------+------------------+-------------------+
|dept_no|         dept_name|      last_modified|
+-------+------------------+-------------------+
|   d001|         Marketing|2013-01-28 23:59:59|
|   d002|           Finance|2013-01-28 23:59:59|
|   d003|   Human Resources|2013-01-28 23:59:59|
|   d004|        Production|2013-01-28 23:59:59|
|   d005|       Development|2013-01-28 23:59:59|
|   d006|Quality Management|2013-01-28 23:59:59|
|   d007|             Sales|2013-01-28 23:59:59|
|   d008|          Research|2013-01-28 23:59:59|
|   d009|  Customer Service|2013-01-28 23:59:59|
|   d010|         Analytics|2019-01-17 12:47:16|
+-------+------------------+-------------------+



Verify the schema for the above DataFrame

In [9]:
deptDF.printSchema()

root
 |-- dept_no: string (nullable = true)
 |-- dept_name: string (nullable = true)
 |-- last_modified: timestamp (nullable = true)



#### Departments and Employees Table
Create DataFrame for departments & employees from dept_emp table in HIVE
- An employee might work in different departments during their tenure.
- At any time an employee is active in only in one department.
- So, there will be multiple records for an employee with different departments
- At any time an there will be only one deparment an employee actively working, and is can be identified by to_date value '9999-01-01'

In [10]:
dept_empDF = spark.sql("SELECT * FROM dept_emp")

Verify the above DataFrame

In [11]:
dept_empDF.show()

+------+------+-------+----------+----------+-------------------+
|seq_no|emp_no|dept_no| from_date|   to_date|      last_modified|
+------+------+-------+----------+----------+-------------------+
|     1|     1|   d001|1986-01-01|9999-01-01|2013-01-28 23:59:59|
|    10|    10|   d002|1986-01-14|9999-01-01|2013-01-28 23:59:59|
|   100|    94|   d004|1994-05-26|1999-04-29|2013-01-28 23:59:59|
|  1000|   909|   d009|1988-09-12|9999-01-01|2013-01-28 23:59:59|
| 10000|  9059|   d008|1995-04-21|1996-02-20|2013-01-28 23:59:59|
|100000| 90419|   d004|1998-05-23|9999-01-01|2013-01-28 23:59:59|
|100001| 90420|   d008|1991-07-23|2005-12-15|2013-01-28 23:59:59|
|100002| 90421|   d005|1991-07-23|9999-01-01|2013-01-28 23:59:59|
|100003| 90422|   d002|1992-03-25|9999-01-01|2013-01-28 23:59:59|
|100004| 90422|   d003|1991-07-23|1992-03-25|2013-01-28 23:59:59|
|100005| 90423|   d004|1991-07-23|2005-10-14|2013-01-28 23:59:59|
|100006| 90424|   d007|2003-06-25|9999-01-01|2013-01-28 23:59:59|
|100007| 9

Veify the schema for the above DataFrame

In [12]:
dept_empDF.printSchema()

root
 |-- seq_no: integer (nullable = true)
 |-- emp_no: integer (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- from_date: string (nullable = true)
 |-- to_date: string (nullable = true)
 |-- last_modified: timestamp (nullable = true)



Verify the record counts in the above DataFrame

In [13]:
dept_empDF.count()

341603

Filter only the active records from the above DataFrame, Active employees can be identified with the to_date == '9999-01-01'

In [14]:
# from pyspark.sql.functions import col
# active_dept_empDF = dept_empDF.filter(col("to_date") == '9999-01-01')
active_dept_empDF = dept_empDF[dept_empDF.to_date == '9999-01-01']

Verify the active employees count from the above DataFrame

In [15]:
active_dept_empDF.count()

250124

Verify the Data types from the above DataFrame

In [16]:
active_dept_empDF.dtypes

[('seq_no', 'int'),
 ('emp_no', 'int'),
 ('dept_no', 'string'),
 ('from_date', 'string'),
 ('to_date', 'string'),
 ('last_modified', 'timestamp')]

In [17]:
active_dept_empDF.count()

250124

Remove unnecessary columns, **last_modified** column is not necessary. 

In [18]:
active_dept_empDF = active_dept_empDF.select('seq_no', 'emp_no', 'dept_no', 'from_date', 'to_date')

Verify above DataFrame

In [19]:
active_dept_empDF.show()

+------+------+-------+----------+----------+
|seq_no|emp_no|dept_no| from_date|   to_date|
+------+------+-------+----------+----------+
|     1|     1|   d001|1986-01-01|9999-01-01|
|    10|    10|   d002|1986-01-14|9999-01-01|
|  1000|   909|   d009|1988-09-12|9999-01-01|
|100000| 90419|   d004|1998-05-23|9999-01-01|
|100002| 90421|   d005|1991-07-23|9999-01-01|
|100003| 90422|   d002|1992-03-25|9999-01-01|
|100006| 90424|   d007|2003-06-25|9999-01-01|
|100007| 90425|   d007|1996-02-26|9999-01-01|
|100009| 90427|   d008|1991-07-23|9999-01-01|
| 10001|  9060|   d003|1991-09-30|9999-01-01|
|100010| 90428|   d007|2003-12-03|9999-01-01|
|100012| 90430|   d004|2000-06-21|9999-01-01|
|100014| 90431|   d004|1995-08-26|9999-01-01|
|100018| 90433|   d009|2005-11-15|9999-01-01|
|100019| 90434|   d005|1991-07-23|9999-01-01|
| 10002|  9061|   d001|1986-10-24|9999-01-01|
|100022| 90437|   d005|1991-07-23|9999-01-01|
|100023| 90438|   d004|2003-04-25|9999-01-01|
|100025| 90440|   d006|1998-10-18|

Make the DataFrame available in in-memory

In [20]:
active_dept_empDF.cache()

DataFrame[seq_no: int, emp_no: int, dept_no: string, from_date: string, to_date: string]

#### Departments and Managers Table
Create DataFrame for departments and managers data from dept_manager table in HIVE

- For each department respective manager's employee number is available in dept_manager table
- A department may have multiple manager's
- So, there will be multiple records for a department (with different employee number's for different time periods)
- At any time an there will be only one active manager for each department and is can be identified by to_date value '9999-01-01'

In [21]:
dept_managerDF = spark.sql("SELECT * FROM dept_manager")

Verify the departments and manager's DataFrame

In [22]:
dept_managerDF.show()

+------+-------+------+----------+----------+-------------------+
|seq_no|dept_no|emp_no| from_date|   to_date|      last_modified|
+------+-------+------+----------+----------+-------------------+
|     1|   d001|     1|1986-01-01|1992-10-01|2013-01-28 23:59:59|
|    10|   d002|    10|1990-12-17|9999-01-01|2013-01-28 23:59:59|
|    11|   d003| 19827|1993-03-21|9999-01-01|2013-01-28 23:59:59|
|    12|   d004| 31345|1990-09-09|1994-08-02|2013-01-28 23:59:59|
|    13|   d001| 45502|1993-10-01|9999-01-01|2013-01-28 23:59:59|
|    14|   d006| 57739|1994-09-12|1997-06-28|2013-01-28 23:59:59|
|    15|   d005| 64439|1995-04-25|9999-01-01|2013-01-28 23:59:59|
|    16|   d007| 71341|1994-03-07|9999-01-01|2013-01-28 23:59:59|
|    17|   d008|107706|1996-04-08|9999-01-01|2013-01-28 23:59:59|
|    18|   d009|108801|1993-10-17|1997-09-08|2013-01-28 23:59:59|
|    19|   d004|129808|1998-08-02|2002-08-30|2013-01-28 23:59:59|
|     2|   d002|     2|1986-01-01|1990-12-17|2013-01-28 23:59:59|
|    20|  

Verify the schema for departments and manager's DataFrame

In [23]:
dept_managerDF.printSchema()

root
 |-- seq_no: integer (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- emp_no: integer (nullable = true)
 |-- from_date: string (nullable = true)
 |-- to_date: string (nullable = true)
 |-- last_modified: timestamp (nullable = true)



Total Count in departments and manager's DataFrame

In [24]:
dept_managerDF.count()

25

Filter only the active records from above DataFrame
<br>Though there are only total 10 departments, but there are 25 records (manager records) exists,
<br>remove the inactive records, Active records are whose to_date is '9999-01-01'

In [25]:
# from pyspark.sql.functions import col
# active_dept_managerDF = dept_managerDF.filter(col("to_date") == '9999-01-01')
active_dept_managerDF = dept_managerDF[dept_managerDF.to_date == '9999-01-01']

Verify the records count from the above DataFrame

In [26]:
active_dept_managerDF.count()

10

Verify the columns

In [27]:
active_dept_managerDF.dtypes

[('seq_no', 'int'),
 ('dept_no', 'string'),
 ('emp_no', 'int'),
 ('from_date', 'string'),
 ('to_date', 'string'),
 ('last_modified', 'timestamp')]

Remove unwanted columns and rename the columns as necessary, **seq_no**, **last_modified** and **to_date** are not necessary.

In [28]:
from pyspark.sql.functions import expr
active_dept_managerDF = active_dept_managerDF.select('dept_no', expr('emp_no AS mgr_emp_no'), expr('from_date AS mgr_from_date'))

Verify above DataFrame

In [29]:
active_dept_managerDF.show()

+-------+----------+-------------+
|dept_no|mgr_emp_no|mgr_from_date|
+-------+----------+-------------+
|   d002|        10|   1990-12-17|
|   d003|     19827|   1993-03-21|
|   d001|     45502|   1993-10-01|
|   d005|     64439|   1995-04-25|
|   d007|     71341|   1994-03-07|
|   d008|    107706|   1996-04-08|
|   d006|    149081|   2000-06-28|
|   d009|    151543|   2003-01-03|
|   d004|    215054|   2005-08-30|
|   d010|    300030|   2013-01-29|
+-------+----------+-------------+



#### Employees Table
Create DataFrame for employees data from employees table in HIVE

In [30]:
employeesDF = spark.sql("SELECT * FROM employees")

Verify employees DataFrame

In [31]:
employeesDF.show(4)

+------+----------+----------+------------+------+----------+-------------------+
|emp_no|birth_date|first_name|   last_name|gender| hire_date|      last_modified|
+------+----------+----------+------------+------+----------+-------------------+
|     1|1958-09-12| Margareta|  Markovitch|     M|1986-01-01|2013-01-28 23:59:59|
|     2|1961-10-28|      Ebru|       Alpin|     M|1986-01-01|2013-01-28 23:59:59|
|     3|1955-06-24|   Shirish|Ossenbruggen|     F|1986-01-01|2013-01-28 23:59:59|
|     4|1958-06-08| Krassimir|     Wegerle|     F|1986-01-01|2013-01-28 23:59:59|
+------+----------+----------+------------+------+----------+-------------------+
only showing top 4 rows



Verify schema of employees DataFrame

In [32]:
employeesDF.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- birth_date: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- hire_date: string (nullable = true)
 |-- last_modified: timestamp (nullable = true)



Remove unwanted columns

In [33]:
employeesDF = employeesDF.drop('last_modified')

Verify above DataFrame

In [34]:
employeesDF.show(4)

+------+----------+----------+------------+------+----------+
|emp_no|birth_date|first_name|   last_name|gender| hire_date|
+------+----------+----------+------------+------+----------+
|     1|1958-09-12| Margareta|  Markovitch|     M|1986-01-01|
|     2|1961-10-28|      Ebru|       Alpin|     M|1986-01-01|
|     3|1955-06-24|   Shirish|Ossenbruggen|     F|1986-01-01|
|     4|1958-06-08| Krassimir|     Wegerle|     F|1986-01-01|
+------+----------+----------+------------+------+----------+
only showing top 4 rows



#### Salaries Table
Create DataFrame for salaries data from salaries table in HIVE

In [35]:
salariesDF = spark.sql("SELECT * FROM salaries")

Verify salaries DataFrame

In [36]:
salariesDF.show(4)

+------+------+------+----------+----------+-------------------+
|seq_no|emp_no|salary| from_date|   to_date|      last_modified|
+------+------+------+----------+----------+-------------------+
|     1|     1| 70166|1986-01-01|1987-01-01|2013-01-28 23:59:59|
|    10|     1| 91165|1994-12-30|1995-12-30|2013-01-28 23:59:59|
|   100|     6| 84203|1994-12-30|1995-12-30|2013-01-28 23:59:59|
|  1000|    73| 39000|1997-11-25|1998-11-25|2013-01-28 23:59:59|
+------+------+------+----------+----------+-------------------+
only showing top 4 rows



Verify schema of salaries DataFrame

In [37]:
salariesDF.printSchema()

root
 |-- seq_no: integer (nullable = true)
 |-- emp_no: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- from_date: string (nullable = true)
 |-- to_date: string (nullable = true)
 |-- last_modified: timestamp (nullable = true)



Verify count of records in salaries DataFrame

In [38]:
salariesDF.count()

2854047

Filter only the active records from above DataFrame

In [39]:
active_salariesDF = salariesDF[salariesDF.to_date=='9999-01-01']

Verify the record count

In [40]:
active_salariesDF.count()

250124

In [41]:
active_salariesDF.dtypes

[('seq_no', 'int'),
 ('emp_no', 'int'),
 ('salary', 'int'),
 ('from_date', 'string'),
 ('to_date', 'string'),
 ('last_modified', 'timestamp')]

Remove and rename unnecessary columns

In [42]:
from pyspark.sql.functions import col, expr, column
active_salariesDF = active_salariesDF.select("emp_no", "salary", expr("from_date as sal_from_date"))

In [43]:
active_salariesDF.dtypes

[('emp_no', 'int'), ('salary', 'int'), ('sal_from_date', 'string')]

#### Titles Table
Create titles DataFrame for the titles table in HIVE

In [44]:
titlesDF = spark.sql("SELECT * FROM titles")

Verify titles DataFrame

In [45]:
titlesDF.show(4)

+------+------+----------------+----------+----------+-------------------+
|seq_no|emp_no|           title| from_date|   to_date|      last_modified|
+------+------+----------------+----------+----------+-------------------+
|     1|     1|         Manager|1986-01-01|1992-10-01|2013-01-28 23:59:59|
|    10|     5|Technique Leader|1993-04-25|9999-01-01|2013-01-28 23:59:59|
|   100|    60|           Staff|1997-11-02|9999-01-01|2013-01-28 23:59:59|
|  1000|   620|        Engineer|1996-09-15|2003-03-20|2013-01-28 23:59:59|
+------+------+----------------+----------+----------+-------------------+
only showing top 4 rows



Verify titles DataFrame schema

In [46]:
titlesDF.printSchema()

root
 |-- seq_no: integer (nullable = true)
 |-- emp_no: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- from_date: string (nullable = true)
 |-- to_date: string (nullable = true)
 |-- last_modified: timestamp (nullable = true)



Verify records count in the above DataFrame

In [47]:
titlesDF.count()

453308

Filter active records from the above DataFrame

In [48]:
active_titlesDF = titlesDF[titlesDF.to_date=='9999-01-01']

In [49]:
active_titlesDF.dtypes

[('seq_no', 'int'),
 ('emp_no', 'int'),
 ('title', 'string'),
 ('from_date', 'string'),
 ('to_date', 'string'),
 ('last_modified', 'timestamp')]

Remove and rename the columns as necessary

In [50]:
from pyspark.sql.functions import col, expr, column
active_titlesDF = active_titlesDF.select('emp_no', 'title', expr('from_date AS title_from_date'))

In [51]:
active_titlesDF.dtypes

[('emp_no', 'int'), ('title', 'string'), ('title_from_date', 'string')]


Join department and departments_manager s DataFrames

Result will have each department and corresponding manager's employee no


In [52]:
dept_curr_mgrDF = deptDF.join(active_dept_managerDF, 'dept_no', 'inner')
dept_curr_mgrDF.show(50)

+-------+------------------+-------------------+----------+-------------+
|dept_no|         dept_name|      last_modified|mgr_emp_no|mgr_from_date|
+-------+------------------+-------------------+----------+-------------+
|   d002|           Finance|2013-01-28 23:59:59|        10|   1990-12-17|
|   d003|   Human Resources|2013-01-28 23:59:59|     19827|   1993-03-21|
|   d001|         Marketing|2013-01-28 23:59:59|     45502|   1993-10-01|
|   d005|       Development|2013-01-28 23:59:59|     64439|   1995-04-25|
|   d007|             Sales|2013-01-28 23:59:59|     71341|   1994-03-07|
|   d008|          Research|2013-01-28 23:59:59|    107706|   1996-04-08|
|   d006|Quality Management|2013-01-28 23:59:59|    149081|   2000-06-28|
|   d009|  Customer Service|2013-01-28 23:59:59|    151543|   2003-01-03|
|   d004|        Production|2013-01-28 23:59:59|    215054|   2005-08-30|
|   d010|         Analytics|2019-01-17 12:47:16|    300030|   2013-01-29|
+-------+------------------+----------

Find the manager's details by joining above DataFrame with employee's details using emp_no

In [53]:
dept_curr_mgrDF.dtypes

[('dept_no', 'string'),
 ('dept_name', 'string'),
 ('last_modified', 'timestamp'),
 ('mgr_emp_no', 'int'),
 ('mgr_from_date', 'string')]

In [54]:
employeesDF.dtypes

[('emp_no', 'int'),
 ('birth_date', 'string'),
 ('first_name', 'string'),
 ('last_name', 'string'),
 ('gender', 'string'),
 ('hire_date', 'string')]

In [55]:
join_expr = dept_curr_mgrDF["mgr_emp_no"] == employeesDF["emp_no"]

In [56]:
dept_curr_mgr_detailsDF = dept_curr_mgrDF.join(employeesDF,join_expr,'inner')

In [57]:
dept_curr_mgr_detailsDF.dtypes

[('dept_no', 'string'),
 ('dept_name', 'string'),
 ('last_modified', 'timestamp'),
 ('mgr_emp_no', 'int'),
 ('mgr_from_date', 'string'),
 ('emp_no', 'int'),
 ('birth_date', 'string'),
 ('first_name', 'string'),
 ('last_name', 'string'),
 ('gender', 'string'),
 ('hire_date', 'string')]

In [58]:
dept_curr_mgr_detailsDF.show(4)

+-------+---------------+-------------------+----------+-------------+------+----------+----------+----------+------+----------+
|dept_no|      dept_name|      last_modified|mgr_emp_no|mgr_from_date|emp_no|birth_date|first_name| last_name|gender| hire_date|
+-------+---------------+-------------------+----------+-------------+------+----------+----------+----------+------+----------+
|   d002|        Finance|2013-01-28 23:59:59|        10|   1990-12-17|    10|1959-03-28|     Isamu|Legleitner|     F|1986-01-14|
|   d003|Human Resources|2013-01-28 23:59:59|     19827|   1993-03-21| 19827|1960-12-02|   Karsten|   Sigstam|     F|1986-08-04|
|   d001|      Marketing|2013-01-28 23:59:59|     45502|   1993-10-01| 45502|1967-06-21|  Vishwani|  Minakawa|     M|1988-04-12|
|   d005|    Development|2013-01-28 23:59:59|     64439|   1995-04-25| 64439|1970-04-25|      Leon|  DasSarma|     F|1989-10-21|
+-------+---------------+-------------------+----------+-------------+------+----------+---------

Rename columns as necessary

In [59]:
from pyspark.sql.functions import col

replacements = {'birth_date' : 'mgr_birth_date', 
                'first_name' : 'mgr_first_name',
                'last_name' : 'mgr_last_name',
                'gender' : 'mgr_gender',
                'hire_date' : 'mgr_hire_date'
               }

dept_curr_mgr_detailsDF = dept_curr_mgr_detailsDF.select([col(c).alias(replacements.get(c, c)) for c in dept_curr_mgr_detailsDF.columns])

Verify above DataFrame

In [60]:
dept_curr_mgr_detailsDF.show(4)

+-------+---------------+-------------------+----------+-------------+------+--------------+--------------+-------------+----------+-------------+
|dept_no|      dept_name|      last_modified|mgr_emp_no|mgr_from_date|emp_no|mgr_birth_date|mgr_first_name|mgr_last_name|mgr_gender|mgr_hire_date|
+-------+---------------+-------------------+----------+-------------+------+--------------+--------------+-------------+----------+-------------+
|   d002|        Finance|2013-01-28 23:59:59|        10|   1990-12-17|    10|    1959-03-28|         Isamu|   Legleitner|         F|   1986-01-14|
|   d003|Human Resources|2013-01-28 23:59:59|     19827|   1993-03-21| 19827|    1960-12-02|       Karsten|      Sigstam|         F|   1986-08-04|
|   d001|      Marketing|2013-01-28 23:59:59|     45502|   1993-10-01| 45502|    1967-06-21|      Vishwani|     Minakawa|         M|   1988-04-12|
|   d005|    Development|2013-01-28 23:59:59|     64439|   1995-04-25| 64439|    1970-04-25|          Leon|     DasSar

Remove unwanted columns

In [61]:
dept_curr_mgr_detailsDF = dept_curr_mgr_detailsDF.drop('last_modified', 'emp_no')

In [62]:
dept_curr_mgr_detailsDF.dtypes

[('dept_no', 'string'),
 ('dept_name', 'string'),
 ('mgr_emp_no', 'int'),
 ('mgr_from_date', 'string'),
 ('mgr_birth_date', 'string'),
 ('mgr_first_name', 'string'),
 ('mgr_last_name', 'string'),
 ('mgr_gender', 'string'),
 ('mgr_hire_date', 'string')]

In [63]:
dept_curr_mgr_detailsDF.show(4)

+-------+---------------+----------+-------------+--------------+--------------+-------------+----------+-------------+
|dept_no|      dept_name|mgr_emp_no|mgr_from_date|mgr_birth_date|mgr_first_name|mgr_last_name|mgr_gender|mgr_hire_date|
+-------+---------------+----------+-------------+--------------+--------------+-------------+----------+-------------+
|   d002|        Finance|        10|   1990-12-17|    1959-03-28|         Isamu|   Legleitner|         F|   1986-01-14|
|   d003|Human Resources|     19827|   1993-03-21|    1960-12-02|       Karsten|      Sigstam|         F|   1986-08-04|
|   d001|      Marketing|     45502|   1993-10-01|    1967-06-21|      Vishwani|     Minakawa|         M|   1988-04-12|
|   d005|    Development|     64439|   1995-04-25|    1970-04-25|          Leon|     DasSarma|         F|   1989-10-21|
+-------+---------------+----------+-------------+--------------+--------------+-------------+----------+-------------+
only showing top 4 rows



Join employees DataFrame with departments&employee DataFrame
<br>Join employeesDF and active_dept_empDF based on emp_no
<br>result DataFrame of the above is employee and his/her corresponding department

In [64]:
employeesDF.dtypes

[('emp_no', 'int'),
 ('birth_date', 'string'),
 ('first_name', 'string'),
 ('last_name', 'string'),
 ('gender', 'string'),
 ('hire_date', 'string')]

In [65]:
active_dept_empDF.dtypes

[('seq_no', 'int'),
 ('emp_no', 'int'),
 ('dept_no', 'string'),
 ('from_date', 'string'),
 ('to_date', 'string')]

In [66]:
emp_deptDF = active_dept_empDF.join(employeesDF,'emp_no','inner')

Verify the above result DataFrame

In [67]:
emp_deptDF.show(4)

+------+------+-------+----------+----------+----------+----------+---------+------+----------+
|emp_no|seq_no|dept_no| from_date|   to_date|birth_date|first_name|last_name|gender| hire_date|
+------+------+-------+----------+----------+----------+----------+---------+------+----------+
|   148|   156|   d005|1986-02-03|9999-01-01|1960-03-11|    Feipei| Nollmann|     M|1986-02-03|
|   463|   507|   d008|1986-02-06|9999-01-01|1955-04-15|Dharmaraja| Sadowsky|     M|1986-02-06|
|   496|   543|   d006|2001-09-02|9999-01-01|1964-03-29|      Mari|    Rotem|     M|1986-02-06|
|   833|   914|   d005|1994-03-25|9999-01-01|1961-09-14|      Huan|  Preusig|     M|1986-02-09|
+------+------+-------+----------+----------+----------+----------+---------+------+----------+
only showing top 4 rows



Rename the columns

In [68]:
from pyspark.sql.functions import col

replacements = {
    'from_date' : 'dept_from_date',
    'birth_date' : 'emp_birth_date',
    'first_name' : 'emp_first_name',
    'last_name' : 'emp_last_name',
    'gender' : 'emp_gender',
    'hire_date' : 'emp_hire_date'
}

emp_deptDF = emp_deptDF.select([col(c).alias(replacements.get(c, c)) for c in emp_deptDF.columns])

Verify DataFrame

In [69]:
emp_deptDF.show(4)

+------+------+-------+--------------+----------+--------------+--------------+-------------+----------+-------------+
|emp_no|seq_no|dept_no|dept_from_date|   to_date|emp_birth_date|emp_first_name|emp_last_name|emp_gender|emp_hire_date|
+------+------+-------+--------------+----------+--------------+--------------+-------------+----------+-------------+
|   148|   156|   d005|    1986-02-03|9999-01-01|    1960-03-11|        Feipei|     Nollmann|         M|   1986-02-03|
|   463|   507|   d008|    1986-02-06|9999-01-01|    1955-04-15|    Dharmaraja|     Sadowsky|         M|   1986-02-06|
|   496|   543|   d006|    2001-09-02|9999-01-01|    1964-03-29|          Mari|        Rotem|         M|   1986-02-06|
|   833|   914|   d005|    1994-03-25|9999-01-01|    1961-09-14|          Huan|      Preusig|         M|   1986-02-09|
+------+------+-------+--------------+----------+--------------+--------------+-------------+----------+-------------+
only showing top 4 rows



Verify active records count

In [70]:
emp_deptDF.count()

250124

Create a DataFrame with employees and respective manager's details<br>
<br>DataFrame **emp_deptDF** contains all the active employees along with the their department
<br>DataFrame **dept_curr_mgr_detailsDF** contains all the departments its manager''s details
<br>Join these two DataFrames based on the **dept_no**, to result a DataFrame with employee''s along with the manager''s details.

In [71]:
emp_deptDF.dtypes

[('emp_no', 'int'),
 ('seq_no', 'int'),
 ('dept_no', 'string'),
 ('dept_from_date', 'string'),
 ('to_date', 'string'),
 ('emp_birth_date', 'string'),
 ('emp_first_name', 'string'),
 ('emp_last_name', 'string'),
 ('emp_gender', 'string'),
 ('emp_hire_date', 'string')]

In [72]:
emp_deptDF.count()

250124

In [73]:
dept_curr_mgr_detailsDF.dtypes

[('dept_no', 'string'),
 ('dept_name', 'string'),
 ('mgr_emp_no', 'int'),
 ('mgr_from_date', 'string'),
 ('mgr_birth_date', 'string'),
 ('mgr_first_name', 'string'),
 ('mgr_last_name', 'string'),
 ('mgr_gender', 'string'),
 ('mgr_hire_date', 'string')]

In [74]:
dept_curr_mgr_detailsDF.count()

10

Join by broadcasting the smaller table - efficient join

In [75]:
from pyspark.sql.functions import broadcast
active_emp_dept_mgrDF = emp_deptDF.join(broadcast(dept_curr_mgr_detailsDF), 'dept_no', 'inner')

Verify the DataFrame

In [76]:
active_emp_dept_mgrDF.show(4)

+-------+------+------+--------------+----------+--------------+--------------+-------------+----------+-------------+------------------+----------+-------------+--------------+--------------+-------------+----------+-------------+
|dept_no|emp_no|seq_no|dept_from_date|   to_date|emp_birth_date|emp_first_name|emp_last_name|emp_gender|emp_hire_date|         dept_name|mgr_emp_no|mgr_from_date|mgr_birth_date|mgr_first_name|mgr_last_name|mgr_gender|mgr_hire_date|
+-------+------+------+--------------+----------+--------------+--------------+-------------+----------+-------------+------------------+----------+-------------+--------------+--------------+-------------+----------+-------------+
|   d005|   148|   156|    1986-02-03|9999-01-01|    1960-03-11|        Feipei|     Nollmann|         M|   1986-02-03|       Development|     64439|   1995-04-25|    1970-04-25|          Leon|     DasSarma|         F|   1989-10-21|
|   d008|   463|   507|    1986-02-06|9999-01-01|    1955-04-15|    Dhar

Verify the counts

In [77]:
active_emp_dept_mgrDF.count()

250124

Make the DataFrame available in in-memory

In [78]:
active_emp_dept_mgrDF.cache()

DataFrame[dept_no: string, emp_no: int, seq_no: int, dept_from_date: string, to_date: string, emp_birth_date: string, emp_first_name: string, emp_last_name: string, emp_gender: string, emp_hire_date: string, dept_name: string, mgr_emp_no: int, mgr_from_date: string, mgr_birth_date: string, mgr_first_name: string, mgr_last_name: string, mgr_gender: string, mgr_hire_date: string]

Join Salaries and Titles DataFrames<br>
<br>**active_salariesDF** DataFrame contains the current salaries details of active employees
<br>**active_titlesDF** DataFrame contains the current titles/designation details of active employees
<br>join these two DataFrames based on the emp_no
<br>result is DataFrame consists of all active employees along with their salaries and titles details

In [79]:
active_salariesDF.dtypes

[('emp_no', 'int'), ('salary', 'int'), ('sal_from_date', 'string')]

In [80]:
active_titlesDF.dtypes

[('emp_no', 'int'), ('title', 'string'), ('title_from_date', 'string')]

In [81]:
active_salariesDF.cache()
active_titlesDF.cache()

DataFrame[emp_no: int, title: string, title_from_date: string]

In [82]:
emp_sal_titlesDF = active_salariesDF.join(active_titlesDF, 'emp_no', 'inner')

Verify the DataFrame

In [83]:
emp_sal_titlesDF.show(4)

+------+------+-------------+---------------+---------------+
|emp_no|salary|sal_from_date|          title|title_from_date|
+------+------+-------------+---------------+---------------+
|   148|121640|   2003-01-30|Senior Engineer|     1993-02-04|
|   463| 63130|   2003-02-02|   Senior Staff|     1986-02-06|
|   496| 50281|   2002-08-16|       Engineer|     2000-08-17|
|   833| 52747|   2003-03-23|Senior Engineer|     1994-03-25|
+------+------+-------------+---------------+---------------+
only showing top 4 rows



Verify record count

In [84]:
emp_sal_titlesDF.count()

250124

In [85]:
emp_sal_titlesDF.cache()

DataFrame[emp_no: int, salary: int, sal_from_date: string, title: string, title_from_date: string]

Final Join<br>
<br>By now there are 2 DataFrames
<br>**active_emp_dept_mgrDF** consists of all active employees along with the manager''s details.
<br>**emp_sal_titlesDF** consists of the current salary and titles details for all the active employees.
<br>join these two DataFrames to result a DataFrame with all the details for all active employees

In [86]:
active_emp_dept_mgrDF.count()

250124

In [87]:
active_emp_dept_mgrDF.dtypes

[('dept_no', 'string'),
 ('emp_no', 'int'),
 ('seq_no', 'int'),
 ('dept_from_date', 'string'),
 ('to_date', 'string'),
 ('emp_birth_date', 'string'),
 ('emp_first_name', 'string'),
 ('emp_last_name', 'string'),
 ('emp_gender', 'string'),
 ('emp_hire_date', 'string'),
 ('dept_name', 'string'),
 ('mgr_emp_no', 'int'),
 ('mgr_from_date', 'string'),
 ('mgr_birth_date', 'string'),
 ('mgr_first_name', 'string'),
 ('mgr_last_name', 'string'),
 ('mgr_gender', 'string'),
 ('mgr_hire_date', 'string')]

In [88]:
emp_sal_titlesDF.count()

250124

In [89]:
emp_sal_titlesDF.dtypes

[('emp_no', 'int'),
 ('salary', 'int'),
 ('sal_from_date', 'string'),
 ('title', 'string'),
 ('title_from_date', 'string')]

In [90]:
emp_sal_titlesDF.cache()
active_emp_dept_mgrDF.cache()

DataFrame[dept_no: string, emp_no: int, seq_no: int, dept_from_date: string, to_date: string, emp_birth_date: string, emp_first_name: string, emp_last_name: string, emp_gender: string, emp_hire_date: string, dept_name: string, mgr_emp_no: int, mgr_from_date: string, mgr_birth_date: string, mgr_first_name: string, mgr_last_name: string, mgr_gender: string, mgr_hire_date: string]

In [91]:
active_emp_detailsDF = active_emp_dept_mgrDF.join(emp_sal_titlesDF, 'emp_no', 'inner')

In [92]:
active_emp_detailsDF.cache()

DataFrame[emp_no: int, dept_no: string, seq_no: int, dept_from_date: string, to_date: string, emp_birth_date: string, emp_first_name: string, emp_last_name: string, emp_gender: string, emp_hire_date: string, dept_name: string, mgr_emp_no: int, mgr_from_date: string, mgr_birth_date: string, mgr_first_name: string, mgr_last_name: string, mgr_gender: string, mgr_hire_date: string, salary: int, sal_from_date: string, title: string, title_from_date: string]

Verify DataFrame

In [93]:
active_emp_detailsDF.show(4)

+------+-------+------+--------------+----------+--------------+--------------+-------------+----------+-------------+------------------+----------+-------------+--------------+--------------+-------------+----------+-------------+------+-------------+---------------+---------------+
|emp_no|dept_no|seq_no|dept_from_date|   to_date|emp_birth_date|emp_first_name|emp_last_name|emp_gender|emp_hire_date|         dept_name|mgr_emp_no|mgr_from_date|mgr_birth_date|mgr_first_name|mgr_last_name|mgr_gender|mgr_hire_date|salary|sal_from_date|          title|title_from_date|
+------+-------+------+--------------+----------+--------------+--------------+-------------+----------+-------------+------------------+----------+-------------+--------------+--------------+-------------+----------+-------------+------+-------------+---------------+---------------+
|   148|   d005|   156|    1986-02-03|9999-01-01|    1960-03-11|        Feipei|     Nollmann|         M|   1986-02-03|       Development|     644

Derive additional columns such as <br>
- emp_age = current_date - emp_birth_date
- emp_tenure = current_date - emp_hire_date
- mgr_age = current_date - mgr_birth_date
- mgr_tenure = current_date - mgr_hire_date
- salary_since = current_date - sal_from_date
- role_since = current_date - title_from_date
- emp_dept_tenure = current_date - dept_from_date
- mgr_dept_tenure = current_date - mgr_from_date
<br>
<br>Create a temporary table/view to perform sql queries

In [94]:
active_emp_detailsDF.registerTempTable("active_emp_details_sqlTBL")

In [95]:
active_emp_detailsDF.dtypes

[('emp_no', 'int'),
 ('dept_no', 'string'),
 ('seq_no', 'int'),
 ('dept_from_date', 'string'),
 ('to_date', 'string'),
 ('emp_birth_date', 'string'),
 ('emp_first_name', 'string'),
 ('emp_last_name', 'string'),
 ('emp_gender', 'string'),
 ('emp_hire_date', 'string'),
 ('dept_name', 'string'),
 ('mgr_emp_no', 'int'),
 ('mgr_from_date', 'string'),
 ('mgr_birth_date', 'string'),
 ('mgr_first_name', 'string'),
 ('mgr_last_name', 'string'),
 ('mgr_gender', 'string'),
 ('mgr_hire_date', 'string'),
 ('salary', 'int'),
 ('sal_from_date', 'string'),
 ('title', 'string'),
 ('title_from_date', 'string')]

Change the order of the columns with the select and derive the columns as necessary

In [96]:
active_employees_data  = spark.sql("""
SELECT emp_no, emp_first_name, emp_last_name, emp_gender, emp_birth_date, emp_hire_date,
       round(datediff(current_date,to_date(emp_birth_date))/365) as emp_age,
       round(datediff(current_date,to_date(emp_hire_date))/365) as emp_tenure,
       salary, sal_from_date, 
       round(datediff(current_date,to_date(sal_from_date))/365) as salary_since,
       title, title_from_date,
       round(datediff(current_date,to_date(title_from_date))/365) as role_since,
       dept_no, dept_name, dept_from_date,
       round(datediff(current_date,to_date(dept_from_date))/365) as emp_dept_tenure,
       mgr_emp_no, mgr_first_name, mgr_last_name, mgr_gender, mgr_birth_date, mgr_hire_date, mgr_from_date,
       round(datediff(current_date,to_date(mgr_birth_date))/365) as mgr_age,
       round(datediff(current_date,to_date(mgr_hire_date))/365) as mgr_tenure,
       round(datediff(current_date,to_date(mgr_from_date))/365) as mgr_dept_tenure
FROM active_emp_details_sqlTBL""")

Verify the DataFrame

In [97]:
active_employees_data.show(4)

+------+--------------+-------------+----------+--------------+-------------+-------+----------+------+-------------+------------+---------------+---------------+----------+-------+------------------+--------------+---------------+----------+--------------+-------------+----------+--------------+-------------+-------------+-------+----------+---------------+
|emp_no|emp_first_name|emp_last_name|emp_gender|emp_birth_date|emp_hire_date|emp_age|emp_tenure|salary|sal_from_date|salary_since|          title|title_from_date|role_since|dept_no|         dept_name|dept_from_date|emp_dept_tenure|mgr_emp_no|mgr_first_name|mgr_last_name|mgr_gender|mgr_birth_date|mgr_hire_date|mgr_from_date|mgr_age|mgr_tenure|mgr_dept_tenure|
+------+--------------+-------------+----------+--------------+-------------+-------+----------+------+-------------+------------+---------------+---------------+----------+-------+------------------+--------------+---------------+----------+--------------+-------------+-------

Write the DataFrame to the persistent storage - HDFS

In [98]:
active_employees_data.repartition(1).write.option("header", "false").csv("/user/rameshm/Batch49/employeesdb/results/active_employees_data")

In [99]:
active_employees_data.cache()

DataFrame[emp_no: int, emp_first_name: string, emp_last_name: string, emp_gender: string, emp_birth_date: string, emp_hire_date: string, emp_age: double, emp_tenure: double, salary: int, sal_from_date: string, salary_since: double, title: string, title_from_date: string, role_since: double, dept_no: string, dept_name: string, dept_from_date: string, emp_dept_tenure: double, mgr_emp_no: int, mgr_first_name: string, mgr_last_name: string, mgr_gender: string, mgr_birth_date: string, mgr_hire_date: string, mgr_from_date: string, mgr_age: double, mgr_tenure: double, mgr_dept_tenure: double]

In [100]:
active_employees_data.count()

250124

#### Aggregated Data
<br>Create the Aggregations
- Based on Department
- Based on Department and Gender
<br>Aggregate based on department

In [101]:
from pyspark.sql import functions as F

dept_aggrDF = active_employees_data.groupBy('dept_no').agg(
    F.min('salary').alias('Min_Salary'),
    F.max('salary').alias('Max_Salary'),
    F.mean('salary').alias('Mean_Salary'),
    F.count('salary').alias('Total_Employees'),
    F.stddev('salary').alias('StdDev_Salary'),
    F.sum('salary').alias('Total_salary'),
    F.min('emp_age').alias('Min_Age'),
    F.max('emp_age').alias('Max_Age'),
    F.mean('emp_age').alias('Mean_Age'),
    F.min('emp_tenure').alias('Min_Tenure'),
    F.max('emp_tenure').alias('Max_Tenure'),
    F.mean('emp_tenure').alias('Mean_Tenure'),
    F.mean('salary_since').alias('Mean_Salary_Since'),
    F.mean('role_since').alias('Mean_Role_Since')
)

In [102]:
dept_aggrDF.cache()

DataFrame[dept_no: string, Min_Salary: int, Max_Salary: int, Mean_Salary: double, Total_Employees: bigint, StdDev_Salary: double, Total_salary: bigint, Min_Age: double, Max_Age: double, Mean_Age: double, Min_Tenure: double, Max_Tenure: double, Mean_Tenure: double, Mean_Salary_Since: double, Mean_Role_Since: double]

In [103]:
dept_aggrDF.show()

+-------+----------+----------+------------------+---------------+------------------+------------+-------+-------+------------------+----------+----------+------------------+------------------+------------------+
|dept_no|Min_Salary|Max_Salary|       Mean_Salary|Total_Employees|     StdDev_Salary|Total_salary|Min_Age|Max_Age|          Mean_Age|Min_Tenure|Max_Tenure|       Mean_Tenure| Mean_Salary_Since|   Mean_Role_Since|
+-------+----------+----------+------------------+---------------+------------------+------------+-------+-------+------------------+----------+----------+------------------+------------------+------------------+
|   d005|     12582|    142434|61388.935342615165|          62344|16393.484361748524|  3827231785|   24.0|   67.0|47.371711792634414|       2.0|      33.0|22.235692287950727|10.399797895547286|16.192416271012448|
|   d009|     12505|    142950| 61567.09564047363|          18580|19054.432158113068|  1143916637|   24.0|   67.0| 46.94913885898816|       2.0|    

Aggregation based on Department and Gender

In [104]:
dept_gender_aggrDF = active_employees_data.groupBy('dept_no', 'emp_gender').agg(
    F.min('salary').alias('Min_Salary'),
    F.max('salary').alias('Max_Salary'),
    F.mean('salary').alias('Mean_Salary'),
    F.count('salary').alias('Total_Employees'),
    F.stddev('salary').alias('StdDev_Salary'),
    F.sum('salary').alias('Total_salary'),
    F.min('emp_age').alias('Min_Age'),
    F.max('emp_age').alias('Max_Age'),
    F.mean('emp_age').alias('Mean_Age'),
    F.min('emp_tenure').alias('Min_Tenure'),
    F.max('emp_tenure').alias('Max_Tenure'),
    F.mean('emp_tenure').alias('Mean_Tenure'),
    F.mean('salary_since').alias('Mean_Salary_Since'),
    F.mean('role_since').alias('Mean_Role_Since')
)

In [105]:
dept_gender_aggrDF.show()

+-------+----------+----------+----------+------------------+---------------+------------------+------------+-------+-------+------------------+----------+----------+------------------+------------------+------------------+
|dept_no|emp_gender|Min_Salary|Max_Salary|       Mean_Salary|Total_Employees|     StdDev_Salary|Total_salary|Min_Age|Max_Age|          Mean_Age|Min_Tenure|Max_Tenure|       Mean_Tenure| Mean_Salary_Since|   Mean_Role_Since|
+-------+----------+----------+----------+------------------+---------------+------------------+------------+-------+-------+------------------+----------+----------+------------------+------------------+------------------+
|   d006|         M|     12596|    137308| 59807.84364820847|           9210|17371.741914549362|   550830240|   24.0|   67.0|46.915526601520085|       2.0|      33.0|21.419978284473398|10.110749185667752|15.670901194353963|
|   d006|         F|     12776|    137294| 60621.83104034627|           6469|18632.929352115545|   39216

Write above DataFrames to storage

In [106]:
dept_aggrDF.repartition(1).write.option("header", "false").csv("/user/rameshm/Batch49/employeesdb/results/aggr_dept/")
dept_gender_aggrDF.repartition(1).write.option("header", "false").csv("/user/rameshm/Batch49/employeesdb/results/aggr_dept_gender/")

In [108]:
active_employees_data.printSchema()

root
 |-- emp_no: integer (nullable = true)
 |-- emp_first_name: string (nullable = true)
 |-- emp_last_name: string (nullable = true)
 |-- emp_gender: string (nullable = true)
 |-- emp_birth_date: string (nullable = true)
 |-- emp_hire_date: string (nullable = true)
 |-- emp_age: double (nullable = true)
 |-- emp_tenure: double (nullable = true)
 |-- salary: integer (nullable = true)
 |-- sal_from_date: string (nullable = true)
 |-- salary_since: double (nullable = true)
 |-- title: string (nullable = true)
 |-- title_from_date: string (nullable = true)
 |-- role_since: double (nullable = true)
 |-- dept_no: string (nullable = true)
 |-- dept_name: string (nullable = true)
 |-- dept_from_date: string (nullable = true)
 |-- emp_dept_tenure: double (nullable = true)
 |-- mgr_emp_no: integer (nullable = true)
 |-- mgr_first_name: string (nullable = true)
 |-- mgr_last_name: string (nullable = true)
 |-- mgr_gender: string (nullable = true)
 |-- mgr_birth_date: string (nullable = true)
 |

DB Store

In [109]:
active_employees_data.write.mode("overwrite").saveAsTable("active_employees_data")

In [110]:
dept_aggrDF.write.mode("overwrite").saveAsTable("dept_aggr")

In [111]:
dept_gender_aggrDF.write.mode("overwrite").saveAsTable("dept_gender_aggr")

In [86]:
# active_employees_data.write.format('jdbc').options(
#       url='jdbc:mysql://172.16.0.241:3306/insofe_results_10064',
#       driver='com.mysql.jdbc.Driver',
#       dbtable='active_emp_details',
#       user='insofeadmin',
#       password='insofe_password').mode('append').save()

In [113]:
spark.sql("SHOW TABLES").show(truncate=False)

+------------------+-------------------------+-----------+
|database          |tableName                |isTemporary|
+------------------+-------------------------+-----------+
|insofe_empdb_10064|active_employees_data    |false      |
|insofe_empdb_10064|departments              |false      |
|insofe_empdb_10064|dept_aggr                |false      |
|insofe_empdb_10064|dept_emp                 |false      |
|insofe_empdb_10064|dept_gender_aggr         |false      |
|insofe_empdb_10064|dept_manager             |false      |
|insofe_empdb_10064|employees                |false      |
|insofe_empdb_10064|salaries                 |false      |
|insofe_empdb_10064|titles                   |false      |
|                  |active_emp_details_sqltbl|true       |
+------------------+-------------------------+-----------+



In [87]:
spark.sql("DESCRIBE dept_aggr").show(truncate=False)

+-----------------+---------+-------+
|col_name         |data_type|comment|
+-----------------+---------+-------+
|dept_no          |string   |null   |
|Min_Salary       |int      |null   |
|Max_Salary       |int      |null   |
|Mean_Salary      |double   |null   |
|Total_Employees  |bigint   |null   |
|StdDev_Salary    |double   |null   |
|Total_salary     |bigint   |null   |
|Min_Age          |double   |null   |
|Max_Age          |double   |null   |
|Mean_Age         |double   |null   |
|Min_Tenure       |double   |null   |
|Max_Tenure       |double   |null   |
|Mean_Tenure      |double   |null   |
|Mean_Salary_Since|double   |null   |
|Mean_Role_Since  |double   |null   |
+-----------------+---------+-------+



In [89]:
spark.sql("SHOW CREATE TABLE dept_aggr").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                                                                                                                                                                                                                                                     |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [92]:
!hdfs dfs -ls /apps/hive/warehouse/

Found 785 items
drwxrwxrwx   - 1355B30     hdfs          0 2017-11-11 12:08 /apps/hive/warehouse/.hive-staging_hive_2017-11-11_12-08-32_725_406234144602136858-1
drwxrwxrwx   - 1305B40     hdfs          0 2018-07-15 12:21 /apps/hive/warehouse/1305b40.db
drwxrwxrwx   - 1803B40     hdfs          0 2018-07-15 12:56 /apps/hive/warehouse/1803b40_hive_data.db
drwxrwxrwx   - 1806B40     hdfs          0 2018-07-15 12:47 /apps/hive/warehouse/1806_testing.db
drwxrwxrwx   - 1886B39     hdfs          0 2018-07-07 14:54 /apps/hive/warehouse/1886_batch39.db
drwxrwxrwx   - 1895B40     hdfs          0 2018-07-29 15:10 /apps/hive/warehouse/1895_cute.db
drwxrwxrwx   - 1896B40     hdfs          0 2018-07-15 12:04 /apps/hive/warehouse/1896b40.db
drwxrwxrwx   - 1915B40     hdfs          0 2018-07-15 12:48 /apps/hive/warehouse/1915_testing.db
drwxrwxrwx   - 1915B40     hdfs          0 2018-07-29 15:03 /apps/hive/warehouse/1915b40_cute.db
drwxrwxrwx   - 1923B40     hdfs          0 2018-07-15 12:26 /

drwxrwxrwx   - 1867B39     hdfs          0 2018-07-07 15:44 /apps/hive/warehouse/advanced_shashank2.db
drwxrwxrwx   - 1878B39     hdfs          0 2018-07-07 17:04 /apps/hive/warehouse/advanced_shijith.db
drwxrwxrwx   - 1882B39     hdfs          0 2018-07-07 15:44 /apps/hive/warehouse/advanced_sumeet.db
drwxrwxrwx   - 1887B39     hdfs          0 2018-07-07 15:02 /apps/hive/warehouse/advanced_uma.db
drwxrwxrwx   - 1887B39     hdfs          0 2018-07-07 16:47 /apps/hive/warehouse/advanced_uma1.db
drwxrwxrwx   - 1890B39     hdfs          0 2018-07-07 16:43 /apps/hive/warehouse/advanced_vikas.db
drwxrwxrwx   - 1724B39     hdfs          0 2018-07-07 16:44 /apps/hive/warehouse/advanced_vikass.db
drwxrwxrwx   - 1879B39     hdfs          0 2018-07-07 16:22 /apps/hive/warehouse/advancedsri.db
drwxrwxrwx   - 1834B39     hdfs          0 2018-07-07 16:05 /apps/hive/warehouse/ajar.db
drwxrwxrwx   - 1836B39     hdfs          0 2018-07-07 17:00 /apps/hive/warehouse/amy.db
drwxrwxrwx   - 1900

drwxrwxrwx   - 1985B41     hdfs          0 2018-07-26 10:49 /apps/hive/warehouse/insofe_empdb_1985.db
drwxrwxrwx   - 2152B49     hdfs          0 2019-01-20 10:25 /apps/hive/warehouse/insofe_empdb_2152b49.db
drwxrwxrwx   - 2184B45     hdfs          0 2019-01-05 18:13 /apps/hive/warehouse/insofe_empdb_2184.db
drwxrwxrwx   - 2184B45     hdfs          0 2019-01-06 09:55 /apps/hive/warehouse/insofe_empdb_2184b45.db
drwxrwxrwx   - 2187B45     hdfs          0 2019-01-06 11:45 /apps/hive/warehouse/insofe_empdb_2187.db
drwxrwxrwx   - 2187B45     hdfs          0 2019-01-06 14:48 /apps/hive/warehouse/insofe_empdb_2187b45.db
drwxrwxrwx   - 2210B46     hdfs          0 2018-11-24 08:05 /apps/hive/warehouse/insofe_empdb_2210.db
drwxrwxrwx   - 2245B49     hdfs          0 2019-01-19 17:44 /apps/hive/warehouse/insofe_empdb_2245.db
drwxrwxrwx   - 2308B48     hdfs          0 2019-01-05 17:53 /apps/hive/warehouse/insofe_empdb_2308.db
drwxrwxrwx   - 2308B48     hdfs          0 2019-01-05 17:54 /app

In [93]:
!hdfs dfs -ls /apps/hive/warehouse/insofe_empdb_10064.db/

Found 3 items
drwxr-xr-x   - rameshm hdfs          0 2019-01-20 10:31 /apps/hive/warehouse/insofe_empdb_10064.db/active_employees_data
drwxr-xr-x   - rameshm hdfs          0 2019-01-20 10:31 /apps/hive/warehouse/insofe_empdb_10064.db/dept_aggr
drwxr-xr-x   - rameshm hdfs          0 2019-01-20 10:31 /apps/hive/warehouse/insofe_empdb_10064.db/dept_gender_aggr
