# Data Incubator Data Science
> This is Sungryong Hong. 

>I have my own Spark(2.3.2)/Hadoop(2.8.3) cluster, which has 48 logical cores with 150GB memory. This notebook demonstrates how I have solved my data-science problem.

## 1. Import Basic Packages

In [1]:
# Basic Libraries 
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree
import gc

# plot settings
plt.rc('font', family='serif') 
plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

In [2]:
# Basic PySpark Libraries

# Old Style : SparkContext 
#from pyspark import SparkContext   
#from pyspark.sql import SQLContext

#New Style : Spark Session  
#Shell-Mode: Spark Session Name is `spark`

sc = spark.sparkContext
sqlsc = SQLContext(sc)
sc.setCheckpointDir("hdfs://master:54310/tmp/spark/checkpoints")

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W

In [3]:
# Enable Arrow for boosting up python performances 
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

## 2. Read the *job-listing* 

### 2.1 Read the file to a spark dataframe

In [4]:
rawdf_schema = T.StructType([\
                          T.StructField('dataset_id', T.IntegerType(), True),\
                          T.StructField('listing_id', T.StringType(), True),\
                          T.StructField('domain', T.StringType(), True),\
                          T.StructField('as_of_date', T.TimestampType(), True),\
                          T.StructField('title', T.StringType(), True),\
                          T.StructField('url', T.StringType(), True),\
                          T.StructField('brand', T.StringType(), True),\
                          T.StructField('category', T.StringType(), True),\
                          T.StructField('locality', T.StringType(), True),\
                          T.StructField('region', T.StringType(), True),\
                          T.StructField('country', T.StringType(), True),\
                          T.StructField('number_of_openings', T.FloatType(), True),\
                          T.StructField('date_added', T.TimestampType(), True),\
                          T.StructField('date_updated', T.TimestampType(), True),\
                          T.StructField('posted_date', T.TimestampType(), True),\
                          T.StructField('location_string', T.StringType(), True),\
                          T.StructField('description', T.StringType(), True),\
                          T.StructField('entity_id', T.LongType(), True),\
                          T.StructField('city_lat', T.StringType(), True),\
                          T.StructField('city_lng', T.StringType(), True),\
                          T.StructField('cusip', T.StringType(), True),\
                          T.StructField('isin', T.StringType(), True)
                         ])

In [5]:
rawdf = sqlsc.read.csv("hdfs://master:54310/data/spark/tdi/temp_datalab_records_job_listings4.csv",\
                        header=True, schema = rawdf_schema)

In [6]:
total_csv_size_GB = 25.16 + 28.97 + 12.94 + 18.86 + 17.13 + 13.73
print total_csv_size_GB

116.79


### 2.2 Browse the raw data

In [7]:
rawdf.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date',
             'region','location_string')\
        .show(10,truncate=True)

+--------------------+--------------------+------------------+-------------------+------+--------------------+
|               title|            category|number_of_openings|        posted_date|region|     location_string|
+--------------------+--------------------+------------------+-------------------+------+--------------------+
|Quick Lube Techni...|Quick Lube Techni...|               1.0|2017-07-19 00:00:00|    GA|      US-GA-Lithonia|
|Service Porter / ...|Customer Service/...|               1.0|2017-09-22 00:00:00|    TX|    US-TX-Fort Worth|
|Service Valet / M...|Customer Service/...|               3.0|2017-08-21 00:00:00|    TX|        US-TX-Frisco|
|Director, Technic...|          Accounting|               1.0|2017-07-31 00:00:00|  null|US-CA-San Francis...|
|   Financial Analyst|  Finance & Treasury|               1.0|2017-08-30 00:00:00|    CA| US-CA-San Francisco|
|Finance Administr...|  Finance & Treasury|               1.0|2017-08-30 00:00:00|    CA| US-CA-San Francisco|
|

#### Let's find out what kinds of job `categories` exist

In [8]:
%%time
rawdf.groupby('category').count().show()

+--------------------+-----+
|            category|count|
+--------------------+-----+
|        Intern/Co-Op|   91|
|Environmental/Hea...|  501|
|Data Center Opera...| 5832|
|Business Operatio...|   40|
|Credit & Portfoli...|  102|
|Surgery Hand Surgery|   31|
|              SAFETY|  189|
|    Employee Success|   63|
|Global Sustainabi...|   82|
|Physician  - Psyc...|  286|
|           Physician|   12|
| Outbound Engagement|   26|
|    Recursos Humanos|    2|
|Operations (Gener...|    6|
|Engineering - Pro...|  277|
|       :Ops Projects|  561|
|  Repair and Service|  122|
|Product Developme...|   83|
|    Streaming Client|  901|
|Orthopaedic Sport...|  301|
+--------------------+-----+
only showing top 20 rows

CPU times: user 5.58 ms, sys: 2.87 ms, total: 8.45 ms
Wall time: 20.4 s


In [9]:
%%time
pdcategory = rawdf.groupby('category').count().toPandas()

CPU times: user 28.7 ms, sys: 9.51 ms, total: 38.3 ms
Wall time: 21.5 s


In [10]:
len(pdcategory.index)

2840

In [11]:
pdcategory = pdcategory.sort_values(by='count',ascending=False)

In [12]:
pdcategory.head(10)

Unnamed: 0,category,count
606,,27364011
2649,Stores,652998
183,Sales,515872
2533,Store: Sales and Support Associate,417703
1122,Store Hourly,310282
2598,Restaurant Team Members,237945
2044,Support,217924
2469,Front of House Opportunities,200226
310,Engineering,155760
42,Kitchen Opportunities,152681


In [13]:
#save the job category as a text file to print and look at it
pdcategory.to_csv('histogram-category.csv', encoding='utf-8', index=False)

#### Let's check out what kind of jobs are shown as `category == null` 

In [14]:
rawdf.filter(rawdf.category.isNull()).select('title','category','location_string').show(100,truncate=True)

+--------------------+--------+--------------------+
|               title|category|     location_string|
+--------------------+--------+--------------------+
|Neonatal Nurse Pr...|    null|        US-TX-Conroe|
|Sales Manager - I...|    null|AMER_North Amer-U...|
|            280-6921|    null|3812 Cook Blvd, C...|
|Banking Center Ma...|    null|Mineral Wells, Mi...|
|            258-6464|    null|141 Knobbs Creek ...|
|          Packager I|    null|       United States|
|Warehouse Supervisor|    null|       United States|
|Production Mainte...|    null|       United States|
|  Director Marketing|    null|       United States|
|Production Mainte...|    null|       United States|
|Commodities Accou...|    null|       United States|
|Production Superv...|    null|       United States|
|  Sanitation Manager|    null|       United States|
|        Sanitation I|    null|       United States|
|Commodities Accou...|    null|       United States|
|Commodities Accou...|    null|       United S

> At least, such `null` jobs are not related to **Information and Technology**

### 2.3 Extract only useful pieces of information from the raw data

#### Let's check the size and schema, converted to a pandasDF

In [15]:
alljobsdf = \
rawdf.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date','region','location_string')

In [16]:
alljobsdf.cache()

DataFrame[title: string, category: string, number_of_openings: float, posted_date: timestamp, region: string, location_string: string]

In [17]:
%%time
alljobsdf.show(5,truncate=True)

+--------------------+--------------------+------------------+-------------------+------+--------------------+
|               title|            category|number_of_openings|        posted_date|region|     location_string|
+--------------------+--------------------+------------------+-------------------+------+--------------------+
|Quick Lube Techni...|Quick Lube Techni...|               1.0|2017-07-19 00:00:00|    GA|      US-GA-Lithonia|
|Service Porter / ...|Customer Service/...|               1.0|2017-09-22 00:00:00|    TX|    US-TX-Fort Worth|
|Service Valet / M...|Customer Service/...|               3.0|2017-08-21 00:00:00|    TX|        US-TX-Frisco|
|Director, Technic...|          Accounting|               1.0|2017-07-31 00:00:00|  null|US-CA-San Francis...|
|   Financial Analyst|  Finance & Treasury|               1.0|2017-08-30 00:00:00|    CA| US-CA-San Francisco|
+--------------------+--------------------+------------------+-------------------+------+--------------------+
o

In [18]:
%%time
alljobsdf.count()

CPU times: user 4.91 ms, sys: 2.73 ms, total: 7.63 ms
Wall time: 24.8 s


11242

In [19]:
jobpart4 = alljobsdf.toPandas()

In [20]:
jobpart4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11242 entries, 0 to 11241
Data columns (total 6 columns):
title                 11242 non-null object
category              8327 non-null object
number_of_openings    11242 non-null float32
posted_date           11242 non-null datetime64[ns]
region                8443 non-null object
location_string       8568 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 483.1+ KB


In [21]:
jobpart4.head()

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,Quick Lube Technician / Nalley Toyota Stonecrest,Quick Lube Technicians,1.0,2017-07-19,GA,US-GA-Lithonia
1,Service Porter / McDavid Ford,Customer Service/Support,1.0,2017-09-22,TX,US-TX-Fort Worth
2,Service Valet / McDavid Honda Frisco,Customer Service/Support,3.0,2017-08-21,TX,US-TX-Frisco
3,"Director, Technical Accounting & Reporting",Accounting,1.0,2017-07-31,,US-CA-San FranciscoUS-TX-Houston
4,Financial Analyst,Finance & Treasury,1.0,2017-08-30,CA,US-CA-San Francisco


> The size is quite small enough. Let's read all part files and save them as pandasDFs

In [22]:
rawdf.unpersist()
alljobsdf.unpersist()

DataFrame[title: string, category: string, number_of_openings: float, posted_date: timestamp, region: string, location_string: string]

#### Trim all csv files to pandasDF

In [23]:
%%time
rawdf2 = sqlsc.read.csv("hdfs://master:54310/data/spark/tdi/temp_datalab_records_job_listings2.csv",\
                        header=True, schema = rawdf_schema)
jobpart2 = rawdf2.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date','region','location_string').toPandas()

CPU times: user 25.1 ms, sys: 8.96 ms, total: 34.1 ms
Wall time: 33.4 s


In [24]:
jobpart2.head()

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,OPER NEGÓCIOS ITC -,Júnior/Trainee,1.0,2018-04-23,SP,
1,Estágio Private,Estágio,1.0,2018-04-02,RJ,
2,CAIXA,Júnior/Trainee,1.0,2018-04-11,SE,
3,ATENDENTE COMERCIAL,Auxiliar/Operacional,1.0,2018-04-27,MT,
4,ESTAGIÁRIO AGÊNCIA OPERACIONAL 6H,Estágio,1.0,2018-04-23,SP,


In [31]:
jobpart2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31268 entries, 0 to 31267
Data columns (total 6 columns):
title                 31268 non-null object
category              31268 non-null object
number_of_openings    31268 non-null float32
posted_date           31268 non-null datetime64[ns]
region                30576 non-null object
location_string       0 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 1.3+ MB


In [32]:
%%time
rawdf3 = sqlsc.read.csv("hdfs://master:54310/data/spark/tdi/temp_datalab_records_job_listings3.csv",\
                        header=True, schema = rawdf_schema)
jobpart3 = rawdf3.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date','region','location_string').toPandas()

rawdf5 = sqlsc.read.csv("hdfs://master:54310/data/spark/tdi/temp_datalab_records_job_listings5.csv",\
                        header=True, schema = rawdf_schema)
jobpart5 = rawdf5.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date','region','location_string').toPandas()


rawdf6 = sqlsc.read.csv("hdfs://master:54310/data/spark/tdi/temp_datalab_records_job_listings6.csv",\
                        header=True, schema = rawdf_schema)
jobpart6 = rawdf6.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date','region','location_string').toPandas()

rawdf7 = sqlsc.read.csv("hdfs://master:54310/data/spark/tdi/temp_datalab_records_job_listings7.csv",\
                        header=True, schema = rawdf_schema)
jobpart7 = rawdf7.dropna(how='any',subset=['number_of_openings','posted_date'])\
        .select('title','category','number_of_openings','posted_date','region','location_string').toPandas()

CPU times: user 1.48 s, sys: 191 ms, total: 1.67 s
Wall time: 1min 40s


#### Check the pandasDFs

In [33]:
jobpart3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4531 entries, 0 to 4530
Data columns (total 6 columns):
title                 4531 non-null object
category              4531 non-null object
number_of_openings    4531 non-null float32
posted_date           4531 non-null datetime64[ns]
region                4479 non-null object
location_string       0 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 194.8+ KB


In [34]:
jobpart3.head(3)

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,ESTAGIÁRIO AG OP 6H,Estágio,1.0,2018-04-02,RN,
1,ESTAGIÁRIO AG OP 6H,Estágio,1.0,2018-04-16,GO,
2,ESTAGIARIO AG OP 6H,Estágio,1.0,2018-04-19,MG,


In [35]:
jobpart5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 446679 entries, 0 to 446678
Data columns (total 6 columns):
title                 446679 non-null object
category              388997 non-null object
number_of_openings    446679 non-null float32
posted_date           446679 non-null datetime64[ns]
region                355665 non-null object
location_string       276860 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 18.7+ MB


In [36]:
jobpart5.head(3)

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,Medical Claims Collector,Insurance Verification/Billing,1.0,2017-06-23,OH,"{'country': 'USA', 'region': u'OH', 'locality'..."
1,MRI Technologist,Healthcare - Radiology/Imaging,1.0,2017-06-27,NY,"{'country': 'USA', 'region': u'NY', 'locality'..."
2,"Radiation Therapist, Junior",Healthcare - Oncology,1.0,2017-06-29,MA,"{'country': 'USA', 'region': u'MA', 'locality'..."


In [37]:
jobpart6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426310 entries, 0 to 426309
Data columns (total 6 columns):
title                 426310 non-null object
category              379788 non-null object
number_of_openings    426310 non-null float32
posted_date           426310 non-null datetime64[ns]
region                99783 non-null object
location_string       17494 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 17.9+ MB


In [38]:
jobpart6.head(3)

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,Cryptologic Instructor (Unix),Training,1.0,2017-03-28,GA,"Augusta, GA, USA"
1,Cryptologic Instructor (Windows),Training,3.0,2017-03-28,GA,"Augusta, GA, USA"
2,Cryptologic Instuctor (Unix),Training,1.0,2017-03-28,GA,"Augusta, GA, USA"


In [39]:
jobpart7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 347751 entries, 0 to 347750
Data columns (total 6 columns):
title                 347751 non-null object
category              335379 non-null object
number_of_openings    347751 non-null float32
posted_date           347751 non-null datetime64[ns]
region                307010 non-null object
location_string       59106 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 14.6+ MB


In [40]:
jobpart7.head(3)

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,Chrysler Dodge Jeep Service Technician,Service Technicians,2.0,2016-05-13,NC,
1,Quick Lube Technician / Coggin Honda of Orlando,Quick Lube Technicians,1.0,2016-12-10,FL,
2,Service Advisor / Lexus of Greenville,Service Advisors,1.0,2016-12-07,SC,


#### Finally, concatenate and save 

In [41]:
jobdf = pd.concat([jobpart2,jobpart3,jobpart4,jobpart5,jobpart6,jobpart7])\
            .sort_values('posted_date').reset_index(drop=True)

In [42]:
jobdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1267781 entries, 0 to 1267780
Data columns (total 6 columns):
title                 1267781 non-null object
category              1148290 non-null object
number_of_openings    1267781 non-null float32
posted_date           1267781 non-null datetime64[ns]
region                805956 non-null object
location_string       362028 non-null object
dtypes: datetime64[ns](1), float32(1), object(4)
memory usage: 53.2+ MB


In [43]:
jobdf.head(10)

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
0,SCA - LABORER,Facilities,1.0,2013-11-21,MD,
1,SCA - LABORER,Facilities,1.0,2013-11-21,MD,
2,SCA - LABORER,Facilities,1.0,2013-11-21,MD,
3,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville
4,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville
5,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville
6,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville
7,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville
8,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville
9,Toyota of Greenville Body Shop Estimator,Other,1.0,2014-06-12,SC,US-SC-Greenville


In [44]:
jobdf.tail(10)

Unnamed: 0,title,category,number_of_openings,posted_date,region,location_string
1267771,ADVOGADO SR,Sênior,1.0,2018-07-19,SP,
1267772,Analista Operações de Previdência,Júnior/Trainee,26.0,2018-07-19,SP,
1267773,PROGRAMA DE ESTÁGIO - ITAÚ UNIBANCO,Estágio,1.0,2018-07-19,PR,
1267774,Promotor de vendas -ICARROS,Júnior/Trainee,1.0,2018-07-19,SP,
1267775,ESTAGIARIO AG OP 6H,Estágio,1.0,2018-07-19,MG,
1267776,ESTAGIÁRIO AG OP 6H,Estágio,1.0,2018-07-19,MG,
1267777,PROGRAMA DE ESTÁGIO - ITAÚ UNIBANCO,Estágio,1.0,2018-07-19,PR,
1267778,ADVOGADO SR,Sênior,1.0,2018-07-19,SP,
1267779,Analista de Projetos de Infraestrutura Pleno,Pleno,1.0,2018-07-20,SP,
1267780,ATENDENTE COMERCIAL,Auxiliar/Operacional,1.0,2018-07-20,SP,


In [45]:
import pyarrow as pa
import pyarrow.parquet as pq

In [46]:
pq.write_table(pa.Table.from_pandas(jobdf), 'jobdf.parquet.snappy', compression='snappy')

> The file size of `jobdf.parquet.snappy` is 10MB. 