In [None]:
import os
import pyspark
conf = pyspark.SparkConf()
conf.set("spark.ui.proxyBase", "/user/" + os.environ["JUPYTERHUB_USER"] + "/proxy/4041")
conf.set("spark.driver.memory", "16g")

sc = pyspark.SparkContext(conf = conf)
spark = pyspark.SQLContext.getOrCreate(sc)

In [66]:
import re
import statistics
from pyspark.sql.functions import udf, col
import pyspark.sql.types as types

# **Dataset**

Download the dataset from here - https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset/data and place the unzipped `.csv` in the root directory.

Keep in mind this dataset has been synthetically generated so the aggregates and insights we extract are very unrealistic.

Example:
| Job Id | Experience | Qualifications | Salary Range | location | Country | latitude | longitude | Work Type | Company Size | Job Posting Date | Preference | Contact Person | Contact | Job Title | Role | Job Portal | Job Description | Benefits | skills | Responsibilities | Company | Company Profile |
| - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 398454096642776 | 2 to 12 Years | BCA | $56K-$116K | Ashgabat | Turkmenistan | 38.9697 | 59.5563 | Intern | 100340 | 2022-12-19 | Female | Francisco Larsen | 461-509-4216 | Web Developer | Frontend Web Developer | Idealist | Frontend Web Developers design and implement user interfaces for websites, ensuring they are visuall... | {'Health Insurance, Retirement Plans, Paid Time Off (PTO), Flexible Work Arrangements, Employee Assi... | HTML, CSS, JavaScript Frontend frameworks (e.g., React, Angular) User experience (UX) | Design and code user interfaces for websites, ensuring a seamless and visually appealing user experi... | PNC Financial Services Group | {"Sector":"Financial Services","Industry":"Commercial Banks","City":"Pittsburgh","State":"Pennsylvan... |

In [5]:
jobs = spark.read.option("header", True).option("inferSchema", True).csv("../job_descriptions.csv")
jobs.count()

                                                                                

1615940

In [14]:
jobs.groupBy("Role").count().sort("Role").toPandas()

                                                                                

Unnamed: 0,Role,count
0,API Developer,3483
1,Accessibility Developer,3513
2,Account Executive,7063
3,Account Manager,3474
4,Account Strategist,3460
...,...,...
371,Wedding Consultant,3492
372,Wedding Coordinator,3552
373,Wedding Designer,3353
374,Wedding Planner,6902


# **Feature 1**

Comparison of job to other jobs of the same position.

Example: I am looking for a new "software engineer" position.  When I look at a job posting, I'd like to see how this posting compares to other "software engineer" positions.  For example, "this posting has a lower salary than other postings of similar positions, and requires more qualifications

Input:
```json
{
    "id": 398454096642776,
}
```

Output:
```json
{
    "role": "Frontend Web Developer",
    "salary": "less than average",
    "education": "less than average",
    "experience": "less than average",
    "skills": "greater than average",
    "responsibilities": "greater than average",
    "benefits": "greater than average",
}
```

In [71]:
def get_avg_salary(salary_str):
    range = [int(n) for n in re.findall(r"\d+", salary_str)]
    return statistics.mean(range) * 1000

udf_avg_salary = udf(get_avg_salary, types.FloatType())

salaries = jobs.withColumn("Salary", udf_avg_salary(col("Salary Range")))\
.groupBy("Role").mean("Salary").sort("Role").toPandas()
salaries

Unnamed: 0,Role,avg(Salary)
0,API Developer,82636.602452
1,Accessibility Developer,82385.600455
2,Account Executive,82449.157303
3,Account Manager,82809.915014
4,Account Strategist,82440.628638
...,...,...
371,Wedding Consultant,82349.743882
372,Wedding Coordinator,82566.101695
373,Wedding Designer,82452.353283
374,Wedding Planner,82525.977817


In [73]:
def get_avg_exp(exp_str):
    range = [int(n) for n in re.findall(r"\d+", exp_str)]
    return statistics.mean(range)

udf_avg_exp = udf(get_avg_exp, types.FloatType())

exps = jobs.withColumn("Years Experience", udf_avg_exp(col("experience")))\
.groupBy("Role").mean("Years Experience").sort("Role").toPandas()
exps

Unnamed: 0,Role,avg(Years Experience)
0,API Developer,7.003116
1,Accessibility Developer,7.026810
2,Account Executive,6.961190
3,Account Manager,6.964162
4,Account Strategist,7.009029
...,...,...
371,Wedding Consultant,6.962441
372,Wedding Coordinator,6.971982
373,Wedding Designer,7.065859
374,Wedding Planner,7.018487


In [77]:
skills_str = jobs.select("skills").head()["skills"]
[b.strip() for b in skills_str.split(",")]

['Social media platforms (e.g.',
 'Facebook',
 'Twitter',
 'Instagram) Content creation and scheduling Social media analytics and insights Community engagement Paid social advertising']

In [80]:
responsibilities_str = jobs.select("responsibilities").head()["responsibilities"]
[b.strip() for b in responsibilities_str.split(".")]

['Manage and grow social media accounts, create engaging content, and interact with the online community',
 'Develop social media content calendars and strategies',
 'Monitor social media trends and engagement metrics',
 '']

In [52]:
benefits_str = jobs.select("benefits").head()["benefits"]
[b.strip() for b in benefits_str[2:-2].split(",")]

['Flexible Spending Accounts (FSAs)',
 'Relocation Assistance',
 'Legal Assistance',
 'Employee Recognition Programs',
 'Financial Counseling']

# **Feature 2**

What skills an applicant needs for a specific job position.

Example: I am looking for a new "software engineer" position.  When I search for "software engineer," I'd like to see a list of the most common qualifications for this position, e.g., "Python proficiency, AWS CDK, and GitHub CI/CD."

Input:
```json
{
    "role": "Frontend Web Developer",
}
```

Output:
```json
{
    "skills": [
        "HTML",
        "CSS",
        "JavaScript Frontend frameworks (e.g., React, Angular)",
        "User experience (UX)",
    ],
}
```

# **Feature 3**

Job recommendation based on user profile.

Example: I enter in a list of qualifications I have (and maybe some other information such as desired salary).  I'd like to see a list of job positions that most fit my profile.

Input:
```json
{
    "skills": [
        "HTML",
        "CSS",
        "JavaScript Frontend frameworks (e.g., React, Angular)",
        "User experience (UX)",
    ],
    "salary": 100000,
}
```

Output:
```json
{
    "roles": [
        "Frontend Web Developer",
        "User Interface Designer",
        "Backend Developer",
    ],
}
```