# Using cuDF.pandas on string-based ETL

Pandas is the preeminant data science CPU library for data ETL using Python, also known as the PyData stack.  You can use the RAPIDS library ecosystem uses NVIDIA CUDA enabled GPUs to accelerate the PyData stack.  The RAPIDS library cuDF is a pandas-like GPU library that GPU accelerates ETL - even in dataframes with string sequences in the rows as long as a short paragraph.  RAPIDS just introduced `cudDF.pandas()`, which allows for zero-code change of your pandas based workflow and does the heavy lifting of not just figuring out where GPUs will be faster than CPU, but also automatically lets your workflow translates your CPU code to GPU code, allowing you to get the best of both worlds.  

In this notebook, we will be using mostly `cudf.pandas` to preprocess portions a skills and job posting dataset.  But datascience isn't clean...and this won't be the cleanest, nicest notebook.  Real data science is messy and mistakes are made.  RAPIDS makes it faster and easier to go through the laborious effort of GPU cleaning.  Fixing those mistakes can be costly - especially in time. Now cudf.pandas makes it almost convienent. 

We'll start off by doing a minor performance comparison of pandas CPU code versus the zero-code change, GPU+CPU hybrid code of cudf.pandas.  Then we'll give you the choice to stay in cuDF.pandas to complete the workflow, or revert back to CPU only pandas.  You're smart people.  You know where this is going.

Let's begin!

In [1]:
%load_ext cudf.pandas
import pandas as pd

# Download your data:

Data can be found here: https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=linkedin_job_postings.csv

In [2]:
!if [ ! -f "job_skills.csv" ]; then kaggle datasets download asaniczka/1-3m-linkedin-jobs-and-skills-2024; else echo "unzipped job data found"; fi
!if [ ! -f "job_skills.csv" ]; then unzip 1-3m-linkedin-jobs-and-skills-2024; else echo "no need to unzip data"; fi

unzipped job data found
no need to unzip data


# Read your data

In [3]:
skills = pd.read_csv("job_skills.csv")

In [4]:
postings = pd.read_csv("linkedin_job_postings.csv")

# see if the jobs corellate

In [5]:
postings.head()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite
4,https://www.linkedin.com/jobs/view/group-unit-...,2024-01-19 09:45:09.215838+00,f,f,f,Group/Unit Supervisor (Systems Support Manager...,"IRS, Office of Chief Counsel","Chamblee, GA",2024-01-17,Gadsden,United States,Supervisor Travel-Information Center,Mid senior,Onsite


In [6]:
skills.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [7]:
skills.count()

job_link      1296381
job_skills    1294346
dtype: int64

Since we're interested in skills, we won't need jobs without explicit skills

In [8]:
skills = skills.dropna()

In [9]:
postings.count()

job_link               1348454
last_processed_time    1348454
got_summary            1348454
got_ner                1348454
is_being_worked        1348454
job_title              1348454
company                1348443
job_location           1348435
first_seen             1348454
search_city            1348454
search_country         1348454
search_position        1348454
job_level              1348454
job_type               1348454
dtype: int64

In [10]:
postings = postings.merge(skills, on=('job_link'))

In [11]:
postings.count()

job_link               1294346
last_processed_time    1294346
got_summary            1294346
got_ner                1294346
is_being_worked        1294346
job_title              1294346
company                1294337
job_location           1294327
first_seen             1294346
search_city            1294346
search_country         1294346
search_position        1294346
job_level              1294346
job_type               1294346
job_skills             1294346
dtype: int64

As the workflow may require more RAM than the free Colab instance has, we can reduce the dataset size to the first 1 million posts with complete data.

In [12]:
postings = postings.dropna()
postings = postings[:1000000]
postings.count()

job_link               1000000
last_processed_time    1000000
got_summary            1000000
got_ner                1000000
is_being_worked        1000000
job_title              1000000
company                1000000
job_location           1000000
first_seen             1000000
search_city            1000000
search_country         1000000
search_position        1000000
job_level              1000000
job_type               1000000
job_skills             1000000
dtype: int64

# Assessing the Skills

In [13]:
%%time
b = postings["job_skills"].str.split(",", expand=True)

CPU times: user 135 ms, sys: 63.5 ms, total: 199 ms
Wall time: 198 ms


alright, we're already way faster.


In [14]:
b.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,453,454,455,456,457,458,459,460,461,462
0,Medical equipment sales,Key competitors,Terminology,Technology,Trends,Challenges,Reimbursement,Government regulation,BD offerings,Pipeline management,...,,,,,,,,,,
1,Nursing,Bachelor of Science in Nursing,Masters Degree in Nursing,Care management experience,Clinical experience in nursing,Licensure to practice nursing in Michigan,Population management,Selfmanagement,Education,Oversight of registries,...,,,,,,,,,,
2,Restaurant Operations Management,Inventory Management,Food and Beverage Ordering,Profit Optimization,Guest Service,Front and Back of House Coordination,Employee Performance Management,Discipline and Rewards Management,Safety and Sanitation Maintenance,Directional Flow Management,...,,,,,,,,,,
3,Real Estate,Customer Service,Sales,Negotiation,Communication,Home Listings,Local Real Estate Market,Representation Contracts,Purchase Agreements,Closing Statements,...,,,,,,,,,,
4,Nursing,BSN,Medical License,Virtual RN,Nursing Support,Diversity,Equity,Inclusion,Equal Opportunity Employer,,...,,,,,,,,,,


Wow, someone actually created a role with 463 required skills.  I hope they found exactly the person that they were looking for.  However this makes this a really wide notebook.  Let's find out more about this exploded data.  This may take a while, and uses the CPU

In [15]:
desc_b = b.describe()

In [16]:
desc_b

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,453,454,455,456,457,458,459,460,461,462
count,1000000,998031,995977,991890,985843,976992,965045,948902,928235,902560,...,1,1,1,1,1,1,1,1,1,1
unique,124052,181060,209961,226632,235485,241099,245143,246635,245312,244162,...,1,1,1,1,1,1,1,1,1,1
top,Customer service,Customer service,Communication,Communication,Communication,Communication,Communication,Communication,Communication,Communication,...,WorkLife Balance,Stress Management,Conflict Resolution,Negotiation,Cultural Competence,Ethical DecisionMaking,Patient Safety,Quality Improvement,Risk Management,Compliance
freq,47154,15811,18112,19062,20763,21335,20774,19649,18103,16039,...,1,1,1,1,1,1,1,1,1,1


Well, The top skills here are `Customer Service` and then a whole lot of `Communication`.  Let's see what the aggregates are.  We're going to keep this pandas heavy and instead of using a `set`, we'll do `unique` and `concat`.

In [17]:
%%time
stacked_skills = b.stack().dropna() #uncomment if you have an A100 GPU

CPU times: user 17.8 s, sys: 2.87 s, total: 20.7 s
Wall time: 20.6 s


In [18]:
%%time 
concat_skills = pd.Series()
for i in range(0,463):
  #print(i)
  concat_skills = pd.concat([concat_skills, b[i].dropna()])

CPU times: user 4min 9s, sys: 9.06 s, total: 4min 18s
Wall time: 4min 16s


What's going on here?  What's with the if statement?  And one method takes around 20 seconds and the other takes nearly 2.5 minutes.  Well, it all comes down to the RAM available both on your CPU and GPU.

If you happen to have Colab Pro, and pick a V100 or A100, or have a memory extended instance, you can just use `stack()` and then drop NA. If your GPU is a T4, like most of you will have, you need a far more memory friendly way to do this - GPU or CPU.  Without explicitly dropping NA values, we run out of memory very quickly on both CPU and GPU using that more traditional methods like `.stack()`.  It's slower, but it works and let's you keep going.  Here is a screenshot of the counts, so you know that the numbers are the same.

26,925,304 skill mentions

The other cool thing to consider is "How do I know this?".  Because this is the cleaned up version of the notebook.  When using the speed of cuDF.pandas, there is something really great about knowing that something won't work VERY quickly instead of waiting and wondering until it just fails.  Plus, if and when the kernel crashes, you can just restart and you're quickly to your last working point.  Imagine having your workflow crash and it took you 5-10 minutes (or far longer) to get there instead of about a minute.  

Let's just say I know your frustration and swtiched over to doing this entirely on cuDF.pandas first, then did the times on CPU.  Got to hang with my wife a quite a bit more on Mother's Day because I used cuDF.pandas.  What are you missing out on in life?  While you ponder that, let's get that count and then get the list of unique skills.

In [19]:
stacked_skills.count()

20832232

In [20]:
concat_skills.count()

20832232

In [21]:
%%time
skills_count = concat_skills.value_counts()


CPU times: user 29.1 ms, sys: 8.07 ms, total: 37.1 ms
Wall time: 36.8 ms


In [22]:
len(skills_count)

2838860

Just under 3.4 Million skills.  Now let's take a look at the top 49 of these unique skills!

In [23]:
print(skills_count.head(49))

 Communication             275700
 Teamwork                  172715
 Leadership                124831
 Communication skills       81239
 Customer service           80031
 Problem Solving            79578
 Customer Service           72575
 Problemsolving             71653
 Collaboration              66373
 Training                   63247
 Communication Skills       60638
 Attention to detail        59413
 Time management            56844
 Microsoft Office Suite     55071
 Time Management            54761
 Project Management         52865
 Sales                      50199
 Scheduling                 48606
 Multitasking               47200
Customer service            47154
 Adaptability               46337
 Attention to Detail        45677
 Microsoft Office           44103
Nursing                     44066
 Flexibility                43940
 Patient Care               42475
 Interpersonal skills       42317
 Documentation              39453
 Organization               37170
 Data Analysis

So, as expected, `Communication` reigns supreme followed by `Leadership`, `Teamwork`, and `Communication skills`, `Customer service`... as we skim down then... hey wait a sec...is that another "Customer Service"?!  Wait, what?  Is that a capital S?

Oh no!  The skill label text has variations in characters for the same skills.  3 versions of "Problem Solving" needs a bit of it's own problem solving.  Looks like we have to do it again...

In [24]:
concat_skills.head()

0             Medical equipment sales
1                             Nursing
2    Restaurant Operations Management
3                         Real Estate
4                             Nursing
dtype: object

Just to show off a bit on the speed of `cudf.pandas`, we're going to show you how each of these will actions will clean up some of the duplicate skills.  

First, we'll first strip all the leading and trailing whitespace, then run a count.

Next, we'll set the skills to lowercase using `lower()` and run another count.

Then, we'll remove all white spaces, as we can use the capitalized characters to add whitespaces back in later and yet another count.

Finally, we'll remoe the word "skills"...and then do a final unique count.

In [25]:
%%time
stacked_skills = stacked_skills.str.strip()
skills_count2 = stacked_skills.value_counts()
print(skills_count2.count())
stacked_skills = stacked_skills.str.lower()
skills_count2 = stacked_skills.value_counts()
print(skills_count2.count())
stacked_skills = stacked_skills.str.replace(" ", "")
skills_count2 = stacked_skills.value_counts()
print(skills_count2.count())
stacked_skills = stacked_skills.str.replace("skills", "")
skills_count2 = stacked_skills.value_counts()
print(skills_count2.count())

2763922
2329212
2292115
CPU times: user 284 ms, sys: 16.2 ms, total: 300 ms
Wall time: 296 ms


Wow!  We dropped over 660K duplicate skills, or 20% of the dataset, with those 3 different string manipulations...and we did it in milliseconds.  We even had spare time to count everything 3 times and print out the answer 3 times.  

Of course, if we had done this in a single step, and didn't print out the columns, it would be faster, but you were already pretty impressed before we said anything.

In [29]:
skills_count2.head(49)

communication           439194
problemsolving          253154
customerservice         225405
teamwork                197364
leadership              158987
timemanagement          123352
attentiontodetail       106110
projectmanagement        97003
interpersonal            79084
sales                    76144
patientcare              74725
collaboration            69581
nursing                  65272
training                 65043
dataanalysis             65042
organizational           60361
microsoftofficesuite     59952
inventorymanagement      54515
highschooldiploma        51515
multitasking             51489
decisionmaking           50182
bachelor'sdegree         50024
analytical               49875
scheduling               49742
microsoftoffice          48392
adaptability             46760
flexibility              44262
criticalthinking         44244
organization             41778
documentation            41064
writtencommunication     37369
troubleshooting          37228
budgetin

Just too fast and too easy.  The final result is also dramatically different than the original `skills_count`.  

Okay, let's prepare our output to be useful for graph analytics.  We can also use this moment to create checkpoints by saving our cleaned dataframes as a `csv`.

In [73]:
# uniq_b.to_csv("unique_job_skills.csv")

In [74]:
# b.to_csv("skills_matrix.csv")

In [75]:
# b2.to_csv("b2.csv")

In [31]:
stacked_skills.head()

0  0    medicalequipmentsales
   1           keycompetitors
   2              terminology
   3               technology
   4                   trends
dtype: object

In [32]:
stacked_skills = stacked_skills.reset_index()

jobTitles = jobTitles.reset_index()
i2sDict = dict(zip(jobTitles['index'], jobTitles[0]))
stacked_skills['jobs'] = stacked_skills['level_0'].map(i2sDict)
stacked_skills["str_num"] = str(stacked_skills["level_0"])
stacked_skills["score"] = 463-stacked_skills["level_1"]

stacked_skills.head(50)

Unnamed: 0,level_0,level_1,0,jobs
0,0,0,medicalequipmentsales,Onsite Mid senior Color Maker
1,0,1,keycompetitors,Onsite Mid senior Color Maker
2,0,2,terminology,Onsite Mid senior Color Maker
3,0,3,technology,Onsite Mid senior Color Maker
4,0,4,trends,Onsite Mid senior Color Maker
5,0,5,challenges,Onsite Mid senior Color Maker
6,0,6,reimbursement,Onsite Mid senior Color Maker
7,0,7,governmentregulation,Onsite Mid senior Color Maker
8,0,8,bdofferings,Onsite Mid senior Color Maker
9,0,9,pipelinemanagement,Onsite Mid senior Color Maker


At this point, this data is formed into an edgelist and is ready to be injested into your favorite graph analytics platform.  Today, we're going to use `networkX`...jut with a twist.  RAPIDS also has a package called `nx-cugraph`.  It works like `cuDF.pandas`, but for `networkX`.  All you have to do is pip install `nx-cugraph` on a system with a CUDA enabled GPU and then import `networkx` and it will do all the CPU+GPU hybrid heavy lifting in the background for you with zero-code changes

In [None]:
!pip install nx-cugraph

In [16]:
import networkx as nx

In [64]:
G = nx.from_pandas_edgelist(stacked_skills, source='jobs', target='skills', edge_attr='score')

Now you have your data formed into a graph and read for any analytics that you may want to do.