# Using cuDF.pandas on string-based ETL

Pandas is the preeminant data science CPU library for data preperation and forming using Python, also known as the PyData stack.  cuDF, part of the RAPIDS library ecosystem which uses NVIDIA CUDA enabled GPUs to accelerate the PyData stack, is a pandas-like GPU library that GPU accelerates data prepeatdataframes from, thin dataframes - even those with string sequences in the rows as long as a short paragraph.  RAPIDS just introduced `cudDF.pandas()`, which allows for zero-code change of your pandas based workflow and does the heavy lifting of not just figuring out where GPUs will be faster than CPU, but also automatically lets your workflow translates your CPU code to GPU code, allowing you to get the best of both worlds.  

In this notebook, we will be using mostly `cudf.pandas` to preprocess portions a skills and job posting dataset.  But datascience isn't clean...and this won't be the cleanest, nicest notebook.  Real data science is messy and mistakes are made.  Fixing those mistakes can be costly - especially in time.  

We'll start off by doing a minor performance  comparison of pandas CPU code versus the 0 code change, GPU+CPU hybrid code in cudf.pandas.  Then we'll give you the choice to stay in cuDF.pandas to complete the workflow, or revert back to CPU only pandas.  You're smart people.  You know where this is going.

Let's begin!

In [1]:
%load_ext cudf.pandas
import pandas as pd

# Download your data:

Data can be found here: https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=linkedin_job_postings.csv

In [2]:
# !kaggle datasets download asaniczka/1-3m-linkedin-jobs-and-skills-2024

/usr/bin/sh: 1: kaggle: not found


In [3]:
# !unzip 1-3m-linkedin-jobs-and-skills-2024

/usr/bin/sh: 1: unzip: not found


In [4]:
# !wget https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/download?datasetVersionNumber=2

--2024-05-13 18:08:11--  https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/download?datasetVersionNumber=2
Resolving www.kaggle.com (www.kaggle.com)... 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /account/login?titleType=dataset-downloads&showDatasetDownloadSkip=False&messageId=datasetsWelcome&returnUrl=%2Fdatasets%2Fasaniczka%2F1-3m-linkedin-jobs-and-skills-2024%2Fversions%2F2%3Fresource%3Ddownload [following]
--2024-05-13 18:08:11--  https://www.kaggle.com/account/login?titleType=dataset-downloads&showDatasetDownloadSkip=False&messageId=datasetsWelcome&returnUrl=%2Fdatasets%2Fasaniczka%2F1-3m-linkedin-jobs-and-skills-2024%2Fversions%2F2%3Fresource%3Ddownload
Reusing existing connection to www.kaggle.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘download?datasetVersionNumber=2’

download?datasetVer

# Read your data

In [15]:
skills = pd.read_csv("job_skills.csv")

In [16]:
postings = pd.read_csv("linkedin_job_postings.csv")

# see if the jobs corellate

In [17]:
postings.head()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite
4,https://www.linkedin.com/jobs/view/group-unit-...,2024-01-19 09:45:09.215838+00,f,f,f,Group/Unit Supervisor (Systems Support Manager...,"IRS, Office of Chief Counsel","Chamblee, GA",2024-01-17,Gadsden,United States,Supervisor Travel-Information Center,Mid senior,Onsite


In [19]:
skills.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [20]:
skills.count()

job_link      1296381
job_skills    1294346
dtype: int64

In [13]:
postings.count()

job_link               1348454
last_processed_time    1348454
got_summary            1348454
got_ner                1348454
is_being_worked        1348454
job_title              1348454
company                1348443
job_location           1348435
first_seen             1348454
search_city            1348454
search_country         1348454
search_position        1348454
job_level              1348454
job_type               1348454
dtype: int64

In [22]:
postings = postings.merge(skills, on=('job_link'))

In [23]:
postings.count()

job_link               1296381
last_processed_time    1296381
got_summary            1296381
got_ner                1296381
is_being_worked        1296381
job_title              1296381
company                1296372
job_location           1296362
first_seen             1296381
search_city            1296381
search_country         1296381
search_position        1296381
job_level              1296381
job_type               1296381
job_skills             1294346
dtype: int64

In [10]:
# ap = postings["job_link"].str.split("/", expand=True)

In [9]:
# a = skills["job_link"].str.split("/", expand=True)

In [None]:
print(a[5])

In [None]:
a2 = a[5].str.split("-")

In [None]:
a2.head()

In [None]:
a2[1][-1]

In [None]:
a[0][-1].split()

# Assessing the Skills

In [24]:
%%time
b = postings["job_skills"].str.split(",", expand=True)

CPU times: user 271 ms, sys: 63.9 ms, total: 335 ms
Wall time: 332 ms


alright, we're already way faster.


In [25]:
b.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,453,454,455,456,457,458,459,460,461,462
0,Medical equipment sales,Key competitors,Terminology,Technology,Trends,Challenges,Reimbursement,Government regulation,BD offerings,Pipeline management,...,,,,,,,,,,
1,Nursing,Bachelor of Science in Nursing,Masters Degree in Nursing,Care management experience,Clinical experience in nursing,Licensure to practice nursing in Michigan,Population management,Selfmanagement,Education,Oversight of registries,...,,,,,,,,,,
2,Restaurant Operations Management,Inventory Management,Food and Beverage Ordering,Profit Optimization,Guest Service,Front and Back of House Coordination,Employee Performance Management,Discipline and Rewards Management,Safety and Sanitation Maintenance,Directional Flow Management,...,,,,,,,,,,
3,Real Estate,Customer Service,Sales,Negotiation,Communication,Home Listings,Local Real Estate Market,Representation Contracts,Purchase Agreements,Closing Statements,...,,,,,,,,,,
4,Nursing,BSN,Medical License,Virtual RN,Nursing Support,Diversity,Equity,Inclusion,Equal Opportunity Employer,,...,,,,,,,,,,


1. graph.renumber()
1. show edge list
1. make communities

look at dining set dataframe





Wow, someone actually created a role with 463 required skills.  I hope they found exactly the person that they were looking for.  However this makes this a really wide notebook.  Let's find out more about this exploded data.  This may take a while, and uses the CPU

In [26]:
desc_b = b.describe()

In [27]:
desc_b

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,453,454,455,456,457,458,459,460,461,462
count,1294346,1291805,1289206,1283856,1275959,1264390,1248930,1227944,1200780,1167436,...,1,1,1,1,1,1,1,1,1,1
unique,146834,216425,252188,272425,284092,290911,296160,298267,297268,295673,...,1,1,1,1,1,1,1,1,1,1
top,Customer service,Customer service,Communication,Communication,Communication,Communication,Communication,Communication,Communication,Communication,...,WorkLife Balance,Stress Management,Conflict Resolution,Negotiation,Cultural Competence,Ethical DecisionMaking,Patient Safety,Quality Improvement,Risk Management,Compliance
freq,62073,21053,23772,24781,26945,27592,27051,25530,23450,20812,...,1,1,1,1,1,1,1,1,1,1


Well, The top skills here are `Customer Service` and then a whole lot of `Communication`.  Let's see what the aggregates are.  We're going to keep this pandas heavy and instead of using a `set`, we'll do `unique` and `concat`.

In [33]:
%%time
b2 = b.stack().dropna()

CPU times: user 25.7 s, sys: 3.76 s, total: 29.5 s
Wall time: 29.3 s


In [29]:
%%time 
uniq_b = pd.Series()
for i in range(0,463):
  #print(i)
  uniq_b = pd.concat([uniq_b, b[i].dropna()])

CPU times: user 2min 21s, sys: 2.12 s, total: 2min 23s
Wall time: 2min 23s


What's going on here?  What's with the if statement?  And one method takes around 30 seconds and the other takes nearly 2.5 minutes.  Well, it all comes down to the RAM available both on your CPU and GPU.

If you happen to have Colab Pro, and pick a V100 or A100, or have a memory extended instance, you can just use `stack()` and then drop NA. If your GPU is a T4, like most of you will have, you need a far more memory friendly way to do this - GPU or CPU.  Without explicitly dropping NA values, we run out of memory very quickly on both CPU and GPU using that more traditional methods like `.stack()`.  It's slower, but it works and let's you keep going.  Here is a screenshot of the counts, so you know that the numbers are the same.

26,925,304 skill mentions

The other cool thing to consider is "How do I know this?".  Because this is the cleaned up version of the notebook.  When using the speed of cuDF.pandas, there is something really great about knowing that something won't work VERY quickly instead of waiting and wondering until it just fails.  Plus, if and when the kernel crashes, you can just restart and you're quickly to your last working point.  Imagine having your workflow crash and it took you 5-10 minutes (or far longer) to get there instead of about a minute.  

Let's just say I know your frustration and swtiched over to doing this entirely on cuDF.pandas first, then did the times on CPU.  Got to hang with my wife a quite a bit more on Mother's Day because I used cuDF.pandas.  What are you missing out on in life?  While you ponder that, let's get that count and then get the list of unique skills.

In [36]:
b2.count()

26925304

In [37]:
uniq_b.count()

26925304

In [39]:
%%time
skills_count = uniq_b.value_counts()


CPU times: user 43.8 ms, sys: 3.98 ms, total: 47.8 ms
Wall time: 45.4 ms


In [40]:
%%time
ub = uniq_b.unique()

CPU times: user 11.3 s, sys: 450 ms, total: 11.8 s
Wall time: 11.7 s


In [41]:
len(ub)

3390874

Just under 3.4 Million skills.  Now let's take a look at the top 49 of these unique skills!

In [42]:
print(skills_count.head(49))

 Communication             356422
 Teamwork                  223058
 Leadership                162949
 Communication skills      104735
 Customer service          104136
 Problem Solving           101751
 Customer Service           93468
 Problemsolving             92228
 Collaboration              86650
 Training                   82882
 Communication Skills       77287
 Attention to detail        75175
 Time management            72214
 Time Management            69654
 Microsoft Office Suite     69573
 Project Management         67238
 Sales                      65511
 Scheduling                 63271
Customer service            62073
Nursing                     60313
 Multitasking               59075
 Adaptability               58674
 Attention to Detail        57602
 Patient Care               57107
 Flexibility                56489
 Microsoft Office           55550
 Interpersonal skills       54896
 Documentation              51417
 Organization               46746
 Problem solvi

So, as expected, `Communication` reigns supreme followed by `Leadereship`, `Teamwork`, and `Communication skills`, as we skim down then... hey wait a sec...is that another "Communication Skills"?!  Wait, what?  Is that a capital S?

Oh no!  The skill label text has variations in characters for the same skills.  3 versions of "Problem Solving" needs a bit of it's own problem solving.  Looks like we have to do it again...

In [12]:
uniq_b.head()

0        Building Custodial Services
1                   Customer service
2    Applied Behavior Analysis (ABA)
3             Electrical Engineering
4                Electrical Assembly
dtype: object

Just to show off a bit, we're going to show you how each of these will actions will clean up some of the duplicate skills.  

First, we'll first strip all the leading and trailing whitespace, then run a count.  
Next, we'll set the skills to lowercase using `lower()` and run another count.
Finally, we'll remove all white spaces, as we can use the capitalized characters to add whitespaces back in later...and then do a final count.

In [65]:
%%time
ub2 = b2.str.strip()
skills_count2 = ub2.value_counts()
print(skills_count2.count())
ub2 = ub2.str.lower()
skills_count2 = ub2.value_counts()
print(skills_count2.count())
ub2 = ub2.str.replace(" ", "")
skills_count2 = ub2.value_counts()
print(skills_count2.count())

3301513
2773203
2727432
CPU times: user 275 ms, sys: 43.9 ms, total: 319 ms
Wall time: 315 ms


In [69]:
3390874-2727432

663442

Wow!  We dropped over 660K duplicate skills, or 20% of the dataset, with those 3 different string manipulations...and we did it in milliseconds.  We even had spare time to count everything 3 times and print out the answer 3 times.  

Of course, if we had done this in a single step, and didn't print out the columns, it would be faster, but you were already pretty impressed before we said anything.

In [70]:
skills_count2.head(49)

communication           370143
problemsolving          278365
customerservice         278138
teamwork                245258
communicationskills     195956
leadership              185188
timemanagement          143293
attentiontodetail       134017
projectmanagement       121578
interpersonalskills     100273
patientcare              99940
sales                    93031
nursing                  88015
collaboration            87116
training                 83656
dataanalysis             81973
microsoftofficesuite     75567
organizationalskills     75278
inventorymanagement      71913
highschooldiploma        67380
scheduling               64461
bachelor'sdegree         63491
multitasking             62117
analyticalskills         60769
microsoftoffice          60608
decisionmaking           59334
adaptability             59121
flexibility              56896
criticalthinking         53239
documentation            51875
organization             47045
problemsolvingskills     46093
safety  

So there are `problemsolving` and `problemsolvingskills` still...but...we'll pretend that you mean somehting else, right?  No, of course we won't.  There is almost no time penalty.

In [71]:
ub2 = ub2.str.replace("skills", "")
skills_count2 = ub2.value_counts()
print(skills_count2.count())

2715253


In [72]:
skills_count2.head(49)

communication           566099
problemsolving          324458
customerservice         292679
teamwork                254481
leadership              207856
timemanagement          156758
attentiontodetail       134036
projectmanagement       124272
interpersonal           102111
patientcare             100196
sales                    98676
collaboration            90712
nursing                  89540
training                 85271
dataanalysis             82592
organizational           76346
microsoftofficesuite     75581
inventorymanagement      72251
highschooldiploma        67380
decisionmaking           65056
scheduling               64746
multitasking             64516
bachelor'sdegree         63491
analytical               63421
microsoftoffice          60889
adaptability             59218
criticalthinking         58253
flexibility              56918
documentation            53495
organization             52582
writtencommunication     47473
safety                   46061
troubles

Just too fast and too easy.  The final result is also dramatically different than the original `skills_count`.  

Okay, let's save our output to `csv` as a checkpoint and then move on to processing this data.

In [73]:
uniq_b.to_csv("unique_job_skills.csv")

In [74]:
b.to_csv("skills_matrix.csv")

In [75]:
b2.to_csv("b2.csv")

In [None]:
#!pip install nx-cugraph

In [80]:
import networkx as nx

In [78]:
nx.__version__

'3.3'

nx-cugraph and it's dispatcher runs with networkx 3.3+.  Once nx-cugraph is pip installed, it just works with networkx.  

In [83]:
# import cudf
import cugraph
from cugraph.structure import number_map

def renumbered_edgelist(df):
    renumbered_df, num_map = number_map.NumberMap.renumber(df, "src", "dst")
    new_df = renumbered_df[["renumbered_src", "renumbered_dst", "wgt"]]
    column_names = {"renumbered_src": "src", "renumbered_dst": "dst"}
    new_df = new_df.rename(columns=column_names)
    return new_df


if "string" in graph_file.metadata["col_types"]:
        df = renumbered_edgelist(graph_file.get_edgelist(download=True))
        M = get_coo_array(df)

NameError: name 'graph_file' is not defined

In [84]:
postings.head()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_skills
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite,"Medical equipment sales, Key competitors, Term..."
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite,"Nursing, Bachelor of Science in Nursing, Maste..."
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite,"Restaurant Operations Management, Inventory Ma..."
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite,"Real Estate, Customer Service, Sales, Negotiat..."
4,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 08:08:19.663033+00,t,t,f,Registered Nurse (RN),Trinity Health MI,"Muskegon, MI",2024-01-14,Muskegon,United States,Nurse Practitioner,Mid senior,Onsite,"Nursing, BSN, Medical License, Virtual RN, Nur..."


In [89]:
g= postings["job_type"]+" " + postings["job_level"]+" " +postings["search_position"]

In [90]:
g.head()

0                 Onsite Mid senior Color Maker
1    Onsite Mid senior Director Nursing Service
2                    Onsite Mid senior Stand-In
3           Onsite Mid senior Real-Estate Clerk
4          Onsite Mid senior Nurse Practitioner
dtype: object

In [None]:
gnx = pd.DataFrame([g, postings["job_skills"]])

In [95]:
type(ub2)

pandas.core.series.Series

In [96]:
ub2

0        0                          medicalequipmentsales
         1                                 keycompetitors
         2                                    terminology
         3                                     technology
         4                                         trends
                                  ...                    
1296380  17    abilitytoworkbothindependentlyandwithateam
         18          workinginafunandenergeticenvironment
         19                      providingservicetoguests
         20                         interactingwithpeople
         21        abilitytoworknightsweekendsandholidays
Length: 26925304, dtype: object

In [99]:
g.value_counts()

Onsite Mid senior Christian Science Nurse           12911
Onsite Mid senior Consultant Education               9163
Onsite Mid senior Dermatologist                      8716
Onsite Mid senior Account Executive                  8383
Onsite Mid senior Car Inspector                      8283
                                                    ...  
Onsite Mid senior Supervisor Component Assembler        1
Onsite Mid senior Supervisor Wheel Shop                 1
Onsite Mid senior Tool-Machine Set-Up Operator          1
Onsite Mid senior Town Clerk                            1
Onsite Mid senior Treatment-Plant Mechanic              1
Name: count, Length: 3479, dtype: int64

Color me surprised.  That's not what i think anyone was expecting to have the most job postings...at all.

In [100]:
postings["job_level"].value_counts()

job_level
Mid senior    1155276
Associate      141105
Name: count, dtype: int64

In [101]:
postings["job_type"].value_counts()

job_type
Onsite    1285565
Hybrid       6560
Remote       4256
Name: count, dtype: int64

In [103]:
postings["search_position"].value_counts()

search_position
Account Executive                     19465
Christian Science Nurse               16038
Consultant Education                  12133
Change Person                         12021
Circulation-Sales Representative      10972
                                      ...  
Stationary-Engineer Supervisor            1
Still Operator Batch Or Continuous        1
Supervisor Wheel Shop                     1
Ticketing Clerk                           1
Town Clerk                                1
Name: count, Length: 1923, dtype: int64

In [116]:
mer=pd.concat([g,ub2], axis=1)

KeyboardInterrupt: 

In [117]:
mer2 = ub2.to_frame().join(g)

ValueError: Other Series must have a name

In [None]:
mer.head()

In [97]:
h = ub2.reset_index()

In [98]:
h.head()

Unnamed: 0,level_0,level_1,0
0,0,0,medicalequipmentsales
1,0,1,keycompetitors
2,0,2,terminology
3,0,3,technology
4,0,4,trends


In [119]:
h.head()

Unnamed: 0,level_0,level_1,0
0,0,0,medicalequipmentsales
1,0,1,keycompetitors
2,0,2,terminology
3,0,3,technology
4,0,4,trends


In [None]:
h.rename

In [122]:
h.dtypes

level_0     int64
level_1     int64
0          object
dtype: object

In [123]:
h["level_0"] = str(h["level_0"])

In [None]:
nx.graph = h["level_0"], h["0"]

In [None]:
h["int_skills"] = h[0].map(int)

In [124]:
G = cugraph.Graph()
G.from_pandas_edgelist(h, source='level_0', destination=0)

OverflowError: value too large to convert to int

In [None]:
eig = cugraph.eigenvector_centrality(G, max_iter=1000, tol=1.0e-3)

In [None]:
c = b.count()

In [None]:
cDict = dict(c)