The following is a quick research on data from "ai-jobs.net", a popular data science job listing site. 
I use simple "classical" (non-LLM) machine learning methods to glean insights on what the current trending tools are in the data science field.
All data was obtained via webscraping under the "jobscryer" repo https://github.com/tyleryou/jobscryer. 

In [2]:
import os
import re
import nltk
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem import WordNetLemmatizer 

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tyler\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tyler\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
path = os.environ.get('jobscryer_data_path')
df_init = pd.read_csv(path)

In [5]:
df = df_init

# Fill NaN values with an empty string, null values throw errors when trying to stem or lemmatize the words.

df['description'].fillna('', inplace=True)

# Perform string operations to eliminate non-alphebetical text. Non-alphebetical text is 
# useless when gleaning insights from the text. In fact, it'll probably break the model.
df['description'] = df['description'].str.replace(r'[^a-zA-Z ]', '', regex=True).str.lower()

# Initialize Wordnet Lemmatizer. Lemmatizer is similar to a stemmer (PorterStemmer through nltk)
# in that they both cut the words down to their base versions. Prefixes and conjugates are removed, for example:
# 'runs' and 'running' become 'run.' This is the stem (or base) of the word.
# The difference between the two is that stemming specifically removes prefixes and suffixes without converting the word.
# Lemmatizing actually transforms the word into the base form, known as the 'lemma.'
# Stemming is faster and simpler, but some words are chopped off, for example; 'experiencing' becomes 'experienc'
# Lemmatizing is good for smaller datasets and better accuracy for the words original form.
# Since the dataset we're using is smaller, we can juse lemmatize it.

lm = WordNetLemmatizer() 

# Create a set of stopwords with non-alphabetic characters removed. Stopwords is a popular library full of 
# words that are generally useless for NLP, such as 'a, i, the, has.' These words give no insight towards the 
# what the actual meaning of the text is. However, 'not' is included in stopwords, which can actually give an entirely
# different meaning based on its presence, aka 'not good' vs 'good.'

word_set = set([re.sub(r'[^a-zA-Z]', '', word) for word in stopwords.words('english')])
word_set.remove('not') # taking out "not" because this is a useful word

def check_words(row): 
    lemmatized_words = [lm.lemmatize(word) for word in row.split() if word not in word_set]
    return ' '.join(lemmatized_words)

# Apply the function to the 'description' column because this is what we're using to ultimately glean insights.
df['description'] = df['description'].apply(check_words)x

In [100]:
### Prep data for model ingestion ###
# CountVectorizer turns each string into a binary column (what we actually feed into the model.)
# X will have the length of original dataset rows with n columns (n = max_features)
# Each column is one word. Setting max_features = 1500 will use 1500 unique words (the parameter automatically takes unique words.)  
# This is a finetuning parameter, as too many words risks overfitting. If the model is overfitting, it will simply guess
# the training data instead of using extrapolation. However, too few words will lead to underfitting, 
# which would lead to inaccurate predictions. Larger values of max_features also requires more computational power.
# Size of dataset can give an indication of how large the magnitude of max_features should be. However, a good practice
# is to start with 1000 - 5000 words and finetune based on the output of the model.


# A corpus refers to an organized block of text. This is what we'll fit to our model.

corpus = df['description'].values

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 2000) 
X = cv.fit_transform(corpus).toarray()
#y = df['salary'].values # For now we won't be using a Y value because we aren't actually trying to predict any values
# at this point, we just want to find the most frequent words and most frequent phrases/word distribution.

In [101]:
# The top 10 most occurring words here, which makes sense. AI jobs will need data, experience, etc. None of this is surprising,
# or more importantly useful. The top words are simply buzzwords, so if we look at the less frequent words (rows 30 and up
# in the sorted words), we see more interesting items.

words = cv.get_feature_names_out()

word_frequencies = X.sum(axis=0)

word_freq_dict = {word: freq for word, freq in zip(words, word_frequencies.flat)}

sorted_words = sorted(word_freq_dict.items(), key=lambda x: x[1], reverse=True)

print("Most frequent words:")
for word, freq in sorted_words[:10]:
    print(f"{word}: {freq}")

Most frequent words:
data: 45807
experience: 26604
team: 13932
business: 10614
work: 10481
skill: 8710
de: 8103
year: 7739
development: 7223
model: 7178


In [102]:
# More useful but not surprising, Python is the most frequently sought after skill in data jobs. SQL is very narrowly
# behind it however, which also makes sense as SQL dominates the analytics and engineering domains. 
# Machine is also present, which is probably "machine learning." It seems most jobs use machine learning.

print("Most frequent words:")
for word, freq in sorted_words[30:40]:
    print(f"{word}: {freq}")

Most frequent words:
including: 4911
python: 4599
machine: 4577
sql: 4468
develop: 4358
degree: 4174
new: 4095
environment: 4094
using: 3989
platform: 3961


In [116]:
# Next we'll use an ngram_range parameter in CountVectorizer that will generate n-grams (sequences of n tokens) instead
# of single words. Then we'll analyze the distribution of these n-grams to identify frequent phrases or combinations of words.

# The ngram_range gives the range of words you want to include in your phrases. Another parameter that needs to be hypertuned
# similar to max_features. 
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 5))

X_ngrams = vectorizer_ngrams.fit_transform(corpus)

ngram_features = vectorizer_ngrams.get_feature_names_out()

ngram_frequencies = X_ngrams.sum(axis=0)

ngram_freq_dict = {ngram: freq for ngram, freq in zip(ngram_features, ngram_frequencies.flat)}

sorted_ngrams = sorted(ngram_freq_dict.items(), key=lambda x: x[1], reverse=True)

# Some interesting items here. Most commonly sought after degree is computer science (as opposed to data science.) 
# Also, a bachelor's degree is mentioned more than a master's degree. "Bachelor degree" occurs 1626 times while
# "Master degree" occurs 854 times. Usually, we also see master's degrees falling under "preferred" than "required."
# I'm considering pursuing a master's degree, so seeing this gives an interesting perspective.
# It's important to consider the actual jobs being shown here, and also not all jobs are the same. Some analyst positions
# will have an abundance of engineering, and some data engineering jobs will have quite a bit of analysis work.
# This data pertains to all job categories listed on the site.
#
#
# Understandably, machine learning was the most mentioned phrase. However, we can spot some key phrases here such as:
# Data engineer at 1422, data analytics at 1303, data scientist at 1008. We can also see the lemmatizer didn't always do a 
# great job, as "data engineering" and "data engineer" are present.
#
# It may be useful to pull the actual job titles to see how many jobs there are per category instead of parsing it from
# the description of the job. I ultimately cared more about the descriptions than the titles for this project.
#
# I finetuned the vectorizer (CountVectorizer above) with different ranges,
# but an ngram_range of (2, 5) yielded the most relevant results.
#
# Next step may be to go back and grab all text after "Qualifications"
# then predict salary ranges based on jobs, experience, region, and degree level

print("\nMost frequent n-grams:")
for ngram, freq in sorted_ngrams[:100]:
    print(f"{ngram}: {freq}")


Most frequent n-grams:
machine learning: 4366
year experience: 3357
data science: 2712
computer science: 2500
experience working: 2215
communication skill: 2064
experience data: 1864
best practice: 1762
data pipeline: 1633
bachelor degree: 1626
data analysis: 1527
related field: 1425
data engineering: 1422
data quality: 1383
skill ability: 1327
data analytics: 1303
big data: 1262
power bi: 1215
ability work: 1136
team member: 1101
data management: 1051
degree computer: 1047
data visualization: 1022
data model: 1017
data scientist: 1008
degree computer science: 994
data set: 874
data warehouse: 872
master degree: 854
programming language: 854
experience building: 851
deep learning: 817
crossfunctional team: 806
learning model: 805
handson experience: 789
data source: 769
data governance: 741
software development: 740
business intelligence: 729
excellent communication: 728
track record: 722
work closely: 706
data engineer: 693
machine learning model: 690
work experience: 688
data modeli