## Data Scientist Jobs from Europe:

Relevant URLs:

1. https://methodmatters.github.io/data-jobs-europe/
2. https://github.com/methodmatters/data-jobs-europe (contains a notebook for studying jobs.)
3. https://www.r-bloggers.com/2022/04/text-analysis-of-job-descriptions-for-data-scientists-data-engineers-machine-learning-engineers-and-data-analysts/

These are the columns in the dataset:

| **Attribute**                    | **Description**                                                                                      |
|----------------------------------|------------------------------------------------------------------------------------------------------|
| employer_id                      | Unique identifier for the employer                                                                   |
| company_name                     | Name of the company posting the job                                                                   |
| job_id                           | Unique identifier for the job description                                                            |
| job_title                        | Job title as input by the employer                                                                   |
| job_function                     | Harmonized job title managed by job listing site (values: data scientist, data engineer, data analyst, machine learning engineer) |
| job_description_text             | Text of the job advertisement (not all in English)                                                   |
| job_skills                       | List of employer-determined skill keywords (in English)                                              |
| education_desired                | List of employer-requested educational attainments                                                   |
| job_location                     | City where the job is located                                                                        |
| company_hq_location              | City and country of the company's headquarters                                                       |
| company_sector_name              | Sector in which the company is active                                                                |
| company_industry                 | Industry in which the company is active                                                              |
| company_type                     | Type of company (e.g., government, private company, etc.)                                            |
| company_size                     | Number of employees that the company has                                                             |
| company_revenue                  | Annual company revenue in USD                                                                        |
| company_year_founded             | Year the company was founded                                                                         |
| company_website                  | Company website                                                                                      |
| rating_global                    | Site users' overall rating of company                                                                |
| rating_comp_ben                  | Site users' rating of compensation & benefits (pay, bonus, etc.)                                     |
| rating_culture_values            | Site users' rating of culture and values                                                             |
| rating_career_opportunities      | Site users' rating of career opportunities                                                           |
| rating_w_life_balance            | Site users' rating of work-life balance                                                              |
| rating_sr_mgt                    | Site users' rating of senior management                                                              |
| query_country                    | Country used in the query to scrape the job ad                                                       |
| date_job_posted                  | Date the job advertisement was posted                                                                |
| date_job_expires                 | Date the job advertisement expires                                                                   |
| age_job_posting_days             | Age of the job ad (in days) on the date that it was scraped                                          |
| scraping_date                    | Date the job was scraped                                                                             |
| language                         | Language of the job description text (determined via the langid package in Python)                   |


In [59]:
import numpy as np
import pandas as pd
import pickle

from itables import show
import pprint
import ipywidgets as widgets
from IPython.display import display


import gensim
from gensim.parsing.preprocessing import *
import gensim.downloader as api
from nltk.stem import WordNetLemmatizer

In [18]:
in_dir = '../data/europe_data/'
with open(in_dir + 'omnibus_jobs_df.pickle', 'rb') as handle:
    omnibus_jobs_df = pickle.load(handle)

rng = np.random.default_rng(2913)
pp = pprint.PrettyPrinter(indent=4, compact=True,)

In [175]:
pp.pprint(rng.choice(omnibus_jobs_df.job_title, size=10))
#pp.pprint(rng.choice(omnibus_jobs_df.language, size=3))
#omnibus_jobs_df.language.value_counts()

array(['Data Scientist (m/w/d)',
       'Consultant, Data Engineer / Data Analyst, Process Bionics, Intelligent Automation, Consulting, London',
       'Data Analyst H/F', 'Data Analyst / Data Presenter (Home-based)',
       'Data Engineer Python / Freelance', 'DATA SCIENTIST (D/F/M)',
       '(Junior) Data Analytics Engineer (m/w/d)',
       'Lead Data Engineer (m/w/d)', 'Data Analyst',
       'Data Analyst - People Analytics'], dtype=object)


Let us focus on only the English jobs that have been posted.

In [176]:
en_jobs_df = omnibus_jobs_df[omnibus_jobs_df.language == 'en']

### Querying data science jobs from Europe

#### Using tf-idf

In [None]:
wn = WordNetLemmatizer()
CUSTOM_FILTER = [lambda x: x.lower(), strip_punctuation, 
                 strip_multiple_whitespaces, strip_numeric, 
                 remove_stopwords, strip_short]

all_job_strings = en_jobs_df.job_description_text.values
#all_job_strings[:3]
all_jobs_tokenized = [preprocess_string(x, CUSTOM_FILTER) for x in all_job_strings]

In [None]:
dct = gensim.corpora.Dictionary(all_jobs_tokenized)
bow_corpus = [dct.doc2bow(text) for text in all_jobs_tokenized]
tfidf = gensim.models.TfidfModel(dictionary=dct)

index = gensim.similarities.Similarity(None, corpus=tfidf[bow_corpus], num_features=len(dct))

In [196]:
#q1 = [wn.lemmatize(x) for x in preprocess_string('resourceful machine learning engineer', CUSTOM_FILTER)]
#sims = index[tfidf[dct.doc2bow(q1)]]

#np.flip(np.sort(sims))

#q1_results = np.argsort(-sims)[:10]
#q1_results
#en_jobs_df.job_description_text.values[q1_results[0]]

In [150]:
# Create a Text widget
text_widget = widgets.Text(
    value='',
    placeholder='Type something',
    description='Input:',
    disabled=False
)

return_no = widgets.IntText(
    value=10,
    description='No. to return:',
    disabled=False
)

# Create a Button widget
button = widgets.Button(
    description='Submit',
    disabled=False,
    button_style='',  # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check'  # (FontAwesome names without the `fa-` prefix)
)

out2 = widgets.Textarea(
    value='Retrieved descriptions:',
    placeholder='Type something',
    description='String:',
    disabled=False
    #layout = widgets.Layout('1000px')
)
#out2.layout = widgets.Layout('1000px')

In [167]:
# Function to handle button click
def on_button_click(b):
    #out.clear_output()
    q1 = [wn.lemmatize(x) for x in preprocess_string(text_widget.value, CUSTOM_FILTER)]
    sims = index[tfidf[dct.doc2bow(q1)]]
    flipped_sims = np.flip(np.sort(sims))

    q1_results = np.argsort(-sims)[:return_no.value]
    out_string = ''
    for i,qq in enumerate(q1_results):
        out_string += "---\n"
        out_string += f"Rank: {i+1}, Similarity with query: {flipped_sims[i]:.3f}\n"
        out_string += f"{en_jobs_df.job_description_text.values[qq]} \n\n"

    out2.value = out_string
        
    #q1_results
    #np.flip(np.sort(sims))
    #q1_results = np.argsort(-sims)[:10]
    #out2.value = "Boo3333!\n" * 19
    
# Attach the click event handler to the button
button.on_click(on_button_click)

In [172]:
# Display the text widget and button
display(widgets.VBox([widgets.HBox([text_widget, return_no]), button, out2]))
text_widget.layout=widgets.Layout(width='800px')
out2.layout = widgets.Layout(width='1000px', height='500px')

VBox(children=(HBox(children=(Text(value='research engineer,  SQL, entry level', description='Input:', layout=…

#### Using Word2vec

In [183]:
train_corpus = [gensim.models.doc2vec.TaggedDocument(x, [i]) for i,x in enumerate(all_jobs_tokenized)]
model = gensim.models.doc2vec.Doc2Vec(vector_size=256, min_count=2, epochs=40)
model.build_vocab(train_corpus)

In [191]:
query_vec = model.infer_vector('research engineer SQL entry level'.split())

In [193]:
most_similar = model.dv.most_similar([query_vec], topn=2)

for x,y in most_similar:
    pp.pprint(all_job_strings[x])
    #print(y)

('IN SHORT At Wunderflats we believe that everyone should have the freedom to '
 'live and work wherever they want. On our platform, we offer fully furnished '
 'apartments from a rental period of one month, which can be easily booked '
 "online. Our mission: Shaping the future of housing! If you're looking for a "
 'progressive and rewarding career, supported by a company that really cares '
 'about its people, then this might be the perfect role for you. The Data and '
 'Advanced Analytics team at Wunderflats is growing and we are now looking for '
 'a talented Senior Software Engineer (f/m/d*) with a passion for Machine '
 'Learning to join us. This role will cover a broad spectrum of projects '
 'including developing automated infrastructure that builds and trains models, '
 'data pipelines and scalable endpoints. This role will also be responsible '
 'for refining, automating, and deploying to production and lastly building '
 'user interfaces for some data products. This is a per

In [187]:
# Create a Text widget
text_widget_w2v = widgets.Text(
    value='',
    placeholder='Type something',
    description='Input:',
    disabled=False
)

return_no_w2v = widgets.IntText(
    value=10,
    description='No. to return:',
    disabled=False
)

# Create a Button widget
button_w2v = widgets.Button(
    description='Submit',
    disabled=False,
    button_style='',  # 'success', 'info', 'warning', 'danger' or ''
    tooltip='Click me',
    icon='check'  # (FontAwesome names without the `fa-` prefix)
)

out2_w2v = widgets.Textarea(
    value='Retrieved descriptions:',
    placeholder='Type something',
    description='String:',
    disabled=False
    #layout = widgets.Layout('1000px')
)

In [199]:
# Function to handle button click
def on_button_click_w2v(b):
    #out.clear_output()
    query_vec = model.infer_vector(text_widget_w2v.value.split())
    most_similar = model.dv.most_similar([query_vec], topn=return_no_w2v.value)

    out_string = ''
    for i,z in enumerate(most_similar):
        out_string += "---\n"
        out_string += f"Rank: {i+1}, Similarity with query: {z[1]:.3f}\n"
        out_string += f"{all_job_strings[z[0]]} \n\n"

    out2_w2v.value = out_string
    
# Attach the click event handler to the button
button_w2v.on_click(on_button_click_w2v)

In [200]:
# Display the text widget and button
display(widgets.VBox([widgets.HBox([text_widget_w2v, return_no_w2v]), button_w2v, out2_w2v]))
text_widget_w2v.layout=widgets.Layout(width='800px')
out2_w2v.layout = widgets.Layout(width='1000px', height='500px')

VBox(children=(HBox(children=(Text(value='research engineer  SQL sklearn', description='Input:', layout=Layout…

# Other Job description datasets:

1. https://www.kaggle.com/datasets/andrewmvd/data-analyst-jobs (synthetic)
2. https://www.preprints.org/manuscript/202206.0346/v1
3. https://huggingface.co/datasets/jacob-hugging-face/job-descriptions
4. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset (synthetic)
5. https://github.com/duyet/skill2vec-dataset (skill2vec)