- Library imports

In [45]:
import pandas as pd
import string
import requests
from bs4 import BeautifulSoup
import numpy as np

- Define Functions

In [46]:

filters = ['job_title', 'job_description']
def get_and_clean_data():
    # Load the dataset
    data = pd.read_csv('resource/software_developer_united_states_1971_20191023_1.csv')
    
    # Select the columns from filters
    filtered_data = data[filters]
    
    # Clean all columns from filters
    for column in filters:
        filtered_data[column] = filtered_data[column].fillna('')  # Handle missing values
        # Avoid removing characters like # and + (important for C#, C++)
        filtered_data[column] = filtered_data[column].apply(lambda s: s.translate(str.maketrans('', '', string.punctuation.replace('#', '').replace('+', ''))))
        filtered_data[column] = filtered_data[column].apply(lambda s: s.lower())
        filtered_data[column] = filtered_data[column].apply(lambda s: s.translate(str.maketrans(string.whitespace, ' '*len(string.whitespace), '')))
    
    # Remove duplicates
    filtered_data = filtered_data.drop_duplicates()
    
    return filtered_data

def simple_tokenize(data):
    # Tokenize each column of text
    for column in filters:
        data[column] = data[column].apply(lambda s: [x.strip() for x in s.split()])
    return data


- Read the data

In [47]:
# Read and store the data
data = get_and_clean_data()
# tokenize words into array
data = simple_tokenize(data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data[column] = filtered_data[column].fillna('')  # Handle missing values
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data[column] = filtered_data[column].apply(lambda s: s.translate(str.maketrans('', '', string.punctuation.replace('#', '').replace('+', ''))))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retu

In [48]:
# Print
print(data.head(10).to_markdown())

|    | job_title                                               | job_description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

---
**Q1 :** Identify database and programming language proficiencies essential for a junior-level `(2pts, should be achievable by everyone)`

- Firstly, we filter and find only jobs with **junior** title


In [49]:
# Filter data for 'junior' or similar terms
junior_roles = data[data['job_title'].apply(lambda tokens: 'jr' in tokens or 'junior' in tokens)]

In [50]:
# Print
print(junior_roles.head(30).to_markdown())

|     | job_title                                                                                                                    | job_description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

- Now we can map and count all of the programming languages in job description of our filtered data
    
    We can do that by comparing can count it into counter, saved in dictionary to have multiple data like
    
    ```
    {
        lang1 : 20,
        lang2 : 21
        ...
    }
    ```

    after that, we sort the list and we can get the most frequent one as the most essential one.

In [51]:
# Map languages
program_languages = ['c','c#','c++','java','python','kotlin','swift','rust','ruby','scala','julia','lua']
languages_counter = {}
for pl in program_languages:
    # Count each languages found mentioned in description
    count = junior_roles['job_description'].apply(lambda s: pl in s).sum()
    languages_counter[pl] = count

# Sort ranking
languages_counter = sorted(languages_counter.items(), key=lambda x: x[1], reverse=True)
# Display the ranking
print("For programming Languages: ")
for lang, count in languages_counter:
    print(f' - {lang}: {count}')

# Saving the most esstential language
top_lang = languages_counter[0]

For programming Languages: 
 - java: 270
 - c#: 194
 - c++: 129
 - python: 88
 - c: 75
 - ruby: 73
 - swift: 10
 - scala: 5
 - kotlin: 0
 - rust: 0
 - julia: 0
 - lua: 0


In [52]:
# List of database keywords
databases = ['mysql', 'postgresql', 'mssql', 'oracle', 'sqlite', 'mongodb', 'cassandra', 'redis', 'dynamodb', 'firebase', 'neo4j', 'cloudsql', 'aws', 'azure', 'bigquery']
databases_counter = {}
for db in databases:
    count = junior_roles['job_description'].apply(lambda s: db in s).sum()
    databases_counter[db] = count

# Sort ranking
databases_counter = sorted(databases_counter.items(), key=lambda x: x[1], reverse=True)
# Display the ranking
print("For databases: ")
for db, count in databases_counter:
    print(f' - {db}: {count}')

# Saving the most esstential database
top_database = databases_counter[0]

For databases: 
 - oracle: 101
 - mysql: 86
 - aws: 34
 - azure: 19
 - mssql: 11
 - postgresql: 10
 - mongodb: 8
 - sqlite: 3
 - cassandra: 3
 - redis: 3
 - dynamodb: 2
 - neo4j: 1
 - firebase: 0
 - cloudsql: 0
 - bigquery: 0


- Now I've selected the most essentials:

        top_lang for programming languages
        top_database for database

---
 **Q2 :** For a long-term skill development plan, determine an additional programming language to complement your initial choice for a first job.
<br>`(2 pts – achievable with extra effort)`

Hints:<br> 
    - Analyze effective pairings between your selected language and one another programming language at time.<br>
    - Focus only on senior-level role.<br>