**Convert Job Titles & Keywords into Vectors**

*Tf-idf*

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert job titles into numerical vectors
vectorizer = TfidfVectorizer()
X_job_titles = vectorizer.fit_transform(df["job_title"])

# Convert skills into numerical vectors
X_keywords = vectorizer.fit_transform(df["keyword_name"])

**Silhoutte score for number of clusters (tf-idf)**

In [None]:
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

range_clusters = list(range(2, 11))  # Test clusters from 2 to 10
silhouette_scores = []

for k in range_clusters:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_keywords)
    score = silhouette_score(X_keywords, labels)
    silhouette_scores.append(score)

In [None]:
print(f" Optimal number of clusters: {best_k}")
print(f" Highest silhouette score: {best_score:.4f}")

 Optimal number of clusters: 10
 Highest silhouette score: 0.0256


*Sentence transformers vectorization*

In [None]:
from sentence_transformers import SentenceTransformer

# Load pre-trained sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Convert skills into dense vectors
X_keywords1 = model.encode(df["keyword_name"].tolist(), convert_to_numpy=True)
X_job_titles1 = model.encode(df["job_title"].tolist(), convert_to_numpy=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**PCA**

In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Fit PCA on SBERT embeddings
pca = PCA()
X_keywords_pca_full = pca.fit_transform(X_keywords1)

# Find the number of components that explain at least 90% variance
explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
optimal_components = np.argmax(explained_variance_ratio >= 0.90) + 1  # First component reaching 90%

print(f"Optimal number of components to retain 90% variance: {optimal_components}")


Optimal number of components to retain 90% variance: 161


In [None]:
# Now apply PCA using the optimal number of components
pca = PCA(n_components=optimal_components, random_state=42)
X_keywords_pca = pca.fit_transform(X_keywords1)

# Print final explained variance
final_variance = sum(pca.explained_variance_ratio_)
print(f"Final Explained Variance: {final_variance:.4f}")


Final Explained Variance: 0.8998


**Silhoutte score after pca**

In [None]:
range_clusters = list(range(2, 31))  # Test clusters from 2 to 30
silhouette_scores = []

for k in range_clusters:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_keywords_pca) # using pca transformed data
    score = silhouette_score(X_keywords_pca, labels)
    silhouette_scores.append(score)

In [None]:
print(f" Optimal number of clusters: {best_k_pca}")
print(f" Highest silhouette score: {best_score_pca:.4f}")

 Optimal number of clusters: 30
 Highest silhouette score: 0.0624


**Apply Kmeans with more clusters for better output**

In [None]:
# Inspect clusters manually for K-Means
for cluster in range(10):  # Checking first 10 clusters
    print(f"\nCluster {cluster}:")
    print(df[df["skill_cluster"] == cluster]["keyword_name"].head(10))  # Display 10 sample keywords per cluster



Cluster 0:
1                        JAVA
2                 Objective-C
4                        java
7                         C++
43                         C#
72                     Python
102                  software
164                      Java
255                     Linux
279    BS in Computer Science
Name: keyword_name, dtype: object

Cluster 1:
10                            sales
16        GST AND VAT AND Sales Tax
39                demand generation
55                         Presales
57                     Inside sales
59                            Sales
81                    Reimbursement
125                         payment
148    Sales / Business Development
154                    retail sales
Name: keyword_name, dtype: object

Cluster 2:
18                 wireframes
25              visual design
29          Industrial Design
31                     design
58             digital design
104                 Architect
184    engineering background
190            Concept Art

**Skill gap analysis**

In [None]:
# Take user input for job title
desired_job_title = input("Enter the job title you are interested in: ").strip()

# Take user input for skills (comma-separated)
student_skills_input = input("Enter your skills (comma-separated): ").strip().lower()

# Convert input to a list
student_skills = [skill.strip() for skill in student_skills_input.split(",")]

print("\n Your Selected Job Title:", desired_job_title)
print("Your Skills:", student_skills)


Enter the job title you are interested in: business analyst
Enter your skills (comma-separated): excel, statistics

 Your Selected Job Title: business analyst
Your Skills: ['excel', 'statistics']


In [None]:
# Filter dataset for the selected job title
job_skills = df[df["job_title"].str.contains(desired_job_title, case=False, na=False)]["keyword_name"]
job_skills = set(job_skills)  # Convert to a set for comparison

In [None]:
# Compare student skills with job skills
missing_skills = job_skills - set(student_skills)

# Output missing skills
if missing_skills:
    print(f"\n Skills you need to learn for {desired_job_title}:")
    print(missing_skills)
else:
    print(f"\n You have all required skills for {desired_job_title}!")


 Skills you need to learn for business analyst:
{'System Analyst', 'business analyst', 'analyst', 'Digital', 'POCs', 'Data Modelling'}


In [None]:
df.to_csv("processed_skill_gap_data.csv", index=False)
from google.colab import files
files.download("processed_skill_gap_data.csv")  # Downloads file to your computer

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>