# üî¢ Job Clustering - Vectorization & Clustering

This notebook performs:
1. **Load cleaned data** from previous step
2. **Vectorize text** using TF-IDF
3. **Find optimal clusters** using silhouette analysis
4. **Apply K-Means clustering**
5. **Evaluate clustering quality**

---

In [82]:
# Import required libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from modules.cleaning import get_stopwords
from modules.vectorization import TextVectorizer, get_vocabulary_stats
from modules.clustering import ClusterOptimizer, JobClusterer

# Visualization
import plotly.express as px
import plotly.graph_objects as go

print("‚úÖ Modules loaded successfully!")

‚úÖ Modules loaded successfully!


In [83]:
# Setup: Download NLTK data if needed
import nltk
try:
    from nltk.corpus import stopwords
    stopwords.words('french')
    print("‚úÖ NLTK stopwords already available")
except LookupError:
    print("üì• Downloading NLTK stopwords...")
    nltk.download('stopwords', quiet=True)
    print("‚úÖ NLTK stopwords downloaded")

‚úÖ NLTK stopwords already available


## 1Ô∏è‚É£ Load Cleaned Data

In [84]:
# Load cleaned data from previous notebook
df = pd.read_csv('data_cleaned.csv')

print(f"üìä Loaded {len(df)} job offers")
print(f"   Columns: {list(df.columns)}")
print(f"\n   Sample titles:")
print(df['title_cleaned'].head(10).to_string(index=False))

üìä Loaded 1000 job offers
   Columns: ['mission_cleaned', 'profil_cleaned', 'title_cleaned']

   Sample titles:
                       work force management rh
                       work force management rh
                       work force management rh
wordpress graphiste community manager notion ia
     webmaster charge marketing digital dovelec
 webmaster designer developpeur application web
                                      webmaster
                           webmarketing manager
                  webmarketing campaign manager
                                webmarketer seo


## 2Ô∏è‚É£ Text Vectorization (TF-IDF)

Convert text into numerical vectors using TF-IDF (Term Frequency - Inverse Document Frequency).

In [85]:
# Initialize vectorizer
stopwords = get_stopwords(include_locations=True)

vectorizer = TextVectorizer(
    max_features=50,      # Maximum vocabulary size
    min_df=2,              # Min document frequency
    max_df=0.8,            # Max document frequency
    ngram_range=(2, 3),    # Unigrams and bigrams
    stop_words=stopwords,
    use_svd=False          # Set True for dimensionality reduction
)

# Clean data: remove NaN values and empty strings
df_clean = df[df['title_cleaned'].notna() & (df['title_cleaned'] != '')].copy()
print(f"üìä Filtered data: {len(df)} ‚Üí {len(df_clean)} job offers (removed {len(df) - len(df_clean)} invalid entries)")

# Vectorize job titles
X = vectorizer.fit_transform(df_clean['title_cleaned'])

print(f"\n‚úÖ Vectorization complete!")
print(f"   Matrix shape: {X.shape}")
print(f"   Sparsity: {(1 - X.nnz / (X.shape[0] * X.shape[1])) * 100:.1f}%")

# Update df to use cleaned version
df = df_clean

üìä Filtered data: 1000 ‚Üí 1000 job offers (removed 0 invalid entries)
üîÑ Vectorizing 1000 documents...
‚úÖ TF-IDF matrix shape: (1000, 50)
   Features: 50

‚úÖ Vectorization complete!
   Matrix shape: (1000, 50)
   Sparsity: 99.1%


In [86]:
# Examine top TF-IDF features
vocab_stats = get_vocabulary_stats(vectorizer)

print("üìä Top 20 TF-IDF Features:\n")
print(vocab_stats.head(20).to_string(index=False))

üìä Top 20 TF-IDF Features:

                                   feature      idf  doc_frequency
                         conseil materiaux 6.116995              5
                      charge communication 6.116995              5
                         stagiaire qualite 6.116995              5
                      stagiaire logistique 6.116995              5
            conseil materiaux construction 6.116995              5
                             redacteur web 6.116995              5
                  redacteur web polyvalent 6.116995              5
                           traffic manager 6.116995              5
                                testeur qa 6.116995              5
                          testeur logiciel 6.116995              5
                            web polyvalent 6.116995              5
teleoperateurs teleoperatrices anglophones 6.116995              5
                 teleoperatrices bilingues 6.116995              5
                     technicien 

In [87]:
# Visualize feature distribution
fig = px.bar(vocab_stats.head(30), 
             x='feature', 
             y='doc_frequency',
             title='Top 30 Features by Document Frequency',
             labels={'doc_frequency': 'Number of Documents', 'feature': 'Feature'})
fig.update_layout(xaxis_tickangle=-45, height=500)
fig.show()

## 3Ô∏è‚É£ Find Optimal Number of Clusters

Use silhouette analysis to determine the best K for K-Means clustering.

In [88]:
# Initialize optimizer
optimizer = ClusterOptimizer(min_clusters=2, max_clusters=30)

# Evaluate multiple cluster numbers
metrics_df = optimizer.evaluate_clustering(X, max_k=30)

print("\nüìä Clustering Evaluation Results:")
print(metrics_df.to_string(index=False))

üìä Comprehensive evaluation from K=2 to K=30
   K=2: silhouette=0.540, CH=95.7, DB=0.467
   K=3: silhouette=0.566, CH=86.7, DB=0.415
   K=4: silhouette=0.595, CH=89.7, DB=0.399
   K=5: silhouette=0.621, CH=91.0, DB=0.368
   K=6: silhouette=0.648, CH=96.4, DB=0.353
   K=7: silhouette=0.663, CH=92.6, DB=0.335
   K=8: silhouette=0.689, CH=100.0, DB=0.309
   K=9: silhouette=0.699, CH=95.1, DB=0.298
   K=10: silhouette=0.719, CH=101.8, DB=0.337
   K=11: silhouette=0.733, CH=104.1, DB=0.443
   K=12: silhouette=0.745, CH=105.8, DB=0.452
   K=13: silhouette=0.758, CH=107.8, DB=0.441
   K=14: silhouette=0.769, CH=108.1, DB=0.431
   K=15: silhouette=0.780, CH=110.6, DB=0.423
   K=16: silhouette=0.787, CH=108.8, DB=0.422
   K=17: silhouette=0.797, CH=110.2, DB=0.535
   K=18: silhouette=0.807, CH=114.1, DB=0.404
   K=19: silhouette=0.816, CH=116.1, DB=0.401
   K=20: silhouette=0.825, CH=120.2, DB=0.406
   K=21: silhouette=0.833, CH=123.0, DB=0.406
   K=22: silhouette=0.840, CH=126.0, DB=0.401
  

In [89]:
# Visualize clustering metrics
from modules.visualization import plot_metrics_evolution

fig = plot_metrics_evolution(metrics_df)
fig.show()

In [90]:
# Get optimal K based on silhouette score
optimal_k = optimizer.get_best_k(metric='silhouette_score')
print(f"\nüéØ Optimal number of clusters: {optimal_k}")


üéØ Optimal number of clusters: 30


## 4Ô∏è‚É£ Apply K-Means Clustering

In [91]:
# Cluster with optimal K
clusterer = JobClusterer(n_clusters = optimal_k)
labels = clusterer.fit_predict(X, optimize=False)

# Add cluster labels to DataFrame
df['cluster'] = labels

print(f"\n‚úÖ Clustering complete with {optimal_k} clusters!")

üéØ Fitting K-Means with 30 clusters...
‚úÖ Clustering complete!

üìä Cluster distribution:
   Cluster 0: 697 items (69.7%)
   Cluster 1: 32 items (3.2%)
   Cluster 2: 23 items (2.3%)
   Cluster 3: 18 items (1.8%)
   Cluster 4: 21 items (2.1%)
   Cluster 5: 11 items (1.1%)
   Cluster 6: 19 items (1.9%)
   Cluster 7: 8 items (0.8%)
   Cluster 8: 8 items (0.8%)
   Cluster 9: 21 items (2.1%)
   Cluster 10: 11 items (1.1%)
   Cluster 11: 14 items (1.4%)
   Cluster 12: 19 items (1.9%)
   Cluster 13: 7 items (0.7%)
   Cluster 14: 6 items (0.6%)
   Cluster 15: 7 items (0.7%)
   Cluster 16: 8 items (0.8%)
   Cluster 17: 6 items (0.6%)
   Cluster 18: 7 items (0.7%)
   Cluster 19: 5 items (0.5%)
   Cluster 20: 5 items (0.5%)
   Cluster 21: 5 items (0.5%)
   Cluster 22: 6 items (0.6%)
   Cluster 23: 5 items (0.5%)
   Cluster 24: 5 items (0.5%)
   Cluster 25: 5 items (0.5%)
   Cluster 26: 5 items (0.5%)
   Cluster 27: 6 items (0.6%)
   Cluster 28: 5 items (0.5%)
   Cluster 29: 5 items (0.5%)

‚ú

In [92]:
# Evaluate clustering quality
metrics = clusterer.evaluate(X)


üìà Clustering Metrics:
   Silhouette Score: 0.905 (closer to 1 is better)
   Calinski-Harabasz: 186.9 (higher is better)
   Davies-Bouldin: 0.385 (lower is better)
   Inertia: 51.3


## 5Ô∏è‚É£ Visualize Clusters

In [93]:
# Cluster distribution
from modules.visualization import plot_cluster_distribution

fig = plot_cluster_distribution(df, cluster_column='cluster')
fig.show()

In [94]:
# 2D visualization using PCA
from modules.visualization import plot_cluster_scatter_2d

fig = plot_cluster_scatter_2d(
    df, 
    X, 
    method='pca',
    hover_data=['title_cleaned']
)
fig.show()

üé® Creating 2D visualization using PCA...
‚úÖ Visualization created!


In [95]:
# Alternative: t-SNE visualization (slower but better for complex structures)
# Uncomment to use:

# fig_tsne = plot_cluster_scatter_2d(
#     df, 
#     X, 
#     method='tsne',
#     hover_data=['title_cleaned']
# )
# fig_tsne.show()

## 6Ô∏è‚É£ Examine Individual Clusters

In [96]:
# Display sample jobs from each cluster
for cluster_id in range(min(5, optimal_k)):  # Show first 5 clusters
    cluster_jobs = df[df['cluster'] == cluster_id]['title_cleaned'].head(10)
    print(f"\nüîπ Cluster {cluster_id} ({len(df[df['cluster'] == cluster_id])} jobs):")
    print(cluster_jobs.to_string(index=False))


üîπ Cluster 0 (697 jobs):
                       work force management rh
                       work force management rh
                       work force management rh
wordpress graphiste community manager notion ia
     webmaster charge marketing digital dovelec
 webmaster designer developpeur application web
                                      webmaster
                           webmarketing manager
                  webmarketing campaign manager
                                webmarketer seo

üîπ Cluster 1 (32 jobs):
                               technico commercial
                   technico commercial electricite
                               technico commercial
                               technico commercial
                           technico commercial b b
                       technico commercial karenjy
technico commercial reseau grands comptes autom...
                            technico commercial it
                     technico commercial industrie
      

## 7Ô∏è‚É£ Save Clustered Data

In [97]:
# Save clustered data
output_file = 'data_clustered.csv'
df.to_csv(output_file, index=False)

# Save vectorizer for future use
vectorizer.save('vectorizer.pkl')

print(f"üíæ Clustered data saved to: {output_file}")
print(f"üíæ Vectorizer saved to: vectorizer.pkl")
print(f"\n   Total jobs: {len(df)}")
print(f"   Number of clusters: {optimal_k}")

üíæ Vectorizer saved to vectorizer.pkl
üíæ Clustered data saved to: data_clustered.csv
üíæ Vectorizer saved to: vectorizer.pkl

   Total jobs: 1000
   Number of clusters: 30


## ‚úÖ Summary

**Vectorization & Clustering Complete!**

- ‚úÖ Vectorized text using TF-IDF
- ‚úÖ Found optimal clusters using silhouette analysis
- ‚úÖ Applied K-Means clustering
- ‚úÖ Evaluated clustering quality
- ‚úÖ Visualized cluster distribution and structure
- ‚úÖ Saved clustered data to `data_clustered.csv`

**Next Steps:**
- Open `03_label_extract_visualize.ipynb` to label clusters and extract skills