# **Introduction**  

In this notebook, we perform **Feature Engineering** on the cleaned job market dataset (`cleaned_jobs.csv`). The goal is to **enhance the dataset** by creating new features that improve insights and make the data more useful for analysis and modeling.  

### **Objectives:**  
✅ Convert **Experience Required** into a **numeric format** (years of experience).  
✅ Categorize **Salaries** into `"Low"`, `"Medium"`, and `"High"`.  
✅ Classify job postings into **Seniority Levels** (`Junior`, `Mid-Level`, `Senior`).  
✅ Extract **Key Skills** from job descriptions and apply **one-hot encoding**.  
✅ Save the processed dataset as **`processed_jobs.csv`** for further analysis.  

By engineering these features, we make the dataset more structured, which will help in **exploratory analysis, visualization, and predictive modeling**.  

---

# **Feature Engineering**

## **Step 1: Load & Prepare Data**

***Importing Libraries:***

In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

***Load the cleaned dataset***

In [32]:
df = pd.read_csv("../data/cleaned_jobs.csv")
print("Dataset Loaded Successfully!")

Dataset Loaded Successfully!


In [33]:
# Display first few rows
df.head()

Unnamed: 0,Job Title,Company,Location,Skills,Experience,Salary,Date Posted
0,Data Scientist,Amazon,Mumbai,"Tableau, Excel, R",6,1,Posted 9 days ago
1,Data Scientist,Google,Chennai,"Data Wrangling, Pandas, Numpy",6,1,Posted 13 days ago
2,Data Scientist,Flipkart,Chennai,"Machine Learning, Deep Learning",9,9,Posted 7 days ago
3,Machine Learning Engineer,Infosys,Pune,"Machine Learning, Deep Learning",4,1,Posted 5 days ago
4,Machine Learning Engineer,Deloitte,Pune,"Python, Sql, Power Bi",3,6,Posted 9 days ago


## **Step 2: Convert Experience to Numeric Format**

In [34]:
# Extract numeric years from 'Experience Required'
df["Experience Years"] = df["Experience"].fillna("0").astype(str).str.extract(r'(\d+)').astype(float)

# Fill missing values with median experience
df["Experience Years"].fillna(df["Experience Years"].median(), inplace=True)

print("Experience column transformed successfully!")
df[["Experience", "Experience Years"]].head()


Experience column transformed successfully!


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Experience Years"].fillna(df["Experience Years"].median(), inplace=True)


Unnamed: 0,Experience,Experience Years
0,6,6.0
1,6,6.0
2,9,9.0
3,4,4.0
4,3,3.0


## **Step 3: Categorize Salaries (Low, Medium, High)**

In [35]:
# Define salary percentiles
low_thresh = df["Salary"].quantile(0.33)
high_thresh = df["Salary"].quantile(0.66)

# Assign categories
df["Salary Category"] = pd.cut(df["Salary"], bins=[0, low_thresh, high_thresh, np.inf], labels=["Low", "Medium", "High"])

print("Salary categories assigned successfully!")
df[["Salary", "Salary Category"]].head()

Salary categories assigned successfully!


Unnamed: 0,Salary,Salary Category
0,1,Low
1,1,Low
2,9,High
3,1,Low
4,6,High


## **Step 4: Create Job Seniority Level**

In [36]:
# Define seniority levels based on experience
def classify_seniority(exp):
    if exp <=3:
        return "Junior"
    elif 4 <= exp <= 7:
        return "Mid-Level"
    else:
        return "Senior"
    
df['Seniority'] = df['Experience Years'].apply(classify_seniority)
df[['Experience Years', 'Seniority']].head()

Unnamed: 0,Experience Years,Seniority
0,6.0,Mid-Level
1,6.0,Mid-Level
2,9.0,Senior
3,4.0,Mid-Level
4,3.0,Junior


## **Step 5: Convert Skills into Binary Columns**

In [37]:
from sklearn.preprocessing import MultiLabelBinarizer

# Convert Skills to a list
df["Skills"] = df["Skills"].str.split(", ")

# One-hot encoding skills
mlb = MultiLabelBinarizer()
skills_encoded = pd.DataFrame(mlb.fit_transform(df["Skills"]), columns=mlb.classes_)

# Merge with the main dataset
df = pd.concat([df, skills_encoded], axis=1)

print("Skills encoded successfully!")
df.head()

Skills encoded successfully!


Unnamed: 0,Job Title,Company,Location,Skills,Experience,Salary,Date Posted,Experience Years,Salary Category,Seniority,...,Deep Learning,Excel,Machine Learning,Numpy,Pandas,Power Bi,Python,R,Sql,Tableau
0,Data Scientist,Amazon,Mumbai,"[Tableau, Excel, R]",6,1,Posted 9 days ago,6.0,Low,Mid-Level,...,0,1,0,0,0,0,0,1,0,1
1,Data Scientist,Google,Chennai,"[Data Wrangling, Pandas, Numpy]",6,1,Posted 13 days ago,6.0,Low,Mid-Level,...,0,0,0,1,1,0,0,0,0,0
2,Data Scientist,Flipkart,Chennai,"[Machine Learning, Deep Learning]",9,9,Posted 7 days ago,9.0,High,Senior,...,1,0,1,0,0,0,0,0,0,0
3,Machine Learning Engineer,Infosys,Pune,"[Machine Learning, Deep Learning]",4,1,Posted 5 days ago,4.0,Low,Mid-Level,...,1,0,1,0,0,0,0,0,0,0
4,Machine Learning Engineer,Deloitte,Pune,"[Python, Sql, Power Bi]",3,6,Posted 9 days ago,3.0,High,Junior,...,0,0,0,0,0,1,1,0,1,0


## **Step 6: Save the Processed Data**

In [38]:
df.to_csv("../Data/processed_jobs.csv", index=False)
print("✅ Feature Engineering Complete! Data saved as 'processed_jobs.csv'.")


✅ Feature Engineering Complete! Data saved as 'processed_jobs.csv'.
