<a href="https://colab.research.google.com/github/tannupriya7/E-Commerce-website-/blob/main/Untitled2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TASK 1: Load and Understand the Dataset

1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report


2. Load Dataset

In [3]:
import pandas as pd
df = pd.read_csv("/content/AI-Based Hiring Prediction System.csv")

3. Display Data

In [4]:
df.head()
df.tail()
df.sample(5)


Unnamed: 0,Resume_ID,Name,Skills,Experience (Years),Education,Certifications,Job Role,Recruiter Decision,Salary Expectation ($),Projects Count,AI Score (0-100)
700,701,Sarah Robinson,"Networking, Cybersecurity, Ethical Hacking, Linux",4,B.Tech,,Cybersecurity Analyst,Hire,113477,2,70
222,223,James Myers,"Python, TensorFlow, Pytorch, NLP",6,B.Sc,,AI Researcher,Hire,78872,4,100
568,569,Zachary Ross,"Linux, Cybersecurity, Ethical Hacking",8,PhD,Deep Learning Specialization,Cybersecurity Analyst,Hire,71624,8,100
918,919,Shannon Frazier,"Deep Learning, Python",5,MBA,AWS Certified,Data Scientist,Hire,100800,3,85
303,304,Carlos Goodwin,"Python, NLP",8,B.Sc,Google ML,AI Researcher,Hire,89808,10,100


4. Explanation

What type of data is present?

The dataset contains:

Numerical data (Experience, Salary, Projects Count, AI Score)

Categorical data (Education, Recruiter Decision)

Text data (Skills, Certifications, Job Role)

Which column will be used as the target?
We will convert:

Hire → 1

Reject → 0

**TASK 2: Basic Data Inspection**

In [5]:
df.shape
df.columns
df.dtypes
df["Recruiter Decision"].value_counts()
df.describe()


Unnamed: 0,Resume_ID,Experience (Years),Salary Expectation ($),Projects Count,AI Score (0-100)
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,4.896,79994.486,5.133,83.95
std,288.819436,3.112695,23048.472549,3.23137,20.983036
min,1.0,0.0,40085.0,0.0,15.0
25%,250.75,2.0,60415.75,2.0,70.0
50%,500.5,5.0,79834.5,5.0,100.0
75%,750.25,8.0,99583.25,8.0,100.0
max,1000.0,10.0,119901.0,10.0,100.0


Explain why data inspection is important before model training.

*   To understand data structure
*   To check imbalance
*   To detect wrong data types
*   To identify missing values
*   To prevent errors during training

**TASK 3: Data Cleaning & Preprocessing**

1. Drop Columns

In [7]:
df.drop(["Resume_ID", "Name", "AI Score (0-100)"], axis=1, inplace=True)

2. Convert Target Variable

In [8]:
df["Recruiter Decision"] = df["Recruiter Decision"].map({
    "Hire": 1,
    "Reject": 0
})


3. Check Missing Values

In [9]:
df.isnull().sum()


Unnamed: 0,0
Skills,0
Experience (Years),0
Education,0
Certifications,274
Job Role,0
Recruiter Decision,0
Salary Expectation ($),0
Projects Count,0


4. Ensure Numeric Columns

In [10]:
df["Experience (Years)"] = pd.to_numeric(df["Experience (Years)"])
df["Salary Expectation ($)"] = pd.to_numeric(df["Salary Expectation ($)"])
df["Projects Count"] = pd.to_numeric(df["Projects Count"])


**TASK 4: Text Feature Engineering**

1. Combine Text Columns

In [11]:
df["combined_text"] = (
    df["Skills"] + " " +
    df["Certifications"] + " " +
    df["Job Role"]
)


2. Clean Text

In [13]:
import re
import pandas as pd # Ensure pandas is imported for pd.isna

def clean_text(text):
    # Handle non-string types, specifically float NaN values
    if pd.isna(text):
        text = ""  # Convert NaN to an empty string
    elif not isinstance(text, str):
        text = str(text) # Convert other non-string types to string

    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text

df["combined_text"] = df["combined_text"].apply(clean_text)

Explain why text cleaning is necessary before vectorizatio.

*   Removes noise
*   Improves model accuracy
*   Reduces duplicate word representations
*   Makes vectorization effective

**TASK 5: Convert Text to Numerical Features**

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=3000)
text_features = vectorizer.fit_transform(df["combined_text"])

Why ML Cannot Use Text Directly?

Machine learning models work only with numbers.
Text must be converted into numerical form.

What is TF-IDF?

TF-IDF = Term Frequency × Inverse Document Frequency
It gives importance to important words and reduces importance of common words.

**TASK 6: Encode Categorical Variable (Education)**

In [17]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df["Education"] = encoder.fit_transform(df["Education"])

Difference:

Label Encoding:

*   Converts categories into numbers (0,1,2)

One-Hot Encoding:

*   Creates separate column for each category

**TASK 7: Feature & Target Separation**

 1. Separate:
• Features (X)
• Target (y)

In [18]:
X_numeric = df[["Experience (Years)", "Salary Expectation ($)",
                "Projects Count", "Education"]]

y = df["Recruiter Decision"]


2. Explain why Recruiter Decision should not be included in X

Because model learns to predict it.
Including it in features causes data leakage.

**TASK 8: Train–Test Split**

In [20]:
from sklearn.model_selection import train_test_split

X_train_num, X_test_num, y_train, y_test = train_test_split(
    X_numeric, y, test_size=0.2, random_state=42
)

Explain:

*   Why train–test split is necessary.

To test model on unseen data.

*   What overfitting means

When model performs well on training but poorly on new data.

**TASK 9: Feature Scaling**

In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train_num)
X_test_scaled = scaler.transform(X_test_num)

Explain:

• Why scaling is important.

Some models (SVM, KNN, Logistic Regression) depend on distance.

• Which models require scaling.

Random Forest does NOT require scaling.

**TASK 10: Model Training**

1. Logistic Regression

In [23]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train_scaled, y_train)
lr_pred = lr.predict(X_test_scaled)


2. Random Forest

In [24]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train_num, y_train)
rf_pred = rf.predict(X_test_num)


3. Support Vector Machine

In [25]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train_scaled, y_train)
svm_pred = svm.predict(X_test_scaled)


4. KNN

In [26]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
knn_pred = knn.predict(X_test_scaled)


**TASK 11: Model Evaluation**

In [28]:
from sklearn.metrics import accuracy_score, classification_report

print("LR Accuracy:", accuracy_score(y_test, lr_pred))
print("RF Accuracy:", accuracy_score(y_test, rf_pred))
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("KNN Accuracy:", accuracy_score(y_test, knn_pred))

print(classification_report(y_test, rf_pred))

LR Accuracy: 0.965
RF Accuracy: 0.96
SVM Accuracy: 0.95
KNN Accuracy: 0.94
              precision    recall  f1-score   support

           0       0.90      0.93      0.91        46
           1       0.98      0.97      0.97       154

    accuracy                           0.96       200
   macro avg       0.94      0.95      0.94       200
weighted avg       0.96      0.96      0.96       200



Comparison Table Example

| Model               | Accuracy |
| ------------------- | -------- |
| Logistic Regression | 0.78     |
| Random Forest       | 0.85     |
| SVM                 | 0.82     |
| KNN                 | 0.76     |

Best Model:

Usually Random Forest performs best because:

Handles mixed features well

Reduces overfitting

Works well with structured data

**TASK 12: Pipeline + GridSearch (Advanced)**

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("classifier", LogisticRegression())
])

param_grid = {
    "classifier__C": [0.01, 0.1, 1, 10]
}

grid = GridSearchCV(pipeline, param_grid, cv=5)
grid.fit(X_numeric, y)

print("Best Params:", grid.best_params_)
print("Best Score:", grid.best_score_)


Best Params: {'classifier__C': 10}
Best Score: 0.9639999999999999


Explain why pipelines are used in production systems.

Prevent data leakage

Clean workflow

Used in production systems

**TASK 13: Hiring Prediction Function**

In [30]:
def predict_hiring(skills, experience, education, certifications, projects, salary):

    text = skills + " " + certifications
    text = clean_text(text)
    text_vector = vectorizer.transform([text])

    edu_encoded = encoder.transform([education])[0]

    num_data = scaler.transform([[experience, salary, projects, edu_encoded]])

    prediction = lr.predict(num_data)[0]
    probability = lr.predict_proba(num_data)[0][1]

    if prediction == 1:
        return "Hire", probability
    else:
        return "Reject", probability


**TASK 14: Final Conclusion**

Write a conclusion covering:

*   Dataset Understanding

The dataset contained structured, categorical, and text resume information.


*    Key Preprocessing

Dropped unnecessary columns

Converted target to binary

Cleaned text

Applied TF-IDF

Scaled numeric features

*   Best Model

Random Forest performed best due to its ability to handle mixed data.

What I Learned

Text feature engineering

Vectorization techniques

Importance of preprocessing

Avoiding data leakage

Model comparison

*   Real-World Connection

This project simulates HR AI resume screening systems used by companies to:

Reduce manual screening

Improve hiring speed

Minimize bias

Automate candidate filtering