## **1. Data Loading**

In [3]:
import pandas as pd

# Load dataset
df = pd.read_csv("Social_Network_Ads.csv") # Dataset: https://www.kaggle.com/datasets/rakeshrau/social-network-ads

# Display first few rows
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [4]:
# Check dataset shape
df.shape

(400, 5)

## **2. Data Preprocessing**

In [5]:
# Drop irrelevant column
df = df.drop(columns=["User ID"])

# Encode categorical feature (Gender)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Gender"] = le.fit_transform(df["Gender"])

# Separate features and target
X = df.drop(columns=["Purchased"])
y = df["Purchased"]

# Train-Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Feature Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## **3. Pipeline Creation**

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

## **4. Model Selection**

**Logistic Regression:** Logistic Regression is suitable for binary classification problems and performs well on small-to-medium sized datasets. It is efficient, interpretable, and commonly used in marketing and advertisement click prediction tasks.

## **5. Model Training**

In [7]:
pipeline.fit(X_train, y_train)

## **6. Cross-Validation**

In [8]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    pipeline, X_train, y_train, cv=5, scoring="accuracy"
)

print("Cross-Validation Accuracy:", cv_scores.mean())
print("Standard Deviation:", cv_scores.std())

Cross-Validation Accuracy: 0.815625
Standard Deviation: 0.026882266459508208


## **7. Hyperparameter Tuning**

In [9]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "model__C": [0.01, 0.1, 1, 10],
    "model__solver": ["liblinear", "lbfgs"]
}

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring="accuracy"
)

grid.fit(X_train, y_train)

In [10]:
print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)

Best Parameters: {'model__C': 0.01, 'model__solver': 'liblinear'}
Best CV Score: 0.81875


## **8. Best Model Selection**

In [11]:
best_model = grid.best_estimator_

## **9. Model Evaluation**

In [12]:
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)

y_pred = best_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.8625

Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.96      0.90        52
           1       0.90      0.68      0.78        28

    accuracy                           0.86        80
   macro avg       0.88      0.82      0.84        80
weighted avg       0.87      0.86      0.86        80


Confusion Matrix:
 [[50  2]
 [ 9 19]]


## **Save Model (Required for Deployment)**

In [13]:
import joblib

joblib.dump(best_model, "model.pkl")

['model.pkl']

## **Live Link**

https://huggingface.co/spaces/shehjaddev/Social-Network-Ads-Click-Prediction-System