# Assignment-1: Linear Regression Model for Employee Attrition Prediction (From Scratch)

**Dataset:** HR_Employee.csv  
**Target Variable:** Attrition  
**Input Variables (8):** Age, MonthlyIncome, TotalWorkingYears, YearsAtCompany, DistanceFromHome, JobSatisfaction, EnvironmentSatisfaction, YearsWithCurrManager  
**Note:** Linear Regression is implemented from scratch using NumPy (Normal Equation), without using scikit-learn.

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
print("Libraries imported successfully! (No scikit-learn used)")

## 2. Load the Dataset

In [None]:
df = pd.read_csv('HR_Employee.csv')
print(f"Dataset Shape: {df.shape}")
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}")
df.head()

## 3. Dataset Overview

In [None]:
print("=" * 60)
print("DATASET INFO")
print("=" * 60)
df.info()

In [None]:
print("=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
df.describe()

In [None]:
print("=" * 60)
print("MISSING VALUES")
print("=" * 60)
print(df.isnull().sum())
print(f"\nTotal Missing Values: {df.isnull().sum().sum()}")

In [None]:
print("=" * 60)
print("TARGET VARIABLE DISTRIBUTION")
print("=" * 60)
print(df['Attrition'].value_counts())
print(f"\nAttrition Rate: {df['Attrition'].value_counts(normalize=True)['Yes']*100:.2f}%")

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

colors = ['#2ecc71', '#e74c3c']
df['Attrition'].value_counts().plot(kind='bar', ax=axes[0], color=colors, edgecolor='black')
axes[0].set_title('Attrition Count', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Attrition')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

df['Attrition'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', colors=colors, startangle=90)
axes[1].set_title('Attrition Distribution', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Distribution of selected input features
input_features = ['Age', 'MonthlyIncome', 'TotalWorkingYears', 'YearsAtCompany',
                   'DistanceFromHome', 'JobSatisfaction', 'EnvironmentSatisfaction', 'YearsWithCurrManager']

fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for i, col in enumerate(input_features):
    axes[i].hist(df[col], bins=20, color='#3498db', edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Distribution of {col}', fontsize=11, fontweight='bold')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Box plots - Features vs Attrition
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for i, col in enumerate(input_features):
    sns.boxplot(x='Attrition', y=col, data=df, ax=axes[i], palette=colors)
    axes[i].set_title(f'{col} vs Attrition', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Encode Attrition: Yes = 1, No = 0
df['Attrition_Encoded'] = df['Attrition'].map({'Yes': 1, 'No': 0})

# Correlation heatmap for selected features
selected_cols = input_features + ['Attrition_Encoded']
corr_matrix = df[selected_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', center=0, fmt='.2f',
            linewidths=0.5, square=True)
plt.title('Correlation Heatmap - Selected Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 5. Data Preprocessing

In [None]:
# Define input (X) and output (y)
X = df[input_features].values
y = df['Attrition_Encoded'].values

print("Input Features (X):")
print(input_features)
print(f"\nShape of X: {X.shape}")
print(f"Shape of y: {y.shape}")
print(f"\nTarget Encoding: No -> 0, Yes -> 1")

In [None]:
# Train-Test Split from scratch (80-20 split)
np.random.seed(42)
indices = np.random.permutation(len(X))
split_point = int(0.8 * len(X))

train_idx = indices[:split_point]
test_idx = indices[split_point:]

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size:  {X_test.shape[0]} samples")
print(f"\nTraining set shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Testing set shape:  X_test={X_test.shape}, y_test={y_test.shape}")

## 6. Linear Regression from Scratch

Using the **Normal Equation**:  
$$\theta = (X^T X)^{-1} X^T y$$

Where:
- $X$ = input feature matrix (with bias column of 1s)
- $y$ = target vector
- $\theta$ = model parameters (weights + bias)

In [None]:
class LinearRegressionScratch:
    """
    Linear Regression implemented from scratch using the Normal Equation.
    No scikit-learn dependency.
    """
    
    def __init__(self):
        self.weights = None   # model weights (coefficients)
        self.bias = None      # intercept
    
    def fit(self, X, y):
        """Train the model using the Normal Equation: theta = (X^T X)^(-1) X^T y"""
        n_samples = X.shape[0]
        
        # Add bias column (column of 1s) to X
        X_bias = np.c_[np.ones(n_samples), X]
        
        # Normal Equation: theta = (X^T X)^(-1) X^T y
        XtX = X_bias.T.dot(X_bias)
        Xty = X_bias.T.dot(y)
        theta = np.linalg.inv(XtX).dot(Xty)
        
        # Extract bias and weights
        self.bias = theta[0]
        self.weights = theta[1:]
        
        return self
    
    def predict(self, X):
        """Make predictions: y_pred = X * weights + bias"""
        return X.dot(self.weights) + self.bias

print("Linear Regression class defined from scratch (Normal Equation).")

In [None]:
# Train the model
model = LinearRegressionScratch()
model.fit(X_train, y_train)

print("Model trained successfully!")
print(f"\nIntercept (bias): {model.bias:.4f}")
print(f"\nCoefficients (weights):")
for feature, coef in zip(input_features, model.weights):
    print(f"  {feature:30s}: {coef:.6f}")

## 7. Make Predictions

In [None]:
# Predict on test set
y_pred = model.predict(X_test)

# Display first 10 predictions
results_df = pd.DataFrame({
    'Actual': y_test[:10],
    'Predicted (Raw)': np.round(y_pred[:10], 4),
    'Predicted (Rounded)': (y_pred[:10] >= 0.5).astype(int)
})
print("First 10 Predictions:")
print(results_df.to_string(index=False))

## 8. Evaluation Metrics (From Scratch)

In [None]:
# ---- Regression Metrics (From Scratch) ----

def mean_squared_error(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def root_mean_squared_error(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def mean_absolute_error(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def r2_score(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - (ss_res / ss_tot)

mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=" * 50)
print("REGRESSION EVALUATION METRICS")
print("=" * 50)
print(f"Mean Squared Error (MSE):       {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE):      {mae:.4f}")
print(f"R-squared (R²) Score:           {r2:.4f}")

## 9. Classification Analysis (Threshold = 0.5)

In [None]:
# ---- Classification Metrics (From Scratch) ----

# Convert predictions to binary
y_pred_class = (y_pred >= 0.5).astype(int)

# Accuracy
def accuracy_score(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

# Confusion Matrix
def confusion_matrix(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return np.array([[tn, fp], [fn, tp]])

# Precision, Recall, F1-Score
def precision_score(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    return tp / (tp + fp) if (tp + fp) > 0 else 0

def recall_score(y_true, y_pred):
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return tp / (tp + fn) if (tp + fn) > 0 else 0

def f1_score(y_true, y_pred):
    p = precision_score(y_true, y_pred)
    r = recall_score(y_true, y_pred)
    return 2 * (p * r) / (p + r) if (p + r) > 0 else 0

acc = accuracy_score(y_test, y_pred_class)
prec = precision_score(y_test, y_pred_class)
rec = recall_score(y_test, y_pred_class)
f1 = f1_score(y_test, y_pred_class)
cm = confusion_matrix(y_test, y_pred_class)

print("=" * 50)
print("CLASSIFICATION METRICS (Threshold = 0.5)")
print("=" * 50)
print(f"Accuracy:  {acc:.4f} ({acc*100:.2f}%)")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1-Score:  {f1:.4f}")
print(f"\nConfusion Matrix:")
print(f"  TN={cm[0][0]}  FP={cm[0][1]}")
print(f"  FN={cm[1][0]}  TP={cm[1][1]}")

In [None]:
# Confusion Matrix Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Attrition', 'Attrition'],
            yticklabels=['No Attrition', 'Attrition'])
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.tight_layout()
plt.show()

## 10. Visualizations

In [None]:
# Actual vs Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5, color='#3498db', edgecolors='black', linewidth=0.5)
plt.plot([0, 1], [0, 1], 'r--', linewidth=2, label='Ideal Prediction Line')
plt.xlabel('Actual Values', fontsize=12)
plt.ylabel('Predicted Values', fontsize=12)
plt.title('Actual vs Predicted Values', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

In [None]:
# Residual Plot
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5, color='#e74c3c', edgecolors='black', linewidth=0.5)
plt.axhline(y=0, color='black', linestyle='--', linewidth=1.5)
plt.xlabel('Predicted Values', fontsize=12)
plt.ylabel('Residuals', fontsize=12)
plt.title('Residual Plot', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Distribution of Residuals
plt.figure(figsize=(10, 6))
plt.hist(residuals, bins=30, color='#9b59b6', edgecolor='black', alpha=0.7)
plt.axvline(x=0, color='red', linestyle='--', linewidth=1.5)
plt.xlabel('Residuals', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Residuals', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance (Coefficients)
coef_df = pd.DataFrame({
    'Feature': input_features,
    'Coefficient': model.weights
}).sort_values(by='Coefficient', key=abs, ascending=True)

plt.figure(figsize=(10, 6))
colors_bar = ['#e74c3c' if c < 0 else '#2ecc71' for c in coef_df['Coefficient']]
plt.barh(coef_df['Feature'], coef_df['Coefficient'], color=colors_bar, edgecolor='black')
plt.xlabel('Coefficient Value', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance (Linear Regression Coefficients)', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
plt.tight_layout()
plt.show()

## 11. Summary

### Model Details:
- **Algorithm:** Linear Regression (implemented from scratch using Normal Equation)
- **No scikit-learn used** — all computations done with NumPy
- **Target Variable:** Attrition (encoded as 0 = No, 1 = Yes)
- **Input Features (8):** Age, MonthlyIncome, TotalWorkingYears, YearsAtCompany, DistanceFromHome, JobSatisfaction, EnvironmentSatisfaction, YearsWithCurrManager
- **Train-Test Split:** 80% Training, 20% Testing (manual random split)

### What was implemented from scratch:
1. **Linear Regression Model** — using the Normal Equation: θ = (XᵀX)⁻¹Xᵀy
2. **Train-Test Split** — using numpy random permutation
3. **Regression Metrics** — MSE, RMSE, MAE, R² Score
4. **Classification Metrics** — Accuracy, Precision, Recall, F1-Score, Confusion Matrix

### Key Observations:
- Linear Regression provides continuous predictions between 0 and 1 for this binary classification task.
- The model coefficients indicate the relative importance and direction of influence of each feature on attrition.
- Features with negative coefficients are associated with lower attrition, while positive coefficients indicate higher attrition risk.