#  Classification in Machine Learning

## Introduction 

Classification is one of the fundamental tasks in supervised machine learning, where the goal is to predict categorical labels for data points based on their features. Unlike regression, which predicts continuous values, classification assigns discrete class labels to instances.

### Common Classification Problems

Classification problems are everywhere in real-world applications:
- **Email filtering systems** classify messages as spam or legitimate
- **Medical diagnosis tools** predict whether a patient has a particular disease
- **Credit card companies** detect fraudulent transactions
- **E-commerce platforms** predict whether customers will churn or remain active
- **Image recognition systems** classify objects in photographs

**In business, "churn" is the rate at which a company loses customers or subscribers over a specific period**
#Ech of these scenarios involves learning patterns from historical labeled data to make predictions on new, unseen instances.

### Binary vs Multiclass Classification

**Binary classification** involves predicting one of two possible classes, such as yes/no, true/false, or positive/negative. This is the most common type of classification problem. Examples include:
- Predicting customer churn (will churn or won't churn)
- Spam detection (spam or not spam)
- Disease diagnosis (disease present or absent)

**Multiclass classification** extends this to three or more classes. For instance:
- Classifying images of animals into categories like cat, dog, bird, or fish
- Handwritten digit recognition involves classifying digits 0 through 9
- Product categorization in e-commerce might involve dozens or hundreds of categories

### The Classification Workflow


A typical classification project follows several key stages:
1. Prepare and clean the data
2. Establish a validation framework to properly evaluate models
3. Perform EDA understand patterns
4. Assess feature importance using various techniques
5. Engineer new features and encode categorical variables
6. Train classification models
7. Interpret model behavior
8. Deploy for predictions


## Data Preparation

### Handling Missing Values
- Identify which features contain missing values and assess the extent of missingness
- Decide on an appropriate strategy: remove rows with missing values, impute with mean/median/mode, or use advanced imputation techniques
- Document which approach was used for reproducibility
- Consider the impact of missing data patterns on model performance

### Data Cleaning
- Remove duplicate records that could bias the model
- Fix inconsistencies in categorical values (e.g., "yes", "Yes", "YES" should be standardized)
- Correct obvious data entry errors and anomalies
- Ensure data types are appropriate for each feature
- Validate that values fall within expected ranges

### Data Formatting
- Convert string representations to appropriate types (dates, numbers, categories)
- Parse complex fields into usable components
- Standardize text fields (lowercase, remove special characters)
- Ensure consistent units of measurement across all records
- Handle special characters and encoding issues

### Outlier Treatment
- Identify outliers using statistical methods (IQR, z-scores) or visualization
- Determine if outliers are errors (remove) or genuine extreme values (keep)
- Consider capping extreme values or transforming features to reduce outlier impact
- Document decisions about outlier handling
- Be cautious not to remove legitimate data points

### Feature Scaling
- Some algorithms (like logistic regression) benefit from normalized or standardized features
- **Standardization** transforms features to have mean 0 and standard deviation 1
- **Normalization** scales features to a fixed range, typically [0, 1]
- Note: Tree-based models don't require feature scaling
- Always fit scalers on training data only, then transform test data


## Setting Up Validation Framework

A proper validation framework is crucial for honestly assessing model performance and avoiding overfitting.

### Data Splitting Strategy
- Divide the dataset into three distinct subsets: **training, validation, and test sets**
- 60-20-20  (train-validation-test)
- Ensure the split is random to avoid introducing bias
- For time-series data, use temporal splits to respect time ordering
- Use `train_test_split` from scikit-learn with a fixed random_state for reproducibility

### Purpose of Each Set

**Training Set**
- Used to fit the model and learn feature-target relationships
- The model sees this data during the learning process
- Typically the largest subset (60-80% of data)

**Validation Set**
- Used during model development to tune hyperparameters
- Compare different models and select the best one
- Never used for training the model
- Helps detect overfitting

**Test Set**
- Held out until the very end to provide an unbiased estimate of final model performance
- Never used for any decision-making during model development
- Provides the final performance metric you report
- Simulates real-world performance on completely unseen data

### Cross-Validation

**K-Fold Cross-Validation**
- Splits training data into k subsets (folds), typically k=5 or k=10
- The model is trained k times, each time using k-1 folds for training and 1 fold for validation
- Results are averaged across all folds to get a more robust performance estimate
- Particularly useful when you have limited data
- Provides better estimate of model performance than single train-validation split

**Stratified K-Fold**
- Maintains class proportions in each fold
- Important for imbalanced datasets
- Ensures each fold is representative of the overall class distribution
- Reduces variance in performance estimates

### Handling Class Imbalance
- Check if your target classes are balanced or imbalanced
- Use stratified splitting to ensure each subset has representative class proportions
- Consider oversampling minority class, undersampling majority class, or using SMOTE
- Adjust evaluation metrics to account for imbalance (use precision, recall, F1 instead of just accuracy)
- Use class_weight parameter in models to give more importance to minority class


## Exploratory Data Analysis (EDA)

EDA is the process of investigating your dataset to discover patterns, relationships, and potential issues before modeling.

### Target Variable Analysis
- Examine the distribution of the target variable (class balance)
- Calculate the proportion of each class
- Identify severe class imbalance that might require special handling
- Visualize class distribution with bar charts or pie charts
- Understanding baseline: what accuracy would you get by always predicting the majority class?

### Feature Distributions

**For Numerical Features**
- Create histograms to understand their distributions
- Use box plots to identify outliers and compare distributions across classes
- Generate density plots to see smoothed distribution shapes
- Check for skewness that might benefit from transformation
- Look for bimodal or multimodal distributions

**For Categorical Features**
- Count unique values and their frequencies
- Identify rare categories that might need grouping
- Check for high cardinality features
- Look for missing or unknown categories

### Univariate Analysis
- Examine each feature individually before looking at relationships
- Calculate summary statistics (mean, median, std, min, max, quartiles)
- Identify features with low variance that might not be useful
- Look for features with excessive missing values
- Understand the range and distribution of each feature

### Bivariate Analysis
- Plot features against the target variable to identify predictive relationships
- Use grouped bar charts for categorical features vs target
- Create box plots showing feature distributions for each class
- Generate scatter plots for numerical features colored by target class
- Look for clear separation between classes

### Multivariate Analysis
- Examine correlations between features using correlation matrices and heatmaps
- Identify highly correlated features that might cause multicollinearity
- Use pair plots to visualize multiple feature relationships simultaneously
- Consider dimensionality reduction techniques (PCA) for high-dimensional data visualization
- Look for feature interactions that might be important

### Key Questions to Answer
- Which features show clear separation between classes?
- Are there any obvious patterns or trends?
- Do any features have non-linear relationships with the target?
- Are there interactions between features that might be important?
- Is the data quality sufficient for modeling?
- Are there any data quality issues that need addressing?


## Feature Importance: Churn Rate and Risk Ratio

Risk ratio is an intuitive way to measure how different values of a categorical feature relate to the target outcome, particularly useful in churn prediction problems.

### Understanding Churn Rate
- **Churn rate** is the proportion of customers who left (churned) within a specific group
- Calculate it by dividing the number of churned customers by total customers in that group
- Formula: `Churn Rate = (Number who churned) / (Total in group)`
- Example: If 30 out of 100 customers with feature value A churned, the churn rate is 0.30 or 30%
- This gives us a baseline understanding of churn within specific segments

### Calculating Risk Ratio
- **Risk ratio** compares a group's churn rate to the global average churn rate
- Formula: `Risk Ratio = (Group Churn Rate) / (Global Churn Rate)`
- Example: If global churn is 20% and a group has 40% churn, risk ratio = 40/20 = 2.0
- This means the group is twice as likely to churn compared to average
- Normalizes churn rates for easy comparison across different features

### Interpreting Risk Ratio
- **Risk ratio = 1**: The group has average risk (same as global churn rate)
- **Risk ratio > 1**: Higher risk (feature value associated with more churn)
- **Risk ratio < 1**: Lower risk (feature value associated with less churn)
- **Risk ratio = 2**: Twice the risk
- **Risk ratio = 0.5**: Half the risk

### Practical Application
- Calculate risk ratio for each category within categorical features
- Identify high-risk segments that need intervention or special attention
- Use risk ratios to create new binary features (high_risk vs low_risk)
- Helps in feature selection by identifying which categorical features are most predictive
- Provides business-friendly interpretation of feature importance

### Example: Contract Type Analysis

Suppose we're predicting customer churn and examining the "contract_type" feature:
- Overall churn rate: 25%
- Month-to-month contracts: 45% churn rate → Risk ratio = 45/25 = **1.8**
- One-year contracts: 15% churn rate → Risk ratio = 15/25 = **0.6**
- Two-year contracts: 5% churn rate → Risk ratio = 5/25 = **0.2**

**Interpretation**: This clearly shows contract type is highly predictive, with month-to-month being high risk (1.8x normal risk), one-year being lower risk (0.6x), and two-year being very low risk (0.2x).


## Feature Importance: Mutual Information

Mutual information is a statistical measure that quantifies the dependency between a feature and the target variable.

### What is Mutual Information?
- Measures how much knowing the value of one variable reduces uncertainty about another variable
- Based on information theory concepts (entropy)
- Captures both **linear and non-linear** relationships between features and target
- Higher mutual information scores indicate stronger relationships with the target
- A score of 0 means the feature and target are completely independent

### How It Works
- Measures the reduction in entropy (uncertainty) of the target when we know the feature value
- Unlike correlation, it can detect complex, non-linear dependencies
- Works for both categorical and numerical features
- No assumptions about the relationship between variables
- Symmetric: MI(X, Y) = MI(Y, X)



### Advantages of Mutual Information
- Detects **non-linear relationships** that correlation might miss
- Works with categorical features without requiring encoding
- Provides a universal measure applicable to any feature type
- Not sensitive to monotonic transformations of features
- No assumptions about distribution of variables
- Can handle mixed types of features

### Limitations
- Requires sufficient data to estimate accurately
- Sensitive to feature discretization for continuous variables
- Computationally more expensive than correlation
- Doesn't indicate direction of relationship (positive or negative)

### Practical Use
- Rank features by mutual information scores for feature selection
- Remove features with very low MI scores to reduce dimensionality
- Compare MI scores across different feature sets
- Use in combination with other importance measures for robust feature selection
- Particularly useful when relationships are non-linear


## Feature Importance: Correlation

Correlation measures the linear relationship between numerical features and the target variable.

### Understanding Correlation
- **Correlation coefficient** quantifies the strength and direction of linear relationships
- Values range from **-1** (perfect negative correlation) to **+1** (perfect positive correlation)
- A value of **0** indicates no linear relationship
- Most commonly uses **Pearson correlation coefficient**
- Formula captures how two variables move together

### Correlation with Binary Target
- For binary classification (target encoded as 0 and 1), we can calculate correlation between features and target
- **Positive correlation** means higher feature values are associated with the positive class (1)
- **Negative correlation** means higher feature values are associated with the negative class (0)
- The **magnitude** (absolute value) indicates strength of relationship
- Can be calculated directly with pandas `.corr()` method

### Interpreting Correlation Values

**Strength Guidelines**
- **|r| > 0.7**: Strong correlation
- **0.3 < |r| ≤ 0.7**: Moderate correlation
- **|r| ≤ 0.3**: Weak correlation

**Direction**
- **Positive (r > 0)**: Variables move in the same direction
- **Negative (r < 0)**: Variables move in opposite directions
- **r = 0**: No linear relationship

### Limitations of Correlation
- **Only captures linear relationships**: Misses non-linear patterns
- **Sensitive to outliers**: Outliers can artificially inflate or deflate correlation
- **Correlation ≠ causation**: Strong correlation doesn't imply one causes the other
- **May be misleading** for categorical or ordinal features
- Can't detect complex interactions between variables
- Assumes both variables are continuous

### Feature-Feature Correlation (Multicollinearity)
- Examine correlations between features themselves
- **Highly correlated features** (|r| > 0.8-0.9) provide redundant information
- **Multicollinearity** can make model interpretation difficult
- Consider removing one feature from highly correlated pairs
- Helps reduce dimensionality without losing information
- Use correlation heatmaps to visualize the correlation matrix

### Using Correlation for Feature Selection
- Select features with high absolute correlation with target
- Remove redundant features with high inter-feature correlation
- Use as one of several feature importance metrics
- Combine with domain knowledge for better decisions
- Be cautious about removing features that might have non-linear effects

### Comparison with Other Methods
- **Correlation**: Fast, simple, only linear relationships
- **Mutual Information**: Captures non-linear, more computationally expensive
- **Risk Ratio**: Intuitive for categorical features, business-friendly
- Use multiple methods for robust feature selection

---

## Feature Engineering

Feature engineering is the process of creating new features or transforming existing ones to improve model performance.

### Why Feature Engineering Matters
- Good features can dramatically improve model accuracy
- Domain knowledge is crucial for creating meaningful features
- Often makes the difference between mediocre and excellent models
- Can reveal hidden patterns not apparent in raw features
- Sometimes more impactful than algorithm choice

### Common Feature Engineering Techniques

#### Creating Interaction Features
- Combine two or more features through multiplication or other operations
- Example: `total_price = quantity × unit_price`
- Useful when the combination is more predictive than individual features
- Can capture synergistic effects between variables
- Example: `price_per_sqft = price / square_footage`

#### Polynomial Features
- Create squared, cubed, or other polynomial terms
- Helps capture non-linear relationships
- Example: `age²` might better capture age-related patterns
- Be cautious of overfitting with high-degree polynomials
- Example: `income²` for income effects that accelerate

#### Binning Continuous Variables
- Convert numerical features into categorical bins
- Example: `age → age_group (young, middle-aged, senior)`
- Can make relationships easier to interpret
- Useful when relationships are non-linear or have thresholds
- Example: `credit_score → credit_rating (poor, fair, good, excellent)`

#### Date and Time Features
- Extract components: `year, month, day, day_of_week, hour`
- Calculate time differences: `days_since_last_purchase`
- Create cyclical features for periodic patterns (day of week, month)
- Identify special periods: `is_weekend, is_holiday, is_month_end`
- Calculate duration: `account_age_days = current_date - registration_date`

#### Aggregation Features
- Calculate statistics across groups: `average_transaction_per_customer`
- Count-based features: `number_of_purchases, number_of_logins`
- Rolling window statistics: `30_day_average, moving_standard_deviation`
- Useful in transactional or time-series data
- Example: `total_spent_last_3_months, purchase_frequency`

#### Text Features
- Extract length of text: `comment_length, title_word_count`
- Count specific characters or patterns: `num_exclamation_marks, num_urls`
- Extract sentiment scores or keywords
- Create TF-IDF or word embeddings for text classification
- Boolean features: `contains_email, contains_phone_number`

#### Ratio and Rate Features
- Create ratios between related features
- Example: `debt_to_income_ratio = total_debt / annual_income`
- Example: `conversion_rate = conversions / visits`
- Example: `average_order_value = total_revenue / number_orders`
- Often more meaningful than absolute values

#### Domain-Specific Features
- Apply business or domain knowledge to create meaningful features
- Example in credit scoring: `debt-to-income ratio, credit_utilization`
- Example in e-commerce: `average_order_value, purchase_frequency, days_since_last_order`
- Example in healthcare: `BMI = weight / height²`
- These often become the most powerful predictors

#### Feature Transformation
- **Log transformation** for right-skewed features
- **Square root** or **box-cox transformations**
- **Normalize** or **standardize** features
- Convert to categorical through binning or encoding
- Apply domain-specific transformations


## One-Hot Encoding

Most machine learning algorithms require numerical inputs, but real-world data often contains categorical variables. One-hot encoding is the standard technique for converting categorical features into numerical format.

### What is One-Hot Encoding?
- Converts categorical variables into binary (0/1) dummy variables
- Each unique category becomes a separate binary column
- Only one column has value 1, all others have 0 (hence "one-hot")
- Preserves categorical information without imposing ordinal relationships
- Treats all categories as equally different (no inherent ordering)

### How It Works

Suppose we have a feature "color" with values: red, blue, green

**Original Data:**
```
Sample 1: color = "red"
Sample 2: color = "blue"
Sample 3: color = "red"
Sample 4: color = "green"
```

**After One-Hot Encoding:**
```
Sample 1: color_red=1, color_blue=0, color_green=0
Sample 2: color_red=0, color_blue=1, color_green=0
Sample 3: color_red=1, color_blue=0, color_green=0
Sample 4: color_red=0, color_blue=0, color_green=1
```

### Why One-Hot Encoding?

**Problem with Label Encoding**
- Simply converting categories to numbers (red=1, blue=2, green=3) implies ordering
- Model might interpret blue (2) as "between" red (1) and green (3)
- Creates artificial magnitude relationships
- Not appropriate for nominal categories

**One-Hot Encoding Solution**
- No ordinal relationship is implied
- Each category is treated independently
- Model learns separate weights for each category
- Appropriate for nominal categorical variables

### Using DictVectorizer in Scikit-Learn

DictVectorizer is an efficient way to implement one-hot encoding, particularly useful when data is in dictionary format.

**Advantages of DictVectorizer:**
- Converts dictionaries of feature-value pairs into numerical feature vectors
- Automatically handles both categorical and numerical features
- Efficient implementation with sparse matrices
- Remembers feature names for interpretation
- Single tool for mixed data types

**Basic Usage:**
```python
from sklearn.feature_extraction import DictVectorizer

# Original data as list of dictionaries
data = [
    {'color': 'red', 'size': 'large', 'price': 100},
    {'color': 'blue', 'size': 'small', 'price': 50},
    {'color': 'green', 'size': 'medium', 'price': 75}
]

# Create and fit DictVectorizer
dv = DictVectorizer(sparse=False)
X = dv.fit_transform(data)

# Get feature names
feature_names = dv.get_feature_names_out()
print(feature_names)
# Output: ['color=blue', 'color=green', 'color=red', 'price', 'size=large', 'size=medium', 'size=small']

# X now contains one-hot encoded categorical features and original numerical features
print(X)
```

**With Pandas DataFrames:**
```python
# Convert DataFrame rows to dictionaries
train_dict = df_train.to_dict(orient='records')

# Fit and transform
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)

# Transform test data (use transform, not fit_transform)
test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)
```

### DictVectorizer vs OneHotEncoder

**DictVectorizer:**
- Works with dictionaries
- Handles mixed types (categorical and numerical) automatically
- Single step for all preprocessing
- Convenient when data is already in dictionary format
- Good for Kaggle-style tabular data

**OneHotEncoder:**
- Works with arrays
- Requires separate preprocessing for different feature types
- More control over encoding parameters
- Part of sklearn.preprocessing
- Better for pipelines with ColumnTransformer

### Important Considerations

**Dummy Variable Trap**
- If you one-hot encode n categories, you only need n-1 columns
- The nth column is redundant (can be inferred from others)
- Can cause perfect multicollinearity in linear models
- DictVectorizer keeps all columns but models handle this internally
- Use `drop='first'` parameter in OneHotEncoder to avoid this

**High Cardinality Features**
- Features with many unique categories create many columns
- Increases dimensionality significantly
- Can lead to sparse data and overfitting
- Consider alternative encodings for high-cardinality features

**Unknown Categories at Test Time**
- What if test data contains categories not seen in training?
- DictVectorizer ignores unknown categories by default
- OneHotEncoder can be set to handle unknown categories
- Set `handle_unknown='ignore'` in OneHotEncoder

**Memory Considerations**
- One-hot encoding increases feature dimensionality
- Use sparse matrices (`sparse=True`) to save memory
- Sparse matrices store only non-zero values
- Important for datasets with many categorical features

### When to Use One-Hot Encoding
- For nominal categorical features (no inherent order)
- With tree-based models and linear models
- When you have relatively few categories per feature (< 50)
- Essential preprocessing step for most ML algorithms
- Default choice for categorical encoding

### Alternatives for High-Cardinality Features

**Target Encoding (Mean Encoding)**
- Replace category with mean target value for that category
- Risk of overfitting; use cross-validation techniques
- Reduces dimensionality significantly

**Frequency Encoding**
- Replace category with its frequency in the dataset
- Simple and effective for some problems

**Binary Encoding**
- Convert categories to binary numbers
- Fewer columns than one-hot encoding
- Useful for high-cardinality features

**Embedding Layers**
- Learn dense representations for categories
- Used in deep learning models
- Captures relationships between categories

### Complete Example

```python
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd

# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['L', 'M', 'S', 'L', 'M'],
    'price': [100, 50, 75, 120, 60],
    'sold': [1, 0, 1, 1, 0]
})

# Split data
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# Prepare features and target
X_train_dict = df_train[['color', 'size', 'price']].to_dict(orient='records')
y_train = df_train['sold'].values

X_test_dict = df_test[['color', 'size', 'price']].to_dict(orient='records')
y_test = df_test['sold'].values

# One-hot encode with DictVectorizer
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_dict)
X_test = dv.transform(X_test_dict)

print("Feature names:", dv.get_feature_names_out())
print("Training data shape:", X_train.shape)
print("First sample encoded:", X_train[0])
```

---

## Machine Learning for Classification

With properly prepared and encoded data, we're ready to train classification models.

### What is a Classification Model?
- A classification model learns patterns from labeled training data
- It creates a **decision boundary** that separates different classes
- Once trained, it can predict class labels for new, unseen data
- The model learns a function that maps input features to output classes
- Training involves finding optimal parameters that minimize prediction errors

### The Learning Process
1. Feed the model training data with both features (X) and labels (y)
2. The model adjusts its internal parameters to minimize prediction errors
3. Different algorithms use different strategies to find optimal parameters
4. The goal is to **generalize** well to new data, not just memorize training data
5. Validation data helps ensure the model isn't overfitting

### Types of Classification Algorithms

**Linear Models**
- Logistic Regression
- Linear Discriminant Analysis (LDA)
- Simple, interpretable, fast
- Work well when classes are linearly separable

**Tree-Based Models**
- Decision Trees: Simple, interpretable, handle non-linear relationships
- Random Forests: Ensemble of trees, robust, high accuracy
- Gradient Boosting (XGBoost, LightGBM, CatBoost): State-of-the-art performance
- Handle non-linear relationships and interactions naturally

**Instance-Based Learning**
- K-Nearest Neighbors (KNN): Classifies based on similarity to nearby examples
- Simple concept but can be slow with large datasets
- No training phase, just stores data

**Probabilistic Models**
- Naive Bayes: Fast, works well with text data
- Based on Bayes' theorem and conditional probability
- Assumes feature independence (naive assumption)

**Support Vector Machines (SVM)**
- Finds optimal hyperplane to separate classes
- Works well in high-dimensional spaces
- Can use kernel trick for non-linear boundaries

**Neural Networks and Deep Learning**
- Can learn complex, hierarchical patterns
- Requires more data and computational resources
- Excellent for image, text, and complex structured data

### Choosing a Classification Algorithm

**Consider Dataset Size**
- Small datasets (< 1000 samples): Logistic Regression, Naive Bayes, SVM
- Medium datasets (1000-100K): Random Forest, Gradient Boosting
- Large datasets (> 100K): Logistic Regression, Neural Networks, LightGBM

**Feature Types**
- Mixed categorical and numerical: Tree-based models (handle both natively)
- Mostly numerical: Any algorithm, but consider scaling for linear models
- High-dimensional: Logistic Regression, SVM, Neural Networks

**Interpretability Needs**
- High interpretability: Logistic Regression, Decision Trees
- Moderate interpretability: Random Forest (feature importance)
- Black box acceptable: Gradient Boosting, Neural Networks

**Performance Requirements**
- Fast training: Logistic Regression, Naive Bayes
- Fast prediction: Logistic Regression, Linear models
- Highest accuracy: Gradient Boosting, Neural Networks (with enough data)

**General Strategy**
- Start simple: Begin with Logistic Regression as baseline
- Try tree-based: Random Forest or XGBoost for better performance
- Experiment: Test multiple algorithms and compare
- Ensemble: Combine multiple models for best results

---

## Logistic Regression

Logistic regression is one of the most fundamental and widely-used classification algorithms, despite having "regression" in its name.

### What is Logistic Regression?
- A linear model designed specifically for classification tasks
- Similar to linear regression but outputs probabilities instead of continuous values
- Uses the logistic (sigmoid) function to transform linear combinations into probabilities
- Despite the name, it's a **classification algorithm**, not a regression algorithm
- Works by finding a linear decision boundary between classes

### How It Works

**Step 1: Linear Combination**
- Like linear regression, computes a weighted sum of input features
- Formula: `z = w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ`
- w₀ is the intercept (bias term)
- w₁, w₂, ..., wₙ are the weights (coefficients) for each feature
- This gives us a score that can range from -∞ to +∞

**Step 2: Sigmoid Transformation**
- Instead of outputting z directly, passes it through the sigmoid function
- Sigmoid function: `σ(z) = 1 / (1 + e^(-z))`
- This transforms any real number into a value between 0 and 1
- The output represents a probability

**Step 3: Making Predictions**
- Use a threshold (typically 0.5) to convert probability to class
- If σ(z) ≥ 0.5, predict class 1
- If σ(z) < 0.5, predict class 0

### The Sigmoid Function

**Mathematical Properties**
- Domain: All real numbers (-∞ to +∞)
- Range: (0, 1) - perfect for probabilities
- When z = 0, sigmoid outputs 0.5
- When z is large positive, sigmoid approaches 1
- When z is large negative, sigmoid approaches 0
- Creates an S-shaped curve

**Visual Understanding**
```
σ(z)
1.0 |           ________
    |         /
0.5 |       /
    |     /
0.0 |____/
    |________________
       -∞    0    +∞  (z)
```

**Why Sigmoid?**
- Smoothly maps any real number to probability
- Differentiable everywhere (important for gradient descent)
- Has nice mathematical properties for optimization
- Natural interpretation as probability

### Output as Probability
- The output represents **P(y=1|x)** - probability of positive class given features
- For binary classification: P(y=1|x) = σ(z)
- Probability of negative class: P(y=0|x) = 1 - σ(z)
- These probabilities always sum to 1
- Can use these probabilities for ranking or confidence scoring

### Decision Boundary
- Logistic regression creates a **linear decision boundary**
- The boundary is where the probability equals 0.5 (where z = 0)
- In 2D: boundary is a line; in 3D: a plane; in higher dimensions: a hyperplane
- Points on one side → class 1; points on other side → class 0
- The boundary equation: w₀ + w₁x₁ + w₂x₂ + ... + wₙxₙ = 0

**Example in 2D:**
```
   x₂
    |   Class 1 (σ > 0.5)
    |  o  o  o
    | o  o  o
    |______________ Decision Boundary
    |    x  x
    | x  x  x
    |  x  x    Class 0 (σ < 0.5)
    |________________ x₁
```

### Comparison to Linear Regression

**Similarities**
- Both are linear models
- Both use weighted sum of features: w₀ + w₁x₁ + ...
- Both have interpretable weights
- Both are simple and fast

**Differences**
- **Linear Regression**: Predicts continuous values directly (y = w₀ + w₁x₁ + ...)
- **Logistic Regression**: Predicts probabilities using sigmoid(w₀ + w₁x₁ + ...)
- **Linear Regression**: Output can be any value
- **Logistic Regression**: Output is bounded [0, 1]
- **Linear Regression**: Used for regression tasks
- **Logistic Regression**: Used for classification tasks
- **Loss Function**: Linear uses MSE, Logistic uses log loss (cross-entropy)

### Learning the Weights

**Optimization Goal**
- Find weights that maximize the likelihood of correct classifications
- Equivalent to minimizing the log loss (cross-entropy loss)
- Log loss penalizes confident wrong predictions heavily

**Log Loss (Cross-Entropy)**
```
Loss = -[y·log(p) + (1-y)·log(1-p)]
```
- When y=1 and p is close to 1: loss is small (good)
- When y=1 and p is close to 0: loss is large (bad)
- When y=0 and p is close to 0: loss is small (good)
- When y=0 and p is close to 1: loss is large (bad)

**Optimization Algorithms**
- Gradient Descent: Iteratively adjust weights in direction that reduces loss
- Stochastic Gradient Descent (SGD): Use mini-batches for efficiency
- L-BFGS: Quasi-Newton method, efficient for medium-sized datasets
- Newton-CG, SAG, SAGA: Other optimization algorithms available

**Regularization**
- L2 regularization (Ridge): Penalizes sum of squared weights
- L1 regularization (Lasso): Penalizes sum of absolute weights, can produce sparse models
- Prevents overfitting by discouraging large weights
- Controlled by hyperparameter C (inverse regularization strength)

### Advantages of Logistic Regression
- **Simple and interpretable**: Easy to understand what the model learned
- **Fast to train and predict**: Efficient even on large datasets
- **Provides probabilities**: Not just class labels, but confidence scores
- **Works well with linearly separable classes**: Strong baseline
- **Requires less training data**: Compared to more complex models
- **No hyperparameter tuning required** for basic usage
- **Industry standard**: Well-tested and widely used
- **Good for online learning**: Can update with new data incrementally

### Limitations
- **Linear decision boundary**: Can't model complex non-linear relationships
- **May underperform** with highly non-linear data
- **Requires feature engineering** for non-linear patterns
- **Sensitive to outliers**: Extreme values can influence weights
- **Sensitive to feature scale**: Benefits from standardization
- **Assumes features are relatively independent**: Multicollinearity can be problematic
- **Can underfit** if relationships are too complex

### When to Use Logistic Regression
- As a **baseline model** for any classification problem
- When **interpretability** is important
- When you have **linearly separable** or nearly separable classes
- When you need **probability estimates**
- For **high-dimensional data** (works well even with many features)
- When training **speed** is important
- For **real-time predictions** (very fast inference)
- In **regulated industries** where model transparency is required

---

## Training Logistic Regression with Scikit-Learn

Scikit-learn makes it straightforward to train logistic regression models with just a few lines of code.



### Important Hyperparameters

#### C (Regularization Strength)
- **Inverse of regularization strength**: Smaller C = stronger regularization
- **Default**: C=1.0
- **Lower C** (e.g., 0.01, 0.1): Stronger regularization, simpler model, helps prevent overfitting
- **Higher C** (e.g., 10, 100): Weaker regularization, more complex model, fits training data more closely
- **Tuning**: Use cross-validation to find optimal value
- **Range to try**: [0.001, 0.01, 0.1, 1, 10, 100, 1000]


#### penalty (Regularization Type)
- **'l2'** (default): Ridge regularization - penalizes sum of squared weights
- **'l1'**: Lasso regularization - penalizes sum of absolute weights, can produce sparse models
- **'elasticnet'**: Combination of L1 and L2 (requires 'saga' solver)
- **'none'**: No regularization (not recommended, can overfit)



#### solver (Optimization Algorithm)
- **'lbfgs'** (default): Good for most problems, supports L2 and no penalty
- **'liblinear'**: Good for small datasets, supports L1 and L2
- **'saga'**: Fast for large datasets, supports all penalties including elasticnet
- **'sag'**: Stochastic Average Gradient, faster for large datasets
- **'newton-cg'**: Good for small datasets with many features

**Choosing a solver:**
- Small dataset: 'liblinear' or 'lbfgs'
- Large dataset: 'saga' or 'sag'
- Need L1 regularization: 'liblinear' or 'saga'
- Need elasticnet: 'saga'



#### max_iter
- **Maximum number of iterations** for optimization algorithm
- **Default**: 100 (often insufficient)
- If you get convergence warnings, increase this value
- **Common values**: 1000, 5000, 10000
- Higher values increase training time


#### random_state
- Sets random seed for reproducibility
- Important when using stochastic solvers ('sag', 'saga')
- Use any integer for consistent results



## Model Interpretation

Understanding what your model has learned is crucial for trust, debugging, and gaining business insights.

### Examining Feature Coefficients (Weights)


### Understanding Weight Values

**Sign of Weights**
- **Positive weights**: Increase the probability of the positive class
- **Negative weights**: Decrease the probability of the positive class (increase negative class probability)
- **Weight ≈ 0**: Feature doesn't affect predictions much

**Magnitude of Weights**
- **Larger absolute values**: Stronger influence on predictions
- **Smaller absolute values**: Weaker influence
- Compare magnitudes to identify most important features

### Mathematical Interpretation

**Log-Odds Interpretation**
- Weights directly affect the log-odds of the positive class
- Log-odds = log(p/(1-p)) where p is probability of positive class
- One unit increase in feature changes log-odds by the weight value
- This is before applying the sigmoid transformation

**Example:**
- Weight for `tenure` = -0.03
- One month increase in tenure decreases log-odds of churn by 0.03
- This translates to slightly lower probability of churning



**Understanding Odds Ratios**
- **Odds ratio = exp(weight)**
- **Odds ratio > 1**: Feature increases odds of positive class
- **Odds ratio < 1**: Feature decreases odds of positive class
- **Odds ratio = 1**: Feature has no effect

**Examples:**
- Odds ratio = 2.0: Doubling the odds (100% increase)
- Odds ratio = 1.5: 50% increase in odds
- Odds ratio = 0.5: Halving the odds (50% decrease)
- Odds ratio = 0.75: 25% decrease in odds



### One-Hot Encoded Features

**Interpreting Categorical Features**
- Each category gets its own coefficient
- Interpretation is relative to the reference category (usually the omitted one)
- Compare coefficients within the same original feature

**Example:**
```
contract_month-to-month: +1.2
contract_one-year: -0.3
contract_two-year: -0.9

Interpretation:
- Month-to-month: Highest churn risk (positive, large weight)
- One-year: Lower than month-to-month
- Two-year: Lowest churn risk (negative, large magnitude)
```

### Limitations of Weight Interpretation

**Scale Dependence**
- Weights are affected by feature scale
- A feature with larger scale will have smaller weight
- **Solution**: Standardize features before training for fair comparison

**Multicollinearity**
- Highly correlated features can have unstable, misleading weights
- Weight might be split between correlated features
- **Solution**: Remove one feature from highly correlated pairs

**Linear Effects Only**
- Weights show linear effects, not interactions
- Don't capture how features work together
- **Solution**: Create interaction features or use tree-based models for interpretation

**Reference Category**
- One-hot encoded features interpreted relative to omitted category
- Need to know which category was the reference

### Business Insights from Interpretation

**Identifying Key Drivers**
- Which factors most strongly predict the outcome?
- What characteristics define high-risk vs low-risk customers?

**Validating Model**
- Do the patterns make business sense?
- Are the relationships aligned with domain knowledge?
- Any surprising or suspicious patterns?

**Actionable Insights**
- Which factors can the business influence?
- Where should interventions be focused?
- What customer segments need attention?


**Why Use Probabilities?**
- Provides confidence level in predictions
- Allows for ranking customers by risk
- Enables custom business rules and thresholds
- Better for decision-making than binary predictions

### Using Custom Thresholds

The default threshold is 0.5, but you can adjust based on business needs.

**Why Adjust Thresholds?**
- **Lower threshold** (e.g., 0.3): More sensitive, catches more at-risk customers, but more false alarms
- **Higher threshold** (e.g., 0.7): More conservative, fewer false alarms, but might miss some at-risk customers

## Summary

### Key Takeaways

**Feature Importance Methods**
- **Risk Ratio**: Intuitive measure for categorical features comparing group churn rates to global rates
- **Mutual Information**: Captures both linear and non-linear relationships between features and target
- **Correlation**: Measures linear relationships; fast and simple but limited to linear patterns
- Use multiple methods together for robust feature selection

**Data Preparation and Validation**
- Proper data cleaning and handling missing values is crucial
- Split data into training, validation, and test sets
- Use stratified splitting for imbalanced datasets
- Cross-validation provides robust performance estimates

**Exploratory Data Analysis**
- Understand target distribution and check for class imbalance
- Examine feature distributions and relationships with target
- Identify patterns, outliers, and data quality issues
- Guide feature engineering decisions with EDA insights

**Feature Engineering**
- Create interaction features, polynomial terms, and ratios
- Extract components from dates and aggregate statistics
- Apply domain knowledge for meaningful derived features
- Significantly impacts model performance, often more than algorithm choice

**One-Hot Encoding**
- Essential for converting categorical variables to numerical format
- **DictVectorizer** efficiently implements one-hot encoding
- Works seamlessly with both categorical and numerical features in dictionary format
- Each category becomes a separate binary column
- Critical preprocessing step for most machine learning algorithms

**Logistic Regression**
- Linear model that outputs probabilities for classification
- Uses sigmoid function to transform linear combination into probability [0, 1]
- Despite "regression" in name, it's a classification algorithm
- Similar to linear regression but designed for categorical outcomes
- Creates linear decision boundary between classes

**Training with Scikit-Learn**
- Simple API: `model.fit(X_train, y_train)`
- Key hyperparameters: C (regularization), penalty (L1/L2), solver, max_iter
- Use `predict()` for class labels, `predict_proba()` for probabilities
- Handle class imbalance with `class_weight='balanced'`
- Cross-validation for hyperparameter tuning

**Model Interpretation**
- Examine feature weights to understand what model learned
- Positive weights increase probability of positive class
- Negative weights decrease probability of positive class
- **Odds ratios** (exp(weight)) provide intuitive interpretation
- Validate that learned patterns align with domain knowledge

**Using the Model**
- Always use same preprocessing on new data (transform, not fit_transform)
- Get probability scores for risk-based decision making
- Adjust prediction threshold based on business requirements
- Lower threshold for sensitivity, higher threshold for precision
- Create risk segments for targeted interventions

**Best Practices**
- Start with logistic regression as baseline
- Feature engineering often more impactful than algorithm choice
- Monitor model performance in production
- Retrain periodically as data distribution changes
- Document preprocessing and modeling decisions
- Balance model complexity with interpretability needs

### Workflow Recap

1. **Data Preparation**: Clean data, handle missing values, split into train/validation/test
2. **EDA**: Understand distributions, relationships, and identify issues
3. **Feature Importance**: Use risk ratio, mutual information, and correlation to identify predictive features
4. **Feature Engineering**: Create new features based on domain knowledge and EDA insights
5. **Encoding**: Apply one-hot encoding to categorical features using DictVectorizer
6. **Modeling**: Train logistic regression, tune hyperparameters
7. **Interpretation**: Analyze feature weights and validate learned patterns
8. **Deployment**: Use model for predictions with appropriate thresholds
9. **Monitoring**: Track performance and update as needed

### Evaluation Metrics

**For Balanced Datasets**
- **Accuracy**: Proportion of correct predictions
- Use when classes are roughly balanced

**For Imbalanced Datasets**
- **Precision**: Of predicted positives, how many are actually positive
- **Recall**: Of actual positives, how many did we predict
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under ROC curve, threshold-independent metric

**Business Metrics**
- Cost of false positives vs false negatives
- Coverage (what % of customers can we reach)
- Lift (how much better than random)


Classification with logistic regression provides a strong foundation for machine learning. The combination of proper data preparation, thoughtful feature engineering, and careful model interpretation creates models that are both accurate and trustworthy. The techniques covered here - from risk ratios to one-hot encoding to probability calibration - form the essential toolkit for tackling real-world classification problems.



In [1]:
"""
Classification in Machine Learning - Code Examples
All code examples extracted from the classification notes
"""

# ============================================================================
# FEATURE ENGINEERING EXAMPLES
# ============================================================================

import pandas as pd
import numpy as np
from datetime import datetime

# Example: Customer Churn Feature Engineering

# Time-based features
df['account_age_months'] = (pd.to_datetime('today') - df['registration_date']).dt.days / 30
df['is_new_customer'] = (df['account_age_months'] < 6).astype(int)

# Ratio features
df['charge_per_month'] = df['total_charges'] / df['tenure']
df['service_ratio'] = df['num_services'] / df['max_possible_services']

# Interaction features
df['tech_support_senior'] = df['tech_support'] * df['is_senior_citizen']

# Binning
df['tenure_group'] = pd.cut(df['tenure'], 
                              bins=[0, 12, 24, 48, 100],
                              labels=['0-1yr', '1-2yr', '2-4yr', '4+yr'])

# Aggregation
df['avg_monthly_charges'] = df.groupby('customer_id')['monthly_charges'].transform('mean')


# ============================================================================
# CORRELATION ANALYSIS
# ============================================================================

import seaborn as sns
import matplotlib.pyplot as plt

# Calculate correlation with target
correlation_with_target = df.corr()['target'].sort_values(ascending=False)
print(correlation_with_target)

# Full correlation matrix
correlation_matrix = df.corr()

# Visualize with heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

# Identify Highly Correlated Features
def get_correlated_features(df, threshold=0.8):
    corr_matrix = df.corr().abs()
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    # Find features with correlation > threshold
    correlated_features = [
        column for column in upper_triangle.columns 
        if any(upper_triangle[column] > threshold)
    ]
    
    return correlated_features

# Usage
high_corr_features = get_correlated_features(df, threshold=0.85)
print(f"Highly correlated features: {high_corr_features}")


# ============================================================================
# MUTUAL INFORMATION
# ============================================================================

from sklearn.feature_selection import mutual_info_classif

# Calculate mutual information scores
mi_scores = mutual_info_classif(X_train, y_train)

# Create a dataframe for better visualization
mi_df = pd.DataFrame({
    'feature': X_train.columns,
    'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)

print(mi_df)

# Calculate MI scores with visualization
mi_scores = mutual_info_classif(X_train, y_train, random_state=42)

# Create feature importance dataframe
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'mi_score': mi_scores
}).sort_values('mi_score', ascending=False)

# Visualize top features
top_n = 15
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:top_n], 
         feature_importance['mi_score'][:top_n])
plt.xlabel('Mutual Information Score')
plt.title('Top Features by Mutual Information')
plt.tight_layout()
plt.show()


# ============================================================================
# RISK RATIO CALCULATION
# ============================================================================

def calculate_risk_ratio(df, feature, target):
    """Calculate risk ratio for a categorical feature"""
    # Global churn rate
    global_churn = df[target].mean()
    
    # Group churn rates
    group_churn = df.groupby(feature)[target].mean()
    
    # Calculate risk ratios
    risk_ratios = group_churn / global_churn
    
    return risk_ratios

# Usage
risk_ratios = calculate_risk_ratio(df, 'contract_type', 'churn')
print(risk_ratios)


# ============================================================================
# DATA PREPARATION AND SPLITTING
# ============================================================================

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

# Load your data
df = pd.read_csv('data.csv')

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Convert to dictionaries if needed
train_dict = X_train.to_dict(orient='records')
test_dict = X_test.to_dict(orient='records')

# Apply DictVectorizer for one-hot encoding
dv = DictVectorizer(sparse=False)
X_train_encoded = dv.fit_transform(train_dict)
X_test_encoded = dv.transform(test_dict)


# ============================================================================
# ONE-HOT ENCODING WITH DICTVECTORIZER
# ============================================================================

from sklearn.feature_extraction import DictVectorizer

# Original data as list of dictionaries
data = [
    {'color': 'red', 'size': 'large', 'price': 100},
    {'color': 'blue', 'size': 'small', 'price': 50},
    {'color': 'green', 'size': 'medium', 'price': 75}
]

# Create and fit DictVectorizer
dv = DictVectorizer(sparse=False)
X = dv.fit_transform(data)

# Get feature names
feature_names = dv.get_feature_names_out()
print(feature_names)

# X now contains one-hot encoded categorical features and original numerical features
print(X)

# With Pandas DataFrames
train_dict = df_train.to_dict(orient='records')

# Fit and transform
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)

# Transform test data (use transform, not fit_transform)
test_dict = df_test.to_dict(orient='records')
X_test = dv.transform(test_dict)

# Complete Example
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue'],
    'size': ['L', 'M', 'S', 'L', 'M'],
    'price': [100, 50, 75, 120, 60],
    'sold': [1, 0, 1, 1, 0]
})

# Split data
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

# Prepare features and target
X_train_dict = df_train[['color', 'size', 'price']].to_dict(orient='records')
y_train = df_train['sold'].values

X_test_dict = df_test[['color', 'size', 'price']].to_dict(orient='records')
y_test = df_test['sold'].values

# One-hot encode with DictVectorizer
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_dict)
X_test = dv.transform(X_test_dict)

print("Feature names:", dv.get_feature_names_out())
print("Training data shape:", X_train.shape)
print("First sample encoded:", X_train[0])


# ============================================================================
# LOGISTIC REGRESSION - BASIC TRAINING
# ============================================================================

from sklearn.linear_model import LogisticRegression

# Basic instantiation
model = LogisticRegression()

# Or with specific parameters
model = LogisticRegression(
    C=1.0,              # Inverse regularization strength
    max_iter=1000,      # Maximum iterations for convergence
    solver='lbfgs',     # Optimization algorithm
    random_state=42     # For reproducibility
)

# Fit the model
model.fit(X_train_encoded, y_train)

# Predict class labels
y_pred = model.predict(X_test_encoded)

# Predict probabilities
y_pred_proba = model.predict_proba(X_test_encoded)
# Returns array with [prob_class_0, prob_class_1] for each sample

# Get probability for positive class only
y_pred_proba_positive = y_pred_proba[:, 1]


# ============================================================================
# HYPERPARAMETER TUNING
# ============================================================================

from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train_encoded, y_train)
best_C = grid_search.best_params_['C']

# L1 regularization (sparse model)
model = LogisticRegression(penalty='l1', solver='liblinear')

# L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs')

# For large datasets with L1 penalty
model = LogisticRegression(solver='saga', penalty='l1', max_iter=5000)

# Increase if you see convergence warnings
model = LogisticRegression(max_iter=5000)

# Automatic balancing
model = LogisticRegression(class_weight='balanced')

# Custom weights
model = LogisticRegression(class_weight={0: 1, 1: 3})  # Give 3x weight to class 1


# ============================================================================
# MAKING PREDICTIONS
# ============================================================================

# Get hard class predictions (0 or 1)
predictions = model.predict(X_test_encoded)

# Get probability estimates
probabilities = model.predict_proba(X_test_encoded)

# Extract probability of positive class
prob_positive = probabilities[:, 1]

# Using Custom Thresholds
proba_positive = model.predict_proba(X_test_encoded)[:, 1]

# Apply custom threshold (e.g., 0.7 for higher precision)
threshold = 0.7
custom_predictions = (proba_positive >= threshold).astype(int)

# Or for lower threshold (higher recall)
threshold = 0.3
sensitive_predictions = (proba_positive >= threshold).astype(int)


# ============================================================================
# MODEL EVALUATION
# ============================================================================

from sklearn.metrics import (
    accuracy_score, 
    precision_score, 
    recall_score, 
    f1_score,
    classification_report,
    confusion_matrix,
    roc_auc_score
)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba[:, 1])

print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
print(f"AUC-ROC: {auc:.3f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)


# ============================================================================
# COMPLETE EXAMPLE - CUSTOMER CHURN
# ============================================================================

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load and prepare data
df = pd.read_csv('customer_churn.csv')

# Split features and target
X = df.drop('churn', axis=1)
y = df['churn']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Convert to dictionaries
train_dict = X_train.to_dict(orient='records')
test_dict = X_test.to_dict(orient='records')

# One-hot encode
dv = DictVectorizer(sparse=False)
X_train_encoded = dv.fit_transform(train_dict)
X_test_encoded = dv.transform(test_dict)

# Train model
model = LogisticRegression(
    C=1.0,
    max_iter=1000,
    solver='lbfgs',
    random_state=42,
    class_weight='balanced'  # Handle class imbalance
)
model.fit(X_train_encoded, y_train)

# Make predictions
y_pred = model.predict(X_test_encoded)
y_pred_proba = model.predict_proba(X_test_encoded)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
feature_names = dv.get_feature_names_out()
weights = model.coef_[0]
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'weight': weights
}).sort_values('weight', key=abs, ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))


# ============================================================================
# MODEL PERSISTENCE (SAVING AND LOADING)
# ============================================================================

import pickle

# Save model and vectorizer together
with open('churn_model.pkl', 'wb') as f:
    pickle.dump((dv, model), f)

# Load model
with open('churn_model.pkl', 'rb') as f:
    dv_loaded, model_loaded = pickle.load(f)

# Use loaded model for predictions
new_customer = {'contract': 'month-to-month', 'tenure': 12}
new_customer_encoded = dv_loaded.transform([new_customer])
prediction = model_loaded.predict(new_customer_encoded)
probability = model_loaded.predict_proba(new_customer_encoded)[0, 1]


# ============================================================================
# CROSS-VALIDATION
# ============================================================================

from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
cv_scores = cross_val_score(
    model, 
    X_train_encoded, 
    y_train, 
    cv=5, 
    scoring='accuracy'
)

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")


# ============================================================================
# MODEL INTERPRETATION
# ============================================================================

# Get weights (coefficients)
weights = model.coef_[0]  # For binary classification

# Get feature names
feature_names = dv.get_feature_names_out()

# Create interpretation dataframe
importance_df = pd.DataFrame({
    'feature': feature_names,
    'weight': weights,
    'abs_weight': abs(weights)
}).sort_values('abs_weight', ascending=False)

print(importance_df.head(15))

# Calculate odds ratios
importance_df['odds_ratio'] = np.exp(importance_df['weight'])

print(importance_df[['feature', 'weight', 'odds_ratio']].head(10))

# Rank by absolute weight
top_features = importance_df.sort_values('abs_weight', ascending=False).head(20)
print(top_features[['feature', 'weight', 'odds_ratio']])

# Positive influencers (increase churn)
positive_features = importance_df[importance_df['weight'] > 0].sort_values('weight', ascending=False)
print("\nTop features increasing churn probability:")
print(positive_features.head(10))

# Negative influencers (decrease churn)
negative_features = importance_df[importance_df['weight'] < 0].sort_values('weight')
print("\nTop features decreasing churn probability:")
print(negative_features.head(10))


# ============================================================================
# VISUALIZING FEATURE IMPORTANCE
# ============================================================================

# Plot top features by absolute weight
top_n = 15
top_features = importance_df.sort_values('abs_weight', ascending=False).head(top_n)

plt.figure(figsize=(10, 8))
colors = ['red' if w < 0 else 'green' for w in top_features['weight']]
plt.barh(range(len(top_features)), top_features['weight'], color=colors)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Weight (Coefficient)')
plt.title('Top 15 Most Important Features')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()


# ============================================================================
# COMPLETE INTERPRETATION EXAMPLE
# ============================================================================

# Full feature interpretation workflow
weights = model.coef_[0]
feature_names = dv.get_feature_names_out()

# Create comprehensive interpretation dataframe
interpretation = pd.DataFrame({
    'feature': feature_names,
    'weight': weights,
    'abs_weight': np.abs(weights),
    'odds_ratio': np.exp(weights),
    'pct_change_odds': (np.exp(weights) - 1) * 100
}).sort_values('abs_weight', ascending=False)

# Display top influential features
print("=" * 80)
print("TOP 20 MOST INFLUENTIAL FEATURES")
print("=" * 80)
print(interpretation.head(20).to_string(index=False))

# Separate positive and negative influencers
print("\n" + "=" * 80)
print("TOP FACTORS INCREASING CHURN RISK")
print("=" * 80)
positive = interpretation[interpretation['weight'] > 0].head(10)
print(positive[['feature', 'odds_ratio', 'pct_change_odds']].to_string(index=False))

print("\n" + "=" * 80)
print("TOP FACTORS DECREASING CHURN RISK")
print("=" * 80)
negative = interpretation[interpretation['weight'] < 0].head(10)
print(negative[['feature', 'odds_ratio', 'pct_change_odds']].to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Plot 1: Top features by absolute weight
top_15 = interpretation.head(15)
colors = ['red' if w < 0 else 'green' for w in top_15['weight']]
axes[0].barh(range(len(top_15)), top_15['weight'], color=colors)
axes[0].set_yticks(range(len(top_15)))
axes[0].set_yticklabels(top_15['feature'])
axes[0].set_xlabel('Weight')
axes[0].set_title('Top 15 Features by Importance')
axes[0].axvline(x=0, color='black', linestyle='--', linewidth=0.8)

# Plot 2: Odds ratios
axes[1].barh(range(len(top_15)), top_15['odds_ratio'], color=colors)
axes[1].set_yticks(range(len(top_15)))
axes[1].set_yticklabels(top_15['feature'])
axes[1].set_xlabel('Odds Ratio')
axes[1].set_title('Odds Ratios for Top Features')
axes[1].axvline(x=1, color='black', linestyle='--', linewidth=0.8)

plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()


# ============================================================================
# USING THE MODEL - PREPROCESSING NEW DATA
# ============================================================================

# New customer data (single customer)
new_customer = {
    'contract': 'month-to-month',
    'tenure': 12,
    'monthly_charges': 70.0,
    'total_charges': 840.0,
    'internet_service': 'fiber_optic',
    'tech_support': 'no'
}

# IMPORTANT: Use transform(), NOT fit_transform()
new_customer_encoded = dv.transform([new_customer])

# For multiple customers
new_customers = [
    {'contract': 'month-to-month', 'tenure': 12},
    {'contract': 'one-year', 'tenure': 24},
    {'contract': 'two-year', 'tenure': 36}
]
new_customers_encoded = dv.transform(new_customers)


# ============================================================================
# MAKING PREDICTIONS ON NEW DATA
# ============================================================================

# Predict class label (0 or 1)
prediction = model.predict(new_customer_encoded)
print(f"Prediction: {prediction[0]}")

# Interpret result
if prediction[0] == 1:
    print("Customer is likely to churn")
    print("Action: Initiate retention campaign")
else:
    print("Customer is likely to stay")
    print("Action: Continue normal engagement")

# Get probability estimates
probabilities = model.predict_proba(new_customer_encoded)
prob_no_churn = probabilities[0][0]
prob_churn = probabilities[0][1]

print(f"Probability of staying: {prob_no_churn:.2%}")
print(f"Probability of churning: {prob_churn:.2%}")


# ============================================================================
# USING CUSTOM THRESHOLDS
# ============================================================================

# Get probability of churning
prob_churn = model.predict_proba(new_customer_encoded)[0, 1]

# Conservative approach - only flag very high risk
threshold_conservative = 0.7
if prob_churn >= threshold_conservative:
    action = "HIGH PRIORITY: Immediate intervention required"
elif prob_churn >= 0.5:
    action = "MEDIUM PRIORITY: Monitor and engage"
else:
    action = "LOW RISK: Standard engagement"

print(f"Churn Probability: {prob_churn:.2%}")
print(f"Action: {action}")

# Sensitive approach - catch more potential churners
threshold_sensitive = 0.3
if prob_churn >= threshold_sensitive:
    action = "Include in retention campaign"
else:
    action = "No action needed"


# ============================================================================
# RISK-BASED SEGMENTATION
# ============================================================================

# Predict probabilities for all customers
all_customers_encoded = dv.transform(all_customers_dict)
churn_probabilities = model.predict_proba(all_customers_encoded)[:, 1]

# Create risk segments
risk_segments = np.select(
    [
        churn_probabilities >= 0.7,
        churn_probabilities >= 0.5,
        churn_probabilities >= 0.3,
        churn_probabilities < 0.3
    ],
    [
        'Very High Risk',
        'High Risk',
        'Medium Risk',
        'Low Risk'
    ]
)

# Create results dataframe
results = pd.DataFrame({
    'customer_id': customer_ids,
    'churn_probability': churn_probabilities,
    'risk_segment': risk_segments
}).sort_values('churn_probability', ascending=False)

print(results.head(20))


# ============================================================================
# BATCH PREDICTIONS
# ============================================================================

# Predict for many customers at once
customers_df = pd.read_csv('customers_to_score.csv')
customers_dict = customers_df.to_dict(orient='records')

# Encode
customers_encoded = dv.transform(customers_dict)

# Predict
predictions = model.predict(customers_encoded)
probabilities = model.predict_proba(customers_encoded)[:, 1]

# Add predictions to dataframe
customers_df['churn_prediction'] = predictions
customers_df['churn_probability'] = probabilities

# Save results
customers_df.to_csv('customer_predictions.csv', index=False)

# Summary statistics
print(f"Total customers scored: {len(customers_df)}")
print(f"Predicted churners: {predictions.sum()}")
print(f"Average churn probability: {probabilities.mean():.2%}")
print(f"\nRisk Distribution:")
print(customers_df['churn_probability'].describe())


# ============================================================================
# REAL-WORLD DEPLOYMENT EXAMPLE
# ============================================================================

def predict_churn_risk(customer_data, model, vectorizer):
    """
    Predict churn risk for a customer
    
    Parameters:
    -----------
    customer_data : dict
        Customer features as dictionary
    model : LogisticRegression
        Trained model
    vectorizer : DictVectorizer
        Fitted vectorizer
    
    Returns:
    --------
    dict : Prediction results with probability and risk level
    """
    # Encode customer data
    customer_encoded = vectorizer.transform([customer_data])
    
    # Get prediction and probability
    prediction = model.predict(customer_encoded)[0]
    probability = model.predict_proba(customer_encoded)[0, 1]
    
    # Determine risk level
    if probability >= 0.7:
        risk_level = "Very High"
        action = "Immediate retention intervention required"
        priority = 1
    elif probability >= 0.5:
        risk_level = "High"
        action = "Proactive outreach recommended"
        priority = 2
    elif probability >= 0.3:
        risk_level = "Medium"
        action = "Include in next retention campaign"
        priority = 3
    else:
        risk_level = "Low"
        action = "Continue standard engagement"
        priority = 4
    
    return {
        'will_churn': bool(prediction),
        'churn_probability': float(probability),
        'risk_level': risk_level,
        'recommended_action': action,
        'priority': priority
    }

# Usage
customer = {
    'contract': 'month-to-month',
    'tenure': 3,
    'monthly_charges': 85.0,
    'internet_service': 'fiber_optic',
    'tech_support': 'no'
}

result = predict_churn_risk(customer, model, dv)
print(f"Churn Prediction: {result['will_churn']}")
print(f"Risk Probability: {result['churn_probability']:.2%}")
print(f"Risk Level: {result['risk_level']}")
print(f"Recommended Action: {result['recommended_action']}")


# ============================================================================
# CREATING A SCORING PIPELINE
# ============================================================================

class ChurnPredictionPipeline:
    """Complete pipeline for churn prediction"""
    
    def __init__(self, model, vectorizer):
        self.model = model
        self.vectorizer = vectorizer
    
    def preprocess(self, customer_data):
        """Preprocess customer data"""
        # Add any data validation or cleaning here
        return customer_data
    
    def predict(self, customer_data):
        """Make prediction for single customer"""
        processed = self.preprocess(customer_data)
        encoded = self.vectorizer.transform([processed])
        prediction = self.model.predict(encoded)[0]
        probability = self.model.predict_proba(encoded)[0, 1]
        return prediction, probability
    
    def predict_batch(self, customers_list):
        """Make predictions for multiple customers"""
        processed = [self.preprocess(c) for c in customers_list]
        encoded = self.vectorizer.transform(processed)
        predictions = self.model.predict(encoded)
        probabilities = self.model.predict_proba(encoded)[:, 1]
        return predictions, probabilities
    
    def get_risk_segment(self, probability):
        """Assign risk segment based on probability"""
        if probability >= 0.7:
            return 'Very High Risk'
        elif probability >= 0.5:
            return 'High Risk'
        elif probability >= 0.3:
            return 'Medium Risk'
        else:
            return 'Low Risk'
    
    def score_customer(self, customer_data):
        """Complete scoring with all details"""
        prediction, probability = self.predict(customer_data)
        risk_segment = self.get_risk_segment(probability)
        
        return {
            'prediction': int(prediction),
            'probability': float(probability),
            'risk_segment': risk_segment,
            'will_churn': bool(prediction)
        }

# Create pipeline
pipeline = ChurnPredictionPipeline(model, dv)

# Score a customer
customer = {'contract': 'month-to-month', 'tenure': 5}
score = pipeline.score_customer(customer)
print(score)


# ============================================================================
# HANDLING EDGE CASES
# ============================================================================

def safe_predict(customer_data, model, vectorizer):
    """Prediction with error handling"""
    try:
        # Validate required fields
        required_fields = ['contract', 'tenure', 'monthly_charges']
        missing_fields = [f for f in required_fields if f not in customer_data]
        
        if missing_fields:
            return {
                'error': f"Missing required fields: {missing_fields}",
                'prediction': None
            }
        
        # Make prediction
        encoded = vectorizer.transform([customer_data])
        prediction = model.predict(encoded)[0]
        probability = model.predict_proba(encoded)[0, 1]
        
        return {
            'error': None,
            'prediction': int(prediction),
            'probability': float(probability)
        }
    
    except Exception as e:
        return {
            'error': f"Prediction failed: {str(e)}",
            'prediction': None
        }

# Usage
result = safe_predict(customer, model, dv)
if result['error']:
    print(f"Error: {result['error']}")
else:
    print(f"Prediction: {result['prediction']}")
    print(f"Probability: {result['probability']:.2%}")


# ============================================================================
# MONITORING PREDICTIONS IN PRODUCTION
# ============================================================================

import datetime

def log_prediction(customer_id, prediction, probability, timestamp=None):
    """Log predictions for monitoring"""
    if timestamp is None:
        timestamp = datetime.datetime.now()
    
    log_entry = {
        'timestamp': timestamp,
        'customer_id': customer_id,
        'prediction': prediction,
        'probability': probability,
        'model_version': 'v1.0'
    }
    
    # Save to database or file
    # For demo, append to CSV
    log_df = pd.DataFrame([log_entry])
    log_df.to_csv('prediction_log.csv', mode='a', header=False, index=False)
    
    return log_entry

# Make and log prediction
customer_id = 'CUST_12345'
customer_data = {}
prediction, probability = pipeline.predict(customer_data)
log_prediction(customer_id, prediction, probability)


#

NameError: name 'df' is not defined