# Handling Missing Values:
Detection
 * How much missing % (by heatmap)
 * Like massive missing not co-occurrence plot
Numeric Data:
 * Deletion
 * Imputation
   * Arbitrary value
   * Mean
   * Median
   * Random sample
 * Using algorithms like KNN, KNN can handle missing values
 * Making it as dependent variable
Categorical Data :
 * Deletion
 * Most frequent
 * Using deep learning library - Datawig (Imputation)



---

# Noise and Outliers


- **Noise**: Unwanted or wrong data.
- **Outliers**: Data points that are out of range or significantly different from other observations.




| **Category**      | **Robust Methods**                                     | **Sensitive Methods**                               |
|-------------------|--------------------------------------------------------|-----------------------------------------------------|
| **Naive Bayes**   | ✔                                                      |                                                     |
| **SVM**           | ✔                                                      |                                                     |
| **Decision Tree** | ✔                                                      |                                                     |
| **Ensemble Methods** | Random Forest (RF), XGBoost (XGB), Gradient Boosting (GB) |                                                     |
| **KNN**           | ✔                                                      |                                                     |
| **Linear Regression** |                                                        | ✔                                                   |
| **Logistic Regression** |                                                        | ✔                                                   |
| **k-Means Clustering** |                                                        | ✔                                                   |
| **Hierarchical Clustering** |                                                        | ✔                                                   |
| **PCA**           |                                                        | ✔                                                   |
| **Neural Networks** |                                                        | ✔                                                   |





### **Detection Methods for Outliers**
- **Z-score**
- **Standard Deviation & Interquartile Range (IQR)**
- **Boxplot**
- **DBSCAN Clustering**: Identifies points that don’t belong to any cluster.
- **Isolation Forest**: Effective for high-dimensional data.
- **Robust Random Cut Forest**



### **Treating Outliers**
- **Transformation**: Use log or square root transformations to reduce the impact of outliers.
- **Feature Transformation Techniques**:
  - **Standardization**
  - **Robust Scaler**: Based on interquartile range.
- **Winsorization**: Replace extreme values with the nearest acceptable percentile range.
- **Trying Different Models**: E.g., Random Forest, XGBoost.
- **Imputation**: Replace outliers with mean or median values.

---



### Encoding Categorical Variables

1. **Algorithms:**
   - **Naive Bayes:** Works well with categorical data.
   - **Random Forest:** Handles both categorical and numerical data effectively.

2. **Nominal Encoding:**
   - **One-Hot Encoding (OHE):** Creates binary columns for each category.
   - **OHE-Many Categorical:** Applies OHE to datasets with many categories.
   - **Top-10 Names:** Encodes only the top 10 most frequent categories.

3. **Ordinal Encoding:**
   - **Label Encoding:** Assigns numerical values based on category order (e.g., Low = 0, Medium = 1).
   - **Count (or) Frequency Encoding:** Encodes categories based on their frequency.
   - **Cardinality Encoding:** Encodes categories based on the number of unique values.
   - **Target Guided Encoding:** Encodes categories based on their relationship with the target variable.

4. **Additional Methods:**
   - **Frequency Encoding:** Encodes categories by their occurrence frequency.
   - **Binary Encoding:** Converts categories to binary format, reducing dimensionality.
   - **Hashing Encoding:** Uses a hash function to map categories to a fixed number of buckets.
   - **Mean Encoding:** Encodes categories with the mean of the target variable.
   - **Leave-One-Out Encoding:** Uses mean target value excluding the current observation.
   - **Count Encoding:** Encodes categories by their count in the dataset.
   - **Polynomial Encoding:** Creates polynomial features for interactions between categories.
   - **Embeddings:** Uses learned vector representations for categories, often from neural networks.

---

# Feature Engineering

#### 1. **Data Cleaning**
   - **Dropping:** 
     - **Description:** Removing unnecessary or irrelevant data to improve the quality of the dataset.

#### 2. **Feature Scaling**
   - **Normalization and Standardization:**
     - **Normalization:**
       - **Min-Max Scaling:**
         - **Description:** Scales data to a specific range (e.g., 0 to 1 or -1 to 1).
         - **Usage:** Suitable for algorithms that are sensitive to the scale of data, such as neural networks.
     - **Standardization:**
       - **Z-score Normalization:**
         - **Description:** Scales data to have a mean of 0 and a standard deviation of 1.
         - **Usage:** Required for distance-based algorithms and when data should follow a normal distribution.

#### 3. **Feature Transformation**
   - **Transforming Features:**
     - **Description:** Converting features from one domain to another to meet the assumptions of the model.
     - **Assumption for Linear and Logistic Regression:** Data should follow a normal (Gaussian) distribution.
   
   - **Detection of Normal Distribution:**
     - **Histogram:** Visualizes the distribution of data.
     - **Density Plot:** Shows the data's probability density function.
     - **Boxplot:** Highlights data spread and outliers.
     - **Q-Q Plot:** Compares data distribution to a normal distribution.
     - **Shapiro-Wilk Test:** Statistical test for normality.
     - **Many Other Methods:** Various other statistical or graphical techniques.

   - **Feature Transformation Methods:**
     - **Gaussian Transformation:** Converts data to follow a Gaussian distribution.
     - **Logarithmic Transformation:** Useful for skewed distributions.
     - **Reciprocal Transformation:** Inverts values to reduce skew.
     - **Square Root Transformation:** Reduces the impact of outliers.
     - **Exponential Transformation:** Increases the impact of large values.
     - **Box-Cox Transformation:** Finds the optimal power transformation.
     - **Robust Scalar:** Handles outliers more effectively than standard scaling methods.

#### 4. **Algorithms and Their Requirements**
   - **Z-score Normalization:**
     - **Requires:** Distance-based algorithms.
     - **Scales and Distributions:** Assumes data is normally distributed (Gaussian).
   - **Min-Max Scaling:**
     - **Requires:** Tree-based algorithms.
     - **Scales and Distributions:** No specific distribution requirement.

---

### Feature Selection

#### 1. **Feature Selection**
   - **Lasso Regression:**
     - **Description:** A regularization technique that selects features by shrinking some coefficients to zero, thus performing automatic feature selection.
   - **Dropping Constant Features:**
     - **Description:** Removing features that have the same value across all observations, as they do not contribute to the model.
   - **Correlation Checking:**
     - **Description:** Identifying and removing features that are highly correlated with each other to avoid multicollinearity.
   - **Chi-Square Test:**
     - **Description:** A statistical test used for selecting features in categorical data.
   - **ANOVA (Analysis of Variance):**
     - **Description:** A statistical test used for selecting features in numerical data by assessing the variance between groups.
   - **Recursive Feature Elimination (RFE):**
     - **Description:** A greedy algorithm that recursively removes features with the least impact on model performance.
   - **Feature Importance:**
     - **Description:** Using built-in functions in libraries like scikit-learn to determine the importance of features based on model performance.
   - **Variance Inflation Factor (VIF):**
     - **Description:** Measures the degree of multicollinearity among features, identifying features that are highly correlated.

#### 2. **Dimensionality Reduction**
   - **Purpose:**
     - **Description:** Reducing the number of features while preserving the essential information in the data.
   - **Techniques:**
     - **PCA (Principal Component Analysis):**
       - **Description:** Transforms features into a set of orthogonal components that capture the maximum variance.
     - **LDA (Linear Discriminant Analysis):**
       - **Description:** A supervised method that focuses on maximizing class separation.
     - **t-SNE (t-Distributed Stochastic Neighbor Embedding):**
       - **Description:** A technique for visualizing high-dimensional data by mapping it to a lower-dimensional space.



---

### Dimensionality Reduction

#### 1. **Covariance, Co-linearity, and Multi-colinearity**
   - **Covariance:**
     - **Description:** Measures the relationship between two variables; indicates how much two variables change together.
   - **Co-linearity:**
     - **Description:** Occurs when two or more independent variables are highly correlated.
   - **Multi-colinearity:**
     - **Description:** Occurs when more than two independent variables are highly correlated.

   - **Detection:**
     - **Pearson Correlation:**
       - **Description:** Uses a heatmap to visualize the correlation between variables.
     - **Variable Inflation Factor (VIF):**
       - **Description:** Measures the degree of multicollinearity in regression models; higher VIF indicates more multicollinearity.

   - **Fixing:**
     - **Dropping One of the Features:**
       - **Description:** Removes one of the correlated features to reduce multicollinearity.
     - **PCA (Principal Component Analysis):**
       - **Description:** Creates new, uncorrelated features (principal components) to address multicollinearity.

#### 2. **Dimensionality Reduction**
   - **Purpose:**
     - **Description:** Reduces the number of features while preserving essential information.
     - **Benefits:**
       - **Reduces Computational Cost:** Less data to process.
       - **Improves Model Performance:** Can enhance accuracy by reducing noise.
       - **Reduces Overfitting:** Simplifies the model to avoid fitting noise.

   - **Methods:**
     - **PCA (Principal Component Analysis):**
       - **Description:** Creates new, uncorrelated features called principal components.
       - **Visualization:** Represents clusters of co-variant features in a 2D graph.
     - **Kernel PCA:**
       - **Description:** A non-linear extension of PCA, useful for capturing non-linear relationships.
     - **Linear Discriminant Analysis (LDA):**
       - **Description:** A supervised method that focuses on maximizing class separation.
     - **Autoencoders:**
       - **Description:** Neural networks that learn to compress and reconstruct data, effectively reducing dimensionality.
     - **Feature Selection:**
       - **Description:** Choosing a subset of features based on their relevance to the target variable.

---

### Imbalanced Datasets

#### 1. **Characteristics of Imbalanced Datasets**
   - **Description:** Datasets where one class significantly outnumbers the other.
   - **Algorithm Suitability:**
     - **Tree-Based Algorithms:** Generally handle imbalance better.
     - **Decision Trees and Random Forests:** Often not well-suited for imbalanced datasets.
     - **Boosting Algorithms:** AdaBoost, XGBoost, and Gradient Boosting Machines can handle imbalance effectively.

#### 2. **Fixing Imbalance**
   - **Over-sampling:**
     - **Description:** Increasing the number of instances in the minority class.
     - **Methods:**
       - **Duplicating Examples:** Replicating existing minority class samples.
       - **Generating Synthetic Examples:** Creating new samples using techniques like SMOTE (Synthetic Minority Over-sampling Technique).

   - **Under-sampling:**
     - **Description:** Reducing the number of instances in the majority class.
     - **Methods:**
       - **Random Removal:** Removing a subset of majority class samples to balance the dataset.

   - **Data Augmentation:**
     - **Description:** Creating new features or transformations to better represent the minority class and improve understanding.

   - **Anomaly Detection:**
     - **Description:** Treating the minority class as anomalies and applying anomaly detection techniques to identify and classify them.

   - **Cost-sensitive Learning:**
     - **Description:** Assigning different costs to misclassification errors based on class, to penalize errors on the minority class more heavily.

#### 3. **Evaluation Metrics**
   - **Precision, Recall, and F1-score:**
     - **Description:** Metrics more appropriate for evaluating performance on imbalanced datasets than accuracy.
       - **Precision:** The ratio of true positives to the sum of true positives and false positives.
       - **Recall:** The ratio of true positives to the sum of true positives and false negatives.
       - **F1-score:** The harmonic mean of precision and recall.
   - **AUC (Area Under the ROC Curve):**
     - **Description:** Measures the overall performance of a model, providing a summary of the model's ability to discriminate between classes.

---


### Training, Testing, and Validation Splits in Datasets

#### 1. **Purpose of Splits**
   - **Training Set:**
     - **Description:** Used to train the model by adjusting its parameters based on the data.
   - **Validation Set:**
     - **Description:** Used to tune hyperparameters and make decisions about model architecture.
   - **Testing Set:**
     - **Description:** Used to assess the final performance of the model after training and validation.

#### 2. **Common Splitting Methods**

   - **Simple Split:**
     - **Description:** Fixed proportions are used to divide the data into training, validation, and test sets.
     - **Process:**
       - **Shuffle Data:** Randomly shuffle the data.
       - **Split Data:** Allocate data into training (e.g., 70%), validation (e.g., 15%), and testing sets (e.g., 15%).

   - **k-Fold Cross-Validation:**
     - **Description:** Data is divided into \( k \) folds. The model is trained \( k \) times, using \( k-1 \) folds for training and 1 fold for validation.
     - **Process:**
       - **Divide Data:** Split data into \( k \) folds.
       - **Train and Validate:** Train on \( k-1 \) folds and validate on the remaining fold.
       - **Average Results:** Compute average performance metrics across all \( k \) folds.

   - **Stratified Split:**
     - **Description:** Maintains the proportion of classes in each split, useful for imbalanced datasets.
     - **Process:**
       - **Divide Data:** Split data while preserving class proportions in training, validation, and test sets.

   - **Time Series Split:**
     - **Description:** Splits data based on time, ensuring the order of data is respected.
     - **Process:**
       - **Chronological Split:** Use past data for training and future data for validation/testing.

   - **Leave-One-Out Cross-Validation (LOOCV):**
     - **Description:** Each sample is used as a validation set once, while the remaining samples are used for training.
     - **Process:**
       - **Train and Validate:** Train on all but one sample and validate on the single sample.

   - **Holdout Method:**
     - **Description:** Randomly splits data into training and test sets, optionally performing further validation on the training set.
     - **Process:**
       - **Shuffle and Split:** Split data into training and test sets, with optional additional validation.

### Summary
- **Training Set:** For fitting the model's parameters.
- **Validation Set:** For tuning model parameters and selecting the best model.
- **Testing Set:** For evaluating final model performance.

**Splitting Methods:**
- **Simple Split:** Fixed proportions for training, validation, and test sets.
- **k-Fold Cross-Validation:** Multiple training and validation rounds with \( k \) folds.
- **Stratified Split:** Preserves class proportions in each split.
- **Time Series Split:** Ensures data is split chronologically.
- **LOOCV:** Uses each sample as a validation set once.
- **Holdout Method:** Simple train-test split with optional further validation.

---

### Handling Overfitting and Underfitting

#### 1. **Overfitting**
   - **Description:** Occurs when a model learns the noise in the training data rather than the underlying pattern, leading to poor performance on new data.

   - **Techniques to Handle Overfitting:**
     - **Cross-Validation:**
       - **Description:** Evaluates the model's performance on different subsets of the data to ensure it generalizes well.
     - **Regularization:**
       - **Description:** Adds a penalty to the loss function to constrain the model's complexity, e.g., L1 (Lasso) and L2 (Ridge) regularization.
     - **Early Stopping:**
       - **Description:** Stops training when the model’s performance on a validation set starts to degrade.
     - **Pruning (for Tree-Based Models):**
       - **Description:** Removes branches from decision trees to prevent them from becoming too complex.
     - **Simplifying the Model:**
       - **Description:** Reduces the complexity of the model by using fewer features or parameters.
     - **Data Augmentation:**
       - **Description:** Increases the diversity of the training data by applying transformations like rotations, flips, or scaling, particularly in image data.

#### 2. **Underfitting**
   - **Description:** Occurs when a model is too simple to capture the underlying pattern in the data, leading to poor performance on both training and new data.

   - **Techniques to Handle Underfitting:**
     - **Get More Data:**
       - **Description:** Provides the model with more examples to learn from, which can help in capturing the underlying pattern better.
     - **Increase Model Complexity:**
       - **Description:** Uses a more complex model or algorithm that can capture more intricate patterns in the data.
     - **Increase Number of Parameters:**
       - **Description:** Adds more parameters to the model to allow it to learn more complex relationships.
     - **Train for More Iterations:**
       - **Description:** Extends the training process to give the model more opportunity to learn from the data.
     - **Use Ensembling Techniques:**
       - **Description:** Combines predictions from multiple models to improve performance and capture more complex patterns.
     - **Data Augmentation:**
       - **Description:** Enhances the training set by applying transformations to the data to help the model learn better.

---



### Model Evaluation Metrics
#### 1. **Regression Metrics**
   - **R-squared (R²):**
     - **Description:** Measures the proportion of variance in the dependent variable explained by the model.
     - **Purpose:** Assesses model fit and explanatory power.
   - **Adjusted R²:**
     - **Description:** Adjusts R² for the number of predictors in the model.
     - **Purpose:** Penalizes the inclusion of irrelevant features, providing a more accurate measure of model performance.
   - **Mean Squared Error (MSE):**
     - **Description:** Measures the average squared difference between predicted and actual values.
     - **Purpose:** Evaluates model accuracy; lower values indicate better performance.
   - **Mean Absolute Error (MAE):**
     - **Description:** Measures the average absolute difference between predicted and actual values.
     - **Purpose:** Provides a direct measure of prediction error; less sensitive to outliers than MSE.
   - **Root Mean Squared Error (RMSE):**
     - **Description:** The square root of the MSE, giving a measure in the same units as the target variable.
     - **Purpose:** Provides a more interpretable measure of prediction error compared to MSE.
#### 2. **Classification Metrics**
   - **Confusion Matrix:**
     - **Description:** A table summarizing the performance of a classification model by showing true positives, true negatives, false positives, and false negatives.
   - **Precision:**
     - **Description:** The proportion of positive predictions that are actually positive.
     - **Formula:** ```Precision = True Positives / (True Positives + False Positives)```
   - **Recall:**
     - **Description:** The proportion of actual positive instances that are correctly predicted as positive.
     - **Formula:** ```Recall = True Positives / (True Positives + False Negatives)```
   - **Accuracy:**
     - **Description:** The overall proportion of correct predictions (both positive and negative).
     - **Formula:** ```Accuracy = (True Positives + True Negatives) / Total Predictions```
   - **F-score:**
     - **Description:** The harmonic mean of precision and recall.
     - **Formula:** ```F_β = (1 + β²) * (Precision * Recall) / (β² * Precision + Recall)```
   - **Type I Error (False Positive Rate, FPR):**
     - **Description:** The rate of incorrectly predicting the positive class when the true class is negative.
     - **Formula:** ```FPR = False Positives / (True Negatives + False Positives)```
   - **Type II Error (False Negative Rate, FNR):**
     - **Description:** The rate of incorrectly predicting the negative class when the true class is positive.
     - **Formula:** ```FNR = False Negatives / (True Positives + False Negatives)```
   - **ROC Curve:**
     - **Description:** A plot showing the trade-off between the true positive rate (sensitivity) and the false positive rate.
   - **AUC-ROC:**
     - **Description:** The area under the ROC curve, representing the model's overall ability to distinguish between classes.
     - **Purpose:** Higher AUC indicates better model performance.
#### 3. **Clustering Metrics**
   - **Silhouette Coefficient:**
     - **Description:** Measures how similar a data point is to its own cluster compared to other clusters.
     - **Purpose:** Helps assess the quality of clustering.
   - **Davies-Bouldin Index:**
     - **Description:** Measures the average similarity between each cluster and its nearest cluster.
     - **Purpose:** Lower values indicate better clustering.
   - **Calinski-Harabasz Index:**
     - **Description:** Measures the ratio of between-cluster variance to within-cluster variance.
     - **Purpose:** Higher values indicate more distinct clusters.
---


### Important Terminology from Algorithms

#### 1. **Bias and Variance**
   - **Bias:**
     - **Definition:** Error due to model assumptions; systematic error.
   - **Variance:**
     - **Definition:** Error due to variability in model predictions across different training sets.

#### 2. **Cost Function**
   - **Definition:** Measures the model's error; guides model training by minimizing error.

#### 3. **Overfitting and Underfitting**
   - **Overfitting:**
     - **Definition:** Model performs well on training data but poorly on unseen data.
   - **Underfitting:**
     - **Definition:** Model performs poorly on both training and test data.

#### 4. **Precision and Recall**
   - **Precision:**
     - **Definition:** Proportion of true positives among all positive predictions.
   - **Recall:**
     - **Definition:** Proportion of true positives among all actual positives.

#### 5. **Regularization**
   - **Purpose:** Prevents overfitting by penalizing complex models.
   - **L1 Regularization (Lasso):** Encourages sparsity by setting some coefficients to zero.
   - **L2 Regularization (Ridge):** Reduces the magnitude of coefficients.

#### 6. **Interquartile Range (IQR)**
   - **Definition:** Measure of dispersion; distance between the first and third quartiles.
   - **Purpose:** Excludes outliers to show the spread of the central 50% of the data.

### Box Plots, Percentiles, Entropy, and Information Gain

#### 1. **Box Plots**
   - **IQR:** Distance between the first (Q1) and third quartiles (Q3).
   - **Outliers:** Data points outside 1.5 times the IQR from the quartiles.
   - **Whiskers:** Extend from the box to the max/min within 1.5 times the IQR.

#### 2. **Percentiles**
   - **Definition:** Percentage of values below a certain threshold.
   - **Calculation:** \( \frac{\text{Number below the value}}{\text{Total number of values}} \times 100 \)

#### 3. **Entropy**
   - **Definition:** Measure of impurity or uncertainty in a dataset.
   - **Higher Entropy:** More uncertainty.
   - **Lower Entropy:** More homogeneity.

#### 4. **Information Gain**
   - **Definition:** Reduction in entropy after splitting the dataset based on a feature.
   - **Higher Information Gain:** More informative feature for classification.


   ---

# hyperparameter tuning ?

Feature importance techniques help identify which features have the most impact on your model's predictions. This is crucial for understanding your model, improving it, and gaining business insights. Here are several methods to determine feature importance:

1. Tree-based Methods:

   a) Random Forest Feature Importance:
   ```python
   from sklearn.ensemble import RandomForestClassifier
   
   rf = RandomForestClassifier()
   rf.fit(X_train, y_train)
   importances = rf.feature_importances_
   ```

   b) XGBoost Feature Importance:
   ```python
   import xgboost as xgb
   
   xgb_model = xgb.XGBClassifier()
   xgb_model.fit(X_train, y_train)
   importances = xgb_model.feature_importances_
   ```

2. Permutation Importance:
   Works with any model, not just tree-based ones.
   ```python
   from sklearn.inspection import permutation_importance
   
   result = permutation_importance(model, X_test, y_test, n_repeats=10)
   importances = result.importances_mean
   ```

3. SHAP (SHapley Additive exPlanations):
   Provides both global and local feature importance.
   ```python
   import shap
   
   explainer = shap.TreeExplainer(model)
   shap_values = explainer.shap_values(X)
   shap.summary_plot(shap_values, X)
   ```

4. Recursive Feature Elimination (RFE):
   ```python
   from sklearn.feature_selection import RFE
   
   rfe = RFE(estimator=model, n_features_to_select=10)
   rfe.fit(X, y)
   ```

5. Correlation Analysis:
   For linear relationships between features and target.
   ```python
   correlation = X.corrwith(y)
   ```

6. Mutual Information:
   Captures non-linear relationships.
   ```python
   from sklearn.feature_selection import mutual_info_classif
   
   mi_scores = mutual_info_classif(X, y)
   ```


After calculating importances, you can visualize them:

```python
import pandas as pd
import matplotlib.pyplot as plt

feature_importance = pd.DataFrame({'feature': X.columns, 'importance': importances})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.bar(feature_importance['feature'], feature_importance['importance'])
plt.xticks(rotation=90)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
```

For your churn prediction task, I recommend starting with methods like Random Forest or XGBoost feature importance, as they work well with the models you've been using. SHAP values can provide more detailed insights if needed.

Would you like me to help you implement one of these methods for your specific churn prediction model?