# Machine Learning Project Report

## 1. Introduction
### Problem Statement
The goal of this project is to classify e-commerce data into predefined categories using machine learning algorithms. Effective classification will help businesses in identifying customer behavior patterns and improving recommendations.

### Dataset Description
The dataset `ecommerce_dataset_updated.csv` contains information on customer attributes and behaviors, with the target variable being the category of e-commerce activity. The dataset includes both numerical and categorical features. Missing values and class imbalance were addressed during preprocessing.

---

## 2. Methodology
### Preprocessing Steps
1. **Missing Values**: Imputed missing values with the mean for numerical features.
2. **Encoding**: Applied Label Encoding to convert categorical variables into numerical format.
3. **Balancing the Dataset**: Used SMOTE (Synthetic Minority Oversampling Technique) to address class imbalance.
4. **Feature Scaling**: Standardized features using `StandardScaler`.
5. **Train-Test Split**: Split the dataset into training (70%) and testing (30%) sets using stratified sampling.

### Algorithms Applied
1. **Random Forest Classifier**:
   - Hyperparameter tuning with GridSearchCV.
   - Parameters tuned: `n_estimators`, `max_depth`.
2. **Support Vector Machine (SVM)**:
   - Used RBF and linear kernels.
   - Parameters tuned: `C`, `gamma`.
3. **Logistic Regression**:
   - Regularization techniques: L1 and L2 penalties.
   - Parameters tuned: `penalty`, `C`.

### Optimization Techniques
- GridSearchCV was applied to all algorithms to optimize hyperparameters.
- Metrics like ROC-AUC were used for evaluation during cross-validation.

---

## 3. Results
### Performance Metrics
| Model                 | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|-----------------------|----------|-----------|--------|----------|---------|
| Random Forest         | 0.93     | 0.94      | 0.93   | 0.93     | 0.95    |
| Support Vector Machine| 0.89     | 0.90      | 0.89   | 0.89     | 0.91    |
| Logistic Regression   | 0.88     | 0.89      | 0.88   | 0.88     | 0.90    |

### Visualizations
1. **Model Comparison**: A bar plot comparing Accuracy, Precision, Recall, F1-Score, and ROC-AUC.
2. **Confusion Matrices**: Heatmaps for each model showcasing prediction performance.
3. **Feature Importance**: Visualizations for Random Forest and Logistic Regression.

---

## 4. Analysis
### Insights
- **Random Forest**: Achieved the highest performance across all metrics due to its ability to handle non-linear data and ensemble learning.
- **SVM**: Performed well with optimized hyperparameters but was computationally expensive.
- **Logistic Regression**: Provided competitive results with interpretability and simplicity.

### Algorithm Comparison
- Random Forest showed the best results, especially in handling class imbalance and providing high ROC-AUC.
- SVM required careful tuning of `C` and `gamma` but delivered robust results for linearly separable data.
- Logistic Regression, while less complex, was efficient and interpretable but less effective for non-linear relationships.

### Challenges Faced
1. **Imbalanced Data**: Resolved using SMOTE to oversample the minority class.
2. **Hyperparameter Tuning**: Computational cost was high, particularly for SVM.
3. **Feature Scaling**: Essential for SVM and Logistic Regression but added complexity.

---

### Conclusion
This project demonstrated the application of Random Forest, SVM, and Logistic Regression for e-commerce data classification. The Random Forest model emerged as the best performer, making it suitable for deployment in similar tasks. Future improvements could include feature engineering, advanced hyperparameter tuning, and exploring deep learning models for further performance enhancement.

