# Detailed Summary of `model.py`

This notebook provides a detailed explanation of the `model.py` script, including the libraries used, their purposes, and where they are applied in the code. Special emphasis is given to **Pandas**, **NumPy**, and **Scikit-learn**.

## Libraries Used

### 1. **Pandas**
**Purpose**: Pandas is used for data manipulation and analysis. It provides data structures like `DataFrame` and `Series` for handling structured data.

**Where It Is Used**:
- **File Reading**:
  - `pd.read_csv()` is used to load datasets from CSV files.
  ```python
  df = pd.read_csv('csv_files/transaction_dataset.csv', index_col=0)
  ```
- **Data Cleaning**:
  - Dropping categorical columns:
    ```python
    categories = df.select_dtypes('O').columns
    df.drop(df[categories], axis=1, inplace=True)
    ```
  - Filling missing values with the median:
    ```python
    df.fillna(df.median(), inplace=True)
    ```
  - Removing zero-variance features:
    ```python
    no_var = df.var() == 0
    df.drop(df.var()[no_var].index, axis=1, inplace=True)
    ```
- **Feature Engineering**:
  - Dropping redundant or less important features:
    ```python
    drop_existing = [col for col in drop if col in df.columns]
    df.drop(drop_existing, axis=1, inplace=True)
    ```
- **DataFrame Operations**:
  - Splitting features (`X`) and target (`y`):
    ```python
    y = df['FLAG']
    X = df.drop('FLAG', axis=1)
    ```

**Why It Is Important**:
- Simplifies data preprocessing, cleaning, and manipulation.
- Provides intuitive syntax for handling structured data, making it easier to prepare datasets for training and testing.

### 2. **NumPy**
**Purpose**: NumPy is used for numerical computing and array operations.

**Where It Is Used**:
- **Array Operations**:
  - Converting Pandas DataFrames to NumPy arrays for model training:
    ```python
    norm_train_f = norm.fit_transform(X_train)
    norm_test_f = norm.transform(X_test)
    ```
- **Statistical Operations**:
  - Calculating the median for missing value imputation:
    ```python
    df.fillna(df.median(), inplace=True)
    ```
  - Counting class distributions:
    ```python
    np.bincount(y_train)
    np.bincount(y_tr_resample)
    ```

**Why It Is Important**:
- Provides the foundation for numerical operations in Python.
- Enables efficient handling of large datasets and is tightly integrated with Scikit-learn.

### 3. **Scikit-learn**
**Purpose**: Scikit-learn is used for machine learning tasks, including preprocessing, model training, and evaluation.

**Where It Is Used**:
- **Data Preprocessing**:
  - Splitting the dataset into training and testing sets:
    ```python
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=123)
    ```
  - Normalizing features using `PowerTransformer`:
    ```python
    norm = PowerTransformer()
    norm_train_f = norm.fit_transform(X_train)
    norm_test_f = norm.transform(X_test)
    ```
- **Handling Class Imbalance**:
  - Applying SMOTE to balance the dataset:
    ```python
    oversample = SMOTE()
    x_tr_resample, y_tr_resample = oversample.fit_resample(norm_train_f, y_train)
    ```
- **Model Training**:
  - Training a Random Forest classifier:
    ```python
    RF = RandomForestClassifier(random_state=42, n_estimators=100)
    RF.fit(x_tr_resample, y_tr_resample)
    ```
- **Model Evaluation**:
  - Generating classification reports and confusion matrices:
    ```python
    print(classification_report(y_test, preds_RF))
    print(confusion_matrix(y_test, preds_RF))
    ```
  - Calculating the ROC AUC score:
    ```python
    print(roc_auc_score(y_test, preds_RF))
    ```
- **Feature Importance**:
  - Extracting and visualizing feature importance:
    ```python
    feature_importance = pd.DataFrame({
        'Feature': X.columns,
        'Importance': RF.feature_importances_
    }).sort_values(by='Importance', ascending=False)
    ```

**Why It Is Important**:
- Provides a comprehensive set of tools for building and evaluating machine learning models.
- Integrates seamlessly with Pandas and NumPy, enabling efficient data preprocessing and model training.

### 4. **Matplotlib**
**Purpose**: Matplotlib is used for creating static, interactive, and animated visualizations.

**Where It Is Used**:
- **Pie Charts**:
  - Visualizing class distribution before and after SMOTE:
    ```python
    plt.pie(fraud_nonfraud_before, labels=['Non-Fraud', 'Fraud'], autopct='%1.1f%%', startangle=90, colors=['skyblue', 'orange'])
    plt.pie(fraud_nonfraud_after, labels=['Non-Fraud', 'Fraud'], autopct='%1.1f%%', startangle=90, colors=['skyblue', 'orange'])
    ```
- **Bar Charts**:
  - Visualizing feature importance:
    ```python
    sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
    ```
- **Saving Figures**:
  - Saving visualizations to the `images` directory:
    ```python
    plt.savefig('images/fraud_distribution_before_smote.png')
    plt.savefig('images/fraud_distribution_after_smote.png')
    plt.savefig('images/feature_importance.png')
    ```

### 5. **Seaborn**
**Purpose**: Seaborn is used for statistical data visualization.

**Where It Is Used**:
- **Bar Charts**:
  - Used for plotting feature importance:
    ```python
    sns.barplot(x='Importance', y='Feature', data=feature_importance.head(10))
    ```

### 6. **Pickle**
**Purpose**: Pickle is used for serializing and deserializing Python objects.

**Where It Is Used**:
- **Saving Models**:
  - Saving the trained Random Forest model:
    ```python
    with open('models/ethereum_fraud_model.pkl', 'wb') as file:
        pickle.dump(RF, file)
    ```
- **Saving Scalers**:
  - Saving the `PowerTransformer` scaler:
    ```python
    with open('models/ethereum_fraud_scaler.pkl', 'wb') as file:
        pickle.dump(norm, file)
    ```

### 7. **Imbalanced-learn (SMOTE)**
**Purpose**: Imbalanced-learn is used to handle imbalanced datasets by oversampling the minority class.

**Where It Is Used**:
- **Balancing Classes**:
  - Applying SMOTE to the training data:
    ```python
    oversample = SMOTE()
    x_tr_resample, y_tr_resample = oversample.fit_resample(norm_train_f, y_train)
    ```

## Summary of Key Libraries
| **Library**      | **Purpose**                                                                                     | **Where It Is Used**                                                                                     |
|-------------------|-----------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| **Pandas**        | Data manipulation and analysis.                                                               | Reading CSV files, cleaning data, feature engineering, creating DataFrames for results.                 |
| **NumPy**         | Numerical computing and array operations.                                                     | Handling numerical data, statistical operations, and integration with Scikit-learn.                     |
| **Scikit-learn**  | Machine learning tools for preprocessing, training, and evaluation.                           | Splitting data, normalizing features, applying SMOTE, training models, and evaluating performance.       |
| **Matplotlib**    | Creating visualizations.                                                                      | Pie charts for class distribution, bar charts for feature importance, and saving figures.               |
| **Seaborn**       | Statistical data visualization.                                                               | Plotting feature importance.                                                                             |
| **Pickle**        | Serializing and deserializing Python objects.                                                 | Saving and loading models and scalers.                                                                   |
| **Imbalanced-learn** | Handling imbalanced datasets.                                                              | Applying SMOTE to balance the dataset.                                                                   |