# Machine Learning Pipeline 

## Introduction
In this script, I've implemented a machine learning pipeline to handle a binary classification problem. This involves data preprocessing, model training, and evaluation. The data consists of two types: normal and anomalous.

## Libraries and Data Loading
Firstly, I import necessary libraries:

- `numpy` for numerical operations.
- `pandas` for data manipulation.
- `sklearn` for machine learning models and functions.
- `matplotlib.pyplot` for visualization.

I set a random seed using `np.random.seed(42)` for reproducibility. Then, I load two datasets: `normalData` and `anomalousData`, each containing features and a target variable.

## Data Preprocessing
The function `myTrainTestSplit` splits both datasets into training and testing sets. It first separates features (`X`) and labels (`y`), then applies `train_test_split` with a 70-30 split. Finally, it merges the normal and anomalous data to form combined training and testing sets.

## Model Training Functions
I define functions for training different models:

1. **Decision Tree (`myDecisionTree`)**: Trains a basic decision tree classifier.
2. **Bagging (`myBagging`)**: Implements a bagging classifier using decision trees as base estimators.
3. **Random Forest (`myRandomForest`)**: Trains a random forest classifier.

Each function fits the model to the training data and returns the trained model.

## Feature Importance Visualization
`plot_feature_importance` visualizes the importance of different features in a model, particularly useful for understanding random forests.

## Model Evaluation
The function `myEvaluateSupervisedModelPerformance` evaluates the performance of the supervised models (Decision Tree, Bagging, Random Forest) using metrics like recall, precision, and F1 score. It also prints the training time and confusion matrix for each model.

## PCA with Random Forest Pipeline
I also define `myPCARF`, which creates a pipeline combining PCA (Principal Component Analysis) for dimensionality reduction and a random forest classifier. This is an example of an unsupervised approach to feature extraction followed by supervised learning.

## Unsupervised Model Evaluation
Finally, `myEvaluateUnsupervisedModelPerformance` evaluates the PCA-Random Forest pipeline using the same metrics as the supervised models.

## Summary
This script demonstrates a comprehensive approach to a binary classification task, utilizing both supervised and unsupervised learning techniques, and emphasizing model evaluation and interpretation.


In [1]:
import numpy as np
np.random.seed(42)

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score, f1_score, recall_score, precision_score, confusion_matrix
import time
import zipfile
import matplotlib.pyplot as plt


In [3]:

normalData = pd.read_csv('normal.csv')
print(f'Shape of normal dataset: {normalData.shape}')

Shape of normal dataset: (76593, 10)


In [4]:

anomalousData = pd.read_csv('anomalous.csv')
print(f'Shape of anomalous dataset: {anomalousData.shape}')

Shape of anomalous dataset: (2599, 10)


In [5]:


def myTrainTestSplit(normalData, anomalousData):
    
    X_normal = normalData.iloc[:, :-1]
    y_normal = normalData.iloc[:, -1]

    X_anomalous = anomalousData.iloc[:, :-1]
    y_anomalous = anomalousData.iloc[:, -1]

  
    X_train_normal, X_test_normal, y_train_normal, y_test_normal = train_test_split(X_normal, y_normal, test_size=0.3, random_state=42)
    X_train_anomalous, X_test_anomalous, y_train_anomalous, y_test_anomalous = train_test_split(X_anomalous, y_anomalous, test_size=0.3, random_state=42)

    
    X_train = pd.concat([X_train_normal, X_train_anomalous])
    X_test = pd.concat([X_test_normal, X_test_anomalous])
    y_train = pd.concat([y_train_normal, y_train_anomalous])
    y_test = pd.concat([y_test_normal, y_test_anomalous])

    return X_train, X_test, y_train, y_test

In [6]:



def myDecisionTree(X_train, y_train):
    
    clf = DecisionTreeClassifier()

    clf.fit(X_train, y_train)

    
    return clf

In [7]:



def myBagging(X_train, y_train):
    
    base_clf = DecisionTreeClassifier()
    bagging_clf = BaggingClassifier(base_estimator=base_clf, n_estimators=10, random_state=42)
    bagging_clf.fit(X_train, y_train)

   
    return bagging_clf

In [8]:



def myRandomForest(X_train, y_train):
   
    clf = RandomForestClassifier(n_estimators=10, random_state=42)

    
    clf.fit(X_train, y_train)

    
    return clf

In [9]:



def plot_feature_importance(model, feature_names):
   
    importances = model.feature_importances_

  
    indices = np.argsort(importances)

    # Plot the feature importances
    plt.figure(figsize=(10, 8))
    plt.title('Feature Importances')
    plt.barh(range(len(indices)), importances[indices], color='b', align='center')
    plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
    plt.xlabel('Relative Importance')
    plt.show()




In [10]:

import time

def myEvaluateSupervisedModelPerformance(X_train, y_train, X_test, y_test):
    
    recalls = np.zeros(3)
    precisions = np.zeros(3)
    f1_scores = np.zeros(3)

    # List of model functions
    model_funcs = [myDecisionTree, myBagging, myRandomForest]

    for i, model_func in enumerate(model_funcs):
       
        start = time.time()
        model = model_func(X_train, y_train)
        end = time.time()
        print(f'Training time for model {i+1}: {end - start} seconds')

       
        y_pred = model.predict(X_test)

        recalls[i] = recall_score(y_test, y_pred)
        precisions[i] = precision_score(y_test, y_pred)
        f1_scores[i] = f1_score(y_test, y_pred)

        
        print(f'Confusion matrix for model {i+1}:\n{confusion_matrix(y_test, y_pred)}\n')

   
    print(f'Recalls: {recalls}')
    print(f'Precisions: {precisions}')
    print(f'F1 scores: {f1_scores}')

    return recalls, precisions, f1_scores

In [11]:

def myPCARF(X_train, y_train):
    
    pipeline = Pipeline([
        ('pca', PCA(n_components=2)),
        ('rf', RandomForestClassifier(n_estimators=50, random_state=42))
    ])

    
    pipeline.fit(X_train, y_train)

    
    return pipeline

In [12]:



def myEvaluateUnsupervisedModelPerformance(X_train, y_train, X_test, y_test):
    
    recalls = np.zeros(1)
    precisions = np.zeros(1)
    f1_scores = np.zeros(1)

 
    start = time.time()
    model = myPCARF(X_train, y_train)
    end = time.time()
    print(f'Training time for model: {end - start} seconds')

    
    y_pred = model.predict(X_test)

    
    recalls[0] = recall_score(y_test, y_pred)
    precisions[0] = precision_score(y_test, y_pred)
    f1_scores[0] = f1_score(y_test, y_pred)

   
    print(f'Confusion matrix for model:\n{confusion_matrix(y_test, y_pred)}\n')

    
    print(f'Recalls: {recalls}')
    print(f'Precisions: {precisions}')
    print(f'F1 scores: {f1_scores}')

    return recalls, precisions, f1_scores