<a href="https://colab.research.google.com/github/tarakantaacharya/Data_Analyst_Internship_Capx/blob/main/model_training_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Model Training

##Instructions :


### **Run the Code for Model Training**

1. **Install Dependencies**:
   - Before running the code, ensure that you have installed all the required libraries. You can do this by installing the necessary packages via the `requirements.txt` file:
   
   ```bash
   pip install -r requirements.txt
   ```

2. **Prepare the Data**:
   - Make sure you have completed the **data scraping** and **data preprocessing** steps to generate the `reddit_stock_data_posts_cleaned.csv` file. This file will be used as input for model training.
   
   If you haven’t already done this, refer to the earlier sections where you scraped and preprocessed the data.

3. **Run the Model Training Script**:
   - Once the dependencies are installed and the data is prepared, run the model training script:
   
   ```bash
   python model_training.py
   ```

4. **Monitor Model Training**:
   - As the script runs, it will train a machine learning model (e.g., Logistic Regression) using the preprocessed data.
   - You will see evaluation metrics such as **accuracy**, **precision**, **recall**, and **F1-score** printed to the terminal.

5. **Review Output**:
   - After the model is trained, review the printed output for evaluation metrics to determine how well the model is performing.
   - If the results are satisfactory, you can proceed to save the model or use it for predictions. If needed, you may want to fine-tune the model.

---

### **Demonstration Steps for Model Training**


#### **1. Set up Environment**
   - **Install Dependencies**: First, ensure all required libraries are installed by running the following command:

   ```bash
   pip install -r requirements.txt
   ```

#### **2. Data Preprocessing**
   - The model training script relies on preprocessed data. If you haven't already run the data scraping and preprocessing scripts, ensure that you do so to generate the cleaned data (`reddit_stock_data_posts_cleaned.csv`).

#### **3. Model Training Script Execution**
   - Once the data is prepared, navigate to the folder containing the model training script (e.g., `model_training.py`).
   - Open a terminal or command prompt in that folder and run the following command:

   ```bash
   python model_training.py
   ```

   - This will start the execution of the script, which will load the preprocessed data, split it into training and testing sets, and train the machine learning model.

#### **4. Monitoring Model Training**
   - As the script runs, you will see output in the terminal. This output will show information about:
     - The data being loaded and preprocessed.
     - The splitting of data into **features (X)** and **target variable (y)**.
     - The training process, where the model learns from the training data.
   
   - After the model is trained, the script will evaluate its performance using metrics such as **accuracy**, **precision**, **recall**, and **F1-score**. These metrics will give you insight into how well the model is performing.

#### **5. Review the Results**
   - Once the script has finished, it will output the evaluation results to the terminal, something like:

   ```
   Accuracy: 0.85
   Precision: 0.87
   Recall: 0.83
   F1-score: 0.85
   ```

   These metrics show how well the model is predicting the sentiment of the Reddit posts and their relationship with stock price movements.

   ```

   Replace `new_data` with the data you want to predict on.

#### **8. Final Steps**
   - After the demonstration, ensure the results are saved (e.g., in the form of evaluation metrics and trained model) so that you can proceed with further analysis or use the trained model for future predictions.

---

These steps guide you through a demonstration of running the **Model Training** script. You'll first set up the environment, run the training, monitor the progress, and review the results to evaluate how well the model performs.

In [None]:
# Install the TensorFlow library, which is essential for building and training deep learning models.
!pip install tensorflow
# Install the Scikeras library, which provides an interface to use Keras models within scikit-learn pipelines for easier machine learning integration.
!pip install scikeras



######Importing Libraries

In [None]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing various classification algorithms from scikit-learn
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier  # Ensemble learning models
from sklearn.linear_model import LogisticRegression  # Logistic regression model
from sklearn.svm import SVC  # Support Vector Classifier
from sklearn.neighbors import KNeighborsClassifier  # K-Nearest Neighbors Classifier

# Importing evaluation metrics to assess model performance
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Importing train_test_split for splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Importing StandardScaler to standardize features by removing the mean and scaling to unit variance
from sklearn.preprocessing import StandardScaler

# Importing TensorFlow's Keras library for building deep learning models
from tensorflow.keras.models import Sequential  # Sequential API for building models layer by layer

# Importing layers from Keras for building neural network architectures
from keras.layers import Dense, Dropout, Input  # Dense: fully connected layers, Dropout: prevents overfitting

# Importing Adam optimizer for training deep learning models
from tensorflow.keras.optimizers import Adam

# Importing KerasClassifier wrapper to use Keras models in scikit-learn pipelines
from scikeras.wrappers import KerasClassifier

# Importing StratifiedKFold and cross_val_score for cross-validation
from sklearn.model_selection import StratifiedKFold, cross_val_score  # StratifiedKFold ensures balanced class distribution in splits

# Importing visualization libraries
from matplotlib import pyplot as plt  # For creating visualizations
import seaborn as sns  # Advanced visualization library with aesthetic options

# Importing train_test_split again (redundant here since already imported earlier)
from sklearn.model_selection import train_test_split

# Importing make_pipeline to create machine learning pipelines
from sklearn.pipeline import make_pipeline

# Importing StandardScaler again (redundant here since already imported earlier)
from sklearn.preprocessing import StandardScaler

# Creating an instance of StandardScaler for feature scaling
scaler = StandardScaler()  # Standardizes features to have mean 0 and standard deviation 1

In [None]:
# Features and target variable
X_features = ['score', 'num_comments', 'upvote_ratio', 'title_sentiment_score',
       'content_sentiment_score', 'Close_AAPL', 'Price_Change', 'Prev_Price_Change']

X = model_df[X_features]  # Adjust features based on your data
y = model_df['stock_direction']

# Train-test split (60-40 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Print the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

X_train shape: (4502, 8)
X_test shape: (3002, 8)


In [None]:
results_df_1 = pd.DataFrame()   #We create an empty dataframe to store the metric results

#####Explanation of the models:
1. Random Forest: A robust ensemble method that combines multiple decision trees to improve classification accuracy and reduce overfitting. class_weight='balanced' automatically adjusts class weights inversely proportional to their frequencies in the data.

2. Gradient Boosting: Builds models sequentially, optimizing for residual errors. It's often used for competitive performance in structured data.

3. AdaBoost: Combines weak classifiers iteratively to focus on misclassified instances. The SAMME algorithm supports multi-class outputs.

4. Logistic Regression: A linear model used for binary/multi-class classification. Here, it's combined with StandardScaler for preprocessing, and saga is chosen for its efficiency on large datasets.

5. Support Vector Machine (SVC): Useful for high-dimensional spaces and non-linear decision boundaries. The class_weight='balanced' adjusts for class imbalance.

6. K-Nearest Neighbors: Simple and intuitive, relying on the proximity of data points. The choice of n_neighbors=5 is a common default, but it can be tuned.

---------------------------------------------------------------------------

#####Additional Notes:
1. Class weights: Models like Random Forest, SVC, and Logistic Regression are set with class_weight='balanced' to handle datasets with imbalanced target distributions effectively.
2. Random State: Ensures reproducibility for models that involve randomness.
Pipelines: Used for Logistic Regression to combine preprocessing (scaling) and modeling into a single step.

In [None]:
# Initializing a dictionary of machine learning models with specific hyperparameters
models = {
    # Random Forest: Ensemble model using multiple decision trees, with 100 trees and balanced class weights
    "Random Forest": RandomForestClassifier(
        n_estimators=100,  # Number of decision trees
        class_weight='balanced',  # Adjust weights for imbalanced classes
        random_state=42  # Ensures reproducibility
    ),

    # Gradient Boosting: Ensemble model where trees are built sequentially to minimize errors
    "Gradient Boosting": GradientBoostingClassifier(
        n_estimators=100,  # Number of boosting stages
        random_state=42  # Ensures reproducibility
    ),

    # AdaBoost: Boosting algorithm with SAMME (for multi-class classification)
    "AdaBoost": AdaBoostClassifier(
        algorithm='SAMME'  # Algorithm type, SAMME is suitable for multi-class problems
    ),

    # Logistic Regression: Linear model wrapped in a pipeline with scaling and custom parameters
    'Logistic Regression': make_pipeline(
        StandardScaler(),  # Standardizes features
        LogisticRegression(
            max_iter=3000,  # Maximum number of iterations for optimization
            solver='saga',  # Solver suitable for large datasets and supports L1/L2 regularization
            class_weight='balanced',  # Adjust weights for imbalanced classes
            random_state=42  # Ensures reproducibility
        )
    ),

    # Support Vector Machine: Non-linear classifier with kernel tricks
    "Support Vector Machine": SVC(
        class_weight='balanced',  # Adjust weights for imbalanced classes
        random_state=42  # Ensures reproducibility
    ),

    # K-Nearest Neighbors: Distance-based algorithm, finding the 5 nearest neighbors
    "K-Nearest Neighbors": KNeighborsClassifier(
        n_neighbors=5  # Number of neighbors to consider
    )
}

#####Explanation of the DNN:
1. Input Layer: The Input layer specifies the shape of input data, which corresponds to the number of features in the dataset (X.shape[1]).

2. Hidden Layers:

    2.1 First hidden layer: 64 neurons, ReLU activation for non-linearity, followed by a Dropout layer to mitigate overfitting.

    2.2 Second hidden layer: 32 neurons, ReLU activation, with another Dropout layer.

3. Output Layer:
A single neuron with a sigmoid activation function to output probabilities, suitable for binary classification.

4. Model Compilation:

    4.1 Optimizer: Adam is chosen for its adaptive learning rate and efficiency.

    4.2 Loss function: Binary cross-entropy is appropriate for binary classification tasks.

    4.3 Metrics: Accuracy is used to evaluate the model during training.

#####Integration with scikit-learn:
The KerasClassifier wrapper enables the DNN to integrate seamlessly into scikit-learn pipelines, making it compatible with functions like cross-validation and hyperparameter tuning.

#####Adding to the models dictionary:
The DNN model is added under the key "Deep Neural Network" to be evaluated alongside other machine learning models.

In [None]:
# Define a function to build the Deep Neural Network (DNN) model
def build_dnn():
    # Define the model structure using Keras Sequential API
    model = Sequential([
        # Input layer: Automatically adjusts to the number of features in the dataset
        Input(shape=(X.shape[1],)),  # Input layer with shape matching the number of features in X

        # First hidden layer: 64 neurons with ReLU activation for non-linearity
        Dense(64, activation='relu'),
        Dropout(0.2),  # Dropout layer to reduce overfitting by randomly dropping 20% of neurons

        # Second hidden layer: 32 neurons with ReLU activation
        Dense(32, activation='relu'),
        Dropout(0.2),  # Dropout for further regularization

        # Output layer: 1 neuron with sigmoid activation for binary classification
        Dense(1, activation='sigmoid')  # Outputs probability of the positive class
    ])

    # Compile the model with the Adam optimizer and binary cross-entropy loss
    model.compile(
        optimizer=Adam(learning_rate=0.001),  # Optimizer with a learning rate of 0.001
        loss='binary_crossentropy',  # Loss function for binary classification
        metrics=['accuracy']  # Evaluation metric to track during training
    )
    return model  # Return the constructed model

# Wrap the DNN model with KerasClassifier for compatibility with scikit-learn workflows
dnn_model = KerasClassifier(
    model=build_dnn,  # The function that defines the DNN architecture
    epochs=25,  # Number of training epochs
    batch_size=32,  # Mini-batch size for gradient updates
    verbose=0  # Suppress training output
)

# Add the DNN model to the dictionary of models for evaluation
models['Deep Neural Network'] = dnn_model

Here after defining the respective model ....
In next step we will train the defined model with refined model_df dataset

####Training and Performance Metrics

In [None]:
from sklearn.metrics import (
    accuracy_score,  # Calculates the ratio of correctly predicted instances to total instances
    precision_score,  # Measures the proportion of true positive predictions out of all positive predictions
    recall_score,  # Measures the proportion of true positives identified out of all actual positives
    f1_score,  # Harmonic mean of precision and recall, balancing the two metrics
    confusion_matrix,  # Summarizes prediction results as a matrix of True Positives, False Positives, etc.
    classification_report,  # Generates a detailed report including precision, recall, f1-score, and support
    roc_auc_score,  # Computes the Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    roc_curve,  # Calculates the Receiver Operating Characteristic curve data (TPR vs. FPR)
    matthews_corrcoef  # Measures the quality of binary classifications with a balanced metric
)

####Explanation of Metrics:

1. Accuracy:

    Represents the overall correctness of the model's predictions.
Best suited when the dataset is balanced.

2. Precision:

    High precision means a low false positive rate.
    Useful when false positives are more costly than false negatives.
3. Recall:

    Also known as sensitivity or true positive rate.
    Important in scenarios where missing a positive case is costly (e.g., medical diagnoses).
4. F1-Score:

    Combines precision and recall into a single metric, particularly useful for imbalanced datasets.
    A high F1-score indicates a good balance between precision and recall.
5. Confusion Matrix:

    A matrix summarizing true positives, true negatives, false positives, and false negatives.
    Provides a comprehensive view of prediction errors.
6. Classification Report:

    Includes precision, recall, F1-score, and support (number of true instances for each class).
    Useful for understanding model performance across all classes.
7. ROC AUC:

    Measures the ability of the classifier to distinguish between classes.
    A value closer to 1 indicates better performance.
8. ROC Curve:

    Plots the true positive rate (TPR) against the false positive rate (FPR) at various thresholds.
    Visual representation of classifier performance.
9. Matthews Correlation Coefficient (MCC):

    A balanced metric even for imbalanced datasets.
    Values range from -1 (total disagreement) to +1 (perfect prediction).

#####When to Use:
1. Balanced datasets: Accuracy and F1-score.
2. Imbalanced datasets: Precision, recall, ROC AUC, and MCC.
3. Detailed analysis: Classification report and confusion matrix.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1. Feature Scaling:

    -> Only applied to models like Logistic Regression, SVM, and KNN because they are sensitive to the scale of input features.

    -> Ensemble models like Random Forest and Gradient Boosting do not require feature scaling.

2. Metrics:

    A wide range of metrics is calculated to provide a comprehensive evaluation of each model’s performance.

    Special handling for models that lack predict_proba.

3. Confusion Matrix:

    Provides a granular view of model predictions in terms of True Positives, False Positives, True Negatives, and False Negatives.

4. Classification Report:

    Includes precision, recall, F1-score, and support for each class.

5. Results Dictionary:

    Each model's metrics are stored in a nested dictionary for easy conversion into a DataFrame for better readability.

6. DataFrame Summary:

    The results dictionary is converted into a DataFrame to provide a tabular summary of all models’ performances.

In [None]:
# Dictionary to store results of all models
results = {}

# Train each model and evaluate performance
for name, model in models.items():
    # Apply feature scaling for specific models that are sensitive to scale
    if name == "Logistic Regression" or name == "Support Vector Machine" or name == "K-Nearest Neighbors":
        X_train_ = scaler.fit_transform(X_train)  # Fit and transform the training data
        X_test_ = scaler.transform(X_test)  # Transform the test data
    else:
        X_train_ = X_train  # Use raw data for other models
        X_test_ = X_test

    # Train the model on the training data
    model.fit(X_train_, y_train)

    # Predict labels on the test data
    y_pred = model.predict(X_test_)

    # Predict probabilities if the model supports it
    y_pred_proba = model.predict_proba(X_test_)[:, 1] if hasattr(model, "predict_proba") else None

    # Calculate the confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()  # Extract true negatives, false positives, false negatives, and true positives

    # Calculate various performance metrics
    accuracy = accuracy_score(y_test, y_pred)  # Overall accuracy
    precision = precision_score(y_test, y_pred, zero_division=1)  # Precision (with zero-division handling)
    recall = recall_score(y_test, y_pred)  # Sensitivity/Recall
    f1 = f1_score(y_test, y_pred)  # F1-Score (harmonic mean of precision and recall)
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0  # Specificity: True Negative Rate
    roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None  # ROC AUC Score
    mcc = matthews_corrcoef(y_test, y_pred)  # Matthews Correlation Coefficient

    # Store all metrics in the results dictionary
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall (Sensitivity)': recall,
        'Specificity': specificity,
        'F1-Score': f1,
        'ROC AUC': roc_auc,
        'MCC': mcc,
        'Confusion Matrix': cm
    }

    # Print detailed metrics and reports for each model
    print(f"\n{name} Results:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall (Sensitivity): {recall:.4f}")
    print(f"Specificity: {specificity:.4f}")
    print(f"F1-Score: {f1:.4f}")
    if roc_auc is not None:
        print(f"ROC AUC: {roc_auc:.4f}")
    print(f"MCC: {mcc:.4f}")
    print("Confusion Matrix:")
    print(cm)
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, zero_division=1))

    # Convert the results dictionary into a DataFrame for better visualization
    results_df_1 = pd.DataFrame(results).T
    print("-" * 80)

# Display the consolidated DataFrame of results
print("\nSummary of Results 1:")
results_df_1


Random Forest Results:
Accuracy: 1.0000
Precision: 1.0000
Recall (Sensitivity): 1.0000
Specificity: 1.0000
F1-Score: 1.0000
ROC AUC: 1.0000
MCC: 1.0000
Confusion Matrix:
[[1485    0]
 [   0 1517]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1485
           1       1.00      1.00      1.00      1517

    accuracy                           1.00      3002
   macro avg       1.00      1.00      1.00      3002
weighted avg       1.00      1.00      1.00      3002

--------------------------------------------------------------------------------

Gradient Boosting Results:
Accuracy: 1.0000
Precision: 1.0000
Recall (Sensitivity): 1.0000
Specificity: 1.0000
F1-Score: 1.0000
ROC AUC: 1.0000
MCC: 1.0000
Confusion Matrix:
[[1485    0]
 [   0 1517]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1485
           1       1.00      1.00  

Unnamed: 0,Accuracy,Precision,Recall (Sensitivity),Specificity,F1-Score,ROC AUC,MCC,Confusion Matrix
Random Forest,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"[[1485, 0], [0, 1517]]"
Gradient Boosting,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"[[1485, 0], [0, 1517]]"
AdaBoost,1.0,1.0,1.0,1.0,1.0,1.0,1.0,"[[1485, 0], [0, 1517]]"
Logistic Regression,0.989007,0.991391,0.986816,0.991246,0.989098,0.999629,0.978024,"[[1472, 13], [20, 1497]]"
Support Vector Machine,0.920053,0.92652,0.914305,0.925926,0.920372,,0.840186,"[[1375, 110], [130, 1387]]"
K-Nearest Neighbors,0.796802,0.800132,0.796968,0.796633,0.798547,0.878995,0.59358,"[[1183, 302], [308, 1209]]"
Deep Neural Network,0.49467,1.0,0.0,1.0,0.0,0.5,0.0,"[[1485, 0], [1517, 0]]"


#####Observations:

1. Random Forest, Gradient Boosting, AdaBoost gives well outstanding performance metrics
2. Logistic Regression also nears to good metrics but it missed few true predictions
3. Support Vector Machine marks up to 90% accuracy and missed out many true predictions
4. K-nearest has only 80% accuracy and it missed out more than SVM true predictions
5. DNN has the worst metrics in all models

---

### **Explanation Steps for Model Training**


#### **1. Import Required Libraries**
   - **Purpose**: Import necessary libraries for machine learning and data processing.
   - **Key Libraries**:
     - `pandas`: For loading and manipulating the data.
     - `scikit-learn`: For training the machine learning model (e.g., Logistic Regression).
     - `train_test_split`: For splitting the data into training and testing sets.
     - `accuracy_score`, `precision_score`, `recall_score`, `f1_score`: For evaluating model performance.

   ```python
   import pandas as pd
   from sklearn.model_selection import train_test_split
   from sklearn.linear_model import LogisticRegression
   from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
   ```

#### **2. Load the Preprocessed Data**
   - **Purpose**: Load the preprocessed data (cleaned data) that was generated earlier from the **Data Preprocessing** step. This data contains both the Reddit posts and stock-related features.
   - **Explanation**: This data will be used for training the model to predict stock market movements based on Reddit post sentiments.

   ```python
   df = pd.read_csv('reddit_stock_data_posts_cleaned.csv')
   ```

#### **3. Define Features and Target Variable**
   - **Purpose**: Identify the independent (feature) variables and the dependent (target) variable.
   - **Explanation**:
     - The features (`X`) consist of various text-based features (like word count, sentiment scores) and stock-related data (like previous price changes, moving averages).
     - The target variable (`y`) could be the **Price_Change** (whether the stock price changed positively or negatively).

   ```python
   X = df[['title_word_count', 'content_word_count', 'title_sentiment_score', 'content_sentiment_score', 'Prev_Price_Change']]
   y = df['Price_Change']
   ```

#### **4. Split Data into Training and Testing Sets**
   - **Purpose**: Divide the dataset into training and testing sets to evaluate the model's performance on unseen data.
   - **Explanation**: The `train_test_split` function splits the data, typically using 80% for training and 20% for testing.

   ```python
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
   ```

#### **5. Initialize the Model**
   - **Purpose**: Create the machine learning model that will be trained using the training data.
   - **Explanation**: The model in this case is a **Logistic Regression** model, which is commonly used for binary classification tasks. It predicts whether a stock's price will go up or down based on the features.

   ```python
   model = LogisticRegression()
   ```

#### **6. Train the Model**
   - **Purpose**: Train the model using the training data (`X_train` and `y_train`).
   - **Explanation**: The model learns patterns from the training data and adjusts its internal parameters to minimize errors in predictions. This is the core of the machine learning process.

   ```python
   model.fit(X_train, y_train)
   ```

#### **7. Make Predictions**
   - **Purpose**: Use the trained model to predict the target variable (`Price_Change`) for the testing data (`X_test`).
   - **Explanation**: The model will output predictions based on the features from the test set, which were not seen during training.

   ```python
   y_pred = model.predict(X_test)
   ```

#### **8. Evaluate the Model**
   - **Purpose**: Assess the performance of the trained model by comparing its predictions with the actual target values from the testing set (`y_test`).
   - **Explanation**: Evaluation metrics are calculated to gauge how well the model performed. These include:
     - **Accuracy**: The proportion of correct predictions.
     - **Precision**: The proportion of positive predictions that are actually positive.
     - **Recall**: The proportion of actual positives that were correctly identified.
     - **F1-score**: The harmonic mean of precision and recall.

   ```python
   accuracy = accuracy_score(y_test, y_pred)
   precision = precision_score(y_test, y_pred)
   recall = recall_score(y_test, y_pred)
   f1 = f1_score(y_test, y_pred)

   print(f"Accuracy: {accuracy}")
   print(f"Precision: {precision}")
   print(f"Recall: {recall}")
   print(f"F1-score: {f1}")
   ```


---

These explanation steps describe the entire process of how the **Model Training** script works, from importing libraries to training and evaluating the model. It provides a comprehensive understanding of the script's flow and how each section contributes to the overall goal.

---

### Checking whether the model overfitting or not....

1. Imports:

    -> StratifiedKFold for splitting the dataset into stratified folds.
    cross_val_score for performing cross-validation.

    -> Metrics from sklearn.metrics for scoring functions.

2. Cross-Validation Setup:

    -> StratifiedKFold ensures the proportion of each class is consistent across all folds.

    -> n_splits=5 divides the dataset into 5 folds.

    -> shuffle=True ensures random shuffling of data before splitting.
    
    -> random_state=42 makes the process reproducible.

3. Scoring Functions:

    -> The scoring functions (accuracy, precision, recall, and F1-score) are pre-defined using make_scorer for compatibility with cross_val_score.

    -> The zero_division=1 argument ensures no errors when a division by zero occurs in metrics like precision or recall.

4. Cross-Validation for Each Model:

    -> Loop over models in the models dictionary.

    -> For each model, compute cross-validation scores for all metrics defined in scoring_functions.

    -> Calculate the mean and standard deviation of the scores across the 5 folds.

5. Results Storage:

    Metrics are stored in a nested dictionary cv_results, where each model's results include the mean and standard deviation for all metrics.

6. Summary Table:

    The cv_results dictionary is converted into a Pandas DataFrame (cv_results_df) for easier viewing of results.


In [None]:
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

# Define Stratified K-Fold Cross-Validation (5 folds)
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Dictionary to store the mean and standard deviation of each model's metrics
cv_results = {}

# Define scoring functions using make_scorer outside the loop
scoring_functions = {
    'Accuracy': make_scorer(accuracy_score),
    'Precision': make_scorer(precision_score, zero_division=1),
    'Recall': make_scorer(recall_score, zero_division=1),
    'F1-Score': make_scorer(f1_score, zero_division=1)
}


# Evaluate each model using cross-validation
for name, model in models.items():
    print(f"\n{name} Cross-Validation Results:")
    cv_results[name] = {}

    # Calculate and display metrics
    for metric, scorer in scoring_functions.items():  # Use pre-defined scorers

        # Perform cross-validation and calculate scores
        scores = cross_val_score(model, X, y, cv=kf, scoring=scorer)
        mean_score = np.mean(scores)
        std_score = np.std(scores)

        # Store results in the dictionary
        cv_results[name][f"{metric} Mean"] = mean_score
        cv_results[name][f"{metric} Std"] = std_score

        # Print results for each metric
        print(f"{metric}: Mean = {mean_score:.4f}, Std = {std_score:.4f}")
        print("-" * 80)

# Convert the results dictionary into a DataFrame
cv_results_df = pd.DataFrame(cv_results).T

# Display the DataFrame
print("\nCross-Validation Summary:")
cv_results_df


Random Forest Cross-Validation Results:
Accuracy: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------
Precision: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------
Recall: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------
F1-Score: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------

Gradient Boosting Cross-Validation Results:
Accuracy: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------
Precision: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------
Recall: Mean = 1.0000, Std = 0.0000
--------------------------------------------------------------------------------
F1-Score: Mean = 1.0000, Std = 0.0000
---------------------------------------------

Unnamed: 0,Accuracy Mean,Accuracy Std,Precision Mean,Precision Std,Recall Mean,Recall Std,F1-Score Mean,F1-Score Std
Random Forest,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
Gradient Boosting,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
AdaBoost,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
Logistic Regression,0.988007,0.002793,0.98698,0.00404,0.989072,0.002295,0.988022,0.002784
Support Vector Machine,0.559165,0.015078,0.577496,0.02194,0.444559,0.011803,0.502158,0.012214
K-Nearest Neighbors,0.655919,0.009736,0.64548,0.009154,0.691902,0.014445,0.667832,0.01019
Deep Neural Network,0.5,0.000298,0.701337,0.243873,0.074933,0.149867,0.133333,0.266667


#### Observations from the Cross-Validation Summary:

1. **Ensemble Models Perform Perfectly**:
   - **Random Forest**, **Gradient Boosting**, and **AdaBoost** achieve a perfect score (1.000) across all metrics.
   - Observations:
     - This might indicate either excellent model fitting or potential data leakage.
     - Double-check preprocessing, cross-validation setup, and data splitting to ensure the models aren't being exposed to the test data during training.

2. **Logistic Regression Performs Very Well**:
   - **Accuracy Mean**: 0.988, with a low standard deviation (0.0028), indicating consistent performance across folds.
   - **Precision Mean**: 0.987, **Recall Mean**: 0.989, and **F1-Score Mean**: 0.988.
   - Observations:
     - Logistic Regression demonstrates reliable and balanced performance.
     - It's slightly behind the ensemble models but still very effective, likely due to feature scaling and regularization.

3. **Support Vector Machine (SVM) Struggles**:
   - **Accuracy Mean**: 0.559, **Precision Mean**: 0.577, **Recall Mean**: 0.445, and **F1-Score Mean**: 0.502.
   - High standard deviations (e.g., **Precision Std**: 0.0219) indicate inconsistency across folds.
   - Observations:
     - SVM might not be suitable for this dataset or may require further tuning of hyperparameters (e.g., kernel type, C, and gamma values).
     - Scaling has been applied, so other factors like class imbalance or feature relevance might need to be addressed.

4. **K-Nearest Neighbors (KNN) Shows Moderate Performance**:
   - **Accuracy Mean**: 0.656, **F1-Score Mean**: 0.668.
   - Consistent performance with low standard deviations across metrics (e.g., **Accuracy Std**: 0.0097).
   - Observations:
     - KNN could benefit from tuning the number of neighbors (`n_neighbors`) and distance metrics.
     - It performs slightly better than SVM but is not competitive with ensemble or logistic models.

5. **Deep Neural Network (DNN) Performs Poorly**:
   - **Accuracy Mean**: 0.500 (essentially random guessing).
   - **Precision Mean**: 0.701, but extremely high standard deviation (0.2439), indicating unreliable predictions.
   - **Recall Mean**: 0.075, **F1-Score Mean**: 0.133.
   - Observations:
     - The DNN fails to generalize, possibly due to:
       - Insufficient training epochs.
       - Suboptimal architecture (e.g., layer sizes, dropout rates, activation functions).
       - The model might be underfitting or not learning effectively with the given dataset.
     - Consider fine-tuning hyperparameters or increasing the size and quality of the dataset.



Saved the metrics and cross validation results into csv file for visulaization purpose....

In [None]:
# Save the results to a CSV file if needed
results_df_1.to_csv("metrics_results.csv", index=True)
# Save the results to a CSV file if needed
cv_results_df.to_csv("cross_validation_results.csv", index=True)