<a href="https://colab.research.google.com/github/tarakantaacharya/Stock_Movement_Analysis/blob/main/Model_Building.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Building

#####Instructions :
1. You should run the Data Scrapping, Data_Preprocessing_Cleaning and Feature Engineering files, Dataset for Model Building files before Model Building for required data...
2. Load the required datasets if you dont want to rebuild the datatset...
3. Installl the required libraries from requirements.txt

In [None]:
# Install the TensorFlow library, which is essential for building and training deep learning models.
!pip install tensorflow
# Install the Scikeras library, which provides an interface to use Keras models within scikit-learn pipelines for easier machine learning integration.
!pip install scikeras



######Importing Libraries

In [None]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing various classification algorithms from scikit-learn
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier  # Ensemble learning models
from sklearn.linear_model import LogisticRegression  # Logistic regression model
from sklearn.svm import SVC  # Support Vector Classifier
from sklearn.neighbors import KNeighborsClassifier  # K-Nearest Neighbors Classifier

# Importing evaluation metrics to assess model performance
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Importing train_test_split for splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split

# Importing StandardScaler to standardize features by removing the mean and scaling to unit variance
from sklearn.preprocessing import StandardScaler

# Importing TensorFlow's Keras library for building deep learning models
from tensorflow.keras.models import Sequential  # Sequential API for building models layer by layer

# Importing layers from Keras for building neural network architectures
from keras.layers import Dense, Dropout, Input  # Dense: fully connected layers, Dropout: prevents overfitting

# Importing Adam optimizer for training deep learning models
from tensorflow.keras.optimizers import Adam

# Importing KerasClassifier wrapper to use Keras models in scikit-learn pipelines
from scikeras.wrappers import KerasClassifier

# Importing StratifiedKFold and cross_val_score for cross-validation
from sklearn.model_selection import StratifiedKFold, cross_val_score  # StratifiedKFold ensures balanced class distribution in splits

# Importing visualization libraries
from matplotlib import pyplot as plt  # For creating visualizations
import seaborn as sns  # Advanced visualization library with aesthetic options

# Importing train_test_split again (redundant here since already imported earlier)
from sklearn.model_selection import train_test_split

# Importing make_pipeline to create machine learning pipelines
from sklearn.pipeline import make_pipeline

# Importing StandardScaler again (redundant here since already imported earlier)
from sklearn.preprocessing import StandardScaler

# Creating an instance of StandardScaler for feature scaling
scaler = StandardScaler()  # Standardizes features to have mean 0 and standard deviation 1

In [None]:
# Features and target variable
X_features = ['score', 'num_comments', 'upvote_ratio', 'title_sentiment_score',
       'content_sentiment_score', 'Close_AAPL', 'Price_Change', 'Prev_Price_Change']

X = model_df[X_features]  # Adjust features based on your data
y = model_df['stock_direction']

# Train-test split (60-40 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Print the shapes of the resulting datasets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

X_train shape: (4502, 8)
X_test shape: (3002, 8)


In [None]:
results_df_1 = pd.DataFrame()   #We create an empty dataframe to store the metric results

#####Explanation of the models:
1. Random Forest: A robust ensemble method that combines multiple decision trees to improve classification accuracy and reduce overfitting. class_weight='balanced' automatically adjusts class weights inversely proportional to their frequencies in the data.

2. Gradient Boosting: Builds models sequentially, optimizing for residual errors. It's often used for competitive performance in structured data.

3. AdaBoost: Combines weak classifiers iteratively to focus on misclassified instances. The SAMME algorithm supports multi-class outputs.

4. Logistic Regression: A linear model used for binary/multi-class classification. Here, it's combined with StandardScaler for preprocessing, and saga is chosen for its efficiency on large datasets.

5. Support Vector Machine (SVC): Useful for high-dimensional spaces and non-linear decision boundaries. The class_weight='balanced' adjusts for class imbalance.

6. K-Nearest Neighbors: Simple and intuitive, relying on the proximity of data points. The choice of n_neighbors=5 is a common default, but it can be tuned.

---------------------------------------------------------------------------

#####Additional Notes:
1. Class weights: Models like Random Forest, SVC, and Logistic Regression are set with class_weight='balanced' to handle datasets with imbalanced target distributions effectively.
2. Random State: Ensures reproducibility for models that involve randomness.
Pipelines: Used for Logistic Regression to combine preprocessing (scaling) and modeling into a single step.

In [None]:
# Initializing a dictionary of machine learning models with specific hyperparameters
models = {
    # Random Forest: Ensemble model using multiple decision trees, with 100 trees and balanced class weights
    "Random Forest": RandomForestClassifier(
        n_estimators=100,  # Number of decision trees
        class_weight='balanced',  # Adjust weights for imbalanced classes
        random_state=42  # Ensures reproducibility
    ),

    # Gradient Boosting: Ensemble model where trees are built sequentially to minimize errors
    "Gradient Boosting": GradientBoostingClassifier(
        n_estimators=100,  # Number of boosting stages
        random_state=42  # Ensures reproducibility
    ),

    # AdaBoost: Boosting algorithm with SAMME (for multi-class classification)
    "AdaBoost": AdaBoostClassifier(
        algorithm='SAMME'  # Algorithm type, SAMME is suitable for multi-class problems
    ),

    # Logistic Regression: Linear model wrapped in a pipeline with scaling and custom parameters
    'Logistic Regression': make_pipeline(
        StandardScaler(),  # Standardizes features
        LogisticRegression(
            max_iter=3000,  # Maximum number of iterations for optimization
            solver='saga',  # Solver suitable for large datasets and supports L1/L2 regularization
            class_weight='balanced',  # Adjust weights for imbalanced classes
            random_state=42  # Ensures reproducibility
        )
    ),

    # Support Vector Machine: Non-linear classifier with kernel tricks
    "Support Vector Machine": SVC(
        class_weight='balanced',  # Adjust weights for imbalanced classes
        random_state=42  # Ensures reproducibility
    ),

    # K-Nearest Neighbors: Distance-based algorithm, finding the 5 nearest neighbors
    "K-Nearest Neighbors": KNeighborsClassifier(
        n_neighbors=5  # Number of neighbors to consider
    )
}

#####Explanation of the DNN:
1. Input Layer: The Input layer specifies the shape of input data, which corresponds to the number of features in the dataset (X.shape[1]).

2. Hidden Layers:

    2.1 First hidden layer: 64 neurons, ReLU activation for non-linearity, followed by a Dropout layer to mitigate overfitting.

    2.2 Second hidden layer: 32 neurons, ReLU activation, with another Dropout layer.

3. Output Layer:
A single neuron with a sigmoid activation function to output probabilities, suitable for binary classification.

4. Model Compilation:

    4.1 Optimizer: Adam is chosen for its adaptive learning rate and efficiency.

    4.2 Loss function: Binary cross-entropy is appropriate for binary classification tasks.

    4.3 Metrics: Accuracy is used to evaluate the model during training.

#####Integration with scikit-learn:
The KerasClassifier wrapper enables the DNN to integrate seamlessly into scikit-learn pipelines, making it compatible with functions like cross-validation and hyperparameter tuning.

#####Adding to the models dictionary:
The DNN model is added under the key "Deep Neural Network" to be evaluated alongside other machine learning models.

In [None]:
# Define a function to build the Deep Neural Network (DNN) model
def build_dnn():
    # Define the model structure using Keras Sequential API
    model = Sequential([
        # Input layer: Automatically adjusts to the number of features in the dataset
        Input(shape=(X.shape[1],)),  # Input layer with shape matching the number of features in X

        # First hidden layer: 64 neurons with ReLU activation for non-linearity
        Dense(64, activation='relu'),
        Dropout(0.2),  # Dropout layer to reduce overfitting by randomly dropping 20% of neurons

        # Second hidden layer: 32 neurons with ReLU activation
        Dense(32, activation='relu'),
        Dropout(0.2),  # Dropout for further regularization

        # Output layer: 1 neuron with sigmoid activation for binary classification
        Dense(1, activation='sigmoid')  # Outputs probability of the positive class
    ])

    # Compile the model with the Adam optimizer and binary cross-entropy loss
    model.compile(
        optimizer=Adam(learning_rate=0.001),  # Optimizer with a learning rate of 0.001
        loss='binary_crossentropy',  # Loss function for binary classification
        metrics=['accuracy']  # Evaluation metric to track during training
    )
    return model  # Return the constructed model

# Wrap the DNN model with KerasClassifier for compatibility with scikit-learn workflows
dnn_model = KerasClassifier(
    model=build_dnn,  # The function that defines the DNN architecture
    epochs=25,  # Number of training epochs
    batch_size=32,  # Mini-batch size for gradient updates
    verbose=0  # Suppress training output
)

# Add the DNN model to the dictionary of models for evaluation
models['Deep Neural Network'] = dnn_model

Here after defining the respective model ....
In next step we will train the defined model with refined model_df dataset