Summary of the Entire Code in 5 Steps
Data Loading and Preprocessing:

Load the dataset: The dataset is loaded from a CSV file using the pandas library, which reads the data into a DataFrame.
Drop null values: Rows with null values in the 'comment' column are removed to ensure that the data is clean and ready for processing.
Data Splitting:

Separate features and target: The comments are extracted as features (X), and the corresponding labels are extracted as the target (y).
Train-test split: The data is split into training and testing sets using an 80-20 ratio. This means 80% of the data is used to train the model, and 20% is used to evaluate its performance. The split is done in a way that ensures reproducibility by setting a random state.
Text Vectorization:

TF-IDF Vectorization: The text data (comments) is converted into numerical vectors using the TF-IDF (Term Frequency-Inverse Document Frequency) method. This process transforms the text into a format that the machine learning model can understand by representing each comment as a vector of numbers that reflect the importance of each word in the corpus.
Model Training:

Initialize the Random Forest model: A Random Forest classifier is created with 100 trees. Random Forest is an ensemble learning method that builds multiple decision trees and merges them to get a more accurate and stable prediction.
Train the model: The Random Forest model is trained on the TF-IDF vectorized training data. The model learns to map the input text vectors to the corresponding labels.
Model Evaluation and Visualization:

Make predictions: The trained model is used to make predictions on the test data.
Calculate accuracy: The accuracy of the model is calculated, which measures the proportion of correctly predicted instances out of the total instances.
Print classification report: A detailed classification report is printed, which includes metrics such as precision, recall, F1-score, and support for each class.
Compute confusion matrix: The confusion matrix is computed to provide a detailed breakdown of the model's performance, showing the counts of true positive, true negative, false positive, and false negative predictions.
Visualize confusion matrix: The confusion matrix is visualized using a heatmap, which provides an intuitive understanding of the model's performance in distinguishing between the classes.

 Random Forest Model 
Improved Accuracy and Robustness:

Random Forest combines the predictions of multiple decision trees, leading to improved accuracy and robustness. By averaging the results of many trees, it reduces the risk of overfitting and produces more reliable predictions.
Handling of High-Dimensional Data:

Random Forest is capable of handling high-dimensional data and can model complex interactions between features. It is well-suited for datasets with a large number of features, making it a versatile choice for various applications.
Feature Importance:

Random Forest provides insights into feature importance, helping to identify which features are most influential in making predictions. This is valuable for understanding the underlying patterns in the data and for feature selection.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = 'cleaned_balanced_dataset_FINAL.csv'
data = pd.read_csv(file_path)
print("Dataset loaded successfully.")  # Added print statement

# Drop rows with null values in the 'comment' column
data = data.dropna(subset=['comment'])
print("Null values dropped from the dataset.")  # Added print statement

# Split the dataset into features and target
X = data['comment']
y = data['label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split into training and testing sets.")  # Added print statement

# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
print("Text data vectorized using TF-IDF.")  # Added print statement

# Create a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train_tfidf, y_train)
print("Model trained successfully.")  # Added print statement

# Make predictions on the test set
y_pred = model.predict(X_test_tfidf)
print("Predictions made on the test set.")  # Added print statement

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')  # Added print statement

print('Classification Report:')
print(classification_report(y_test, y_pred))  # Added print statement

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Visualize the Confusion Matrix
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


Dataset loaded successfully.
Null values dropped from the dataset.
Data split into training and testing sets.
Text data vectorized using TF-IDF.
