### Load the Data

In [None]:
pip install imbalanced-learn




In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/preprocessed_dataset.csv')

# Define the text and target columns
text_column = 'cleaned_comment'
target_column = 'labels'


### Import Necessary Libraries

- **nltk**: Natural Language Toolkit, used for various text processing tasks.
- **re**: Regular expressions, used for string manipulation.

### Download NLTK Resources

- **stopwords**: A list of common words that are typically filtered out in text processing.
- **punkt**: Tokenizer models, used for breaking text into words and sentences.

### Define the preprocess_text Function

1. Convert text to lowercase.
2. Remove URLs using regular expressions.
3. Remove punctuation and digits using regular expressions.
4. Remove stopwords using the NLTK stopwords list.
5. Tokenize the text using NLTK's `word_tokenize` function.
6. Filter out stopwords from the tokens.
7. Join the filtered tokens back into a single string.

### Apply Text Preprocessing

- Apply the `preprocess_text` function to a specified column (`text_column`) in a DataFrame (`df`).

### Output

The output of this code is a DataFrame column where each text entry has been cleaned and preprocessed according to the steps defined in the `preprocess_text` function. The text in this column will be:

- Converted to lowercase.
- Stripped of URLs, punctuation, and digits.
- Filtered to remove common stopwords.
- Tokenized and then rejoined into a single string of meaningful words.


In [None]:
import nltk

# Download the stopwords resource
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download the stopwords resource
nltk.download('stopwords')

# Define a function to clean and preprocess the text data
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove punctuation and digits
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Join tokens back into a single string
    text = ' '.join(filtered_tokens)
    return text

# Apply text preprocessing
df[text_column] = df[text_column].apply(preprocess_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Vectorize the Text Data Using TF-IDF

### Import Necessary Library

- **sklearn.feature_extraction.text.TfidfVectorizer**: A tool from the `scikit-learn` library used for transforming text data into TF-IDF (Term Frequency-Inverse Document Frequency) features.

### Vectorize the Text Data Using TF-IDF

1. **Initialize TF-IDF Vectorizer**:
    - Create an instance of `TfidfVectorizer` with a maximum feature limit of 5000. This means the vectorizer will consider only the top 5000 terms based on their TF-IDF scores.

2. **Fit and Transform the Text Data**:
    - Apply the `fit_transform` method on the text data (`df[text_column]`). This step involves learning the vocabulary from the text data and then transforming the text into a TF-IDF feature matrix.
    - The result, `X`, is a sparse matrix where each row represents a document (or text entry) and each column represents a term (word) from the vocabulary. The values in the matrix are the TF-IDF scores.

3. **Assign Target Variable**:
    - Assign the target variable column from the DataFrame (`df[target_column]`) to `y`.

### Output

The output of this code is:

- `X`: A sparse matrix of shape `(n_samples, max_features)` where `n_samples` is the number of documents (text entries) in the DataFrame and `max_features` is 5000. Each element in the matrix represents the TF-IDF score of a term in a document.
- `y`: A series or array containing the target variable values from the specified column in the DataFrame.

For example, if the text data in `df[text_column]` contains three documents:



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df[text_column])
y = df[target_column]


### Handling Imbalanced Data for Machine Learning Models

### Apply SMOTE to the TF-IDF Vectorized Data

1. **Initialize SMOTE**:
    - Create an instance of the `SMOTE` class with a specified `random_state` for reproducibility.

2. **Resample the Data**:
    - Apply the `fit_resample` method to the TF-IDF vectorized data (`X_tfidf`) and the target variable (`y`). This step involves generating synthetic samples to balance the class distribution.
    - The result is two new arrays: `X_resampled` (the resampled feature matrix) and `y_resampled` (the resampled target variable).

### Split the Data into Training and Testing Sets

1. **Split the Data**:
    - Use the `train_test_split` function to split the resampled data into training and testing sets.
    - Specify the `test_size` parameter to determine the proportion of the dataset to include in the test split (20% in this case).
    - Set the `random_state` for reproducibility.

### Output

The output of this code is:

- **X_resampled, y_resampled**: The feature matrix and target variable after applying SMOTE to balance the class distribution.
- **X_train, X_test, y_train, y_test**: The training and testing sets derived from the resampled data.

For example, if the original dataset had a severe class imbalance, SMOTE would generate synthetic samples to balance the classes, resulting in `X_resampled` and `y_resampled`. The `train_test_split` function then splits this balanced data into training and testing sets, ensuring that the training set (`X_train`, `y_train`) can be used to train a machine learning model, and the testing set (`X_test`, `y_test`) can be used to evaluate its performance.


In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Apply SMOTE to the TF-IDF vectorized data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_tfidf, y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)


##  Function to Build and Evaluate Machine Learning Models



### Define the build_and_evaluate_ml_model Function

This function builds and evaluates a machine learning model based on the specified type.

1. **Select Model Type**:
    - `model_type` can be one of the following:
      - `'random_forest'`: Uses `RandomForestClassifier`.
      - `'logistic_regression'`: Uses `LogisticRegression`.
      - `'svm'`: Uses `SVC`.
    - An error is raised if an unsupported model type is provided.

2. **Train the Model**:
    - The model is instantiated with `random_state=42` for reproducibility.
    - The model is trained using the `fit` method with the training data (`X_train`, `y_train`).

3. **Make Predictions**:
    - The trained model makes predictions on the test data (`X_test`) using the `predict` method.

4. **Evaluate the Model**:
    - The model's performance is evaluated using the `accuracy_score` and `classification_report` functions.
    - The accuracy score and classification report are printed, providing a detailed assessment of the model's performance.

5. **Return the Model**:
    - The trained model is returned for further use if needed.

### Example Usage

To use the function, you can call it with different model types as shown in the example:

```python
# Example usage
rf_model = build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='random_forest')
lr_model = build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='logistic_regression')
svm_model = build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='svm')



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

def build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='random_forest'):
    if model_type == 'random_forest':
        model = RandomForestClassifier(random_state=42)
    elif model_type == 'logistic_regression':
        model = LogisticRegression(max_iter=1000, random_state=42)
    elif model_type == 'svm':
        model = SVC(random_state=42)
    else:
        raise ValueError("Unsupported model type. Choose from 'random_forest', 'logistic_regression', or 'svm'.")

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    print(f"Model: {model_type}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))

    return model

# # Example usage
# rf_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='random_forest')
# lr_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='logistic_regression')
# svm_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='svm')


## Handling Imbalanced Data for Deep Learning Models


### Tokenization, Padding, and Handling Imbalanced Data for Deep Learning

Here, we are preparing text data for deep learning models by tokenizing, padding sequences, and handling class imbalance using SMOTE.

### Steps

1. **Tokenization**:
   - Using `Tokenizer` from `TensorFlow.keras.preprocessing.text`, we convert text data (`df[text_column]`) into sequences of integers. We limit the vocabulary size to 5000 words (`num_words=5000`).

2. **Padding Sequences**:
   - `pad_sequences` from `TensorFlow.keras.preprocessing.sequence` is used to ensure all sequences have the same length (`maxlen=100`). Sequences are padded with zeros (`padding='post'`) at the end.

3. **Handling Imbalance with SMOTE**:
   - `SMOTE` from `imblearn.over_sampling` is applied to the padded sequences (`X_padded`) and target variable (`df[target_column]`) to balance the class distribution.

4. **Splitting into Training and Testing Sets**:
   - `train_test_split` from `sklearn.model_selection` is used to split the balanced data (`X_resampled_dl`, `y_resampled_dl`) into training (`X_train_dl`, `y_train_dl`) and testing (`X_test_dl`, `y_test_dl`) sets.

### Output

After executing this code, you will have:
- `X_train_dl`, `X_test_dl`: Padded sequences ready for training and testing deep learning models.
- `y_train_dl`, `y_test_dl`: Corresponding target variables (labels) for training and testing.
  
This preprocessing prepares the text data by converting it into numerical sequences, ensuring uniform sequence lengths, and addressing class imbalance for use in deep learning models.


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from imblearn.over_sampling import SMOTE

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df[text_column])
X_tokenized = tokenizer.texts_to_sequences(df[text_column])

# Pad the sequences
maxlen = 100
X_padded = pad_sequences(X_tokenized, padding='post', maxlen=maxlen)

# Apply SMOTE to the padded sequences
smote = SMOTE(random_state=42)
X_resampled_dl, y_resampled_dl = smote.fit_resample(X_padded, df[target_column])

# Split the data into training and testing sets
X_train_dl, X_test_dl, y_train_dl, y_test_dl = train_test_split(X_resampled_dl, y_resampled_dl, test_size=0.2, random_state=42)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Vectorize the text data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = tfidf_vectorizer.fit_transform(df[text_column])
y = df[target_column]

# Apply SMOTE to the data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_tfidf, y)

# Split the data into training and testing sets
X_train_ml, X_test_ml, y_train_ml, y_test_ml = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)


## Function to Build and Evaluate Deep Learning Models

### Building and Evaluating a Deep Learning Model

This code snippet demonstrates how to build and evaluate a simple deep learning model using TensorFlow and Keras for text classification.

### Steps

1. **Label Encoding**:
   - Convert the target variable (`y_train` and `y_test`) to categorical labels using `LabelEncoder` from `sklearn.preprocessing`. This step is necessary if the target variable is not already encoded.

2. **Define the Model Architecture**:
   - Use `Sequential` from `tensorflow.keras.models` to define a sequential model.
   - Add an `Embedding` layer with an input dimension of 5000 (vocabulary size), output dimension of 16 (embedding dimension), and input length (`maxlen`) determined during padding.
   - Add a `GlobalAveragePooling1D` layer to pool the embeddings across the sequence dimension.
   - Add `Dense` layers with 24 units and ReLU activation, followed by a final `Dense` layer with 1 unit and sigmoid activation for binary classification.

3. **Compile the Model**:
   - Compile the model using `'adam'` optimizer and `'binary_crossentropy'` loss function for binary classification. Metrics are set to `'accuracy'` for evaluation.

4. **Train the Model**:
   - Train the model on `X_train` and `y_train` with 10 epochs, a batch size of 32, and a validation split of 0.2 (20% of training data used for validation).

5. **Evaluate the Model**:
   - Evaluate the trained model on `X_test` and `y_test` to measure its performance in terms of accuracy.

### Output

After executing `build_and_evaluate_dl_model`, you will see the following output:
- The accuracy of the deep learning model on the test set (`X_test`, `y_test`).

This function encapsulates the process of building, training, and evaluating a deep learning model for text classification tasks.


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from sklearn.preprocessing import LabelEncoder

def build_and_evaluate_dl_model(X_train, X_test, y_train, y_test):
    # Convert target to categorical if necessary
    label_encoder = LabelEncoder()
    y_train = label_encoder.fit_transform(y_train)
    y_test = label_encoder.transform(y_test)

    # Define the model
    model = Sequential([
        Embedding(input_dim=5000, output_dim=16, input_length=maxlen),
        GlobalAveragePooling1D(),
        Dense(24, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model
    model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=1)

    # Evaluate the model
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f"Deep Learning Model Accuracy: {accuracy}")

    return model

# Example usage
#dl_model = build_and_evaluate_dl_model(X_train_dl, X_test_dl, y_train_dl, y_test_dl)


 ### Function Calls to Generate Models

### Conclusion

- The SVM model outperforms both the Random Forest and Logistic Regression models in terms of accuracy and F1-score.
- SVM shows a balanced performance with good precision and recall for both classes (0 and 1), indicating robust classification capabilities.
- Random Forest and Logistic Regression models also perform reasonably well, but SVM provides slightly better performance metrics across accuracy, precision, recall, and F1-score.
  
In summary, based on this evaluation, the SVM model is recommended for this classification task due to its superior performance metrics compared to Random Forest and Logistic Regression models.







In [None]:
# For Machine Learning models
rf_model = build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='random_forest')
lr_model = build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='logistic_regression')
svm_model = build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='svm')



Model: random_forest
Accuracy: 0.7660924750679964
              precision    recall  f1-score   support

           0       0.75      0.78      0.77      2189
           1       0.78      0.75      0.76      2223

    accuracy                           0.77      4412
   macro avg       0.77      0.77      0.77      4412
weighted avg       0.77      0.77      0.77      4412

Model: logistic_regression
Accuracy: 0.772438803263826
              precision    recall  f1-score   support

           0       0.77      0.77      0.77      2189
           1       0.78      0.77      0.77      2223

    accuracy                           0.77      4412
   macro avg       0.77      0.77      0.77      4412
weighted avg       0.77      0.77      0.77      4412

Model: svm
Accuracy: 0.8125566636446057
              precision    recall  f1-score   support

           0       0.79      0.85      0.82      2189
           1       0.84      0.78      0.81      2223

    accuracy                         

### Hyperparameter Tuning and Model Evaluation

The provided code snippet performs hyperparameter tuning for the SVM model using GridSearchCV and evaluates its performance on the test set.

#### Steps:

1. **Hyperparameter Tuning with GridSearchCV**:
   - GridSearchCV is used to search for the best combination of hyperparameters (`C`, `gamma`, `kernel`) for the SVM model (`SVC(random_state=42)`).
   - The parameter grid (`svm_param_grid`) specifies different values for `C`, `gamma`, and `kernel` to be evaluated.
   - Cross-validation (`cv=3`) with 3 folds is used to validate the performance of each parameter combination.
   - The best performing model based on accuracy is selected (`best_model = grid_search.best_estimator_`).

2. **Model Evaluation**:
   - The best model obtained from GridSearchCV is evaluated on the test set (`X_test`, `y_test`).
   - Accuracy, precision, recall, and F1-score are computed and printed using `classification_report`.

#### Output:

The output shows the evaluation metrics of the best SVM model on the test set:

- **Accuracy:** 0.811
- **Precision (0/1):** 0.81 / 0.81
- **Recall (0/1):** 0.81 / 0.81
- **F1-score (0/1):** 0.81 / 0.81

### Conclusion:

- The tuned SVM model achieves a balanced accuracy, precision, recall, and F1-score of around 0.81 for both classes (0 and 1).
- This indicates that the SVM model with the best hyperparameters performs well in classifying the data, demonstrating robustness and generalization capability.
- Further hyperparameter tuning and evaluation for Gradient Boosting and XGBoost models (commented out in the provided code) can be similarly performed to compare their performance against the SVM model.


In [None]:
import joblib
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

# Function to perform hyperparameter tuning and evaluate models
def tune_and_evaluate_model(X_train, X_test, y_train, y_test, model, param_grid):
    with joblib.parallel_backend('threading'):
        grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2, scoring='accuracy')
        grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)

    print(f"Best Model: {best_model}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))

    return best_model

# SVM Hyperparameter Tuning
svm_param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf', 'linear']
}
svm_best_model = tune_and_evaluate_model(X_train, X_test, y_train, y_test, SVC(random_state=42), svm_param_grid)



Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=  33.1s
[CV] END ......................C=0.1, gamma=1, kernel=linear; total time=  40.4s
[CV] END C=37.55401188473625, gamma=0.9517143064099162, kernel=rbf; total time= 1.2min
[CV] END C=37.55401188473625, gamma=0.9517143064099162, kernel=rbf; total time= 1.2min
[CV] END C=37.55401188473625, gamma=0.9517143064099162, kernel=rbf; total time=  55.3s
[CV] END C=78.06910002727692, gamma=0.597850157946487, kernel=linear; total time= 2.0min
[CV] END C=78.06910002727692, gamma=0.597850157946487, kernel=linear; total time= 1.9min
[CV] END C=15.699452033620265, gamma=0.05908361216819946, kernel=linear; total time=  39.2s
[CV] END C=78.06910002727692, gamma=0.597850157946487, kernel=linear; total time= 1.9min
[CV] END C=15.699452033620265, gamma=0.05908361216819946, kernel=linear; total time=  35.4s
[CV] END C=15.699452033620265, gamma=0.05908361216819946, kernel=

In [None]:
# XGBoost Hyperparameter Tuning
xgb_param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}
xgb_best_model = tune_and_evaluate_model(X_train, X_test, y_train, y_test, XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss'), xgb_param_grid)


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   7.7s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   7.8s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   4.9s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=   8.9s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  11.3s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  11.9s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   6.4s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   7.0s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   7.1s
[CV] END ..learning_rate=0.01, max_depth=5, n_estimators=200; total time=  17.0s
[CV] END ..learning_rate=0.01, max_depth=5, n_estimators=200; total time=  16.5s
[CV] END ..learning_rate=0.01, max_depth=4, n_es

In [None]:
# Gradient Boosting Hyperparameter Tuning
gb_param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}
gb_best_model = tune_and_evaluate_model(X_train, X_test, y_train, y_test, GradientBoostingClassifier(random_state=42), gb_param_grid)


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=  10.7s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=  11.7s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=100; total time=   7.5s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  16.5s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  14.2s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=200; total time=  12.4s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   8.5s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   6.8s
[CV] END ...learning_rate=0.1, max_depth=4, n_estimators=100; total time=   6.7s
[CV] END ..learning_rate=0.01, max_depth=5, n_estimators=200; total time=  19.6s
[CV] END ..learning_rate=0.01, max_depth=5, n_estimators=200; total time=  19.6s
[CV] END ..learning_rate=0.01, max_depth=4, n_es

### Deep Learning Model Training and Evaluation

The provided output shows the training progress and evaluation metrics of a deep learning model trained over 10 epochs.

#### Training Progress:

- **Epochs:** The model is trained over 10 epochs.
- **Loss:** The loss decreases from 0.6915 to 0.3594, indicating improvement in model performance.
- **Accuracy:** The accuracy improves from 0.5304 to 0.8388 on the training set.
- **Validation Accuracy:** The validation accuracy increases from 0.5971 to 0.7121, showing that the model generalizes reasonably well to unseen data but may not perform as well as on the training data.

#### Model Evaluation:

- After training, the model achieves an accuracy of 0.7246 on the test set (`val_accuracy`).
- The `classification_report` can provide more insights into precision, recall, and F1-score for each class (not shown here).

### Conclusion:

- The deep learning model shows a promising improvement in accuracy and loss during training.
- However, there might be some overfitting as the validation accuracy is slightly lower than the training accuracy.
- Further tuning of hyperparameters or model architecture adjustments may be beneficial to improve validation performance and overall model robustness.


In [None]:
# For Deep Learning model
dl_model = build_and_evaluate_dl_model(X_train_dl, X_test_dl, y_train_dl, y_test_dl)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Deep Learning Model Accuracy: 0.724614679813385


Performing Varoius practices

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

def build_and_evaluate_ml_model(X_train, X_test, y_train, y_test, model_type='random_forest'):
    if model_type == 'random_forest':
        model = RandomForestClassifier(random_state=42)
    elif model_type == 'logistic_regression':
        model = LogisticRegression(max_iter=1000, random_state=42)
    elif model_type == 'svm':
        model = SVC(random_state=42)
    elif model_type == 'gradient_boosting':
        model = GradientBoostingClassifier(random_state=42)
    else:
        raise ValueError("Unsupported model type. Choose from 'random_forest', 'logistic_regression', 'svm', or 'gradient_boosting'.")

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    print(f"Model: {model_type}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))

    return model

# Example usage
rf_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='random_forest')
lr_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='logistic_regression')
svm_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='svm')
gb_model = build_and_evaluate_ml_model(X_train_ml, X_test_ml, y_train_ml, y_test_ml, model_type='gradient_boosting')


Model: random_forest
Accuracy: 0.7660924750679964
              precision    recall  f1-score   support

           0       0.75      0.78      0.77      2189
           1       0.78      0.75      0.76      2223

    accuracy                           0.77      4412
   macro avg       0.77      0.77      0.77      4412
weighted avg       0.77      0.77      0.77      4412

Model: logistic_regression
Accuracy: 0.772438803263826
              precision    recall  f1-score   support

           0       0.77      0.77      0.77      2189
           1       0.78      0.77      0.77      2223

    accuracy                           0.77      4412
   macro avg       0.77      0.77      0.77      4412
weighted avg       0.77      0.77      0.77      4412

Model: svm
Accuracy: 0.8125566636446057
              precision    recall  f1-score   support

           0       0.79      0.85      0.82      2189
           1       0.84      0.78      0.81      2223

    accuracy                         

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import class_weight
import numpy as np

def build_and_evaluate_dl_model(X_train, X_test, y_train, y_test):
    # Convert target to categorical if necessary
    label_encoder = LabelEncoder()
    y_train = label_encoder.fit_transform(y_train)
    y_test = label_encoder.transform(y_test)

    # Calculate class weights
    class_weights = class_weight.compute_class_weight(
        class_weight='balanced',
        classes=np.unique(y_train),
        y=y_train
    )
    class_weights = {i: weight for i, weight in enumerate(class_weights)}

    # Define the model
    model = Sequential([
        Embedding(input_dim=5000, output_dim=16, input_length=maxlen),
        GlobalAveragePooling1D(),
        Dense(24, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model with class weights
    model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weights, verbose=1)

    # Evaluate the model
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f"Deep Learning Model Accuracy: {accuracy}")

    return model

# Example usage
dl_model = build_and_evaluate_dl_model(X_train_dl, X_test_dl, y_train_dl, y_test_dl)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Deep Learning Model Accuracy: 0.7441070079803467


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import class_weight
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import BertTokenizer

# Function for WordPiece tokenization
def wordpiece_tokenize(texts, tokenizer, max_length):
    tokenized_texts = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='np'
    )
    return tokenized_texts['input_ids']

def build_and_evaluate_dl_model(X_train, X_test, y_train, y_test, maxlen=100):
    # Convert target to categorical if necessary
    label_encoder = LabelEncoder()
    y_train = label_encoder.fit_transform(y_train)
    y_test = label_encoder.transform(y_test)

    # Ensure input data is in list of strings format
    X_train = [str(doc) for doc in X_train]
    X_test = [str(doc) for doc in X_test]

    # TF-IDF Vectorization
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train).toarray()
    X_test_tfidf = tfidf_vectorizer.transform(X_test).toarray()

    # WordPiece Tokenization
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    X_train_tokenized = wordpiece_tokenize(X_train, tokenizer, max_length=maxlen)
    X_test_tokenized = wordpiece_tokenize(X_test, tokenizer, max_length=maxlen)

    # Combine TF-IDF and Tokenized features (you may choose to use one or both)
    X_train_combined = np.hstack((X_train_tfidf, X_train_tokenized))
    X_test_combined = np.hstack((X_test_tfidf, X_test_tokenized))

    # Calculate class weights
    class_weights = class_weight.compute_class_weight(
        class_weight='balanced',
        classes=np.unique(y_train),
        y=y_train
    )
    class_weights = {i: weight for i, weight in enumerate(class_weights)}

    # Define the model
    model = Sequential([
        Dense(512, activation='relu', input_shape=(X_train_combined.shape[1],)),
        Dense(256, activation='relu'),
        Dense(128, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    # Compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # Train the model with class weights
    model.fit(X_train_combined, y_train, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weights, verbose=1)

    # Evaluate the model
    y_pred = (model.predict(X_test_combined) > 0.5).astype("int32")
    accuracy = tf.keras.metrics.BinaryAccuracy()
    precision = tf.keras.metrics.Precision()
    recall = tf.keras.metrics.Recall()
    f1_score = tf.keras.metrics.Mean()

    accuracy.update_state(y_test, y_pred)
    precision.update_state(y_test, y_pred)
    recall.update_state(y_test, y_pred)
    f1 = 2 * (precision.result().numpy() * recall.result().numpy()) / (precision.result().numpy() + recall.result().numpy())
    f1_score.update_state(f1)

    print(f"Deep Learning Model Accuracy: {accuracy.result().numpy()}")
    print(f"Deep Learning Model Precision: {precision.result().numpy()}")
    print(f"Deep Learning Model Recall: {recall.result().numpy()}")
    print(f"Deep Learning Model F1 Score: {f1_score.result().numpy()}")

    return model

# Example usage
# Ensure X_train_dl and X_test_dl are lists of strings, and y_train_dl and y_test_dl are labels
dl_model = build_and_evaluate_dl_model(X_train_dl, X_test_dl, y_train_dl, y_test_dl)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Deep Learning Model Accuracy: 0.5
Deep Learning Model Precision: 0.5
Deep Learning Model Recall: 1.0
Deep Learning Model F1 Score: 0.6666666865348816
