### **Code Description**:  
1.   This code performs text classification on a dataset of comments, where the goal is to predict if a comment is toxic or not. The code imports necessary libraries, loads the data, preprocesses it, and balances the target feature. Then it splits the data into training and testing sets, vectorizes the text data using CountVectorizer and TfidfVectorizer, trains different models including Logistic Regression and Multinomial Naive Bayes, and evaluates them using classification reports. Finally, it selects the best model and evaluates it on the testing set using the F1 score.
2.   The preprocessing step applies lowercasing, removal of stop words, and lemmatization to the comment text. The target feature imbalance is handled by downsampling the non-toxic comments to match the number of toxic comments.

3. The vectorization step uses both CountVectorizer and TfidfVectorizer with n-gram range of 1 to 2 and maximum features of 50,000.

4. The code trains two models, Logistic Regression and Multinomial Naive Bayes, and evaluates them using classification reports for both CountVectorizer and TfidfVectorizer. The results show that Logistic Regression with TfidfVectorizer has the best performance with an F1 score of 0.89.

5. Finally, the code selects the best model (Logistic Regression with TfidfVectorizer) and evaluates it on the testing set using the classification report and F1 score.

6. At last, for testing a sample comment, it applies the same preprocessing function used in the data preparation step. Then it uses the pipeline object, which represents the best trained model, to predict whether the comment is toxic or not. The predict method takes a list of comments as input, so we pass a list with only one element [sample_comment]. Finally, the code uses a conditional statement to check whether the predicted label is equal to 1 or 0, and prints the corresponding message.








### **Importing necessary libraries:**

In [6]:
# Import necessary libraries
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score, classification_report
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### **Loading the data**

In [7]:
data = pd.read_csv("/content/drive/MyDrive/zummit/P1 Data/train.csv", error_bad_lines=False) 



  data = pd.read_csv("/content/drive/MyDrive/zummit/P1 Data/train.csv", error_bad_lines=False)


### **Preprocessing**

In [8]:
lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')
def preprocess(text):
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stop_words])
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
    return text


In [9]:
data['comment_text'] = data['comment_text'].apply(preprocess)

### **Target feature imbalance**

In [10]:
toxic_comments = data[data['toxic']==1]
non_toxic_comments = data[data['toxic']==0].sample(n=len(toxic_comments), random_state=42)
balanced_data = pd.concat([toxic_comments, non_toxic_comments])

### **Splitting data into training and testing sets**

In [11]:
X = balanced_data['comment_text']
y = balanced_data['toxic']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### **Vectorization and model training**

In [12]:
cv = CountVectorizer(ngram_range=(1,2), max_features=50000)
tfidf = TfidfVectorizer(ngram_range=(1,2), max_features=50000)
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Multinomial Naive Bayes', MultinomialNB())
]
for model_name, model in models:
    pipeline_cv = Pipeline([('cv', cv), ('model', model)])
    pipeline_tfidf = Pipeline([('tfidf', tfidf), ('model', model)])
    pipeline_cv.fit(X_train, y_train)
    pipeline_tfidf.fit(X_train, y_train)
    y_pred_cv = pipeline_cv.predict(X_test)
    y_pred_tfidf = pipeline_tfidf.predict(X_test)
    print(model_name + " with CountVectorizer\n" + classification_report(y_test, y_pred_cv))
    print(model_name + " with TfidfVectorizer\n" + classification_report(y_test, y_pred_tfidf))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression with CountVectorizer
              precision    recall  f1-score   support

           0       0.89      0.88      0.89      2997
           1       0.89      0.89      0.89      3121

    accuracy                           0.89      6118
   macro avg       0.89      0.89      0.89      6118
weighted avg       0.89      0.89      0.89      6118

Logistic Regression with TfidfVectorizer
              precision    recall  f1-score   support

           0       0.87      0.92      0.90      2997
           1       0.92      0.87      0.89      3121

    accuracy                           0.89      6118
   macro avg       0.90      0.89      0.89      6118
weighted avg       0.90      0.89      0.89      6118

Multinomial Naive Bayes with CountVectorizer
              precision    recall  f1-score   support

           0       0.87      0.90      0.88      2997
           1       0.90      0.87      0.88      3121

    accuracy                           0.88      6118
 

### **Testing the best model**

In [13]:
pipeline = Pipeline([('tfidf', tfidf), ('model', LogisticRegression(random_state=42))])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print("Logistic Regression with TfidfVectorizer\n" + classification_report(y_test, y_pred))
print("F1-score: {:.2f}".format(f1_score(y_test, y_pred)))


Logistic Regression with TfidfVectorizer
              precision    recall  f1-score   support

           0       0.87      0.92      0.90      2997
           1       0.92      0.87      0.89      3121

    accuracy                           0.89      6118
   macro avg       0.90      0.89      0.89      6118
weighted avg       0.90      0.89      0.89      6118

F1-score: 0.89


### **Testing the model on a sample comment**

In [33]:

sample_comment = "Sample Comment"
sample_comment = preprocess(sample_comment)
y_pred = pipeline.predict([sample_comment])
y_pred = pipeline.predict([sample_comment])
if y_pred[0] == 1:
    print("The comment is toxic.")
else:
    print("The comment is not toxic.")


The comment is not toxic.
