# Sentiment Analysis: Food Review Classification
# Group Members:Teh Wei Zhang,Chai Ee Yuan,Tang Yik Hong

Sentiment analysis is a natural language processing (NLP) method used to detect the emotional tone of written text. In food reviews, it helps identify whether customer feedback is positive, negative, or neutral. For instance, praise for taste or service reflects positive sentiment, while complaints about price or delays indicate negative sentiment. Neutral reviews may simply describe portion size or menu variety without strong emotion.

By turning qualitative opinions into structured data, sentiment analysis reveals broader patterns in customer satisfaction. These insights allow restaurants and food platforms to track strengths, address weaknesses, and improve overall dining experiences.

# Benefits of sentiment analysis

1. **Understanding Customer Opinions** It helps businesses quickly identify whether customer feedback is positive, negative, or neutral, giving a clear picture of satisfaction levels.

2. **Improving Decision-Making**: By converting subjective comments into measurable data, organizations can make data-driven choices about product quality, pricing, or service improvements.

3. **Real-Time Analysis**: By identifying critical issues in real-time, sentiment analysis allows businesses to detect emerging patterns in audience reception immediately after release. This enables production companies to quickly identify pain points and implement strategic responses to audience feedback.


In [2]:
import joblib
import numpy as np
import pandas as pd

# Dataset Description

This project utilizes the Amazon Fine Food Reviews Dataset available on Kaggle, a large-scale collection widely used in sentiment analysis research. The dataset contains over 500,000 food-related reviews from Amazon, including detailed ratings, review text, and metadata.

Dataset structure:
- Reviews: More than 500,000 textual evaluations of food products on Amazon
- Ratings: 1–5 star scale, which can be mapped into negative, neutral, and positive sentiment classes
- Balance: For experimental purposes, subsets can be sampled to create equal distributions across sentiment categories
- Source: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews

Our research objective is to systematically evaluate multiple machine learning approaches for sentiment classification, determining which methods most effectively predict customer sentiment from review text. This analysis will highlight the relative performance of classification strategies when applied to the domain-specific language patterns found in food product reviews.

In [4]:
file = 'Reviews.csv'
df_review = pd.read_csv(file, encoding="latin1",low_memory=False) #latin1 = ISO-8859-1 , 1 type of word, Pandas reads the file in chunks to conserve memory.
df_review = df_review.loc[:, ~df_review.columns.str.contains(r'^Unnamed')] #the unamed 
df_review = df_review[df_review['Rating'].between(1, 5)] #only display Rating that are 1-5 
print(df_review.head())
print(df_review['Rating'].value_counts())  # original column name in this dataset

   Id   ProductId  Rating                                               Text
0   1  B001E4KFG0       5  I have bought several of the Vitality canned d...
1   2  B00813GRG4       1  Product arrived labeled as Jumbo Salted Peanut...
2   3  B000LQOCH0       4  This is a confection that has been around a fe...
3   4  B000UA0QIQ       2  If you are looking for the secret ingredient i...
4   5  B006K2ZZ7K       5  Great taffy at a great price.  There was a wid...
Rating
5    363103
4     80650
1     52265
3     42605
2     29768
Name: count, dtype: int64


# Creating three separate datasets for positive,negative reviews

We use the Amazon Fine Food Reviews dataset (500k+ rows). Ratings are mapped to sentiment as:
- Negative = 1–2 stars
- Neutral = 3 stars
- Positive = 4–5 stars

Using simple filtering in Pandas, we created three DataFrames: df_positive, df_negative, and df_neutral. Each contains only the reviews belonging to its sentiment class. Printing the head of each subset confirms that the split worked correctly and shows sample reviews from every category.

In [8]:
def map_sentiment(rating):
    if rating < 3:
        return "negative"
    elif rating == 3:
        return "neutral"
    else: 
        return "positive"

df_review['sentiment'] = df_review['Rating'].apply(map_sentiment)

print(df_review['sentiment'].value_counts())



sentiment
positive    443753
negative     82033
neutral      42605
Name: count, dtype: int64


In [9]:
df_positive = df_review[df_review['sentiment']=='positive']
df_negative = df_review[df_review['sentiment']=='negative']
df_neutral  = df_review[df_review['sentiment']=='neutral']

print("Positive examples:\n", df_positive.head())
print("Negative examples:\n", df_negative.head())
print("Neutral examples:\n", df_neutral.head())


Positive examples:
    Id   ProductId  Rating                                               Text  \
0   1  B001E4KFG0       5  I have bought several of the Vitality canned d...   
2   3  B000LQOCH0       4  This is a confection that has been around a fe...   
4   5  B006K2ZZ7K       5  Great taffy at a great price.  There was a wid...   
5   6  B006K2ZZ7K       4  I got a wild hair for taffy and ordered this f...   
6   7  B006K2ZZ7K       5  This saltwater taffy had great flavors and was...   

  sentiment  
0  positive  
2  positive  
4  positive  
5  positive  
6  positive  
Negative examples:
     Id   ProductId  Rating                                               Text  \
1    2  B00813GRG4       1  Product arrived labeled as Jumbo Salted Peanut...   
3    4  B000UA0QIQ       2  If you are looking for the secret ingredient i...   
12  13  B0009XLVG0       1  My cats have been happily eating Felidae Plati...   
16  17  B001GVISJM       2  I love eating them and they are good for wa

# Random sampling to generate balanced dataset

The Amazon Fine Food Reviews dataset is highly imbalanced, with many more positive (5-star) reviews compared to neutral and negative ones. To address this, we used random sampling to create a dataset with a controlled number of reviews from each sentiment category.

- 100,000 positive reviews

- 80,000 negative reviews

- 40,000 neutral reviews

We then combined these subsets into a single DataFrame and shuffled the rows to mix the classes evenly.

***dataset_name.sample(n = no_of_rows)***

We can randomly select a specified number of rows from each sentiment category. This ensures that the final dataset maintains a distribution closer to balance across all three classes, which is important for fair training and evaluation of sentiment analysis models.

In [10]:
# Sample  number from each sentiment class 
pos_review = df_positive.sample(n=100000, random_state=42)
neg_review = df_negative.sample(n=80000, random_state=42)  
neu_review = df_neutral.sample(n=40000, random_state=42)

# Concatenate them
df_review_bal = pd.concat([pos_review, neg_review, neu_review])

# Shuffle the dataset
df_review_bal = df_review_bal.sample(frac=1, random_state=42).reset_index(drop=True)

print("Balanced dataset shape:", df_review_bal.shape)
print("Class distribution:")
print(df_review_bal['sentiment'].value_counts())

Balanced dataset shape: (220000, 5)
Class distribution:
sentiment
positive    100000
negative     80000
neutral      40000
Name: count, dtype: int64


# Splitting dataset into training and test set

Before we work with our data, we need to split it into a train and test set. The train dataset will be used to fit the model, while the test dataset will be used to provide an unbiased evaluation of a final model fit on the training dataset.

We'll use ***sklearn's train_test_split*** to do the job. In this case, we set 20% to the test data.

In [11]:
from sklearn.model_selection import train_test_split
df_review_bal['Review'] = df_review_bal['Text'].astype(str)
# Split dataset (70% train, 30% test)
train, test = train_test_split(df_review_bal, test_size=0.2, random_state=1, stratify=df_review_bal['sentiment'])

# Separate features (X) and labels (y)
train_x, train_y = train['Review'], train['sentiment']
test_x, test_y   = test['Review'], test['sentiment']

# Natural language processing pipeline:

1.	Tokenizing sentences to break text down into sentences, words, or other units. This helps the computer process text piece by piece.

3.	Removing stop words like "if," "but," "or," and so on. These common words usually don't help determine sentiment.

4. Normalizing words: Condensing all forms of a word into a single form. For example, "running," "runs," and "ran" all become "run."

5. Vectorizing text: Turning the text into a numerical representation for consumption by your classifier. This converts text data into numbers that machine learning models can understand.

# TF-IDF Vectorizer
To transform review text into numerical features suitable for machine learning, we apply the Term Frequency–Inverse Document Frequency (TF-IDF) vectorizer. TF-IDF converts unstructured text into vectors that reflect the importance of words within a collection of documents.

- Term Frequency (TF): Counts how often a term appears in a single document.

- Document Frequency (DF): Measures in how many documents a term appears.

- Inverse Document Frequency (IDF): Reduces the weight of terms that occur across many documents, since these provide little discriminatory power.

By multiplying TF and IDF, the TF-IDF score highlights words that are both frequent in a given review and distinctive across the corpus. Common but uninformative words (e.g., “good,” “food,” “product”) are filtered out through stopword removal.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re #for regex text cleaning
import nltk
from nltk.corpus import stopwords

# Download NLTK resources (no-op if already present)
nltk.download('stopwords')

# Base stopwords
default_stopwords = set(stopwords.words('english'))

# Domain stopwords (restaurant)
domain_stopwords = {
    'like','good','great','just','really','best','better','love',
    'taste','flavor','food','eat','tastes',
    'coffee','tea','chocolate','cup','sugar','water','bag','box',
    'product','amazon','price','order','buy','bought','store','free',
    'dog'
}


all_stopwords = default_stopwords.union(domain_stopwords)

# Simple cleaner with basic negation handling
def clean_text(text: str) -> str:
    if not isinstance(text, str):
        return ""
    text = re.sub(r'<.*?>', ' ', text)            # remove HTML tag
    text = re.sub(r"n['’]t", " not", text)        # convert n't -> not
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)      # keep letters/space
    text = text.lower()                           # lowercase
    text = re.sub(r'\s+', ' ', text).strip()      # remove extra spaces
    return text

# Preprocess
train_x_cleaned = train_x.apply(clean_text)
test_x_cleaned  = test_x.apply(clean_text)

# TF-IDF
tfidf = TfidfVectorizer(
    stop_words=list(all_stopwords),
    ngram_range=(1, 2),
    max_features=10000,
    min_df=5,
    max_df=0.7,
    sublinear_tf=True
)

train_x_vector = tfidf.fit_transform(train_x_cleaned)
test_x_vector  = tfidf.transform(test_x_cleaned)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yikho\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Training and Classification using Linear SVC
## Group Member: Tang Yik Hong

The Linear SVC algorithm works by:
- Representing each document (review) as a vector of features (e.g., TF-IDF scores)
- Finding a linear hyperplane that separates classes (positive, neutral, negative) in this high-dimensional space
- Maximizing the margin, which is the distance between the hyperplane and the nearest data points (support vectors)
- Using only these support vectors to define the decision boundary while ignoring less critical points
- For multi-class problems, applying a one-vs-rest strategy, training one classifier per class and choosing the class with the highest score
- Producing a final decision boundary that generalizes well to new, unseen reviews

Each decision boundary in the Linear SVC divides the data using a linear hyperplane, separating classes while maximizing the margin between them until the optimal boundary is found.

In [13]:
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from tqdm import tqdm
import numpy as np
import time

# Base model (more iters helps convergence)
base_svc = LinearSVC(random_state=42, max_iter=5000)

# Calibrated for predict_proba
svm_model = CalibratedClassifierCV(base_svc, cv=5, method='isotonic')  # try method='isotonic' if you have lots of data

print("Training LinearSVC + Calibrated probabilities model...")
with tqdm(total=100) as pbar:
    svm_model.fit(train_x_vector, train_y)
    for _ in range(100):  # cosmetic progress bar
        time.sleep(0.01)
        pbar.update(1)

def predict_with_confidence(text, model, vectorizer):
    cleaned = clean_text(text)
    X = vectorizer.transform([cleaned])
    probs = model.predict_proba(X)[0]          # (n_classes,)
    classes = model.classes_                   # e.g. ['negative','neutral','positive']
    pred_idx = int(np.argmax(probs))
    return classes[pred_idx], float(probs[pred_idx])

# Food-domain examples
test_reviews = [
    "The pasta was incredible and the sauce tasted fresh.",
    "Shipping was slow and the snacks arrived stale.",
    "It’s okay—nothing special but not terrible either.",
    "Absolutely delicious cookies, will buy again!",
    "The seasoning was bland and the texture was mushy."
]

print("\nPredictions for example reviews:")
for review in test_reviews:
    sentiment, confidence = predict_with_confidence(review, svm_model, tfidf)
    print(f"Review: '{review}'")
    print(f"Prediction: {sentiment} (Confidence: {confidence:.4f})")
    print("-" * 50)




Training LinearSVC + Calibrated probabilities model...


100%|██████████| 100/100 [00:52<00:00,  1.90it/s]


Predictions for example reviews:
Review: 'The pasta was incredible and the sauce tasted fresh.'
Prediction: positive (Confidence: 0.8651)
--------------------------------------------------
Review: 'Shipping was slow and the snacks arrived stale.'
Prediction: negative (Confidence: 0.6962)
--------------------------------------------------
Review: 'It’s okay—nothing special but not terrible either.'
Prediction: negative (Confidence: 0.5214)
--------------------------------------------------
Review: 'Absolutely delicious cookies, will buy again!'
Prediction: positive (Confidence: 0.9747)
--------------------------------------------------
Review: 'The seasoning was bland and the texture was mushy.'
Prediction: negative (Confidence: 0.8757)
--------------------------------------------------





# Training and Classification using Multinomial Naive Bayes Classifier
## Group Member: Chai Ee Yuan 
The Naïve Bayes algorithm is a supervised learning method based on Bayes’ Theorem. It is widely used for text classification tasks due to its simplicity and efficiency in handling high-dimensional data like word features.

The Multinomial Naive Bayes classifier works by:
- Representing each document as a vector of word counts or TF-IDF features
- Assuming that features (words) are conditionally independent given the class label
- Estimating the probability of a review belonging to each class using Bayes’ Theorem
- Selecting the class with the highest posterior probability as the predicted label
- Performing well even with relatively small amounts of training data because of its probabilistic foundation

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from tqdm import tqdm
import time
import numpy as np

# ---- Train Gaussian Naive Bayes with a tqdm bar ----
# NOTE: GaussianNB requires dense arrays; convert TF-IDF (sparse) to dense.
X_train_dense = train_x_vector.toarray()

mnb = MultinomialNB()

print("Training Multinomial Naive Bayes model...")
with tqdm(total=100) as pbar:
    mnb.fit(X_train_dense, train_y)  # actual training
    # show a 100-step bar (cosmetic; fit itself doesn't report progress)
    for _ in range(100):
        time.sleep(0.02)  # adjust to taste for the bar speed
        pbar.update(1)

# ---- Prediction helper (multiclass-safe) ----
def predict_with_confidence(text, model, vectorizer):
    cleaned = clean_text(text)
    X = vectorizer.transform([cleaned]).toarray()
    probs = model.predict_proba(X)[0]            # shape (n_classes,)
    classes = model.classes_                     # e.g., ['negative','neutral','positive']
    pred_idx = int(np.argmax(probs))
    return classes[pred_idx], float(probs[pred_idx])

# ---- Example reviews (use any list you like) ----
test_reviews = [
    "The cookies were amazing, fresh and crispy!",
    "The chips arrived stale and the package was damaged.",
    "It was okay, nothing special but not too bad either.",
    "Absolutely delicious chocolate. I will order again.",
    "The soup had no flavor and the noodles were soggy."
]


# ---- Print predictions in your requested format ----
print("\nPredictions for example reviews:")
for review in test_reviews:
    sentiment, confidence = predict_with_confidence(review, mnb, tfidf)
    print(f"Review: '{review}'")
    print(f"Prediction: {sentiment} (Confidence: {confidence:.4f})")
    print("-" * 50)


Training Multinomial Naive Bayes model...


100%|██████████| 100/100 [00:06<00:00, 15.35it/s]


Predictions for example reviews:
Review: 'The cookies were amazing, fresh and crispy!'
Prediction: positive (Confidence: 0.7846)
--------------------------------------------------
Review: 'The chips arrived stale and the package was damaged.'
Prediction: negative (Confidence: 0.6761)
--------------------------------------------------
Review: 'It was okay, nothing special but not too bad either.'
Prediction: neutral (Confidence: 0.7201)
--------------------------------------------------
Review: 'Absolutely delicious chocolate. I will order again.'
Prediction: positive (Confidence: 0.9326)
--------------------------------------------------
Review: 'The soup had no flavor and the noodles were soggy.'
Prediction: negative (Confidence: 0.4319)
--------------------------------------------------





# Training and Classification using Logistic Regression
## Group Member: Teh Wei Zhang

Logistic regression is a statistical method that predicts the probability of binary outcomes by applying the sigmoid function to a linear combination of features, converting the resulting probability (between 0 and 1) into class predictions, and optimizing model parameters using gradient descent to minimize classification errors.


In [15]:
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import time
import numpy as np

# (Optional) make sure the solver converges comfortably
log_reg = LogisticRegression(max_iter=200)

print("Training Logistic Regression model...")
with tqdm(total=100) as pbar:
    log_reg.fit(train_x_vector, train_y)
    for _ in range(100):  # cosmetic bar like your example
        time.sleep(0.01)
        pbar.update(1)

# Multiclass-safe prediction with calibrated confidence
def predict_with_confidence(text, model, vectorizer):
    # clean with the SAME function you used for training
    cleaned = clean_text(text)
    X = vectorizer.transform([cleaned])
    probs = model.predict_proba(X)[0]           # shape: (n_classes,)
    classes = model.classes_                    # e.g. ['negative','neutral','positive']
    pred_idx = int(np.argmax(probs))
    pred_label = classes[pred_idx]
    return pred_label, float(probs[pred_idx]), dict(zip(classes, probs))

# Pretty printer to match your formatting
def show_predictions(reviews, model, vectorizer):
    print("\nPredictions for example reviews:")
    for r in reviews:
        label, conf, _ = predict_with_confidence(r, model, vectorizer)
        print(f"Review: '{r}'")
        print(f"Prediction: {label} (Confidence: {conf:.4f})")
        print("-" * 50)

# Examples (replace with your own)
test_reviews = [
    "Just saw the new Marvel movie. OMG it was amazing!! Can't wait for the sequel!",
    "What a waste of money. Boring plot, terrible acting, fell asleep halfway through.",
    "This movie literally changed my life. I've seen it 3 times already!",
    "The director should be embarrassed. Worst movie I've seen this year by far.",
    "I laughed so hard my sides hurt. This comedy is an instant classic.",
    "I wanted to walk out after 30 minutes. How did this movie even get made?"
]

show_predictions(test_reviews, log_reg, tfidf)


Training Logistic Regression model...


100%|██████████| 100/100 [00:07<00:00, 13.19it/s]


Predictions for example reviews:
Review: 'Just saw the new Marvel movie. OMG it was amazing!! Can't wait for the sequel!'
Prediction: positive (Confidence: 0.9945)
--------------------------------------------------
Review: 'What a waste of money. Boring plot, terrible acting, fell asleep halfway through.'
Prediction: negative (Confidence: 0.9911)
--------------------------------------------------
Review: 'This movie literally changed my life. I've seen it 3 times already!'
Prediction: negative (Confidence: 0.6490)
--------------------------------------------------
Review: 'The director should be embarrassed. Worst movie I've seen this year by far.'
Prediction: negative (Confidence: 0.9622)
--------------------------------------------------
Review: 'I laughed so hard my sides hurt. This comedy is an instant classic.'
Prediction: negative (Confidence: 0.4901)
--------------------------------------------------
Review: 'I wanted to walk out after 30 minutes. How did this movie even get ma




# Comparing models' performance
Using the scikit-learn library, we can evaluate the accuracy of each machine learning model on the test dataset. By printing the accuracy scores of Linear SVC, Multinomial Naive Bayes, and Logistic Regression, we can directly compare their performance on the sentiment classification task.

- The model with the highest accuracy will be considered the most effective for predicting whether a food review is positive, neutral, or negative.
- This comparison ensures we select a classifier that balances efficiency with predictive reliability for large-scale text data such as the Amazon Fine Food Reviews dataset.

In [16]:
print("SVM accuracy:", svm_model.score(test_x_vector, test_y))
print("Multinomial Naive Bayes accuracy:", mnb.score(test_x_vector.toarray(), test_y))
print("Logistic Regression accuracy:", log_reg.score(test_x_vector, test_y))

SVM accuracy: 0.7827727272727273
Multinomial Naive Bayes accuracy: 0.7357727272727272
Logistic Regression accuracy: 0.7808636363636363


# F1 scores for all models
The F1 score is a robust evaluation metric for classification tasks, especially when dealing with imbalanced datasets. Unlike accuracy, which only measures the percentage of correct predictions, the F1 score combines both precision and recall into a single measure:

- Precision: Of the reviews predicted as a given class (positive, negative, or neutral), how many were actually correct?

- Recall: Of all the reviews that truly belong to a class, how many did the model correctly identify?

- F1 Score: The harmonic mean of precision and recall, balancing both metrics.

By generating classification reports for SVM, Multinomial Naive Bayes, and Logistic Regression, we can directly compare not just accuracy, but also how well each model balances false positives and false negatives.

This evaluation provides a clearer picture of performance for food review sentiment analysis, ensuring the chosen model performs reliably across all three classes (positive, neutral, negative), rather than being biased toward the majority class.

In [17]:
from sklearn.metrics import classification_report
# Decision Tree report
print("SVM Classification Report:")
print(classification_report(test_y, 
                           svm_model.predict(test_x_vector), 
                           labels=['positive', 'negative','neutral']))

# Multinomial Naive Bayes accuracy
print("Multinomial Naive Bayes accuracy Classification Report:")
print(classification_report(test_y, 
                           mnb.predict(test_x_vector.toarray()), 
                           labels=['positive', 'negative','neutral']))

# For classification report
print("Logistic Regression Classification Report:")
print(classification_report(test_y, 
                           log_reg.predict(test_x_vector),
                           labels=['positive', 'negative','neutral']))

SVM Classification Report:
              precision    recall  f1-score   support

    positive       0.82      0.89      0.85     20000
    negative       0.78      0.84      0.81     16000
     neutral       0.63      0.41      0.50      8000

    accuracy                           0.78     44000
   macro avg       0.74      0.71      0.72     44000
weighted avg       0.77      0.78      0.77     44000

Multinomial Naive Bayes accuracy Classification Report:
              precision    recall  f1-score   support

    positive       0.74      0.90      0.82     20000
    negative       0.75      0.79      0.77     16000
     neutral       0.61      0.22      0.32      8000

    accuracy                           0.74     44000
   macro avg       0.70      0.64      0.63     44000
weighted avg       0.72      0.74      0.71     44000

Logistic Regression Classification Report:
              precision    recall  f1-score   support

    positive       0.82      0.89      0.85     20000
   

# Confusion Matrix for all models

**A confusion matrix is a table that allows visualization of the performance of an algorithm. This table typically has two rows and two columns that report the number of false positives, false negatives, true positives, and true negatives:**

- TP (True Positives): Correctly predicted positive values
- FP (False Positives): Incorrectly predicted positive values
- FN (False Negatives): Incorrectly predicted negative values
- TN (True Negatives): Correctly predicted negative values

Array represents: 

TP FN FN

FP TP FN

FP FN TP

In [18]:
from sklearn.metrics import confusion_matrix
# SVM confusion matrix
print("SVM Confusion Matrix:")
print(confusion_matrix(test_y, 
                      svm_model.predict(test_x_vector),
                      labels=['positive', 'negative','neutral']))

# For Multinomial Naive Bayes confusion matrix  
print("Multinomial Naive Bayes Confusion Matrix:")
print(confusion_matrix(test_y, 
                      mnb.predict(test_x_vector.toarray()),
                      labels=['positive', 'negative','neutral']))

# For Logistic Regression confusion matrix
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(test_y, 
                       log_reg.predict(test_x_vector), 
                       labels=['positive', 'negative','neutral']))

SVM Confusion Matrix:
[[17739  1383   878]
 [ 1564 13429  1007]
 [ 2298  2428  3274]]
Multinomial Naive Bayes Confusion Matrix:
[[18034  1514   452]
 [ 2743 12607   650]
 [ 3472  2795  1733]]
Logistic Regression Confusion Matrix:
[[17736  1310   954]
 [ 1609 13207  1184]
 [ 2250  2335  3415]]


In [19]:
joblib.dump(svm_model, "svm_model.pkl")
joblib.dump(mnb, "mnb_model.pkl")
joblib.dump(log_reg, "log_reg_model.pkl")
joblib.dump(tfidf, "vectorizer.pkl")

['vectorizer.pkl']