IMDB Dataset of 50K Movie Reviews dataset-https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Step 1: Install Required Libraries
First, ensure you have the necessary libraries installed:

In [None]:
pip install nltk scikit-learn



Step 2: Import Libraries and Load Dataset
We'll use the movie_reviews dataset from NLTK:

In [None]:
import nltk
from nltk.corpus import movie_reviews
import random
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# Download necessary NLTK data files
nltk.download('movie_reviews')
nltk.download('punkt')

# Load movie reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Step 3: Preprocess Data
Convert the text data into a format suitable for machine learning:

In [None]:
# Separate documents and labels
texts = [" ".join(document) for document, category in documents]
labels = [category for document, category in documents]

# Split data into training and test sets
texts_train, texts_test, labels_train, labels_test = train_test_split(texts, labels, test_size=0.2, random_state=42)


Step 4: Build the Model Pipeline
We'll use a pipeline to streamline the process of transforming data and training the model:

In [None]:
# Create a pipeline with a TfidfVectorizer and a MultinomialNB classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the model
pipeline.fit(texts_train, labels_train)


Step 5: Evaluate the Model
Assess the model's performance using the test set:

In [None]:
# Make predictions
labels_pred = pipeline.predict(texts_test)

# Calculate accuracy
accuracy = accuracy_score(labels_test, labels_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Print detailed classification report
print(classification_report(labels_test, labels_pred))


Accuracy: 77.50%
              precision    recall  f1-score   support

         neg       0.73      0.88      0.80       200
         pos       0.84      0.68      0.75       200

    accuracy                           0.78       400
   macro avg       0.79      0.78      0.77       400
weighted avg       0.79      0.78      0.77       400



Step 6: Use the Model to Classify New Text
You can now use the trained model to classify new text:

In [None]:
new_text = "I absolutely loved this movie! The acting was superb and the plot was thrilling."
new_text_transformed = pipeline['tfidf'].transform([new_text])
predicted_sentiment = pipeline['classifier'].predict(new_text_transformed)
print(f'Sentiment: {predicted_sentiment[0]}')


Sentiment: pos


Optional Enhancements
To enhance this project, we can:

Use a larger, more diverse dataset: Explore other datasets like the IMDB reviews dataset.

Try different machine learning algorithms: Experiment with algorithms like Support Vector Machines (SVM), Random Forest, or even deep learning models.

Optimize hyperparameters: Use techniques like GridSearchCV to find the best parameters for your model.