# NLP Sentiment Analysis

Perform sentiment analysis using Python's NLTK (Natural Language Toolkit) library.

Use the `movie_reviews` corpus, which contains 2,000 movie reviews pre-labeled as either "positive" or "negative."

Build a **Naive Bayes classifier** - a common and effective model for text classification.

### Download NLTK Data

In [4]:
# Import the nltk module
import nltk

# Download data
nltk.download('movie_reviews') # corpus of reviews
nltk.download('stopwords') # corpus of stopwords
nltk.download('punkt') # tokenizer model

[nltk_data] Error loading movie_reviews: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>


False

In [None]:
import random
from nltk.corpus import movie_reviews

# Load the movie_reviews corpus
# Creating a list of (review, sentiment) tuples
documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        # Add a tuple of (list_of_words, category)
        documents.append((list(movie_reviews.words(fileid)), category))

# Shuffle for better training and testing
random.shuffle(documents)

# Display results
print(f"Loaded {len(documents)} documents.")
print("First document (first 20 words):", documents[0][0][:20])
print("Sentiment:", documents[0][1])

### Preprocessing and Feature Extraction

In [None]:
### Step 2: Loading and Preparing the Data

We will load the `movie_reviews` corpus. 
It's structured as a list of file IDs, which we can then categorize into positive and negative reviews. '
'We'll create a list of `(review, sentiment)` tuples.

```python



```

-----

### Step 3: Text Preprocessing and Feature Extraction

A machine learning model can't understand raw text. We need to convert each review into a numerical format. We will use a **"Bag-of-Words"** model.

Our "features" will be a dictionary for each review, where the keys are the most common words in the *entire* dataset, and the values are `True` or `True` depending on whether that word is in the review.

**1. Clean and Tokenize Words**
First, let's get a clean list of all words from all reviews. We'll convert them to lowercase and filter out **stopwords** (common words like "the", "is", "a") and punctuation, as they don't carry much sentiment.

```python
from nltk.corpus import stopwords
import string

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Get all words from all reviews, make them lowercase, and remove stopwords/punctuation
all_words = []
for w_list, category in documents:
    for w in w_list:
        if w.lower() not in stop_words and w.lower() not in string.punctuation:
            all_words.append(w.lower())

# Get the frequency distribution of all words
all_words_freq = nltk.FreqDist(all_words)

# Print the 20 most common words
print("Most common words:", all_words_freq.most_common(20))

# We will use the top 3000 most common words as our features
word_features = [item[0] for item in all_words_freq.most_common(3000)]
```

**2. Create the Feature Sets**
Now, we create a function that will take a review and create the feature dictionary we described.

```python
def find_features(document_words):
    """
    Takes a list of words from a review and returns a dictionary
    of features indicating which of the top 3000 words are present.
    """
    words_in_doc = set(document_words)
    features = {}
    for w in word_features:
        features[w] = (w in words_in_doc)
    return features

# Create feature sets for all our documents
featuresets = [(find_features(rev), category) for (rev, category) in documents]

# Example of one feature set
print("\n--- Example Feature Set ---")
print(featuresets[0][0])
print("Sentiment:", featuresets[0][1])
```

-----

### Step 4: Training the Naive Bayes Classifier

With our data correctly formatted into `featuresets`, we can now train our classifier. We'll split the data into a **training set** (to teach the model) and a **test set** (to evaluate its performance on unseen data).

A common split is 80% for training and 20% for testing. Our 2,000 documents will be split into 1,600 for training and 400 for testing.

```python
# Split the data into training and testing sets
training_set = featuresets[:1600]
testing_set = featuresets[1600:]

# Train the Naive Bayes classifier
print("\nTraining the classifier...")
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Classifier trained successfully.")
```

-----

### Step 5: Evaluating the Model

Now that the classifier is trained, let's see how well it performs on the `testing_set` it has never seen before.

```python
# Evaluate the classifier
accuracy = nltk.classify.accuracy(classifier, testing_set) * 100
print(f"\nClassifier Accuracy: {accuracy:.2f}%")

# Show the most informative features
# These are the words the classifier found most indicative of a 'pos' or 'neg' review
print("\n--- Most Informative Features ---")
classifier.show_most_informative_features(20)
```

You will likely see an accuracy between 75% and 85%. The "most informative features" output shows which words strongly predict a positive or negative review. For example, `outstanding = True` might have a high `pos:neg` ratio.

-----

### Step 6: Using Your Trained Model to Classify New Text

Finally, let's use our new classifier to predict the sentiment of any new sentence. We must remember to apply the **exact same preprocessing** (tokenizing, lowercasing, etc.) and feature extraction as we did for our training data.

```python
def classify_sentiment(text):
    """
    Classifies a new piece of text.
    """
    # 1. Tokenize the text
    words = nltk.word_tokenize(text)
    
    # 2. Clean the words (lowercase, remove stopwords/punctuation)
    clean_words = []
    for w in words:
        if w.lower() not in stop_words and w.lower() not in string.punctuation:
            clean_words.append(w.lower())
    
    # 3. Extract features using the same function as before
    features = find_features(clean_words)
    
    # 4. Classify
    return classifier.classify(features)

# --- Test with new sentences ---
test_sentence_1 = "This was an amazing movie! The acting was superb and the plot was thrilling."
print(f"'{test_sentence_1}' -> {classify_sentiment(test_sentence_1)}")

test_sentence_2 = "I was really bored. The entire film felt slow and predictable."
print(f"'{test_sentence_2}' -> {classify_sentiment(test_sentence_2)}")

test_sentence_3 = "The movie was okay, not great but not terrible."
print(f"'{test_sentence_3}' -> {classify_sentiment(test_sentence_3)}")
```