# Task for Course: DLBAIPNLP01 – Project: NLP

Imporing all neccessary libraries we gonna use. Could be done at the area where the library is really needed, but I prefer to have them all combined to be easier to observe if some are missing or unused

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

load all datasets we need to work with, this include the stopwords list and the training dataset

In [2]:
# download the stopword set
nltk.download('stopwords')

# Load the dataset
df = pd.read_csv('movie_comment_data_labeled/IMDB Dataset.csv')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Thomas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


do some initial investigation and descriptive statistics on the data we working with

In [3]:
df['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [4]:
# count words in the text
def count_words(text):
    return len(text.split())

df['Word_Counts'] = df['review'].apply(count_words)

# Calculate minimum, maximum, and average number of words
min_words = df['Word_Counts'].min()
max_words = df['Word_Counts'].max()
average_words = df['Word_Counts'].mean()

print(f"Minimum number of words: {min_words}")
print(f"Maximum number of words: {max_words}")
print(f"Average number of words: {average_words:.2f}")

Minimum number of words: 4
Maximum number of words: 2470
Average number of words: 231.16


For working later with the text we need the stemmer for the stemming. Previous build of models show this is the best one used for the project

In [5]:
stemmer = PorterStemmer()

# final function after evaluation. it prepares the text for working with it later. This includes "normalize" the text. make all lower case, remove puntuation, split, remove stoüwords and stemming
def preprocess_text_stem(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    # Stemming
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    return ' '.join(stemmed_words)

Apply out text prepreccessing, split the dataset by the default size and set a random_state so we can reproduce the results each time. Can be removed when we want some variation and get a possible slightly better model.

prepare the vectorization, train the model and report the metrics

In [6]:
df['review'] = df['review'].apply(preprocess_text_stem)

# Prepare the common training split
X = df['review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize our vectorization we want to stick with
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Model Training
model = LogisticRegression()
model.fit(X_train_vec, y_train)

# Model Evaluation
y_pred = model.predict(X_test_vec)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(report)



Accuracy: 0.8863
Classification Report:
              precision    recall  f1-score   support

    negative       0.90      0.87      0.88      4961
    positive       0.87      0.90      0.89      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



create another function to get the prediction. So we don´t need to call the preprecessing and vectorization everytime we want to test the model manually

In [7]:
def predict_sentiment(text):
    # Preprocess the text
    preprocessed_text = preprocess_text_stem(text)
    # Transform the text
    text_vector = vectorizer.transform([preprocessed_text])
    # Predict sentiment
    prediction = model.predict(text_vector)
    
    return prediction[0]

lets do some predections to test the model on unseen data on own. This sets we can rate manually and see the outcome. We will go through 4 examples

First: prediction should be negative

In [8]:
text = " This movie disappointed me. The actors should all be retired! The laughs were flat. I was expecting big laughs. Nope."
res = predict_sentiment(text)
res

'negative'

Second: prediction should be positive

In [9]:
text = " This was an overall exceptional movie, that reignites the franchise all over for me. Seeing Eddie and the cast that made the orginal movies years later still going at it was great. The throwbacks, the comedy and the connection between the actors makes this film work and very enjoyable.  "
res = predict_sentiment(text)
res

'positive'

Third: prediction should be negative

In [10]:
text = " inception was dumb. no dreamlike phantasy world but a standard action flick with lots of shooting and explosions instead "
res = predict_sentiment(text)
res

'negative'

Fourth: prediction should be negative. More complex and long one to test it more intense. It also have contracting information in it

In [11]:
text = " Nolan has done it again. He has made a one-dimensional film that people only like because of its false, high standards. Most Nolan films always seem to the casual film viewer like a kind of sophisticated and complex niche film. However, the complexity of a Nolan film is usually affected. And that is the case here too. Nolan tells a simple, one-dimensional story and deliberately leaves some questions open that he only answers later. This has nothing to do with complexity, but only with cheap card tricks. This makes the plot seem complex and confusing. Nolan's films also do not live from the story or the development of the characters, which in this film only takes place as an alibi, but from action sequences, baroque and cheap music and explosions, so they are just as cheap as, for example, the Marvel films. However, some wannabe intellectual part-time film fans sit on their couch and think they have just seen the fourth 3 Colors film, ergo they think they have just seen a sophisticated masterpiece and are surprised at how much they enjoyed this oh-so-complex effect. Over-motivated, they then tell their friends and relatives about the brilliantly complex Inception and how sophisticated their taste is. So: Inception is absolutely overrated, the film is in no way sophisticated and no better than any other contemporary action film. It remains a simple action film full of temporal causality strips, explosions and card tricks. A film for people who think that Hans Zimmer is the most brilliant composer of our time because his pieces have extreme crescendos. The man can't even read music. Well, Inception is still NOT terrible. DiCaprio, Page, Kaine and co. deliver decent performances, the film is not bad visually and the action sequences are well choreographed. However, it is not the film for the history books that it is too often sold as. "

res = predict_sentiment(text)
res

'positive'