# 2 - Data Preparation - IMDb Dataset
In this noteboook, we will prepare the [IMDb Spoilers Dataset](https://www.kaggle.com/rmisra/imdb-spoiler-dataset) for modelling.

- Google's BERT Model will be used to Natural Language Processing

In [1]:
# import important modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from string import punctuation 

# sklearn modules
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    plot_confusion_matrix,
    f1_score,
    roc_auc_score,
)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score, RandomizedSearchCV

# text preprocessing modules
from nltk.tokenize import word_tokenize
# from cleantext import clean

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
import re #regular expression


# from wordcloud import WordCloud, STOPWORDS

# Download dependency
for dependency in (
    "brown",
    "names",
    "wordnet",
    "averaged_perceptron_tagger",
    "universal_tagset",
    "stopwords"
):
    nltk.download(dependency)

#nltk.download('stopwords')

import warnings
warnings.filterwarnings("ignore")
# seeding
np.random.seed(123)

[nltk_data] Downloading package brown to /Users/raj/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package names to /Users/raj/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/raj/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/raj/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/raj/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/raj/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Load Data

In [3]:
import pathlib

# locate data files
data_dir = pathlib.Path('../../Capstone/algorithm/data/org_dataset')
filename = 'imdb_full_dataset.csv'

# Read the json files.
imdb_df = pd.read_csv(data_dir / filename)

In [4]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573913 entries, 0 to 573912
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   review_date       573913 non-null  object 
 1   movie_id          573913 non-null  object 
 2   user_id           573913 non-null  object 
 3   is_spoiler        573913 non-null  bool   
 4   review_text       573913 non-null  object 
 5   review_summary    573911 non-null  object 
 6   rating_by_user    573913 non-null  int64  
 7   plot_summary      573906 non-null  object 
 8   duration          573906 non-null  object 
 9   genre             573906 non-null  object 
 10  release_date      573906 non-null  object 
 11  plot_synopsis     538828 non-null  object 
 12  avg_movie_rating  573906 non-null  float64
dtypes: bool(1), float64(1), int64(1), object(10)
memory usage: 53.1+ MB


To begin with, we will only consider three (`3`) columns to train our base algorithm: 
1. `is_spoiler`: target variable (labels)
2. `review_text`: features
3. `review_summary`: features

Could consider: `movie_id`, `plot_summary`, `movie_name`

In [5]:
columns = ['is_spoiler', 'review_text', 'review_summary', ]

imdb_df2 = imdb_df[columns]

imdb_df2.head()

Unnamed: 0,is_spoiler,review_text,review_summary
0,True,"In its Oscar year, Shawshank Redemption (writt...",A classic piece of unforgettable film-making.
1,True,The Shawshank Redemption is without a doubt on...,Simply amazing. The best film of the 90's.
2,True,I believe that this film is the best story eve...,The best story ever told on film
3,True,"**Yes, there are SPOILERS here**This film has ...",Busy dying or busy living?
4,True,At the heart of this extraordinary movie is a ...,"Great story, wondrously told and acted"


In [6]:
# Check missing values
imdb_df2.isna().sum()

is_spoiler        0
review_text       0
review_summary    2
dtype: int64

As there are only two missing values, we can simply remove them.

In [7]:
# Removing missing values
imdb_df2.dropna(inplace=True)

imdb_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 573911 entries, 0 to 573912
Data columns (total 3 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   is_spoiler      573911 non-null  bool  
 1   review_text     573911 non-null  object
 2   review_summary  573911 non-null  object
dtypes: bool(1), object(2)
memory usage: 13.7+ MB


## Step 1: Convert labels from bool to numerical data

In [8]:
imdb_df2['is_spoiler'].value_counts()

False    422987
True     150924
Name: is_spoiler, dtype: int64

To improve the speed of data processing, two datasets are created.
1. `full_processed`: contains processed data of all the samples in the dataset.
2. `sample_processed`: contains 25,000 random processed reviews.

In [22]:
# creating a sample dataset
sample_size = 500_000

sample_df = imdb_df2.sample(n=sample_size, random_state=42)
sample_df.head()

Unnamed: 0,is_spoiler,review_text,review_summary
287092,False,Seeing this movie is the duty and pleasure of ...,"A ""Destined""Movie"
385947,False,When I kept seeing this getting compared to th...,I wish Matthew McConaughey would get struck by...
107091,False,Dark is a Netflix German TV show that talks ab...,Incredible Just Incredible
162711,False,I'll start by state a cliché (like those in th...,"Wachowski ""Matrix"" brothers... really?"
543174,False,*** This comment may contain spoilers ***I'd s...,"Scary at times, but suffers from a bad script."


In [23]:
processed_df1 = pd.DataFrame()

processed_df1['is_spoiler'] = sample_df['is_spoiler'].replace({True: 1, False: 0})

## Step 2: Clean text (use function)
> _**Note**: the code in the below cell is taken from an article published on [freeCodeCamp.com](https://www.freecodecamp.org/news/deploy-ml-model-to-production-as-api/) by [Davis David](https://www.freecodecamp.org/news/author/davis/)_

The function removes punctuations, stopwords, and any other characters that can be considered as noise in a text. It then reduces the words into their stem from, example `running` to `run`. It also lowers the case of all characters in the text to maintain consistency.

> **Note to self**: Modify the code to keep digits in the text. They might contain information about whether review contains a spoiler or no.

In [24]:
stop_words =  stopwords.words('english')

def text_cleaning(text, remove_stop_words=True, lemmatize_words=True):
    # Clean the text, with the option to remove stop_words and to lemmatize word

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9]", " ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"I'm", "I am", text)
    text = re.sub(r"ur", " your ", text)
    text = re.sub(r" nd "," and ",text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r" tkts "," tickets ",text)
    text = re.sub(r" c "," can ",text)
    text = re.sub(r" e g ", " eg ", text)
    text =  re.sub(r'http\S+',' link ', text)
    text = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text) # remove numbers
    text = re.sub(r" u "," you ",text)
    text = text.lower()  # set in lowercase 
        
    # Remove punctuation from text
    text = ''.join([c for c in text if c not in punctuation])
    
    # Optionally, remove stop words
    if remove_stop_words:
        text = text.split()
        text = [w for w in text if not w in stop_words]
        text = " ".join(text)
    
    # Optionally, shorten words to their stems
    if lemmatize_words:
        text = text.split()
        lemmatizer = WordNetLemmatizer() 
        lemmatized_words = [lemmatizer.lemmatize(word) for word in text]
        text = " ".join(lemmatized_words)
    
    # Return a list of words
    return(text)

### Testing the Function
---

In [25]:
imdb_df['review_text'].iloc[:3].apply(text_cleaning)[0]

'oscar year shawshank redemption written directed frank darabont novella rita hayworth shawshank redemption stephen king nominated seven academy award walked away zero best pict e went forrest gump shawshank pulp fiction happy nominated co se hindsight history look back gump good film pulp redemption remembered time best pulp however success word go making huge splash cannes making writer director american master two film andy dufresne co success come easy fortunately fail e life sentence opening screen take 25m film fell fast theatre finished mere 3m reason fail e many firstly title clunker iconic fan today people knew cared shawshank dvd tim robbins laugh recounting fan congratulating rickshaw movie marketing wise film nightmare prison drama tough sell woman story love two best friend spell winner men worst movie slow molasses desson thomson writes washington post wanders subplots every opportunity ignores abundance narrative exit point settling finale weakness make film strong first

In [26]:
imdb_df['review_text'].iloc[0]

'In its Oscar year, Shawshank Redemption (written and directed by Frank Darabont, after the novella Rita Hayworth and the Shawshank Redemption, by Stephen King) was nominated for seven Academy Awards, and walked away with zero. Best Picture went to Forrest Gump, while Shawshank and Pulp Fiction were "just happy to be nominated." Of course hindsight is 20/20, but while history looks back on Gump as a good film, Pulp and Redemption are remembered as some of the all-time best. Pulp, however, was a success from the word "go," making a huge splash at Cannes and making its writer-director an American master after only two films. For Andy Dufresne and Co., success didn\'t come easy. Fortunately, failure wasn\'t a life sentence.After opening on 33 screens with take of $727,327, the $25M film fell fast from theatres and finished with a mere $28.3M. The reasons for failure are many. Firstly, the title is a clunker. While iconic to fans today, in 1994, people knew not and cared not what a \'Shaws

In [27]:
# Applying function to sample dataset
processed_df1['review_text'] = sample_df['review_text'].apply(text_cleaning)

In [28]:
processed_df1['review_summary'] = sample_df['review_summary'].apply(text_cleaning)

In [29]:
processed_df1.head()

Unnamed: 0,is_spoiler,review_text,review_summary
287092,0,seeing movie duty plea e every serious sci fan...,destined movie
385947,0,kept seeing getting compared indiana jones mov...,wish matthew mcconaughey would get struck ligh...
107091,0,dark netflix german tv show talk mysterious di...,incredible incredible
162711,0,start state clich like movie pretentious film ...,wachowski matrix brother really
543174,0,comment may contain spoiler seen came mixed fe...,scary time suffers bad script


# Save sample dataset

In [30]:
data_dir = pathlib.Path('../../Capstone/algorithm/data/processed_dataset')
filename = 'processed_sample.csv'
path = data_dir / filename

processed_df1.to_csv(path, index=False)


if path.exists():
    print(f'File saved to {path}')
else:
    print('File not saved.')

File saved to ../../Capstone/algorithm/data/processed_dataset/processed_sample.csv
