In [2]:
import PyPDF2 as pdf

In [3]:
myfile=open(r"NLTK.pdf",'rb')
content=pdf.PdfReader(myfile)
print(len(content.pages))

504


In [4]:
page=content.pages[11]
page11=page.extract_text()
print(page11)

Audience
NLP is important for scientific, economic, social, and cultural reasons. NLP is experi-
encing rapid 
growth as its theories and methods are deployed in a variety of new lan-
guage technologies. For this reason it is important for a wide range of people to have a
working knowledge of NLP. Within industry, this includes people in human-computer
interaction, business information analysis, and web software development. Within
academia, it includes people in areas from humanities computing and corpus linguistics
through to computer science and artificial intelligence. (To many people in academia,
NLP is known by the name of “Computational Linguistics.”)
This book is intended for a diverse range of people who want to learn how to write
programs that analyze written language, regardless of previous programming
experience:
New to programming?
The early chapters of the book are suitable for readers with no prior knowledge of
programming, so long as you aren’t afraid to tackle new conc

## **Tokenization**

**Tokenization** in NLP (Natural Language Processing) is the process of breaking down text into smaller units called tokens, which can be words, phrases, or even characters. It is a crucial step in preprocessing text data, allowing machines to understand and analyze the structure and meaning of the text by handling it in manageable parts. Tokenization helps in tasks like text classification, sentiment analysis, and machine translation by creating the basic elements for further processing.

In [5]:
import nltk
# nltk.download('punkt_tab')

In [6]:
mypara="The early chapters are organized in order of conceptual difficulty, starting with a practical introduction to language processing that shows how to explore interesting bodies of text using tiny Python programs (Chapters 1–3). This is followed by a chapter on structured programming (Chapter 4) that consolidates the programming topics scattered across the preceding chapters. After this, the pace picks up, and we move on to a series of chapters covering fundamental topics in language processing: tagging, classification, and information extraction (Chapters 5–7)."
sent_tokenised=nltk.sent_tokenize(mypara)
sent_tokenised

['The early chapters are organized in order of conceptual difficulty, starting with a practical introduction to language processing that shows how to explore interesting bodies of text using tiny Python programs (Chapters 1–3).',
 'This is followed by a chapter on structured programming (Chapter 4) that consolidates the programming topics scattered across the preceding chapters.',
 'After this, the pace picks up, and we move on to a series of chapters covering fundamental topics in language processing: tagging, classification, and information extraction (Chapters 5–7).']

In [7]:
word_tokenized=nltk.word_tokenize(mypara)
word_tokenized

['The',
 'early',
 'chapters',
 'are',
 'organized',
 'in',
 'order',
 'of',
 'conceptual',
 'difficulty',
 ',',
 'starting',
 'with',
 'a',
 'practical',
 'introduction',
 'to',
 'language',
 'processing',
 'that',
 'shows',
 'how',
 'to',
 'explore',
 'interesting',
 'bodies',
 'of',
 'text',
 'using',
 'tiny',
 'Python',
 'programs',
 '(',
 'Chapters',
 '1–3',
 ')',
 '.',
 'This',
 'is',
 'followed',
 'by',
 'a',
 'chapter',
 'on',
 'structured',
 'programming',
 '(',
 'Chapter',
 '4',
 ')',
 'that',
 'consolidates',
 'the',
 'programming',
 'topics',
 'scattered',
 'across',
 'the',
 'preceding',
 'chapters',
 '.',
 'After',
 'this',
 ',',
 'the',
 'pace',
 'picks',
 'up',
 ',',
 'and',
 'we',
 'move',
 'on',
 'to',
 'a',
 'series',
 'of',
 'chapters',
 'covering',
 'fundamental',
 'topics',
 'in',
 'language',
 'processing',
 ':',
 'tagging',
 ',',
 'classification',
 ',',
 'and',
 'information',
 'extraction',
 '(',
 'Chapters',
 '5–7',
 ')',
 '.']

## **Stemming and Lemmatization**
Stemming and lemmatization are techniques in NLP used to reduce words to their base or root forms.

* **Stemming** is the process of removing affixes (like suffixes or prefixes) from words to obtain their base or "stem" form, which may not necessarily be a valid word. For example, "running," "runner," and "ran" might all be reduced to "run." Stemming often uses simple rules, which can lead to results that are not real words (e.g., "studies" becoming "studi").

* **Lemmatization** is the process of reducing words to their base or "lemma" form, which is the meaningful root word as found in the dictionary. Unlike stemming, lemmatization takes into account the word's context and part of speech to provide more accurate results. For example, "running" is lemmatized to "run," but "better" is lemmatized to "good."

Both techniques help standardize words in text processing tasks, reducing redundancy and improving the efficiency of NLP models.

In [8]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
import pandas as pd
# nltk.download("wordnet")

In [9]:
stemmer=PorterStemmer()

stemmed=[]
for i in word_tokenized:
    stemmed.append(stemmer.stem(i))
    # print(i,"===>",stemmer.stem(i))


In [10]:
lemm=WordNetLemmatizer()

print("The base word of main word using Stemming:",stemmer.stem("studies"))
print("The base word of main word using Lemmatization:",lemm.lemmatize("studies"))

The base word of main word using Stemming: studi
The base word of main word using Lemmatization: study


In [11]:
lemmitized=[]
for i in word_tokenized:
    lemmitized.append(lemm.lemmatize(i))

In [12]:
df=pd.DataFrame(zip(word_tokenized,stemmed,lemmitized),columns=['orig','stemmed','lemmetized'])
df

Unnamed: 0,orig,stemmed,lemmetized
0,The,the,The
1,early,earli,early
2,chapters,chapter,chapter
3,are,are,are
4,organized,organ,organized
...,...,...,...
92,(,(,(
93,Chapters,chapter,Chapters
94,5–7,5–7,5–7
95,),),)


## **Stop Words**

A **stop word** in NLP (Natural Language Processing) is a common word that is typically filtered out or removed from text during preprocessing because it carries little meaningful information on its own. Examples of stop words include **"the," "is," "in," "and," and "of."** These words are usually so frequent that they do not contribute significantly to the understanding or analysis of the text. By removing stop words, NLP models can focus on more meaningful words, improving the efficiency and performance of tasks like text classification, sentiment analysis, and information retrieval.

In [13]:
from nltk.corpus import stopwords
import numpy as np
# nltk.download('stopwords')

In [14]:
english_stop_words=stopwords.words('english')

In [15]:
baseWords=[]

for i in lemmitized:
    if i not in english_stop_words:
        baseWords.append(i)

In [16]:
baseWords

['The',
 'early',
 'chapter',
 'organized',
 'order',
 'conceptual',
 'difficulty',
 ',',
 'starting',
 'practical',
 'introduction',
 'language',
 'processing',
 'show',
 'explore',
 'interesting',
 'body',
 'text',
 'using',
 'tiny',
 'Python',
 'program',
 '(',
 'Chapters',
 '1–3',
 ')',
 '.',
 'This',
 'followed',
 'chapter',
 'structured',
 'programming',
 '(',
 'Chapter',
 '4',
 ')',
 'consolidates',
 'programming',
 'topic',
 'scattered',
 'across',
 'preceding',
 'chapter',
 '.',
 'After',
 ',',
 'pace',
 'pick',
 ',',
 'move',
 'series',
 'chapter',
 'covering',
 'fundamental',
 'topic',
 'language',
 'processing',
 ':',
 'tagging',
 ',',
 'classification',
 ',',
 'information',
 'extraction',
 '(',
 'Chapters',
 '5–7',
 ')',
 '.']

In [17]:
df['baseWords']=df['lemmetized'].apply(lambda x: np.nan if x in english_stop_words else x)

In [18]:
df

Unnamed: 0,orig,stemmed,lemmetized,baseWords
0,The,the,The,The
1,early,earli,early,early
2,chapters,chapter,chapter,chapter
3,are,are,are,
4,organized,organ,organized,organized
...,...,...,...,...
92,(,(,(,(
93,Chapters,chapter,Chapters,Chapters
94,5–7,5–7,5–7,5–7
95,),),),)


## **Bag of Words**

The **Bag of Words (BoW)** model is a simple and commonly used technique in NLP for representing text data. In this model, a text is represented as an unordered collection (or "bag") of words, disregarding grammar, word order, and sometimes even word frequency. The model focuses only on the occurrence of words in a document.

To create a Bag of Words representation, each unique word in the text corpus is considered a feature, and the text is converted into a vector based on the count or frequency of these words. For example, if the vocabulary consists of the words ["cat," "dog," "fish"], the text "cat and dog" would be represented as the vector [1, 1, 0]. BoW is useful for tasks like **text classification** and **sentiment analysis**, though it can be limited by its inability to capture the context or semantics of words.
# ------------------------------------------------------------
### **Sentiment Analysis - IMDB Review Dataset**

In [19]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option('display.max_colwidth',0)

In [20]:
df=pd.read_csv('D:/Projects/NLP/imdb_master.csv',encoding='ISO-8859-1').drop(['file','Unnamed: 0'],axis=1)
display(df.shape)
display(df.head())
display(df.type.value_counts())
display(df.label.value_counts())

(100000, 3)

Unnamed: 0,type,review,label
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",neg
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",neg
2,test,"First of all I hate those moronic rappers, who could'nt act if they had a gun pressed against their foreheads. All they do is curse and shoot each other and acting like clichÃ©'e version of gangsters.<br /><br />The movie doesn't take more than five minutes to explain what is going on before we're already at the warehouse There is not a single sympathetic character in this movie, except for the homeless guy, who is also the only one with half a brain.<br /><br />Bill Paxton and William Sadler are both hill billies and Sadlers character is just as much a villain as the gangsters. I did'nt like him right from the start.<br /><br />The movie is filled with pointless violence and Walter Hills specialty: people falling through windows with glass flying everywhere. There is pretty much no plot and it is a big problem when you root for no-one. Everybody dies, except from Paxton and the homeless guy and everybody get what they deserve.<br /><br />The only two black people that can act is the homeless guy and the junkie but they're actors by profession, not annoying ugly brain dead rappers.<br /><br />Stay away from this crap and watch 48 hours 1 and 2 instead. At lest they have characters you care about, a sense of humor and nothing but real actors in the cast.",neg
3,test,"Not even the Beatles could write songs everyone liked, and although Walter Hill is no mop-top he's second to none when it comes to thought provoking action movies. The nineties came and social platforms were changing in music and film, the emergence of the Rapper turned movie star was in full swing, the acting took a back seat to each man's overpowering regional accent and transparent acting. This was one of the many ice-t movies i saw as a kid and loved, only to watch them later and cringe. Bill Paxton and William Sadler are firemen with basic lives until a burning building tenant about to go up in flames hands over a map with gold implications. I hand it to Walter for quickly and neatly setting up the main characters and location. But i fault everyone involved for turning out Lame-o performances. Ice-t and cube must have been red hot at this time, and while I've enjoyed both their careers as rappers, in my opinion they fell flat in this movie. It's about ninety minutes of one guy ridiculously turning his back on the other guy to the point you find yourself locked in multiple states of disbelief. Now this is a movie, its not a documentary so i wont waste my time recounting all the stupid plot twists in this movie, but there were many, and they led nowhere. I got the feeling watching this that everyone on set was sord of confused and just playing things off the cuff. There are two things i still enjoy about it, one involves a scene with a needle and the other is Sadler's huge 45 pistol. Bottom line this movie is like domino's pizza. Yeah ill eat it if I'm hungry and i don't feel like cooking, But I'm well aware it tastes like crap. 3 stars, meh.",neg
4,test,"Brass pictures (movies is not a fitting word for them) really are somewhat brassy. Their alluring visual qualities are reminiscent of expensive high class TV commercials. But unfortunately Brass pictures are feature films with the pretense of wanting to entertain viewers for over two hours! In this they fail miserably, their undeniable, but rather soft and flabby than steamy, erotic qualities non withstanding.<br /><br />Senso '45 is a remake of a film by Luchino Visconti with the same title and Alida Valli and Farley Granger in the lead. The original tells a story of senseless love and lust in and around Venice during the Italian wars of independence. Brass moved the action from the 19th into the 20th century, 1945 to be exact, so there are Mussolini murals, men in black shirts, German uniforms or the tattered garb of the partisans. But it is just window dressing, the historic context is completely negligible.<br /><br />Anna Galiena plays the attractive aristocratic woman who falls for the amoral SS guy who always puts on too much lipstick. She is an attractive, versatile, well trained Italian actress and clearly above the material. Her wide range of facial expressions (signalling boredom, loathing, delight, fear, hate ... and ecstasy) are the best reason to watch this picture and worth two stars. She endures this basically trashy stuff with an astonishing amount of dignity. I wish some really good parts come along for her. She really deserves it.",neg


type
train    75000
test     25000
Name: count, dtype: int64

label
unsup    50000
neg      25000
pos      25000
Name: count, dtype: int64

In [21]:
filtered_df=df.query("label!='unsup'")
filtered_df.label.value_counts()

label
neg    25000
pos    25000
Name: count, dtype: int64

In [22]:
filtered_df['label']=filtered_df['label'].apply(lambda x: 0 if x == 'neg' else 1)
display(filtered_df.head())
display(filtered_df.label.value_counts())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['label']=filtered_df['label'].apply(lambda x: 0 if x == 'neg' else 1)


Unnamed: 0,type,review,label
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",0
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",0
2,test,"First of all I hate those moronic rappers, who could'nt act if they had a gun pressed against their foreheads. All they do is curse and shoot each other and acting like clichÃ©'e version of gangsters.<br /><br />The movie doesn't take more than five minutes to explain what is going on before we're already at the warehouse There is not a single sympathetic character in this movie, except for the homeless guy, who is also the only one with half a brain.<br /><br />Bill Paxton and William Sadler are both hill billies and Sadlers character is just as much a villain as the gangsters. I did'nt like him right from the start.<br /><br />The movie is filled with pointless violence and Walter Hills specialty: people falling through windows with glass flying everywhere. There is pretty much no plot and it is a big problem when you root for no-one. Everybody dies, except from Paxton and the homeless guy and everybody get what they deserve.<br /><br />The only two black people that can act is the homeless guy and the junkie but they're actors by profession, not annoying ugly brain dead rappers.<br /><br />Stay away from this crap and watch 48 hours 1 and 2 instead. At lest they have characters you care about, a sense of humor and nothing but real actors in the cast.",0
3,test,"Not even the Beatles could write songs everyone liked, and although Walter Hill is no mop-top he's second to none when it comes to thought provoking action movies. The nineties came and social platforms were changing in music and film, the emergence of the Rapper turned movie star was in full swing, the acting took a back seat to each man's overpowering regional accent and transparent acting. This was one of the many ice-t movies i saw as a kid and loved, only to watch them later and cringe. Bill Paxton and William Sadler are firemen with basic lives until a burning building tenant about to go up in flames hands over a map with gold implications. I hand it to Walter for quickly and neatly setting up the main characters and location. But i fault everyone involved for turning out Lame-o performances. Ice-t and cube must have been red hot at this time, and while I've enjoyed both their careers as rappers, in my opinion they fell flat in this movie. It's about ninety minutes of one guy ridiculously turning his back on the other guy to the point you find yourself locked in multiple states of disbelief. Now this is a movie, its not a documentary so i wont waste my time recounting all the stupid plot twists in this movie, but there were many, and they led nowhere. I got the feeling watching this that everyone on set was sord of confused and just playing things off the cuff. There are two things i still enjoy about it, one involves a scene with a needle and the other is Sadler's huge 45 pistol. Bottom line this movie is like domino's pizza. Yeah ill eat it if I'm hungry and i don't feel like cooking, But I'm well aware it tastes like crap. 3 stars, meh.",0
4,test,"Brass pictures (movies is not a fitting word for them) really are somewhat brassy. Their alluring visual qualities are reminiscent of expensive high class TV commercials. But unfortunately Brass pictures are feature films with the pretense of wanting to entertain viewers for over two hours! In this they fail miserably, their undeniable, but rather soft and flabby than steamy, erotic qualities non withstanding.<br /><br />Senso '45 is a remake of a film by Luchino Visconti with the same title and Alida Valli and Farley Granger in the lead. The original tells a story of senseless love and lust in and around Venice during the Italian wars of independence. Brass moved the action from the 19th into the 20th century, 1945 to be exact, so there are Mussolini murals, men in black shirts, German uniforms or the tattered garb of the partisans. But it is just window dressing, the historic context is completely negligible.<br /><br />Anna Galiena plays the attractive aristocratic woman who falls for the amoral SS guy who always puts on too much lipstick. She is an attractive, versatile, well trained Italian actress and clearly above the material. Her wide range of facial expressions (signalling boredom, loathing, delight, fear, hate ... and ecstasy) are the best reason to watch this picture and worth two stars. She endures this basically trashy stuff with an astonishing amount of dignity. I wish some really good parts come along for her. She really deserves it.",0


label
0    25000
1    25000
Name: count, dtype: int64

#### **Clearning Starts**

In [23]:
filtered_df['review_lower']=filtered_df['review'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['review_lower']=filtered_df['review'].str.lower()


In [24]:
# Applying stop words
filtered_df['review_without_stopwords']=filtered_df['review_lower'].apply(lambda x:" ".join(word for word in x.split() if word not in english_stop_words))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['review_without_stopwords']=filtered_df['review_lower'].apply(lambda x:" ".join(word for word in x.split() if word not in english_stop_words))


In [25]:
filtered_df.head(2)

Unnamed: 0,type,review,label,review_lower,review_without_stopwords
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",0,"once again mr. costner has dragged out a movie for far longer than necessary. aside from the terrific sea rescue sequences, of which there are very few i just did not care about any of the characters. most of us have ghosts in the closet, and costner's character are realized early on, and then forgotten until much later, by which time i did not care. the character we should really care about is a very cocky, overconfident ashton kutcher. the problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. his only obstacle appears to be winning over costner. finally when we are well past the half way point of this stinker, costner tells us all about kutcher's ghosts. we are told why kutcher is driven to be the best with no prior inkling or foreshadowing. no magic here, it was all i could do to keep from turning it off an hour in.","mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. us ghosts closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem comes kid thinks he's better anyone else around shows signs cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tells us kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in."
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",0,"this is an example of why the majority of action films are the same. generic and boring, there's really nothing worth watching here. a complete waste of the then barely-tapped talents of ice-t and ice cube, who've each proven many times over that they are capable of acting, and acting well. don't bother with this one, go see new jack city, ricochet or watch new york undercover for ice-t, or boyz n the hood, higher learning or friday for ice cube and see the real deal. ice-t's horribly cliched dialogue alone makes this film grate at the teeth, and i'm still wondering what the heck bill paxton was doing in this film? and why the heck does he always play the exact same character? from aliens onward, every film i've seen with bill paxton has him playing the exact same irritating character, and at least in aliens his character died, which made it somewhat gratifying...<br /><br />overall, this is second-rate action trash. there are countless better films to see, and if you really want to see this one, watch judgement night, which is practically a carbon copy but has better acting and a better script. the only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.","example majority action films same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talents ice-t ice cube, who've proven many times capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone makes film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? aliens onward, every film i've seen bill paxton playing exact irritating character, least aliens character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better films see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, comes close making horrible film - quite. 4/10."


In [26]:
lemm=WordNetLemmatizer()
filtered_df['lemmatized_review']=filtered_df['review_without_stopwords'].apply(lambda x: " ".join(lemm.lemmatize(word) for word in x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['lemmatized_review']=filtered_df['review_without_stopwords'].apply(lambda x: " ".join(lemm.lemmatize(word) for word in x.split()))


In [27]:
filtered_df.head(2)

Unnamed: 0,type,review,label,review_lower,review_without_stopwords,lemmatized_review
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",0,"once again mr. costner has dragged out a movie for far longer than necessary. aside from the terrific sea rescue sequences, of which there are very few i just did not care about any of the characters. most of us have ghosts in the closet, and costner's character are realized early on, and then forgotten until much later, by which time i did not care. the character we should really care about is a very cocky, overconfident ashton kutcher. the problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. his only obstacle appears to be winning over costner. finally when we are well past the half way point of this stinker, costner tells us all about kutcher's ghosts. we are told why kutcher is driven to be the best with no prior inkling or foreshadowing. no magic here, it was all i could do to keep from turning it off an hour in.","mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. us ghosts closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem comes kid thinks he's better anyone else around shows signs cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tells us kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in.","mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. u ghost closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem come kid think he's better anyone else around show sign cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tell u kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in."
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",0,"this is an example of why the majority of action films are the same. generic and boring, there's really nothing worth watching here. a complete waste of the then barely-tapped talents of ice-t and ice cube, who've each proven many times over that they are capable of acting, and acting well. don't bother with this one, go see new jack city, ricochet or watch new york undercover for ice-t, or boyz n the hood, higher learning or friday for ice cube and see the real deal. ice-t's horribly cliched dialogue alone makes this film grate at the teeth, and i'm still wondering what the heck bill paxton was doing in this film? and why the heck does he always play the exact same character? from aliens onward, every film i've seen with bill paxton has him playing the exact same irritating character, and at least in aliens his character died, which made it somewhat gratifying...<br /><br />overall, this is second-rate action trash. there are countless better films to see, and if you really want to see this one, watch judgement night, which is practically a carbon copy but has better acting and a better script. the only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.","example majority action films same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talents ice-t ice cube, who've proven many times capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone makes film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? aliens onward, every film i've seen bill paxton playing exact irritating character, least aliens character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better films see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, comes close making horrible film - quite. 4/10.","example majority action film same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talent ice-t ice cube, who've proven many time capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone make film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? alien onward, every film i've seen bill paxton playing exact irritating character, least alien character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better film see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, come close making horrible film - quite. 4/10."


In [28]:
'''
    After performing all the necessary steps:
        - First, transformed all the string to lower case because if two words: Debut and debut are there then it will be tokenized differently, whereas both are same words
        - Secondly, removed the stop words from the text
        - Lastly, lemmatized the text
    After lemmatizing the text, remove all the columns and rename the lemmatized_review column as review
'''

filtered_df=filtered_df.drop(['review','review_lower','review_without_stopwords'],axis=1)
final_filtered_df=filtered_df.rename({'lemmatized_review':'review'},axis=1)
final_filtered_df.head(2)

Unnamed: 0,type,label,review
0,test,0,"mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. u ghost closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem come kid think he's better anyone else around show sign cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tell u kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in."
1,test,0,"example majority action film same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talent ice-t ice cube, who've proven many time capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone make film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? alien onward, every film i've seen bill paxton playing exact irritating character, least alien character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better film see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, come close making horrible film - quite. 4/10."


#### **Cleaning Ends**

In [29]:
train=final_filtered_df.query("type=='train'").drop(['type'],axis=1)
test=final_filtered_df.query("type=='test'").drop(['type'],axis=1)

In [30]:
x_train=train['review'].values
y_train=train['label'].values

x_test=test['review'].values
y_test=test['label'].values

In [31]:
x_train[0]

'story man unnatural feeling pig. start opening scene terrific example absurd comedy. formal orchestra audience turned insane, violent mob crazy chanting singers. unfortunately stay absurd whole time general narrative eventually making putting. even era turned off. cryptic dialogue would make shakespeare seem easy third grader. technical level better might think good cinematography future great vilmos zsigmond. future star sally kirkland frederic forrest seen briefly.'

In [32]:
vector=CountVectorizer()
trained_vector=vector.fit_transform(x_train)

In [33]:
trained_vector[50].toarray()[:50]

array([[0, 0, 0, ..., 0, 0, 0]])

In [34]:
vector.get_feature_names_out()[500:][:4]

array(['20perr', '20s', '20th', '20ties'], dtype=object)

In [35]:
display(train[train.review.str.contains('20minutes')])
display(train.shape)

Unnamed: 0,label,review
47493,1,"""life hit u face........we must try stay beautiful""<br /><br />debut movie one belgian's best artist (he sings songs), tom barman. long awaited movie and---happy happy joy joy flemish filmmaking---really worth watching, promising piece work! take u life 8 main character live friday- night. title say lot way spend time them: float friday night's party kinda' meet.<br /><br />it's rhytmic style 'thought off'. superb use music. sometimes take upperhand image feel power. gainsbourg! qotsa! party scene (20minutes???) thrilling visual experience cause way shot. keep really set small place lot people big party.....so hard shoot.<br /><br />thank menijã¨r barman making daring movie these, already year going, poor time flemish filmmaking. made day!"


(25000, 2)

In [36]:
from sklearn.naive_bayes import MultinomialNB

model=MultinomialNB()
model.fit(trained_vector,y_train)

In [37]:

test_vector=vector.transform(x_test)

In [38]:
prediction=model.predict(test_vector)

In [39]:
from sklearn.metrics import classification_report

print(classification_report(y_test,prediction))

              precision    recall  f1-score   support

           0       0.79      0.88      0.83     12500
           1       0.86      0.76      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.83      0.82      0.82     25000
weighted avg       0.83      0.82      0.82     25000



In [40]:
final_sentiment=pd.DataFrame(zip(x_test,y_test,prediction),columns=['review','actual_sent','prediction_sent'])
final_sentiment.head(2)

Unnamed: 0,review,actual_sent,prediction_sent
0,"mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. u ghost closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem come kid think he's better anyone else around show sign cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tell u kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in.",0,0
1,"example majority action film same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talent ice-t ice cube, who've proven many time capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone make film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? alien onward, every film i've seen bill paxton playing exact irritating character, least alien character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better film see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, come close making horrible film - quite. 4/10.",0,0


In [41]:
final_sentiment.query("actual_sent != prediction_sent").head(2)

Unnamed: 0,review,actual_sent,prediction_sent
4,"brass picture (movies fitting word them) really somewhat brassy. alluring visual quality reminiscent expensive high class tv commercials. unfortunately brass picture feature film pretense wanting entertain viewer two hours! fail miserably, undeniable, rather soft flabby steamy, erotic quality non withstanding.<br /><br />senso '45 remake film luchino visconti title alida valli farley granger lead. original tell story senseless love lust around venice italian war independence. brass moved action 19th 20th century, 1945 exact, mussolini murals, men black shirts, german uniform tattered garb partisans. window dressing, historic context completely negligible.<br /><br />anna galiena play attractive aristocratic woman fall amoral s guy always put much lipstick. attractive, versatile, well trained italian actress clearly material. wide range facial expression (signalling boredom, loathing, delight, fear, hate ... ecstasy) best reason watch picture worth two stars. endures basically trashy stuff astonishing amount dignity. wish really good part come along her. really deserves it.",0,1
9,"wealthy horse rancher buenos aire long-standing no-trading policy crawford manhattan, happens mustachioed latin son fall certain crawford bright eyes, blonde hair, perky move dance floor? 20th century-fox musical glossy veneer yet seems bit tatty around edges. heavy frenetic, gymnastic-like dancing, exceedingly thin story. betty grable (an eleventh hour replacement alice faye) give boost, even though paired leaden ameche (in tan make-up slick hair). also good: charlotte greenwood betty's pithy aunt, limousine driver who's constantly asleep job, carmen miranda playing (who else?). stock shot argentina far outclass action filmed fox backlot, supporting performance quite awful. time big horserace finale, viewer enough. *1/2 ****",0,1


In [58]:
test_review='that movie was awesome.'

def sentiment_analysis(text):

    review=np.array([text])
    text_to_predict=vector.transform(review)
    predicted_value=model.predict(text_to_predict)
    if predicted_value == 1:
        print('The review is positive.')
    else:
        print('The review is negative.')

sentiment_analysis(test_review)

The review is positive.
