## **Term Frequency - Inverse Document Frequency**

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical statistic used in NLP to evaluate the importance of a word in a document relative to a collection or corpus of documents. It combines two key measures:

### **Term Frequency (TF)**
The frequency of a word in a specific document. It shows how often a word appears in a document, indicating its relevance within that document. Words that appear frequently have higher TF scores.

$$
\text{TF} = \frac{\text{Number of times the word appears in the document}}{\text{Total number of words in the document}}
$$

### **Inverse Document Frequency (IDF)**
A measure of how unique or rare a word is across all documents in the corpus. Common words that appear in many documents get lower IDF scores, while rare words that appear in fewer documents get higher scores.

$$
\text{IDF} = \log \left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)
$$

By multiplying TF and IDF, TF-IDF highlights words that are important in a specific document but not too common across all documents. It is commonly used in tasks like information retrieval, document ranking, and text mining, as it helps prioritize significant words for better text analysis.


In [7]:
import pandas as pd
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

pd.set_option('display.max_colwidth',0)

In [8]:
lemma=WordNetLemmatizer()
engStops=stopwords.words('english')

df=pd.read_excel(r"D:/Projects/NLP/temp_tfidf.xlsx")
df

Unnamed: 0,Contents
0,Solved defaulter classification problem using XGBOOST model. Built Churn Risk Model.
1,Built a segmentation model using Random Forest. Built a decision tree model for classification.
2,Built ETL and RDBMS system. Designed and Built Star or Snowflake or Galaxy schemas
3,Experienced in OBIEE. Designed OLAP Datawarehouse.


In [9]:
df['Contents']=df['Contents'].str.lower()
df['Lemmatized']=df['Contents'].apply(lambda x:" ".join(lemma.lemmatize(word) for word in x.split()))
df['Finallist']=df['Lemmatized'].apply(lambda x:" ".join(word for word in x.split() if word not in engStops))
df

Unnamed: 0,Contents,Lemmatized,Finallist
0,solved defaulter classification problem using xgboost model. built churn risk model.,solved defaulter classification problem using xgboost model. built churn risk model.,solved defaulter classification problem using xgboost model. built churn risk model.
1,built a segmentation model using random forest. built a decision tree model for classification.,built a segmentation model using random forest. built a decision tree model for classification.,built segmentation model using random forest. built decision tree model classification.
2,built etl and rdbms system. designed and built star or snowflake or galaxy schemas,built etl and rdbms system. designed and built star or snowflake or galaxy schema,built etl rdbms system. designed built star snowflake galaxy schema
3,experienced in obiee. designed olap datawarehouse.,experienced in obiee. designed olap datawarehouse.,experienced obiee. designed olap datawarehouse.


In [10]:
my_doc=df['Finallist'].values
type(my_doc)

numpy.ndarray

In [11]:
model=TfidfVectorizer()
transformed_doc=model.fit_transform(my_doc)

In [12]:
print(model.vocabulary_)

{'solved': 21, 'defaulter': 5, 'classification': 2, 'problem': 14, 'using': 25, 'xgboost': 26, 'model': 11, 'built': 0, 'churn': 1, 'risk': 17, 'segmentation': 19, 'random': 15, 'forest': 9, 'decision': 4, 'tree': 24, 'etl': 7, 'rdbms': 16, 'system': 23, 'designed': 6, 'star': 22, 'snowflake': 20, 'galaxy': 10, 'schema': 18, 'experienced': 8, 'obiee': 12, 'olap': 13, 'datawarehouse': 3}


In [13]:
print(model.get_feature_names_out())

['built' 'churn' 'classification' 'datawarehouse' 'decision' 'defaulter'
 'designed' 'etl' 'experienced' 'forest' 'galaxy' 'model' 'obiee' 'olap'
 'problem' 'random' 'rdbms' 'risk' 'schema' 'segmentation' 'snowflake'
 'solved' 'star' 'system' 'tree' 'using' 'xgboost']


In [14]:
model.idf_[0]

1.2231435513142097

In [15]:
features=model.get_feature_names_out()
for feature in features:
    idex=model.vocabulary_.get(feature)
    print(feature,"==>",model.idf_[idex])

built ==> 1.2231435513142097
churn ==> 1.916290731874155
classification ==> 1.5108256237659907
datawarehouse ==> 1.916290731874155
decision ==> 1.916290731874155
defaulter ==> 1.916290731874155
designed ==> 1.5108256237659907
etl ==> 1.916290731874155
experienced ==> 1.916290731874155
forest ==> 1.916290731874155
galaxy ==> 1.916290731874155
model ==> 1.5108256237659907
obiee ==> 1.916290731874155
olap ==> 1.916290731874155
problem ==> 1.916290731874155
random ==> 1.916290731874155
rdbms ==> 1.916290731874155
risk ==> 1.916290731874155
schema ==> 1.916290731874155
segmentation ==> 1.916290731874155
snowflake ==> 1.916290731874155
solved ==> 1.916290731874155
star ==> 1.916290731874155
system ==> 1.916290731874155
tree ==> 1.916290731874155
using ==> 1.5108256237659907
xgboost ==> 1.916290731874155


## **Project - IMDB Review with TF-IDF**

In [16]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

pd.set_option('display.max_colwidth',0)

In [17]:
df=pd.read_csv('D:/Projects/NLP/imdb_master.csv',encoding='ISO-8859-1').drop(['file','Unnamed: 0'],axis=1)
display(df.shape)
display(df.head())
display(df.type.value_counts())
display(df.label.value_counts())

(100000, 3)

Unnamed: 0,type,review,label
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",neg
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",neg
2,test,"First of all I hate those moronic rappers, who could'nt act if they had a gun pressed against their foreheads. All they do is curse and shoot each other and acting like clichÃ©'e version of gangsters.<br /><br />The movie doesn't take more than five minutes to explain what is going on before we're already at the warehouse There is not a single sympathetic character in this movie, except for the homeless guy, who is also the only one with half a brain.<br /><br />Bill Paxton and William Sadler are both hill billies and Sadlers character is just as much a villain as the gangsters. I did'nt like him right from the start.<br /><br />The movie is filled with pointless violence and Walter Hills specialty: people falling through windows with glass flying everywhere. There is pretty much no plot and it is a big problem when you root for no-one. Everybody dies, except from Paxton and the homeless guy and everybody get what they deserve.<br /><br />The only two black people that can act is the homeless guy and the junkie but they're actors by profession, not annoying ugly brain dead rappers.<br /><br />Stay away from this crap and watch 48 hours 1 and 2 instead. At lest they have characters you care about, a sense of humor and nothing but real actors in the cast.",neg
3,test,"Not even the Beatles could write songs everyone liked, and although Walter Hill is no mop-top he's second to none when it comes to thought provoking action movies. The nineties came and social platforms were changing in music and film, the emergence of the Rapper turned movie star was in full swing, the acting took a back seat to each man's overpowering regional accent and transparent acting. This was one of the many ice-t movies i saw as a kid and loved, only to watch them later and cringe. Bill Paxton and William Sadler are firemen with basic lives until a burning building tenant about to go up in flames hands over a map with gold implications. I hand it to Walter for quickly and neatly setting up the main characters and location. But i fault everyone involved for turning out Lame-o performances. Ice-t and cube must have been red hot at this time, and while I've enjoyed both their careers as rappers, in my opinion they fell flat in this movie. It's about ninety minutes of one guy ridiculously turning his back on the other guy to the point you find yourself locked in multiple states of disbelief. Now this is a movie, its not a documentary so i wont waste my time recounting all the stupid plot twists in this movie, but there were many, and they led nowhere. I got the feeling watching this that everyone on set was sord of confused and just playing things off the cuff. There are two things i still enjoy about it, one involves a scene with a needle and the other is Sadler's huge 45 pistol. Bottom line this movie is like domino's pizza. Yeah ill eat it if I'm hungry and i don't feel like cooking, But I'm well aware it tastes like crap. 3 stars, meh.",neg
4,test,"Brass pictures (movies is not a fitting word for them) really are somewhat brassy. Their alluring visual qualities are reminiscent of expensive high class TV commercials. But unfortunately Brass pictures are feature films with the pretense of wanting to entertain viewers for over two hours! In this they fail miserably, their undeniable, but rather soft and flabby than steamy, erotic qualities non withstanding.<br /><br />Senso '45 is a remake of a film by Luchino Visconti with the same title and Alida Valli and Farley Granger in the lead. The original tells a story of senseless love and lust in and around Venice during the Italian wars of independence. Brass moved the action from the 19th into the 20th century, 1945 to be exact, so there are Mussolini murals, men in black shirts, German uniforms or the tattered garb of the partisans. But it is just window dressing, the historic context is completely negligible.<br /><br />Anna Galiena plays the attractive aristocratic woman who falls for the amoral SS guy who always puts on too much lipstick. She is an attractive, versatile, well trained Italian actress and clearly above the material. Her wide range of facial expressions (signalling boredom, loathing, delight, fear, hate ... and ecstasy) are the best reason to watch this picture and worth two stars. She endures this basically trashy stuff with an astonishing amount of dignity. I wish some really good parts come along for her. She really deserves it.",neg


type
train    75000
test     25000
Name: count, dtype: int64

label
unsup    50000
neg      25000
pos      25000
Name: count, dtype: int64

In [18]:
filtered_df=df.query("label!='unsup'")
filtered_df.label.value_counts()

label
neg    25000
pos    25000
Name: count, dtype: int64

In [19]:
filtered_df['label']=filtered_df['label'].apply(lambda x: 0 if x == 'neg' else 1)
display(filtered_df.head())
display(filtered_df.label.value_counts())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['label']=filtered_df['label'].apply(lambda x: 0 if x == 'neg' else 1)


Unnamed: 0,type,review,label
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",0
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",0
2,test,"First of all I hate those moronic rappers, who could'nt act if they had a gun pressed against their foreheads. All they do is curse and shoot each other and acting like clichÃ©'e version of gangsters.<br /><br />The movie doesn't take more than five minutes to explain what is going on before we're already at the warehouse There is not a single sympathetic character in this movie, except for the homeless guy, who is also the only one with half a brain.<br /><br />Bill Paxton and William Sadler are both hill billies and Sadlers character is just as much a villain as the gangsters. I did'nt like him right from the start.<br /><br />The movie is filled with pointless violence and Walter Hills specialty: people falling through windows with glass flying everywhere. There is pretty much no plot and it is a big problem when you root for no-one. Everybody dies, except from Paxton and the homeless guy and everybody get what they deserve.<br /><br />The only two black people that can act is the homeless guy and the junkie but they're actors by profession, not annoying ugly brain dead rappers.<br /><br />Stay away from this crap and watch 48 hours 1 and 2 instead. At lest they have characters you care about, a sense of humor and nothing but real actors in the cast.",0
3,test,"Not even the Beatles could write songs everyone liked, and although Walter Hill is no mop-top he's second to none when it comes to thought provoking action movies. The nineties came and social platforms were changing in music and film, the emergence of the Rapper turned movie star was in full swing, the acting took a back seat to each man's overpowering regional accent and transparent acting. This was one of the many ice-t movies i saw as a kid and loved, only to watch them later and cringe. Bill Paxton and William Sadler are firemen with basic lives until a burning building tenant about to go up in flames hands over a map with gold implications. I hand it to Walter for quickly and neatly setting up the main characters and location. But i fault everyone involved for turning out Lame-o performances. Ice-t and cube must have been red hot at this time, and while I've enjoyed both their careers as rappers, in my opinion they fell flat in this movie. It's about ninety minutes of one guy ridiculously turning his back on the other guy to the point you find yourself locked in multiple states of disbelief. Now this is a movie, its not a documentary so i wont waste my time recounting all the stupid plot twists in this movie, but there were many, and they led nowhere. I got the feeling watching this that everyone on set was sord of confused and just playing things off the cuff. There are two things i still enjoy about it, one involves a scene with a needle and the other is Sadler's huge 45 pistol. Bottom line this movie is like domino's pizza. Yeah ill eat it if I'm hungry and i don't feel like cooking, But I'm well aware it tastes like crap. 3 stars, meh.",0
4,test,"Brass pictures (movies is not a fitting word for them) really are somewhat brassy. Their alluring visual qualities are reminiscent of expensive high class TV commercials. But unfortunately Brass pictures are feature films with the pretense of wanting to entertain viewers for over two hours! In this they fail miserably, their undeniable, but rather soft and flabby than steamy, erotic qualities non withstanding.<br /><br />Senso '45 is a remake of a film by Luchino Visconti with the same title and Alida Valli and Farley Granger in the lead. The original tells a story of senseless love and lust in and around Venice during the Italian wars of independence. Brass moved the action from the 19th into the 20th century, 1945 to be exact, so there are Mussolini murals, men in black shirts, German uniforms or the tattered garb of the partisans. But it is just window dressing, the historic context is completely negligible.<br /><br />Anna Galiena plays the attractive aristocratic woman who falls for the amoral SS guy who always puts on too much lipstick. She is an attractive, versatile, well trained Italian actress and clearly above the material. Her wide range of facial expressions (signalling boredom, loathing, delight, fear, hate ... and ecstasy) are the best reason to watch this picture and worth two stars. She endures this basically trashy stuff with an astonishing amount of dignity. I wish some really good parts come along for her. She really deserves it.",0


label
0    25000
1    25000
Name: count, dtype: int64

#### **Clearning Starts**

In [20]:
filtered_df['review_lower']=filtered_df['review'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['review_lower']=filtered_df['review'].str.lower()


In [21]:
# Applying stop words
filtered_df['review_without_stopwords']=filtered_df['review_lower'].apply(lambda x:" ".join(word for word in x.split() if word not in engStops))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['review_without_stopwords']=filtered_df['review_lower'].apply(lambda x:" ".join(word for word in x.split() if word not in engStops))


In [22]:
lemm=WordNetLemmatizer()
filtered_df['lemmatized_review']=filtered_df['review_without_stopwords'].apply(lambda x: " ".join(lemm.lemmatize(word) for word in x.split()))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df['lemmatized_review']=filtered_df['review_without_stopwords'].apply(lambda x: " ".join(lemm.lemmatize(word) for word in x.split()))


In [23]:
filtered_df.head(2)

Unnamed: 0,type,review,label,review_lower,review_without_stopwords,lemmatized_review
0,test,"Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",0,"once again mr. costner has dragged out a movie for far longer than necessary. aside from the terrific sea rescue sequences, of which there are very few i just did not care about any of the characters. most of us have ghosts in the closet, and costner's character are realized early on, and then forgotten until much later, by which time i did not care. the character we should really care about is a very cocky, overconfident ashton kutcher. the problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. his only obstacle appears to be winning over costner. finally when we are well past the half way point of this stinker, costner tells us all about kutcher's ghosts. we are told why kutcher is driven to be the best with no prior inkling or foreshadowing. no magic here, it was all i could do to keep from turning it off an hour in.","mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. us ghosts closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem comes kid thinks he's better anyone else around shows signs cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tells us kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in.","mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. u ghost closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem come kid think he's better anyone else around show sign cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tell u kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in."
1,test,"This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.",0,"this is an example of why the majority of action films are the same. generic and boring, there's really nothing worth watching here. a complete waste of the then barely-tapped talents of ice-t and ice cube, who've each proven many times over that they are capable of acting, and acting well. don't bother with this one, go see new jack city, ricochet or watch new york undercover for ice-t, or boyz n the hood, higher learning or friday for ice cube and see the real deal. ice-t's horribly cliched dialogue alone makes this film grate at the teeth, and i'm still wondering what the heck bill paxton was doing in this film? and why the heck does he always play the exact same character? from aliens onward, every film i've seen with bill paxton has him playing the exact same irritating character, and at least in aliens his character died, which made it somewhat gratifying...<br /><br />overall, this is second-rate action trash. there are countless better films to see, and if you really want to see this one, watch judgement night, which is practically a carbon copy but has better acting and a better script. the only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.","example majority action films same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talents ice-t ice cube, who've proven many times capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone makes film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? aliens onward, every film i've seen bill paxton playing exact irritating character, least aliens character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better films see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, comes close making horrible film - quite. 4/10.","example majority action film same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talent ice-t ice cube, who've proven many time capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone make film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? alien onward, every film i've seen bill paxton playing exact irritating character, least alien character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better film see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, come close making horrible film - quite. 4/10."


In [24]:
'''
    After performing all the necessary steps:
        - First, transformed all the string to lower case because if two words: Debut and debut are there then it will be tokenized differently, whereas both are same words
        - Secondly, removed the stop words from the text
        - Lastly, lemmatized the text
    After lemmatizing the text, remove all the columns and rename the lemmatized_review column as review
'''

filtered_df=filtered_df.drop(['review','review_lower','review_without_stopwords'],axis=1)
final_filtered_df=filtered_df.rename({'lemmatized_review':'review'},axis=1)
final_filtered_df.head(2)

Unnamed: 0,type,label,review
0,test,0,"mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. u ghost closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem come kid think he's better anyone else around show sign cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tell u kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in."
1,test,0,"example majority action film same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talent ice-t ice cube, who've proven many time capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone make film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? alien onward, every film i've seen bill paxton playing exact irritating character, least alien character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better film see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, come close making horrible film - quite. 4/10."


#### **Cleaning Ends**

In [25]:
train=final_filtered_df.query("type=='train'").drop(['type'],axis=1)
test=final_filtered_df.query("type=='test'").drop(['type'],axis=1)

In [26]:
x_train=train['review'].values
y_train=train['label'].values

x_test=test['review'].values
y_test=test['label'].values

In [27]:
vec=TfidfVectorizer()
trained_vector=vec.fit_transform(x_train)

In [28]:
model=MultinomialNB()
model.fit(trained_vector,y_train)

In [29]:
test_vector=vec.transform(x_test)

In [30]:
prediction=model.predict(test_vector)

In [31]:
from sklearn.metrics import classification_report

In [33]:
print(classification_report(prediction,y_test))

              precision    recall  f1-score   support

           0       0.88      0.80      0.84     13804
           1       0.78      0.87      0.82     11196

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.84      0.83      0.83     25000



In [34]:
final_sentiment=pd.DataFrame(zip(x_test,y_test,prediction),columns=['review','actual_sent','prediction_sent'])
final_sentiment.head(2)

Unnamed: 0,review,actual_sent,prediction_sent
0,"mr. costner dragged movie far longer necessary. aside terrific sea rescue sequences, care characters. u ghost closet, costner's character realized early on, forgotten much later, time care. character really care cocky, overconfident ashton kutcher. problem come kid think he's better anyone else around show sign cluttered closet. obstacle appears winning costner. finally well past half way point stinker, costner tell u kutcher's ghosts. told kutcher driven best prior inkling foreshadowing. magic here, could keep turning hour in.",0,0
1,"example majority action film same. generic boring, there's really nothing worth watching here. complete waste barely-tapped talent ice-t ice cube, who've proven many time capable acting, acting well. bother one, go see new jack city, ricochet watch new york undercover ice-t, boyz n hood, higher learning friday ice cube see real deal. ice-t's horribly cliched dialogue alone make film grate teeth, i'm still wondering heck bill paxton film? heck always play exact character? alien onward, every film i've seen bill paxton playing exact irritating character, least alien character died, made somewhat gratifying...<br /><br />overall, second-rate action trash. countless better film see, really want see one, watch judgement night, practically carbon copy better acting better script. thing made worth watching decent hand camera - cinematography almost refreshing, come close making horrible film - quite. 4/10.",0,0
