<center><font size=8>Hands-on - Analyzing Text Data</font></center>

# **Problem Statement**

## **Business Context**

In the rapidly evolving landscape of the entertainment industry, understanding audience feedback through movie reviews is essential for refining content and shaping marketing strategies. However, the sheer volume of reviews presents challenges in efficiently processing and analyzing this information. To remain competitive, entertainment companies must find effective ways to clean and structure this data, enabling them to derive valuable insights for enhancing viewer experiences and making informed decisions.

## **Objective**

As a data scientist, your objective is to develop an efficient text preprocessing pipeline that will clean and structure a dataset of movie reviews. This preprocessing step will ensure that the data is standardized and ready for further analysis, ultimately supporting the identification of trends and insights that can drive content and marketing strategies in the entertainment industry.

## **Data Dictionary**

- **review**: review of a movie

# **Importing Necessary Libraries**

In [None]:
!pip install nltk==3.8.1 scikit-learn==1.5.2 -q

In [None]:
# to read and manipulate the data
import pandas as pd
import numpy as np

# setting column to the maximum column width as per the data
pd.set_option('max_colwidth', None)

# **Importing the dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# loading data into a pandas dataframe
reviews = pd.read_csv("movie_reviews.csv")

In [None]:
# creating a copy of the data
data = reviews.copy()

# **Data Overview**

## **Checking the first 5 rows**

In [None]:
data.head(5)

Unnamed: 0,review
0,"Okay, I know this does'nt project India in a good light. But the overall theme of the movie is not India, it's Shakti. The power of a warlord, and the power of a mother. The relationship between Nandini and her husband and son swallow you up in their warmth. Then things go terribly wrong. The interaction between Nandini and her father in law - the power of their dysfunctional relationship - and the lives changed by it are the strengths of this movie. Shah Rukh Khan's performance seems to be a mere cameo compared to the believable desperation of Karisma Kapoor. It is easy to get caught up in the love, violence and redemption of lives in this film, and find yourself heaving a sigh of relief and sadness at the climax. The musical interludes are strengths, believable and well done."
1,"Despite John Travolta's statements in interviews that this was his favorite role of his career, ""Be Cool"" proves to be a disappointing sequel to 1995's witty and clever ""Get Shorty.""<br /><br />Travolta delivers a pleasant enough performance in this mildly entertaining film, but ultimately the movie falls flat due to an underdeveloped plot, unlikeable characters, and a surprising lack of chemistry between leads Travolta and Uma Thurman. Although there are some laughs, this unfunny dialog example (which appeared frequently in the trailers) kind of says it all: Thurman: Do you dance? Travolta: Hey, I'm from Brooklyn.<br /><br />The film suggests that everyone in the entertainment business is a gangster or aspires to be one, likening it to organized crime. In ""Get Shorty,"" the premise of a gangster ""going legitimate"" by getting into movies was a clever fish-out-of water idea, but in ""Be Cool,"" it seems the biz has entirely gone crooked since then.<br /><br />The film is interestingly casted and the absolute highlight is a ""monolgue"" delivered by The Rock, whose character is an aspiring actor as well as a goon, where he reenacts a scene between Gabrielle Union and Kirsten Dunst from ""Bring It On."" Vince Vaughan's character thinks he's black and he's often seen dressed as a pimp-- this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward.<br /><br />Overall, ""Be Cool"" may be worth a rental for John Travolta die-hards (of which I am one), but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time. Fans of ""Get Shorty"" may actually wish to avoid this, as the sequel is devoid of most things that made that one a winner. I rate this movie an admittedly harsh 4/10."
2,"I am a kung fu fan, but not a Woo fan. I have no interest in gangster movies filled with over-the-top gun-play. Now, martial arts; *that's* beautiful! And John Woo surprised me here by producing a highly entertaining kung fu movie, which almost has *too much* fighting, if such a thing is possible! This is good stuff.<br /><br />Many of the fight scenes are very good (and some of them are less good), and the main characters are amusing and likable. The bad guys are a bit too unbelievably evil, but entertaining none the less. You gotta see the Sleeping Wizard!! He can only fight when he's asleep - it's hysterical!<br /><br />Upon repeated viewings, however, Last Hurrah For Chivalry can tend to get a little boring and long-winded, also especially because many of the fight scenes are actually not that good. Hence, I rate it ""only"" a 7 out of 10. But it really is almost an ""8"".<br /><br />All in all one of the better kung fu movies, made smack-dab in the heart of kung fu cinema's prime. All the really good kung fu movies are from the mid- to late 1970ies, with some notable exceptions from the late '60ies and early '70ies (and early '80ies, to be fair)."
3,"He seems to be a control freak. I have heard him comment on ""losing control of the show"" and tell another guest who brought live animals that he had one rule-""no snakes."" He needs to hire a comedy writer because his jokes are lame. The only reason I watch him is because he some some great guests and bands. <br /><br />I watched the Craig Ferguson show for a while but his show is even worse. He likes to bull sh** to burn time.I don't think either man has much of a future in late night talk shows.<br /><br />Daily also has the annoying habit of sticking his tongue out to lick his lips. He must do this at least 10 times a show. I do like the Joe Firstman band. Carson Daily needs to lighten up before it is too late."
4,"Admittedly, there are some scenes in this movie that seem a little unrealistic. The ravishing woman first panics and then, only a few minutes later, she starts kissing the young lad while the old guy is right next to her. But as the film goes along we learn that she is a little volatile girl (or slut) and that partly explains her behavior. The cinematography of this movie is well done. We get to see the elevator from almost every angle and perspective, and some of those images and scenes really raise the tension. Götz George plays his character well, a wannabe hot-shot getting old and being overpowered by young men like the Jaennicke character. Wolfgang Kieling who I admired in Hitchcock's THE TORN CURTAIN delivers a great performance that, although he doesn't say much, he is by far the best actor in this play. One critic complained about how unrealistic the film was and that in a real case of emergency nothing would really happen. But then again, how realistic are films such as Mission impossible or Phone Booth. Given the fact that we are talking about a movie here, and that in a movie you always have to deal with some scenes that aren't very likely to occur in real life, you can still enjoy this movie. It's a lot better than many things that I see on German TV these days and I think that the vintage 80's style added something to this film."


## **Checking the shape of the data**

In [None]:
data.shape

(10000, 1)

* The dataset has 10000 rows and 1 column.

## **Checking for missing values**

In [None]:
data.isnull().sum()

Unnamed: 0,0
review,0


* There are no missing values in the data

## **Checking for duplicate values**

In [None]:
# checking for duplicate values
data.duplicated().sum()

18

In [None]:
# keeping only the first occurence of duplicate values and dropping the rest
data = data.drop_duplicates(keep = 'first')

In [None]:
# reseting the index of the dataframe
data = data.reset_index(drop = True)

* The duplicate values have been removed.

# **Text Preprocessing**

### **Removing special characters from the text**

<h2>Why Remove Special Characters in Text Preprocessing?</h2>

- Special characters can introduce unnecessary elements, complicating text analysis.

- Clean text is simpler to work with and understand.

<h2>How to Implement Special Character Removal in Text Preprocessing?</h2>

- We can start by manually replacing unwanted characters like @, #, or punctuation marks with spaces.

- String functions like `replace()` help with simple replacements across the text.

In [None]:
text = "office.aiml.utaustin@mygreatlearning.com"

In [None]:
upd_text1 = text.replace("@", " ")
upd_text1

'office.aiml.utaustin mygreatlearning.com'

- We can observe that the '@' character has been replaced with a whitespace.
- Additionally, there are some punctuation marks in the email address. Let's replace those as well.

In [None]:
upd_text2 = upd_text1.replace(".", " ")
upd_text2

'office aiml utaustin mygreatlearning com'

- The punctuation marks ('.') are also replaced.


- However, as text become larger and contain various patterns of special characters, manually handling them becomes tedious.

- In such cases, we need more flexible methods to efficiently identify and remove multiple patterns at once - this is where advanced tools like regular expressions come in handy.

<h2>Regular Expressions</h2>

- Regular expression, also called regex in short, is a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text.

- They enable complex text processing tasks, such as finding, replacing, or removing patterns, making data cleaning more efficient.

- Example:
    - **Pattern:** `a`
    - **Description:** This regex pattern matches the letter "a" in any text.
    - **Usage:** In the text "apple pie", the pattern `a` will match the occurrence of "a," allowing for identification or manipulation of that letter.  

- Regular expressions are implemented in Python using the `re` library.

- It is imported using the statement `import re`

In [None]:
import re

- Suppose you have a string, and you are interested in adding a character between each of its characters.

In [None]:
string = "Regular expressions"

- This is where the join method comes to rescue.

  - The `join()` method takes all characters in a string and joins them into one string with a sepataror.

  - A string must be specified as the separator.

  - **Syntax :** string.join(list)

In [None]:
'space'.join(string)

'Rspaceespacegspaceuspacelspaceaspacerspace spaceespacexspacepspacerspaceespacesspacesspaceispaceospacenspaces'

- As we can observe, the string 'space' has been added between each character in the `string`

In [None]:
' '.join(string)

'R e g u l a r   e x p r e s s i o n s'

- As we can observe, a whitespace has been added between each character in the `string`.

<h2>1. Pattern: [a-z]</h2>
<p><strong>Use Case:</strong> Extracting lowercase letters from usernames.</p>
<p><strong>Description:</strong> This pattern matches any single lowercase letter from 'a' to 'z'. It can be used to ensure usernames contain only lowercase letters.</p>
<pre>
pattern = r'[a-z]'
# Finding the specified pattern and replacing lowercase characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>


- The `r` prefix creates a raw string, preventing Python from interpreting backslashes as escape characters, which is essential when working with regular expressions that often use backslashes.

- [] define a character set, which matches any one character from a specified set of characters inside the brackets.

- For example, [abc] will match either a, b, or c in the input text.

- If you put a hyphen between characters like [a-z], it matches any character in that range, so [a-z] matches any lowercase letter from a to z.

- The `re` library in Python includes many built-in functions, one of which is `re.sub`, used for substituting occurrences of a specified regex pattern in a string.

**```re.sub(pattern, replacement, string)```**

This function is used to replace occurrences of a pattern in a string.

- **`pattern`:** The regular expression that defines the sequence of characters you want to search for in the string.

- **`replacement`:** The string that will replace each occurrence of the pattern found in the original string.

- **`string`:** The input string where the search and replacement will occur

In [None]:
pattern = r'[a-z]'
replacement = ' '
string = 'Python Modules'

output = re.sub(pattern,replacement,string)

output

'P      M      '

- As expected, the characters [a-z] in the string are replaced with a whitespace.

- Let's check a simple example of how the above pattern can be used.

In [None]:
string = "Artificial Intelligence and Machine Learning"
pattern = r'[a-z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

Artificial Intelligence and Machine Learning
A          I                M       L       


- As we can observe, the lower case characters in the string contained in the variable `string` have been replaced with a whitespace (and not a blank).

- Let's verify the same.

In [None]:
print(len(string))
print(len(cleaned_string))

44
44


- Since the lower case characters are replaced with a whitespace and whitespace is also a character, the length of the `cleaned_string` remains unchanged

- We can also replace the lower case characters in the string with a blank, i.e., remove those characters.

In [None]:
string = "Artificial Intelligence and Machine Learning"
pattern = r'[a-z]'

cleaned_string = ''.join(re.sub(pattern,'',string))

print(string)
print(cleaned_string)

Artificial Intelligence and Machine Learning
A I  M L


In [None]:
print(len(string))
print(len(cleaned_string))

44
8


- Since the lowercase characters are replaced with a blank space, and a blank space is a character of zero length, the length of the `cleaned_string` changed and was reduced

- Now, let's try to use this pattern to manipulate one of the reviews in the data.

In [None]:
review = data['review'][0]
pattern = r'[a-z]'

cleaned_review = ''.join(re.sub(pattern,' ',review))

print(review)
print(cleaned_review)

Okay, I know this does'nt project India in a good light. But the overall theme of the movie is not India, it's Shakti. The power of a warlord, and the power of a mother. The relationship between Nandini and her husband and son swallow you up in their warmth. Then things go terribly wrong. The interaction between Nandini and her father in law - the power of their dysfunctional relationship - and the lives changed by it are the strengths of this movie. Shah Rukh Khan's performance seems to be a mere cameo compared to the believable desperation of Karisma Kapoor. It is easy to get caught up in the love, violence and redemption of lives in this film, and find yourself heaving a sigh of relief and sadness at the climax. The musical interludes are strengths, believable and well done.
O   , I               '           I                    . B                                         I    ,   '  S     . T                     ,                          . T                        N               

- As we can observe, the lowercase letters have been replaced with a whitespace.

<h2>2. Pattern: [A-Z]</h2>
<p><strong>Use Case:</strong> Validating employee IDs that start with uppercase letters.</p>
<p><strong>Description:</strong> This pattern matches any single uppercase letter from 'A' to 'Z'. It can be used to check that employee IDs start with a capital letter.</p>
<pre>
pattern = r'[A-Z]'
# Finding the specified pattern and replacing uppercase characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>

In [None]:
string = "ID1230"
pattern = r'[A-Z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

ID1230
  1230


- As we can observe, the uppercase letters have been replaced with a whitespace

In [None]:
import re

review = data['review'][4]
pattern = r'[A-Z]'

cleaned_review = ''.join(re.sub(pattern,' ',review))

print(review)
print(cleaned_review)

Admittedly, there are some scenes in this movie that seem a little unrealistic. The ravishing woman first panics and then, only a few minutes later, she starts kissing the young lad while the old guy is right next to her. But as the film goes along we learn that she is a little volatile girl (or slut) and that partly explains her behavior. The cinematography of this movie is well done. We get to see the elevator from almost every angle and perspective, and some of those images and scenes really raise the tension. Götz George plays his character well, a wannabe hot-shot getting old and being overpowered by young men like the Jaennicke character. Wolfgang Kieling who I admired in Hitchcock's THE TORN CURTAIN delivers a great performance that, although he doesn't say much, he is by far the best actor in this play. One critic complained about how unrealistic the film was and that in a real case of emergency nothing would really happen. But then again, how realistic are films such as Missio

- Every sentence starts with an uppercase letter.
- As we can observe, the uppercase characters have been replaced with a whitespace, along with all other uppercase characters.

<h2>3. Pattern: [0-9]</h2>
<p><strong>Use Case:</strong> Extracting the template from a string.</p>
<p><strong>Description:</strong> This pattern matches any single digit from '0' to '9'.</p>
<pre>
pattern = r'[0-9]'
# Finding the specified pattern and replacing digit characters with a white space
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>

In [None]:
import re

string = "OrderNumber: 12345, TotalAmount: $67.89"
pattern = r'[0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

OrderNumber: 12345, TotalAmount: $67.89
OrderNumber:      , TotalAmount: $  .  


- There is a particular template in the string:
  - It contains an Order Number and Total Amount with some values.
  - If we replace the values with whitespace, we can retrieve the template.

In [None]:
import re

string = "The temperature today is 75 degrees Fahrenheit."
pattern = r'[0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

The temperature today is 75 degrees Fahrenheit.
The temperature today is    degrees Fahrenheit.


- Imagine there is a simple website that displays the daily temperature in a specific format.
  - In the above string, only the temperature value changes while the format remains the same.
  - If we replace that value with a whitespace or any other value, we can update the temperature.
  - For now, we will retrieve the template alone.


In [None]:
import re

review = data['review'][905]
pattern = r'[0-9]'

cleaned_review = ''.join(re.sub(pattern,' ',review))

print(review)
print(cleaned_review)

This one is a real bomb. We are supposed to believe that Merle Oberon is the sequestered daughter of an ambitious politician who must prove to the Tom DeLay of the 1930s that he is worth supporting as a presidential candidate. Poor Merle can't go anywhere, but is surrounded by politicians and their quacking, quaking wives and supported only by kindly uncle Harry Davenport. She joins her two maids on a blind date and Gary Cooper happens to show up. Some shots of rodeo might have enlivened things, a la "Misfits," but no such luck with this one. Gary later breaks in to a formal dinner, at which Merle is presiding, and, though invited to sit down and join the group, reads them a lecture on their snobbery. Where did this diffident cowboy's sudden eloquence come from? The most excruciating scene in the film is a phantom party that Gary holds in his unfinished house for his absent wife, Merle. Will it never end? One to avoid.
This one is a real bomb. We are supposed to believe that Merle Ober

- As expected the string '1930' has been replaced with a whitespace.

<h2>4. Pattern: [^]</h2>

- The ^ character is used as a negation operator in regular expressions.
- Any pattern mentioned after the ^ character will be excluded from consideration.

<strong>Use Case:</strong> Removing all non-numeric characters from a string of digits.

<strong>Description:</strong> The pattern `[^0-9]` matches any character that is not a digit. It can be used to clean up strings that should only contain numbers.

```
pattern = r'[^0-9]'
# Finding the specified pattern and replacing non-digit characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
```

In [None]:
import re

string = "OrderNumber: 12345, TotalAmount: $67.89"
pattern = r'[^0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

OrderNumber: 12345, TotalAmount: $67.89
             12345                67 89


- In one of the previous examples, we have retrieved the template.
  - To retrieve anything apart from the template, we can use the ^ character.

In [None]:
import re

string = "The temperature today is 75 degrees Fahrenheit."
pattern = r'[^0-9]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

The temperature today is 75 degrees Fahrenheit.
                         75                    


- In one of the previous examples, we have retrieved the template.
  - To retrieve anything apart from the template, we can use the ^ character.

In [None]:
import re

review = data['review'][13]
pattern = r'[^0-9]'

cleaned_review = ''.join(re.sub(pattern,' ',review))

print(review)
print(cleaned_review)

One of the most underrated movies I've seen in a long time, Bill & Ted's Bogus Journey is the second hilarious adventure of Bill S. Preston Esq. and Ted Theodore Logan, aka Wyld Stallyns. There are two ways to look at this film: First, you see dumb dialogue, far fetched plot, juvenile idea. OR.. You see brilliantly downplayed idiots who yet again find themselves in a situation too big for their brains. Throw a Bruce Willis or a Arnold Schwarzeneggar into this plot and it becomes a big blockbuster movie. Bill and Ted go into the story with the same level of sincerity, only it's Bill and Ted. This is a tricky fence to balance on, but when you watch the movie not as a throwaway screwball comedy, but as an adventure featuring two guys who have no business being in an adventure, it becomes so much more.
                                                                                                                                                                                              

- As the review contains no digits, all characters have been replaced with whitespace.

<h2>5. Pattern: [^A-Za-z]</h2>
<p><strong>Use Case:</strong> Sanitizing input fields to allow only letters.</p>
<p><strong>Description:</strong> This pattern matches any character that is not a letter (either uppercase or lowercase). It can be used to sanitize user input in forms to ensure it only contains alphabetic characters.</p>
<pre>
pattern = r'[^A-Za-z]'
# Finding the specified pattern and replacing non-letter characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
</pre>

In [None]:
import re

string = "johan123@gmail.com"
pattern = r'[^A-Za-z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

johan123@gmail.com
johan    gmail com


- As expected, all characters outside of [A-Za-z] have been replaced with whitespace.

In [None]:
import re

string = "OrderNumber: 12345, TotalAmount: $67.89"
pattern = r'[^A-Za-z]'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

OrderNumber: 12345, TotalAmount: $67.89
OrderNumber         TotalAmount        


- As expected, all characters outside of [A-Za-z] have been replaced with whitespace.
- Additionally, this is another way to retrieve the template, which was discussed in one of the previous examples.

In [None]:
import re

review = data['review'][7]
pattern = r'[^A-Za-z]'

cleaned_review = ''.join(re.sub(pattern,' ',review))

print(review)
print(cleaned_review)

Having heard so much about the 1990s Cracker series without seeing any of them, I looked forward to this eagerly. Surely the combination of Jimmie McGovern and Robbie Coltrane could not go wrong. How wrong I was! <br /><br />The polemics, backed by frequent, repetitive and violent flashbacks, were overpowering. The production tried to be super-modern, but the flashing boxes and even the childish font irritated. Robbie Coltrane sleep-walked through the two hours, coming up with unexplained and unlikely "insights", and the police were portrayed as one-dimensional bumbling idiots. As a result, the tension never built up and the next-to-final scene (no details for fear of spoilers) was as laughably bad a piece of TV drama as I have seen for a long time.<br /><br />No, I don't want to see any more of these, but I will go back to the DVDs of the 1990s series to see if they match their reputation.
Having heard so much about the     s Cracker series without seeing any of them  I looked forward

- As expected the strings '1900' and '1990' has been replaced with a whitespace.

<h2>6. Pattern: []+</h2>

- The + character is used to indicate that the preceding element must occur one or more times.
- Any pattern mentioned before the + character will be matched as long as it appears at least once.

**Use Case**: Validating hexadecimal color codes.

**Description**: The pattern `[a-fA-F0-9]+` matches one or more characters in the range 'a' to 'f' (lowercase) or 'A' to 'F' (uppercase) or 0-9. It can be used to extract or validate parts of hexadecimal color codes.

```
pattern = r'[a-fA-F]+'
# Finding the specified pattern and replacing non-hexadecimal characters with a blank string
new_text = ''.join(re.sub(pattern, ' ', text))
```


In [None]:
import re

string = "#ff0000 #6a5acd"
pattern = r'[a-fA-F0-9]+'

cleaned_string = ''.join(re.sub(pattern,' ',string))

print(string)
print(cleaned_string)

#ff0000 #6a5acd
#  # 


- ##ff0000 and #6a5acd represents colors in hexadecimal format.

- The entire string 'ff0000' is matched as a single pattern since + matches one or more characters specified.

- Also, the string '6a5acd' is matched as a single pattern.

- Hence, the string 'ff0000' and '6a5acd' will be replaced with a whitespace.

In [None]:
print(len(cleaned_string))

5


- The character '#' is retained and a whitespace between '#ff0000' and '#6a5acd' is retained since it doesn't match with the given pattern

- Hence, the length of `cleaned_string` is 5.

In [None]:
import re

review = data['review'][21]
pattern = r'[a-fA-F0-9]+'

cleaned_review = ''.join(re.sub(pattern,' ',review))

print(review)
print(cleaned_review)

Given the chance to write, direct and star in my own movie, I would probably choose something about robot women with guns. Anthony Hopkins, however, decided to make possibly the strangest movie anyone has ever seen. "Slipstream" is a movie that is so strange that even David Lynch would probably look at the person next to him and say 'What's going on?'.<br /><br />This is a movie where, in one scene, a man crosses the road towards a yellow car facing to the right which suddenly changes into a pink car facing to the left. This is a movie where two characters have a conversation interspersed with shots of random people laughing and insects climbing up walls. This is a movie where a man starts talking about "Invasion Of The Bodysnatchers" only for the actor of that particular movie to suddenly show up as himself (and then disappear into thin air). <br /><br />This is a movie that decides to throw the need for a coherent plot straight out of the window and use fifteen different edits whilst

- The above review doesn't contain any digits.
- Hence, the characters in the range a-f and A-F has been replaced with a whitespace.

- Until now, we have been cleaning individual reviews one by one. While this is helpful, it can be time-consuming for larger number of reviews.

- To make our process faster and easier, we want to clean all reviews in the DataFrame at once instead of handling them individually.

- By using the ***apply*** function in Pandas, we can quickly apply our cleaning function, ***remove_special_characters***, to every review in the DataFrame in one go.

- We will define the ***remove_special_characters*** function to remove special characters from the text. Then, we will apply this function to the review column, ensuring all reviews are clean and ready for analysis.

In [None]:
# defining a function to remove special characters
def remove_special_characters(text):
    # Defining the regex pattern to match non-alphanumeric characters
    pattern = '[^A-Za-z0-9]+'

    # Finding the specified pattern and replacing non-alphanumeric characters with a blank string
    new_text = ''.join(re.sub(pattern, ' ', text))

    return new_text

In [None]:
# Applying the function to remove special characters
data['cleaned_text'] = data['review'].apply(remove_special_characters)

In [None]:
# checking a couple of instances of cleaned data
data.loc[0:3, ['review','cleaned_text']]

Unnamed: 0,review,cleaned_text
0,"Okay, I know this does'nt project India in a good light. But the overall theme of the movie is not India, it's Shakti. The power of a warlord, and the power of a mother. The relationship between Nandini and her husband and son swallow you up in their warmth. Then things go terribly wrong. The interaction between Nandini and her father in law - the power of their dysfunctional relationship - and the lives changed by it are the strengths of this movie. Shah Rukh Khan's performance seems to be a mere cameo compared to the believable desperation of Karisma Kapoor. It is easy to get caught up in the love, violence and redemption of lives in this film, and find yourself heaving a sigh of relief and sadness at the climax. The musical interludes are strengths, believable and well done.",Okay I know this does nt project India in a good light But the overall theme of the movie is not India it s Shakti The power of a warlord and the power of a mother The relationship between Nandini and her husband and son swallow you up in their warmth Then things go terribly wrong The interaction between Nandini and her father in law the power of their dysfunctional relationship and the lives changed by it are the strengths of this movie Shah Rukh Khan s performance seems to be a mere cameo compared to the believable desperation of Karisma Kapoor It is easy to get caught up in the love violence and redemption of lives in this film and find yourself heaving a sigh of relief and sadness at the climax The musical interludes are strengths believable and well done
1,"Despite John Travolta's statements in interviews that this was his favorite role of his career, ""Be Cool"" proves to be a disappointing sequel to 1995's witty and clever ""Get Shorty.""<br /><br />Travolta delivers a pleasant enough performance in this mildly entertaining film, but ultimately the movie falls flat due to an underdeveloped plot, unlikeable characters, and a surprising lack of chemistry between leads Travolta and Uma Thurman. Although there are some laughs, this unfunny dialog example (which appeared frequently in the trailers) kind of says it all: Thurman: Do you dance? Travolta: Hey, I'm from Brooklyn.<br /><br />The film suggests that everyone in the entertainment business is a gangster or aspires to be one, likening it to organized crime. In ""Get Shorty,"" the premise of a gangster ""going legitimate"" by getting into movies was a clever fish-out-of water idea, but in ""Be Cool,"" it seems the biz has entirely gone crooked since then.<br /><br />The film is interestingly casted and the absolute highlight is a ""monolgue"" delivered by The Rock, whose character is an aspiring actor as well as a goon, where he reenacts a scene between Gabrielle Union and Kirsten Dunst from ""Bring It On."" Vince Vaughan's character thinks he's black and he's often seen dressed as a pimp-- this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward.<br /><br />Overall, ""Be Cool"" may be worth a rental for John Travolta die-hards (of which I am one), but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time. Fans of ""Get Shorty"" may actually wish to avoid this, as the sequel is devoid of most things that made that one a winner. I rate this movie an admittedly harsh 4/10.",Despite John Travolta s statements in interviews that this was his favorite role of his career Be Cool proves to be a disappointing sequel to 1995 s witty and clever Get Shorty br br Travolta delivers a pleasant enough performance in this mildly entertaining film but ultimately the movie falls flat due to an underdeveloped plot unlikeable characters and a surprising lack of chemistry between leads Travolta and Uma Thurman Although there are some laughs this unfunny dialog example which appeared frequently in the trailers kind of says it all Thurman Do you dance Travolta Hey I m from Brooklyn br br The film suggests that everyone in the entertainment business is a gangster or aspires to be one likening it to organized crime In Get Shorty the premise of a gangster going legitimate by getting into movies was a clever fish out of water idea but in Be Cool it seems the biz has entirely gone crooked since then br br The film is interestingly casted and the absolute highlight is a monolgue delivered by The Rock whose character is an aspiring actor as well as a goon where he reenacts a scene between Gabrielle Union and Kirsten Dunst from Bring It On Vince Vaughan s character thinks he s black and he s often seen dressed as a pimp this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward br br Overall Be Cool may be worth a rental for John Travolta die hards of which I am one but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time Fans of Get Shorty may actually wish to avoid this as the sequel is devoid of most things that made that one a winner I rate this movie an admittedly harsh 4 10
2,"I am a kung fu fan, but not a Woo fan. I have no interest in gangster movies filled with over-the-top gun-play. Now, martial arts; *that's* beautiful! And John Woo surprised me here by producing a highly entertaining kung fu movie, which almost has *too much* fighting, if such a thing is possible! This is good stuff.<br /><br />Many of the fight scenes are very good (and some of them are less good), and the main characters are amusing and likable. The bad guys are a bit too unbelievably evil, but entertaining none the less. You gotta see the Sleeping Wizard!! He can only fight when he's asleep - it's hysterical!<br /><br />Upon repeated viewings, however, Last Hurrah For Chivalry can tend to get a little boring and long-winded, also especially because many of the fight scenes are actually not that good. Hence, I rate it ""only"" a 7 out of 10. But it really is almost an ""8"".<br /><br />All in all one of the better kung fu movies, made smack-dab in the heart of kung fu cinema's prime. All the really good kung fu movies are from the mid- to late 1970ies, with some notable exceptions from the late '60ies and early '70ies (and early '80ies, to be fair).",I am a kung fu fan but not a Woo fan I have no interest in gangster movies filled with over the top gun play Now martial arts that s beautiful And John Woo surprised me here by producing a highly entertaining kung fu movie which almost has too much fighting if such a thing is possible This is good stuff br br Many of the fight scenes are very good and some of them are less good and the main characters are amusing and likable The bad guys are a bit too unbelievably evil but entertaining none the less You gotta see the Sleeping Wizard He can only fight when he s asleep it s hysterical br br Upon repeated viewings however Last Hurrah For Chivalry can tend to get a little boring and long winded also especially because many of the fight scenes are actually not that good Hence I rate it only a 7 out of 10 But it really is almost an 8 br br All in all one of the better kung fu movies made smack dab in the heart of kung fu cinema s prime All the really good kung fu movies are from the mid to late 1970ies with some notable exceptions from the late 60ies and early 70ies and early 80ies to be fair
3,"He seems to be a control freak. I have heard him comment on ""losing control of the show"" and tell another guest who brought live animals that he had one rule-""no snakes."" He needs to hire a comedy writer because his jokes are lame. The only reason I watch him is because he some some great guests and bands. <br /><br />I watched the Craig Ferguson show for a while but his show is even worse. He likes to bull sh** to burn time.I don't think either man has much of a future in late night talk shows.<br /><br />Daily also has the annoying habit of sticking his tongue out to lick his lips. He must do this at least 10 times a show. I do like the Joe Firstman band. Carson Daily needs to lighten up before it is too late.",He seems to be a control freak I have heard him comment on losing control of the show and tell another guest who brought live animals that he had one rule no snakes He needs to hire a comedy writer because his jokes are lame The only reason I watch him is because he some some great guests and bands br br I watched the Craig Ferguson show for a while but his show is even worse He likes to bull sh to burn time I don t think either man has much of a future in late night talk shows br br Daily also has the annoying habit of sticking his tongue out to lick his lips He must do this at least 10 times a show I do like the Joe Firstman band Carson Daily needs to lighten up before it is too late


- We can observe that regex simply removed the special characters and retained the alphabets and numbers.

### **Lowercasing**

<h2>Why is Lowercasing Important in Text Preprocessing?</h2>

- It ensures that all words are in the same format, which helps maintain consistency across the dataset.
  - This means "Dog" and "dog" are treated as the same word.

- Converting text to lowercase simplifies the data, making it easier to process and analyze.

<h2>How to Implement Lowercasing?</h2>

- The `lower()` method is a built-in string method in Python that converts all uppercase characters in a string to lowercase.
- This is useful for standardizing text data.

<p><strong>Example:</strong></p>

<pre>
input_string = "Hello, World!"

lowercased_string = input_string.lower()

print(lowercased_string)
</pre>

>**Output** : hello, world!


In [None]:
string = "The Quick Brown Fox"

lowercased_string = string.lower()

print(lowercased_string)

the quick brown fox


- As we can see, the characters in the `string` have been converted to lowercase."

- Now, let's try to use this pattern to manipulate one of the reviews in the data.

In [None]:
review = data['review'][10]

lowercased_review = review.lower()

print(lowercased_review)

wow! fantastic film in my opinion, i wasn't expecting it to be this good! i was captivated from start to finish- it's a very well made and educational film that really gives us a fascinating insight into the trials darwin had to go through in order to convey his ideas to the world, chronicling his life as he writes "origin of the species"; fighting both personal demons as well as the ignorant society of the time in order to do so. he struggles hard with his mind, body and soul as personal matters get to breaking point and even his family seems to slip away...whilst the rest of the world stand against him as he knows that his findings literally shake the very foundations of their lives, culture and meaning of existence. it's a subtle movie (not over-exaggerated in any way in that typical hollywood way, this is a bbc produced british film) yet thankfully very powerful in meaning and this is thanks to the amazing well directed scenes as well as the superb acting by bettany. connelly acts 

- As we can see, the characters in the `review` have been converted to lowercase."

- Until now, we have been converting text to lowercase for individual reviews one by one. While this method works, it can be inefficient when dealing with a larger number of reviews.

- To streamline our process, we want to convert all reviews in the DataFrame to lowercase at once instead of handling each one separately.

- By using the `str.lower()` method in Pandas, we can efficiently apply the lowercase transformation to every review in the DataFrame in a single operation.

- We will apply the `str.lower()` method to the review column, ensuring that all reviews are standardized and ready for analysis.

In [None]:
# changing the case of the text data to lower case
data['cleaned_text'] = data['cleaned_text'].str.lower()

In [None]:
# checking a couple of instances of cleaned data
data.loc[0:3, ['review','cleaned_text']]

Unnamed: 0,review,cleaned_text
0,"Okay, I know this does'nt project India in a good light. But the overall theme of the movie is not India, it's Shakti. The power of a warlord, and the power of a mother. The relationship between Nandini and her husband and son swallow you up in their warmth. Then things go terribly wrong. The interaction between Nandini and her father in law - the power of their dysfunctional relationship - and the lives changed by it are the strengths of this movie. Shah Rukh Khan's performance seems to be a mere cameo compared to the believable desperation of Karisma Kapoor. It is easy to get caught up in the love, violence and redemption of lives in this film, and find yourself heaving a sigh of relief and sadness at the climax. The musical interludes are strengths, believable and well done.",okay i know this does nt project india in a good light but the overall theme of the movie is not india it s shakti the power of a warlord and the power of a mother the relationship between nandini and her husband and son swallow you up in their warmth then things go terribly wrong the interaction between nandini and her father in law the power of their dysfunctional relationship and the lives changed by it are the strengths of this movie shah rukh khan s performance seems to be a mere cameo compared to the believable desperation of karisma kapoor it is easy to get caught up in the love violence and redemption of lives in this film and find yourself heaving a sigh of relief and sadness at the climax the musical interludes are strengths believable and well done
1,"Despite John Travolta's statements in interviews that this was his favorite role of his career, ""Be Cool"" proves to be a disappointing sequel to 1995's witty and clever ""Get Shorty.""<br /><br />Travolta delivers a pleasant enough performance in this mildly entertaining film, but ultimately the movie falls flat due to an underdeveloped plot, unlikeable characters, and a surprising lack of chemistry between leads Travolta and Uma Thurman. Although there are some laughs, this unfunny dialog example (which appeared frequently in the trailers) kind of says it all: Thurman: Do you dance? Travolta: Hey, I'm from Brooklyn.<br /><br />The film suggests that everyone in the entertainment business is a gangster or aspires to be one, likening it to organized crime. In ""Get Shorty,"" the premise of a gangster ""going legitimate"" by getting into movies was a clever fish-out-of water idea, but in ""Be Cool,"" it seems the biz has entirely gone crooked since then.<br /><br />The film is interestingly casted and the absolute highlight is a ""monolgue"" delivered by The Rock, whose character is an aspiring actor as well as a goon, where he reenacts a scene between Gabrielle Union and Kirsten Dunst from ""Bring It On."" Vince Vaughan's character thinks he's black and he's often seen dressed as a pimp-- this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward.<br /><br />Overall, ""Be Cool"" may be worth a rental for John Travolta die-hards (of which I am one), but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time. Fans of ""Get Shorty"" may actually wish to avoid this, as the sequel is devoid of most things that made that one a winner. I rate this movie an admittedly harsh 4/10.",despite john travolta s statements in interviews that this was his favorite role of his career be cool proves to be a disappointing sequel to 1995 s witty and clever get shorty br br travolta delivers a pleasant enough performance in this mildly entertaining film but ultimately the movie falls flat due to an underdeveloped plot unlikeable characters and a surprising lack of chemistry between leads travolta and uma thurman although there are some laughs this unfunny dialog example which appeared frequently in the trailers kind of says it all thurman do you dance travolta hey i m from brooklyn br br the film suggests that everyone in the entertainment business is a gangster or aspires to be one likening it to organized crime in get shorty the premise of a gangster going legitimate by getting into movies was a clever fish out of water idea but in be cool it seems the biz has entirely gone crooked since then br br the film is interestingly casted and the absolute highlight is a monolgue delivered by the rock whose character is an aspiring actor as well as a goon where he reenacts a scene between gabrielle union and kirsten dunst from bring it on vince vaughan s character thinks he s black and he s often seen dressed as a pimp this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward br br overall be cool may be worth a rental for john travolta die hards of which i am one but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time fans of get shorty may actually wish to avoid this as the sequel is devoid of most things that made that one a winner i rate this movie an admittedly harsh 4 10
2,"I am a kung fu fan, but not a Woo fan. I have no interest in gangster movies filled with over-the-top gun-play. Now, martial arts; *that's* beautiful! And John Woo surprised me here by producing a highly entertaining kung fu movie, which almost has *too much* fighting, if such a thing is possible! This is good stuff.<br /><br />Many of the fight scenes are very good (and some of them are less good), and the main characters are amusing and likable. The bad guys are a bit too unbelievably evil, but entertaining none the less. You gotta see the Sleeping Wizard!! He can only fight when he's asleep - it's hysterical!<br /><br />Upon repeated viewings, however, Last Hurrah For Chivalry can tend to get a little boring and long-winded, also especially because many of the fight scenes are actually not that good. Hence, I rate it ""only"" a 7 out of 10. But it really is almost an ""8"".<br /><br />All in all one of the better kung fu movies, made smack-dab in the heart of kung fu cinema's prime. All the really good kung fu movies are from the mid- to late 1970ies, with some notable exceptions from the late '60ies and early '70ies (and early '80ies, to be fair).",i am a kung fu fan but not a woo fan i have no interest in gangster movies filled with over the top gun play now martial arts that s beautiful and john woo surprised me here by producing a highly entertaining kung fu movie which almost has too much fighting if such a thing is possible this is good stuff br br many of the fight scenes are very good and some of them are less good and the main characters are amusing and likable the bad guys are a bit too unbelievably evil but entertaining none the less you gotta see the sleeping wizard he can only fight when he s asleep it s hysterical br br upon repeated viewings however last hurrah for chivalry can tend to get a little boring and long winded also especially because many of the fight scenes are actually not that good hence i rate it only a 7 out of 10 but it really is almost an 8 br br all in all one of the better kung fu movies made smack dab in the heart of kung fu cinema s prime all the really good kung fu movies are from the mid to late 1970ies with some notable exceptions from the late 60ies and early 70ies and early 80ies to be fair
3,"He seems to be a control freak. I have heard him comment on ""losing control of the show"" and tell another guest who brought live animals that he had one rule-""no snakes."" He needs to hire a comedy writer because his jokes are lame. The only reason I watch him is because he some some great guests and bands. <br /><br />I watched the Craig Ferguson show for a while but his show is even worse. He likes to bull sh** to burn time.I don't think either man has much of a future in late night talk shows.<br /><br />Daily also has the annoying habit of sticking his tongue out to lick his lips. He must do this at least 10 times a show. I do like the Joe Firstman band. Carson Daily needs to lighten up before it is too late.",he seems to be a control freak i have heard him comment on losing control of the show and tell another guest who brought live animals that he had one rule no snakes he needs to hire a comedy writer because his jokes are lame the only reason i watch him is because he some some great guests and bands br br i watched the craig ferguson show for a while but his show is even worse he likes to bull sh to burn time i don t think either man has much of a future in late night talk shows br br daily also has the annoying habit of sticking his tongue out to lick his lips he must do this at least 10 times a show i do like the joe firstman band carson daily needs to lighten up before it is too late


- We can observe that all the text has now successfully been converted to lower case.

### **Removing extra whitespace**

<h2>Why is Removing Extra Spaces Important in Text Preprocessing?</h2>

- Removing extra spaces ensures uniformity in the text, making it easier to analyze and process.

- Extra spaces can unnecessarily increase the size of the text data. By eliminating them, the overall storage requirements are reduced, leading to more efficient data handling and processing.

<h2>How to Implement Lowercasing?</h2>

- The `strip()` method is a built-in string method in Python that removes leading and trailing whitespace from a string.
- This is useful for cleaning up user input and ensuring consistent formatting in text data.

**Example:**
<pre>
input_string = "Hello, World! "

stripped_string = input_string.strip()

print(stripped_string)
</pre>

> **Output :** Hello, World!


In [None]:
string = " johan@gmail.com "

stripped_string = string.strip()

print(stripped_string)

johan@gmail.com


- As expected, the trailing and leading whitespace has been removed.

- Now, let's try to use this pattern to manipulate one of the reviews in the data.

In [None]:
review = data['review'][178]

stripped_review = review.strip()

print(review)
print(stripped_review)

Night Crossing' is about an enormous barrier designed not to keep enemies out but to keep its own people in<br /><br />'Night Crossing' is about a very long border fencer equipped with silent alarms and automatic firing systems<br /><br />'Night Crossing' is about the denial of the basic human rights of life, liberty, and the pursuit of happiness<br /><br />'Night Crossing' is about the fear and pain that afflict so many families<br /><br />'Night Crossing' is about one attempt to risk a crossing through the border zone<br /><br />'Night Crossing' is about a loving father whose only desire is to give his boys what should never have been taken away from them<br /><br />'Night Crossing' is about a disturbed mother who wants her babies and her husband alive<br /><br />'Night Crossing' is about a caring husband who wants his family to be together but in a better place<br /><br />'Night Crossing' is about children who want to be free to reach at anytime the sky<br /><br />'Night Cr

- As expected, the leading whitespace has been removed.

- Until now, we have been removing leading and trailing whitespace from individual reviews one by one. While this method is effective, it can be time-consuming when working with a large number of reviews.

- To make our process more efficient, we want to remove extra spaces from all reviews in the DataFrame simultaneously instead of handling each one separately.

- By utilizing the ***str.strip()*** method in Pandas, we can easily eliminate leading and trailing whitespace from every review in the DataFrame in a single operation.

- We will apply the ***str.strip()*** method to the review column, ensuring that all reviews are cleaned and formatted consistently for analysis.

In [None]:
# removing extra whitespaces from the text
data['cleaned_text'] = data['cleaned_text'].str.strip()

In [None]:
# checking a couple of instances of cleaned data
data.loc[0:3, ['review','cleaned_text']]

Unnamed: 0,review,cleaned_text
0,"Okay, I know this does'nt project India in a good light. But the overall theme of the movie is not India, it's Shakti. The power of a warlord, and the power of a mother. The relationship between Nandini and her husband and son swallow you up in their warmth. Then things go terribly wrong. The interaction between Nandini and her father in law - the power of their dysfunctional relationship - and the lives changed by it are the strengths of this movie. Shah Rukh Khan's performance seems to be a mere cameo compared to the believable desperation of Karisma Kapoor. It is easy to get caught up in the love, violence and redemption of lives in this film, and find yourself heaving a sigh of relief and sadness at the climax. The musical interludes are strengths, believable and well done.",okay i know this does nt project india in a good light but the overall theme of the movie is not india it s shakti the power of a warlord and the power of a mother the relationship between nandini and her husband and son swallow you up in their warmth then things go terribly wrong the interaction between nandini and her father in law the power of their dysfunctional relationship and the lives changed by it are the strengths of this movie shah rukh khan s performance seems to be a mere cameo compared to the believable desperation of karisma kapoor it is easy to get caught up in the love violence and redemption of lives in this film and find yourself heaving a sigh of relief and sadness at the climax the musical interludes are strengths believable and well done
1,"Despite John Travolta's statements in interviews that this was his favorite role of his career, ""Be Cool"" proves to be a disappointing sequel to 1995's witty and clever ""Get Shorty.""<br /><br />Travolta delivers a pleasant enough performance in this mildly entertaining film, but ultimately the movie falls flat due to an underdeveloped plot, unlikeable characters, and a surprising lack of chemistry between leads Travolta and Uma Thurman. Although there are some laughs, this unfunny dialog example (which appeared frequently in the trailers) kind of says it all: Thurman: Do you dance? Travolta: Hey, I'm from Brooklyn.<br /><br />The film suggests that everyone in the entertainment business is a gangster or aspires to be one, likening it to organized crime. In ""Get Shorty,"" the premise of a gangster ""going legitimate"" by getting into movies was a clever fish-out-of water idea, but in ""Be Cool,"" it seems the biz has entirely gone crooked since then.<br /><br />The film is interestingly casted and the absolute highlight is a ""monolgue"" delivered by The Rock, whose character is an aspiring actor as well as a goon, where he reenacts a scene between Gabrielle Union and Kirsten Dunst from ""Bring It On."" Vince Vaughan's character thinks he's black and he's often seen dressed as a pimp-- this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward.<br /><br />Overall, ""Be Cool"" may be worth a rental for John Travolta die-hards (of which I am one), but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time. Fans of ""Get Shorty"" may actually wish to avoid this, as the sequel is devoid of most things that made that one a winner. I rate this movie an admittedly harsh 4/10.",despite john travolta s statements in interviews that this was his favorite role of his career be cool proves to be a disappointing sequel to 1995 s witty and clever get shorty br br travolta delivers a pleasant enough performance in this mildly entertaining film but ultimately the movie falls flat due to an underdeveloped plot unlikeable characters and a surprising lack of chemistry between leads travolta and uma thurman although there are some laughs this unfunny dialog example which appeared frequently in the trailers kind of says it all thurman do you dance travolta hey i m from brooklyn br br the film suggests that everyone in the entertainment business is a gangster or aspires to be one likening it to organized crime in get shorty the premise of a gangster going legitimate by getting into movies was a clever fish out of water idea but in be cool it seems the biz has entirely gone crooked since then br br the film is interestingly casted and the absolute highlight is a monolgue delivered by the rock whose character is an aspiring actor as well as a goon where he reenacts a scene between gabrielle union and kirsten dunst from bring it on vince vaughan s character thinks he s black and he s often seen dressed as a pimp this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward br br overall be cool may be worth a rental for john travolta die hards of which i am one but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time fans of get shorty may actually wish to avoid this as the sequel is devoid of most things that made that one a winner i rate this movie an admittedly harsh 4 10
2,"I am a kung fu fan, but not a Woo fan. I have no interest in gangster movies filled with over-the-top gun-play. Now, martial arts; *that's* beautiful! And John Woo surprised me here by producing a highly entertaining kung fu movie, which almost has *too much* fighting, if such a thing is possible! This is good stuff.<br /><br />Many of the fight scenes are very good (and some of them are less good), and the main characters are amusing and likable. The bad guys are a bit too unbelievably evil, but entertaining none the less. You gotta see the Sleeping Wizard!! He can only fight when he's asleep - it's hysterical!<br /><br />Upon repeated viewings, however, Last Hurrah For Chivalry can tend to get a little boring and long-winded, also especially because many of the fight scenes are actually not that good. Hence, I rate it ""only"" a 7 out of 10. But it really is almost an ""8"".<br /><br />All in all one of the better kung fu movies, made smack-dab in the heart of kung fu cinema's prime. All the really good kung fu movies are from the mid- to late 1970ies, with some notable exceptions from the late '60ies and early '70ies (and early '80ies, to be fair).",i am a kung fu fan but not a woo fan i have no interest in gangster movies filled with over the top gun play now martial arts that s beautiful and john woo surprised me here by producing a highly entertaining kung fu movie which almost has too much fighting if such a thing is possible this is good stuff br br many of the fight scenes are very good and some of them are less good and the main characters are amusing and likable the bad guys are a bit too unbelievably evil but entertaining none the less you gotta see the sleeping wizard he can only fight when he s asleep it s hysterical br br upon repeated viewings however last hurrah for chivalry can tend to get a little boring and long winded also especially because many of the fight scenes are actually not that good hence i rate it only a 7 out of 10 but it really is almost an 8 br br all in all one of the better kung fu movies made smack dab in the heart of kung fu cinema s prime all the really good kung fu movies are from the mid to late 1970ies with some notable exceptions from the late 60ies and early 70ies and early 80ies to be fair
3,"He seems to be a control freak. I have heard him comment on ""losing control of the show"" and tell another guest who brought live animals that he had one rule-""no snakes."" He needs to hire a comedy writer because his jokes are lame. The only reason I watch him is because he some some great guests and bands. <br /><br />I watched the Craig Ferguson show for a while but his show is even worse. He likes to bull sh** to burn time.I don't think either man has much of a future in late night talk shows.<br /><br />Daily also has the annoying habit of sticking his tongue out to lick his lips. He must do this at least 10 times a show. I do like the Joe Firstman band. Carson Daily needs to lighten up before it is too late.",he seems to be a control freak i have heard him comment on losing control of the show and tell another guest who brought live animals that he had one rule no snakes he needs to hire a comedy writer because his jokes are lame the only reason i watch him is because he some some great guests and bands br br i watched the craig ferguson show for a while but his show is even worse he likes to bull sh to burn time i don t think either man has much of a future in late night talk shows br br daily also has the annoying habit of sticking his tongue out to lick his lips he must do this at least 10 times a show i do like the joe firstman band carson daily needs to lighten up before it is too late


* The above code will remove the white spaces in the text

### **Removing stopwords**

<h2>Why is Removing Stop Words Important?<h2>

- Stop words are common words (like "and," "the," and "is") that are often excluded from text analysis because they add little meaning to the content.

- Excluding frequently occurring words helps emphasize the more significant terms in the text, improving analysis quality.

- Removing pronouns and articles, commonly categorized as stop words, minimizes irrelevant information, allowing algorithms to better identify patterns.

<h2>How to implement stop word removal?</h2>

- Start by using a pre-defined list of stop words to identify common words that can be excluded from our analysis.

- Downloading a stop words dataset ensures we have an updated collection of these words for accurate filtering.

- As the text data grows, manually removing stop words can become tedious and inefficient.

- Implementing an automated method to filter out stop words saves time

- NLTK (Natural Language Toolkit) is a Python library designed for working with human language data (text) and provides tools for various natural language processing tasks.

- NLTK is widely used in academic and research settings, supported by an active community that contributes to its development and documentation, making it a valuable resource for learning and experimentation in NLP.

- It is imported using the statement `import nltk`

In [None]:
import nltk

- NLTK has a module called stopwords that provides a list of common stop words based on the English language.

- This list includes frequently used words that typically do not add significant meaning to text analysis.

- We can download this stop words list using the statement `nltk.download('stopwords')`

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

- Once we have downloaded the list, we can use the `corpus` module from `nltk` to load the stopwords

In [None]:
from nltk.corpus import stopwords

- To access the stopwords in the english language, we can use the `words` method from the stopwords module and pass 'english' as the argument.

- We will list out the first 10 stopwords

In [None]:
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Using the NLTK stop words list, we can filter reviews by keeping only the words not in this list, allowing us to focus on meaningful content and enhance the quality of our analysis.

- Suppose you have a list of words, and you are interested in converting it into a sentence as a string.

In [None]:
list_of_words = ["I","love","Text","Preprocessing"]

- This is where the join method comes to rescue.

  - The `join()` method takes all items in a list and joins them into one string with a seaparator.

  - A string must be specified as the separator.

  - **Syntax :** string.join(list)

In [None]:
'space'.join(list_of_words)

'IspacelovespaceTextspacePreprocessing'

- As we can observe, the string 'space' has been added between each element in the list, thus forming a sentence.

In [None]:
' '.join(list_of_words)

'I love Text Preprocessing'

- As we can observe, a whitespace has been added between each element in the list, thus forming a sentence.


<h2>1. Splitting the Text</h2>
<p>
    The first step involves splitting the input <code>text</code> into individual words using the <code>split()</code> method. This method separates the text at each space and creates a list of words.
</p>
<pre><code>words = text.split()</code></pre>

<h2>2. Removing Stop Words</h2>
<p>
    Next, we remove the English stop words from the list of words. We use a list comprehension to iterate through each word in the <code>words</code> list and check if it is not present in the stop words list obtained from <code>stopwords.words('english')</code>.
</p>
<pre><code>[word for word in words if word not in stopwords.words('english')]</code></pre>

<h2>3. Creating the New Text</h2>
<p>
    Finally, the remaining words (those not in the stop words list) are joined back together into a single string using the <code>join()</code> method. This creates a new text string that contains only the meaningful words.
</p>
<pre><code>new_text = ' '.join([word for word in words if word not in stopwords.words('english')])</code></pre>

<h2>4. Result</h2>
<p>
    The resulting <code>new_text</code> variable now holds the filtered text, free of common English stop words, which allows for a more relevant analysis of the content.
</p>


In [None]:
string = "The quick brown fox jumps over the lazy dog."

words = string.split(' ')

new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

print(string)

print(new_text)

The quick brown fox jumps over the lazy dog.
The quick brown fox jumps lazy dog.


- As we can observe, the stopwords have been removed.
- Additionally, we can note that the stopwords are 'over' and 'the', as they are not present in `new_text`.

In [None]:
string = "In order to succeed, you must first believe that you can."

words = string.split(' ')

new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

print(string)

print(new_text)

In order to succeed, you must first believe that you can.
In order succeed, must first believe can.


- As we can observe, the stopwords have been removed.
- Additionally, we can note that the stopwords are 'to', 'you' and 'that', as they are not present in `new_text`.

- Now, let's try to use this pattern to manipulate one of the reviews in the data.

In [None]:
review = data['review'][25]

words = review.split(' ')

new_text = ' '.join([word for word in words if word not in stopwords.words('english')])


print(words)
print(new_text)

['I', 'think', 'the', 'movie', 'was', 'one', 'sided', 'I', 'watched', 'it', 'recently', 'and', 'find', 'the', 'documentary', 'typical', 'of', 'western', 'movie', 'makers', 'that', 'was', 'biased', 'without', 'substance.', 'The', 'fact', 'is', 'prostitution', 'do', 'exist', 'everywhere', 'in', 'the', 'world', 'not', 'in', 'Tanzania', 'alone', 'and', 'not', 'because', 'of', 'this', 'fish', 'business,', 'there', 'prostitutes', 'were', 'there', 'way', 'before', 'the', 'Russian', 'and', 'other', 'business', 'people', 'arrived', 'in', 'Mwanza.', 'Poverty', 'is', 'indeed', 'endemic', 'in', 'Africa', 'let', 'alone', 'Tanzania', 'and', 'this', 'is', 'not', 'because', 'of', 'fish', 'fillet', 'business,', 'in', 'fact', 'the', 'fish', 'industry', 'has', 'helped', 'millions', 'to', 'support', 'their', 'families', 'on', 'their', 'daily', 'life.', 'This', 'movie', 'just', 'tarnish', 'the', 'good', 'image', 'of', 'this', 'peace', 'loving', 'country.', 'As', 'for', 'the', 'arms', 'trade', 'the', 'film'

- As we can observe, the stopwords have been removed.


- Until now, we have been removing stop words from individual reviews one by one. While this method is effective, it can be inefficient when processing a larger number of reviews.

- To streamline our process, we want to remove stop words from all reviews in the DataFrame at once instead of handling each one separately.

- By defining a function called ***remove_stopwords*** using the NLTK library, we can efficiently apply the stop word removal to every review in the DataFrame in a single operation.

  - This function splits the text into separate words, removes English language stop words, and then joins the remaining words back into a single string, ensuring that all reviews are cleaned and ready for analysis.

- To implement this, we will apply the ***remove_stopwords*** function to the review column in our DataFrame using the Pandas ***apply()*** method, allowing us to filter out common stop words from each review and enhance the quality of our text analysis.

In [None]:
# defining a function to remove stop words using the NLTK library
def remove_stopwords(text):
    # Split text into separate words
    words = text.split()

    # Removing English language stopwords
    new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

    return new_text

In [None]:
# Applying the function to remove stop words using the NLTK library
data['cleaned_text_without_stopwords'] = data['cleaned_text'].apply(remove_stopwords)

In [None]:
# checking a couple of instances of cleaned data
data.loc[0:3,['cleaned_text','cleaned_text_without_stopwords']]

Unnamed: 0,cleaned_text,cleaned_text_without_stopwords
0,okay i know this does nt project india in a good light but the overall theme of the movie is not india it s shakti the power of a warlord and the power of a mother the relationship between nandini and her husband and son swallow you up in their warmth then things go terribly wrong the interaction between nandini and her father in law the power of their dysfunctional relationship and the lives changed by it are the strengths of this movie shah rukh khan s performance seems to be a mere cameo compared to the believable desperation of karisma kapoor it is easy to get caught up in the love violence and redemption of lives in this film and find yourself heaving a sigh of relief and sadness at the climax the musical interludes are strengths believable and well done,okay know nt project india good light overall theme movie india shakti power warlord power mother relationship nandini husband son swallow warmth things go terribly wrong interaction nandini father law power dysfunctional relationship lives changed strengths movie shah rukh khan performance seems mere cameo compared believable desperation karisma kapoor easy get caught love violence redemption lives film find heaving sigh relief sadness climax musical interludes strengths believable well done
1,despite john travolta s statements in interviews that this was his favorite role of his career be cool proves to be a disappointing sequel to 1995 s witty and clever get shorty br br travolta delivers a pleasant enough performance in this mildly entertaining film but ultimately the movie falls flat due to an underdeveloped plot unlikeable characters and a surprising lack of chemistry between leads travolta and uma thurman although there are some laughs this unfunny dialog example which appeared frequently in the trailers kind of says it all thurman do you dance travolta hey i m from brooklyn br br the film suggests that everyone in the entertainment business is a gangster or aspires to be one likening it to organized crime in get shorty the premise of a gangster going legitimate by getting into movies was a clever fish out of water idea but in be cool it seems the biz has entirely gone crooked since then br br the film is interestingly casted and the absolute highlight is a monolgue delivered by the rock whose character is an aspiring actor as well as a goon where he reenacts a scene between gabrielle union and kirsten dunst from bring it on vince vaughan s character thinks he s black and he s often seen dressed as a pimp this was quite funny in the first scene that introduces him and gets tired and embarrassing almost immediately afterward br br overall be cool may be worth a rental for john travolta die hards of which i am one but you may want to keep your finger close to the fast forward button to get through it without feeling that you wasted too much time fans of get shorty may actually wish to avoid this as the sequel is devoid of most things that made that one a winner i rate this movie an admittedly harsh 4 10,despite john travolta statements interviews favorite role career cool proves disappointing sequel 1995 witty clever get shorty br br travolta delivers pleasant enough performance mildly entertaining film ultimately movie falls flat due underdeveloped plot unlikeable characters surprising lack chemistry leads travolta uma thurman although laughs unfunny dialog example appeared frequently trailers kind says thurman dance travolta hey brooklyn br br film suggests everyone entertainment business gangster aspires one likening organized crime get shorty premise gangster going legitimate getting movies clever fish water idea cool seems biz entirely gone crooked since br br film interestingly casted absolute highlight monolgue delivered rock whose character aspiring actor well goon reenacts scene gabrielle union kirsten dunst bring vince vaughan character thinks black often seen dressed pimp quite funny first scene introduces gets tired embarrassing almost immediately afterward br br overall cool may worth rental john travolta die hards one may want keep finger close fast forward button get without feeling wasted much time fans get shorty may actually wish avoid sequel devoid things made one winner rate movie admittedly harsh 4 10
2,i am a kung fu fan but not a woo fan i have no interest in gangster movies filled with over the top gun play now martial arts that s beautiful and john woo surprised me here by producing a highly entertaining kung fu movie which almost has too much fighting if such a thing is possible this is good stuff br br many of the fight scenes are very good and some of them are less good and the main characters are amusing and likable the bad guys are a bit too unbelievably evil but entertaining none the less you gotta see the sleeping wizard he can only fight when he s asleep it s hysterical br br upon repeated viewings however last hurrah for chivalry can tend to get a little boring and long winded also especially because many of the fight scenes are actually not that good hence i rate it only a 7 out of 10 but it really is almost an 8 br br all in all one of the better kung fu movies made smack dab in the heart of kung fu cinema s prime all the really good kung fu movies are from the mid to late 1970ies with some notable exceptions from the late 60ies and early 70ies and early 80ies to be fair,kung fu fan woo fan interest gangster movies filled top gun play martial arts beautiful john woo surprised producing highly entertaining kung fu movie almost much fighting thing possible good stuff br br many fight scenes good less good main characters amusing likable bad guys bit unbelievably evil entertaining none less gotta see sleeping wizard fight asleep hysterical br br upon repeated viewings however last hurrah chivalry tend get little boring long winded also especially many fight scenes actually good hence rate 7 10 really almost 8 br br one better kung fu movies made smack dab heart kung fu cinema prime really good kung fu movies mid late 1970ies notable exceptions late 60ies early 70ies early 80ies fair
3,he seems to be a control freak i have heard him comment on losing control of the show and tell another guest who brought live animals that he had one rule no snakes he needs to hire a comedy writer because his jokes are lame the only reason i watch him is because he some some great guests and bands br br i watched the craig ferguson show for a while but his show is even worse he likes to bull sh to burn time i don t think either man has much of a future in late night talk shows br br daily also has the annoying habit of sticking his tongue out to lick his lips he must do this at least 10 times a show i do like the joe firstman band carson daily needs to lighten up before it is too late,seems control freak heard comment losing control show tell another guest brought live animals one rule snakes needs hire comedy writer jokes lame reason watch great guests bands br br watched craig ferguson show show even worse likes bull sh burn time think either man much future late night talk shows br br daily also annoying habit sticking tongue lick lips must least 10 times show like joe firstman band carson daily needs lighten late


### **Stemming**

<h2>Why is Stemming Important?</h2>

- Stemming transforms different forms of a word (e.g., "running," "ran," "runs") into a single root (e.g., "run"), making data more uniform.

- Fewer unique words mean simpler data, which helps in processing and analyzing text more efficiently.

- It helps in identifying key topics and sentiments by focusing on the core meaning of words rather than their variations.

<h2>How to implement?</h2>

The Porter Stemmer is one of the widely-used algorithms for stemming, and it shorten words to their root form by removing suffixes.
 - This is particularly useful in NLP tasks where you want to analyze the underlying meaning of text without being misled by different grammatical forms of the same word.

NLTK module supports Porter Stemmer algorithm.

Before proceeding, we need to download the `wordnet` database.
 - WordNet is a large lexical database of English words that groups words into sets of synonyms called synsets, providing definitions and semantic relationships between them.
 - By using nltk.download('wordnet'), we can download the WordNet dataset directly into their NLTK environment, enabling easy access to its rich linguistic resources.

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Next, we can import the PorterStemmer module from `stem.porter` module available in the `nltk` library

In [None]:
from nltk.stem.porter import PorterStemmer

Creating an instance of the Porter Stemmer class, which is used to stem words by reducing them to their root form.

In [None]:
ps = PorterStemmer()

Specifying the input word that needs to be stemmed.

In [None]:
word = 'Running'

Applying the stemming process to the input word (in this case, 'Running'), and assigns the resulting stemmed form to the variable stemmed_word.

In [None]:
stemmed_word = ps.stem(word)

Displaying the input word and its stemmed form.

In [None]:
print(word)
print(stemmed_word)

Running
run


In [None]:
word = 'Analzying'

stemmed_word = ps.stem(word)

print(stemmed_word)

analzi


- Here, we can note that the stemmed word is 'analzi' . But, the correct root word should be 'analyze'.

- The reason is the Porter Stemmer uses a set of rules to strip suffixes from words.
  - Its algorithm may not always produce linguistically correct stems, focusing instead on reducing words to their root forms based on pattern matching.

- The specific rules for removing suffixes may lead to unusual results.
  - For example, the removal of "-ing" from "Analyzing" can result in "Analyz," which, combined with its handling of consonant patterns, can lead to "analzi."

- Now, let's try to use this pattern to manipulate one of the reviews in the data.

In [None]:
review = data['review'][21]

words = review.split(' ')

stemmed_text = ' '.join([ps.stem(word) for word in words])


print(review)

print(stemmed_text)

Given the chance to write, direct and star in my own movie, I would probably choose something about robot women with guns. Anthony Hopkins, however, decided to make possibly the strangest movie anyone has ever seen. "Slipstream" is a movie that is so strange that even David Lynch would probably look at the person next to him and say 'What's going on?'.<br /><br />This is a movie where, in one scene, a man crosses the road towards a yellow car facing to the right which suddenly changes into a pink car facing to the left. This is a movie where two characters have a conversation interspersed with shots of random people laughing and insects climbing up walls. This is a movie where a man starts talking about "Invasion Of The Bodysnatchers" only for the actor of that particular movie to suddenly show up as himself (and then disappear into thin air). <br /><br />This is a movie that decides to throw the need for a coherent plot straight out of the window and use fifteen different edits whilst

- As we can observe, the words in the review have been transformed to their root forms.

- Until now, we have been applying stemming to individual words one by one. While this method is effective, it can be inefficient when processing larger texts or multiple reviews.

- To streamline our process, we want to apply stemming to all words in a given text at once instead of handling each word separately.

- By defining a function called ***apply_porter_stemmer*** using the NLTK library, we can efficiently apply the Porter stemming algorithm to every word in the text in a single operation.

  - This function splits the input text into separate words, applies the Porter Stemmer to each word, and then joins the stemmed words back into a single string, ensuring that the text is uniformly stemmed and ready for further analysis.
- To implement this, we will call the ***apply_porter_stemmer*** function for our text data, allowing us to reduce words to their root forms and enhance the quality of our text analysis.

In [None]:
# Loading the Porter Stemmer
ps = PorterStemmer()

# defining a function to perform stemming
def apply_porter_stemmer(text):
    # Split text into separate words
    words = text.split()

    # Applying the Porter Stemmer on every word of a message and joining the stemmed words back into a single string
    new_text = ' '.join([ps.stem(word) for word in words])

    return new_text

In [None]:
# Applying the function to perform stemming
data['final_cleaned_text'] = data['cleaned_text_without_stopwords'].apply(apply_porter_stemmer)

In [None]:
# checking a couple of instances of cleaned data
data.loc[0:2,['cleaned_text_without_stopwords','final_cleaned_text']]

Unnamed: 0,cleaned_text_without_stopwords,final_cleaned_text
0,okay know nt project india good light overall theme movie india shakti power warlord power mother relationship nandini husband son swallow warmth things go terribly wrong interaction nandini father law power dysfunctional relationship lives changed strengths movie shah rukh khan performance seems mere cameo compared believable desperation karisma kapoor easy get caught love violence redemption lives film find heaving sigh relief sadness climax musical interludes strengths believable well done,okay know nt project india good light overal theme movi india shakti power warlord power mother relationship nandini husband son swallow warmth thing go terribl wrong interact nandini father law power dysfunct relationship live chang strength movi shah rukh khan perform seem mere cameo compar believ desper karisma kapoor easi get caught love violenc redempt live film find heav sigh relief sad climax music interlud strength believ well done
1,despite john travolta statements interviews favorite role career cool proves disappointing sequel 1995 witty clever get shorty br br travolta delivers pleasant enough performance mildly entertaining film ultimately movie falls flat due underdeveloped plot unlikeable characters surprising lack chemistry leads travolta uma thurman although laughs unfunny dialog example appeared frequently trailers kind says thurman dance travolta hey brooklyn br br film suggests everyone entertainment business gangster aspires one likening organized crime get shorty premise gangster going legitimate getting movies clever fish water idea cool seems biz entirely gone crooked since br br film interestingly casted absolute highlight monolgue delivered rock whose character aspiring actor well goon reenacts scene gabrielle union kirsten dunst bring vince vaughan character thinks black often seen dressed pimp quite funny first scene introduces gets tired embarrassing almost immediately afterward br br overall cool may worth rental john travolta die hards one may want keep finger close fast forward button get without feeling wasted much time fans get shorty may actually wish avoid sequel devoid things made one winner rate movie admittedly harsh 4 10,despit john travolta statement interview favorit role career cool prove disappoint sequel 1995 witti clever get shorti br br travolta deliv pleasant enough perform mildli entertain film ultim movi fall flat due underdevelop plot unlik charact surpris lack chemistri lead travolta uma thurman although laugh unfunni dialog exampl appear frequent trailer kind say thurman danc travolta hey brooklyn br br film suggest everyon entertain busi gangster aspir one liken organ crime get shorti premis gangster go legitim get movi clever fish water idea cool seem biz entir gone crook sinc br br film interestingli cast absolut highlight monolgu deliv rock whose charact aspir actor well goon reenact scene gabriel union kirsten dunst bring vinc vaughan charact think black often seen dress pimp quit funni first scene introduc get tire embarrass almost immedi afterward br br overal cool may worth rental john travolta die hard one may want keep finger close fast forward button get without feel wast much time fan get shorti may actual wish avoid sequel devoid thing made one winner rate movi admittedli harsh 4 10
2,kung fu fan woo fan interest gangster movies filled top gun play martial arts beautiful john woo surprised producing highly entertaining kung fu movie almost much fighting thing possible good stuff br br many fight scenes good less good main characters amusing likable bad guys bit unbelievably evil entertaining none less gotta see sleeping wizard fight asleep hysterical br br upon repeated viewings however last hurrah chivalry tend get little boring long winded also especially many fight scenes actually good hence rate 7 10 really almost 8 br br one better kung fu movies made smack dab heart kung fu cinema prime really good kung fu movies mid late 1970ies notable exceptions late 60ies early 70ies early 80ies fair,kung fu fan woo fan interest gangster movi fill top gun play martial art beauti john woo surpris produc highli entertain kung fu movi almost much fight thing possibl good stuff br br mani fight scene good less good main charact amus likabl bad guy bit unbeliev evil entertain none less gotta see sleep wizard fight asleep hyster br br upon repeat view howev last hurrah chivalri tend get littl bore long wind also especi mani fight scene actual good henc rate 7 10 realli almost 8 br br one better kung fu movi made smack dab heart kung fu cinema prime realli good kung fu movi mid late 1970i notabl except late 60i earli 70i earli 80i fair


# **Text Vectorization**

<h2>Why is Text Vectorization Important?</h2>

- It transforms words and sentences into numerical formats that mathematical algorithms can understand, making it essential for processing text data.

- By representing text as vectors, it helps capture important patterns and relationships in the data, leading to better insights and analysis in text-related tasks.

<p><h3>Definition:</h3> A document-term matrix is a mathematical representation of a collection of documents, where each document is represented as a row, and each unique word (or term) across all documents is represented as a column. The cells in the matrix contain values that indicate the frequency or presence of the corresponding word in the corresponding document.</p>

<p><h3>Usage in Text Preprocessing:</h3> The document-term matrix is essential in text preprocessing as it converts unstructured text data into a structured numerical format. This transformation allows for efficient analysis and processing of textual information. By representing documents as matrices, it facilitates the extraction of important features and patterns, making it easier to prepare text data for further analysis and understanding.</p>

<p><h3>Example:</h3></p>
<p>Consider three short documents:</p>
<ul>
<li>Document 1: "I love programming"</li>
<li>Document 2: "Programming is fun"</li>
<li>Document 3: "I love fun programming"</li>
</ul>

<p><h3>Steps to Create the Document-Term Matrix:</h3></p>

<li>Convert all text to lowercase:
<ul>
<li>Document 1: "i love programming"</li>
<li>Document 2: "programming is fun"</li>
<li>Document 3: "i love fun programming"</li>
</ul>
</li>

<li>Identify all unique words across the documents and sort it:
<ul>
<li>fun , is , love , programming</li>
</ul>
</li>
    
<li>Count the occurrences of each word in each document:
<ul>
<li>Document 1:
<ul>
<li>"fun": 0</li>
<li>"is": 0</li>
<li>"love": 1</li>
<li>"programming": 1</li>
</ul>
</li>
<li>Document 2:
<ul>
<li>"fun": 1</li>
<li>"is": 1</li>
<li>"love": 0</li>
<li>"programming": 1</li>
</ul>
</li>
<li>Document 3:
<ul>
<li>"fun": 1</li>
<li>"is": 0</li>
<li>"love": 1</li>
<li>"programming": 1</li>
</ul>
</li>
</ul>
</li>

<li>Form a matrix using the counts, where each row represents a document and each column represents a unique word:</li><br>

<table border="1" cellspacing="0" cellpadding="12" align="center" rules="all" frame="box" width="80%" style="font-size: 20px; text-align: center;">
<thead>
<tr bgcolor="#ADD8E6">
<th style="color: white;">Document</th>
<th style="color: white;">fun</th>
<th style="color: white;">is</th>
<th style="color: white;">love</th>
<th style="color: white;">programming</th>
</tr>
</thead>
<tbody>
<tr bgcolor="#f2f2f2">
<td>Document 1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>

</tr>
<tr>
<td>Document 2</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr bgcolor="#f2f2f2">
<td>Document 3</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

Creating the documents

In [None]:
Document_1 = "I love programming"
Document_2 = "programming is fun"
Document_3 = "i love fun programming"

Appending all the documents to a list

In [None]:
list_of_docs = [Document_1,Document_2,Document_3]

- To implement this BoW representation, `sklearn` offers a class to work with called `Countvectorizer`

- It is available in the `feature_extraction.text` module

- It is imported using the statement `from sklearn.feature_extraction.text import CountVectorizer`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

**`CountVectorizer`**:

- This is a tool from `sklearn.feature_extraction.text` used to convert text documents into a matrix of token counts.
- It creates a Bag of Words (BoW) representation of the text, where:
 - Each row represents a document.
 - Each column corresponds to a unique word (token) in the entire corpus.
 - The values in the matrix are the counts of each word's occurrence in the respective document

**`fit_transform()`**:

- The `fit_transform()` method performs both `fit()` and `transform()` in a single operation.
- The `fit()` method learns the vocabulary from the provided documents (i.e., identifies the unique words).
- The `transform()` method converts the documents into a sparse matrix, where:
 - The matrix rows correspond to documents.
 - The columns represent the words from the learned vocabulary.
 - The values in the matrix are the word frequencies.

In [None]:
bow_vec = CountVectorizer()

doc_matrix = bow_vec.fit_transform(list_of_docs)

Compressed Sparse Row format

- It’s a way to store large matrices efficiently when most of the elements are zero.
- Instead of storing every element, including the zeros, CSR compresses the data by only storing non-zero values and their positions.

In [None]:
doc_matrix

<3x4 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [None]:
# Converting CSR to a numpy array
doc_matrix.toarray()

array([[0, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 0, 1, 1]])

- As expected, the BoW representation matches with the previous example.

In [None]:
# returns the list of unique words (tokens) extracted from the documents.
bow_vec.get_feature_names_out()

array(['fun', 'is', 'love', 'programming'], dtype=object)

In [None]:
# Converting to a pandas dataframe
pd.DataFrame(doc_matrix.toarray(),columns=bow_vec.get_feature_names_out())

Unnamed: 0,fun,is,love,programming
0,0,0,1,1
1,1,1,0,1
2,1,0,1,1


In [None]:
# Extracting the reviews
review_1 = data['review'][0]
review_2 = data['review'][1]

# Creating the list of reviews or documents
list_of_docs = [review_1,review_2]

#Initializing the CountVectorizer
bow_vec = CountVectorizer()

#Creating the BoW matrix.
doc_matrix = bow_vec.fit_transform(list_of_docs)

In [None]:
doc_matrix.toarray()

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  2,  0,
         0,  0,  1,  0,  1,  2,  2,  0,  0,  0,  0,  0,  0,  1,  0,  1,
         1,  0,  0,  1,  1,  0,  0,  0,  0,  1,  0,  1,  0,  0,  0,  0,
         0,  0,  1,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  0,  1,  1,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  1,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  0,  0,
         1,  0,  0,  0,  0,  0,  1,  2,  0,  0,  0,  0,  1,  0,  0,  5,
         2,  1,  0,  1,  0,  0,  0,  2,  3,  0,  1,  1,  0,  1,  0,  0,
         1,  0,  0,  1,  0,  0,  1,  0,  2,  1,  0,  0,  1,  0,  0,  0,
         1,  2,  0,  0,  1,  2,  1,  1,  8,  0,  1,  0,  0,  0,  0,  0,
         1,  1,  0,  0,  0,  3,  0,  1,  0,  0,  0,  1,  0,  2,  1,  0,
         0,  0,  1,  1,  0,  0,  1,  0,  0,  1,  1,  0,  1,  0,  0,  1,
         0,  2,  0,  0,  1,  1,  0, 13,  2,  1,  1,  0,  1,  0,  3,  0,
         0,  0,  0,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  2, 

- Here, we can observe positive values greater than 1 also.
  - This is because some of the words occurs more than once.

In [None]:
bow_vec.get_feature_names_out()

array(['10', '1995', 'absolute', 'actor', 'actually', 'admittedly',
       'afterward', 'all', 'almost', 'although', 'am', 'an', 'and',
       'appeared', 'are', 'as', 'aspires', 'aspiring', 'at', 'avoid',
       'be', 'believable', 'between', 'biz', 'black', 'br', 'bring',
       'brooklyn', 'business', 'but', 'button', 'by', 'cameo', 'career',
       'casted', 'caught', 'changed', 'character', 'characters',
       'chemistry', 'clever', 'climax', 'close', 'compared', 'cool',
       'crime', 'crooked', 'dance', 'delivered', 'delivers',
       'desperation', 'despite', 'devoid', 'dialog', 'die',
       'disappointing', 'do', 'does', 'done', 'dressed', 'due', 'dunst',
       'dysfunctional', 'easy', 'embarrassing', 'enough', 'entertaining',
       'entertainment', 'entirely', 'everyone', 'example', 'falls',
       'fans', 'fast', 'father', 'favorite', 'feeling', 'film', 'find',
       'finger', 'first', 'fish', 'flat', 'for', 'forward', 'frequently',
       'from', 'funny', 'gabrielle',

In [None]:
# Converting to a pandas dataframe
pd.DataFrame(doc_matrix.toarray(),columns=bow_vec.get_feature_names_out())

Unnamed: 0,10,1995,absolute,actor,actually,admittedly,afterward,all,almost,although,...,whose,winner,wish,without,witty,worth,wrong,you,your,yourself
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,1
1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,0,3,1,0


- The *`max_features`* parameter in CountVectorizer limits the number of words (or tokens) used to create the vocabulary.
- This helps when dealing with large datasets by keeping only the most important or frequent words.

In [None]:
# Initializing CountVectorizer with top 1000 words
bow_vec = CountVectorizer(max_features = 1000)

# Applying TfidfVectorizer on data
data_features_BOW = bow_vec.fit_transform(data['final_cleaned_text'])

# Convert the data features to array
data_features_BOW = data_features_BOW.toarray()

# Shape of the feature vector
data_features_BOW.shape

(9982, 1000)

In [None]:
# Getting the 1000 words considered by the BoW model
words = bow_vec.get_feature_names_out()

In [None]:
# Checking the words considered by BoW model
words

array(['10', '100', '20', '30', '50', '60', '70', '80', '90', 'abil',
       'abl', 'absolut', 'accent', 'accept', 'achiev', 'across', 'act',
       'action', 'actor', 'actress', 'actual', 'ad', 'adapt', 'add',
       'admit', 'adult', 'adventur', 'age', 'ago', 'agre', 'air', 'alien',
       'aliv', 'allow', 'almost', 'alon', 'along', 'alreadi', 'also',
       'although', 'alway', 'amaz', 'america', 'american', 'among',
       'amount', 'amus', 'anim', 'ann', 'annoy', 'anoth', 'answer',
       'anti', 'anyon', 'anyth', 'anyway', 'apart', 'appar', 'appeal',
       'appear', 'appreci', 'approach', 'armi', 'around', 'arriv', 'art',
       'artist', 'ask', 'aspect', 'atmospher', 'attack', 'attempt',
       'attent', 'attract', 'audienc', 'averag', 'avoid', 'aw', 'award',
       'away', 'awesom', 'babi', 'back', 'background', 'bad', 'badli',
       'band', 'bare', 'base', 'basic', 'battl', 'beat', 'beauti',
       'becam', 'becom', 'begin', 'behind', 'believ', 'best', 'better',
       'beyo

In [None]:
# Creating a DataFrame from the data features
df_BOW = pd.DataFrame(data_features_BOW, columns=bow_vec.get_feature_names_out())
df_BOW.head()

Unnamed: 0,10,100,20,30,50,60,70,80,90,abil,...,writer,written,wrong,wrote,ye,year,yet,york,young,zombi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,2,0


- From the above dataframe, we can observe that the word *wrong* is present only once in the first document, and the word *young* is presented twice in the fifth document.

# **Conclusion**

- We used different text processing techniques to clean the raw text data.

- We then vectorized the cleaned text data using Bag of Words.

<font size=6 color='blue'>Power Ahead</font>
___