### Introduction to NLP

Welcome to the first module of natural language processing. Natural language processing, also referred to as text analytics, plays a very vital role in today’s era because of the sheer volume of text data that users generate around the world on digital channels such as social media apps, e-commerce websites, blog posts, etc. The first session of this module will take you through the following lectures:

- Industry applications of text analytics
- Understanding textual data
- Regular expressions

### a. Understanding Text

#### Lexical Processing (Converting it to Group of Words)
First, you will just convert the raw text into words and, depending on your application's needs, into sentences or paragraphs as well.

1. For example, if an email contains words such as lottery, prize and luck, then the email is represented by these words, and it is likely to be a spam email.

2. Hence, in general, the group of words contained in a sentence gives us a pretty good idea of what that sentence means. Many more processing steps are usually undertaken in order to make this group more representative of the sentence, for example, cat and cats are considered to be the same word. In general, we can consider all plural words to be equivalent to the singular form.

3. For a simple application like spam detection, lexical processing works just fine, but it is usually not enough in more complex applications, like, say, machine translation. For example, the sentences “My cat ate its third meal” and “My third cat ate its meal”, have very different meanings. However, lexical processing will treat the two sentences as equal, as the “group of words” in both sentences is the same. Hence, we clearly need a more advanced system of analysis.

#### Syntactic Processing ( Get meanings of the sentence, use grammer)
The next step after lexical analysis is where we try to extract more meaning from the sentence, by using its syntax this time. Instead of only looking at the words, we look at the syntactic structures, i.e., the grammar of the language to understand what the meaning is.

- One example is differentiating between the subject and the object of the sentence, i.e., identifying who is performing the action and who is the person affected by it. For example, “Ram thanked Shyam” and “Shyam thanked Ram” are sentences with different meanings from each other because in the first instance, the action of ‘thanking’ is done by Ram and affects Shyam, whereas, in the other one, it is done by Shyam and affects Ram. Hence, a syntactic analysis that is based on a sentence’s subjects and objects, will be able to make this distinction.

- There are various other ways in which these syntactic analyses can help us enhance our understanding. For example, a question answering system that is asked the question “Who is the Prime Minister of India?”, will perform much better, if it can understand that the words “Prime Minister” are related to “India”. It can then look up in its database, and provide the answer.


#### Semantic Processing (Understanding the meaning of relationship between the words)

Lexical and syntactic processing don't suffice when it comes to building advanced NLP applications such as language translation, chatbots etc.. The machine, after the two steps given above, will still be incapable of actually understanding the meaning of the text. Such an incapability can be a problem for, say, a question answering system, as it may be unable to understand that PM and Prime Minister mean the same thing. Hence, when somebody asks it the question, “Who is the PM of India?”, it may not even be able to give an answer unless it has a separate database for PMs, as it won’t understand that the words PM and Prime Minister are the same. You could store the answer separately for both the variants of the meaning (PM and Prime Minister), but how many of these meanings are you going to store manually? At some point, your machine should be able to identify synonyms, antonyms, etc. on its own.

- This is typically done by inferring the word’s meaning to the collection of words that usually occur around it. So, if the words, PM and Prime Minister occur very frequently around similar words, then you can assume that the meanings of the two words are similar as well.

- In fact, this way, the machine should also be able to understand other semantic relations. For example, it should be able to understand that the words “King” and “Queen” are related to each other and that the word “Queen” is simply the female version of the word “King”. Also, both of these words can be clubbed under the word “Monarch”. You can probably save these relations manually, but it will help you a lot more, if you can train your machine to look for the relations on its own, and learn them. Exactly how that training can be done, is something we’ll explore in the third module.

Once you have the meaning of the words, obtained via semantic analysis, you can use it for a variety of applications. Machine translation, chatbots and many other applications require a complete understanding of the text, right from the lexical level to the understanding of syntax to that of meaning. Hence, in most of these applications, lexical and semantic processing simply form the “pre-processing” layer of the overall process. In some simpler applications, only lexical processing is also enough as the pre-processing part.

### b. Text Processing

Computers could handle numbers directly and store them on registers (the smallest unit of memory on a computer). But they couldn’t store the non-numeric characters as is. The alphabets and special characters were to be converted to a numeric value first before they could be stored.

Hence, the concept of encoding came into existence. All the non-numeric characters were encoded to a number using a code. Also, the encoding techniques had to be standardised so that different computer manufacturers won’t use different encoding techniques.

The first encoding standard that came into existence was the ASCII (American Standard Code for Information Interchange) standard, in 1960. ASCII standard assigned a unique code to each character of the keyboard which was known as  ASCII code. For example, the ASCII code of the alphabet ‘A’ is 65 and that of the digit zero is 48. Since then, there have been several revisions made to the codes to incorporate new characters that came into existence after the initial encoding.

When ASCII was built, English alphabets were the only alphabets that were present on the keyboard. With time, new languages began to show up on keyboard sets which brought new characters. ASCII became outdated and couldn’t incorporate so many languages. A new standard has come into existence in recent years - the Unicode standard. It supports all the languages in the world - both modern and the older ones.

For someone working on text processing, knowing how to handle encodings becomes crucial. Before even beginning with any text processing, you need to know what kind of encoding the text has and if required, modify it to another encoding format.

To summarise, there are two most popular encoding standards:
- American Standard Code for Information Interchange (ASCII)
- Unicode
    - UTF-8
    - UTF-16
    
UTF-8 offers a big advantage in cases when the character is an English character or a character from the ASCII character set. Also, while UTF-8 uses only 8 bits to store the character, UTF-16 (BE) uses 16 bits to store it, which looks like a waste of memory.

However, in the second case, a symbol is used which doesn’t appear in the ASCII character set. For this case, UTF-8 uses 24 bits, whereas UTF-16 (BE) only uses 16. Hence the storage advantages offered by UTF-8 is reversed and actually becomes a disadvantage here. Also, the advantage UTF-8 offered previously by being same as the ASCII code is also not of use here, as ASCII code doesn’t even exist for this case. The default encoding for strings in python is Unicode UTF-8.

### c. Regular expressions: Quantifiers - I

This section onwards, you’ll learn about regular expressions. Regular expressions, also called regex, are very powerful programming tools that are used for a variety of purposes such as feature extraction from text, string replacement and other string manipulations. For someone to become a master at text analytics, being proficient with regular expressions is a must-have skill.

A regular expression is a set of characters, or a pattern, which is used to find substrings in a given string. 

Let’s say you want to extract all the hashtags from a tweet. A hashtag has a fixed pattern to it, i.e. a pound (‘#’) character followed by a string. Some example hashtags are - #mumbai, #bangalore, #upgrad. You could easily achieve this task by providing this pattern and the tweet that you want to extract the pattern from (in this case, the pattern is - any string starting with #). Another example is to extract all the phone numbers from a large piece of textual data.

In short, if there’s a pattern in any string, you can easily extract, substitute and do all kinds of other string manipulation operations using regular expressions.

Learning regular expressions basically means learning how to identify and define these patterns.

Regulars expressions are a language in itself since they have their own compilers. Almost all popular programming languages support working with regexes and so does Python.

Regular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings

For example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.

Take another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.

Regular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.

In [2]:
#Let's import the regular expression library in python.
import re

In [41]:
#Let's do a quick search using a pattern.
re.search('Ravi', 'Ravi is an exceptional student!')

In [50]:
# print output of re.search()
match = re.search('Ravi', 'Ravi is an exceptional student!')
print(match.group())

Ravi


In [5]:
#Let's define a function to match regular expression patterns
def find_pattern(text, patterns):
    if re.search(patterns, text):
        return re.search(patterns, text)
    else:
        return 'Not Found!'

In [6]:
### I. Quantifiers
# '*': Zero or more 
print(find_pattern("ac", "ab*"))
print(find_pattern("abc", "ab*"))
print(find_pattern("abbc", "ab*"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>


In [7]:
# '?': Zero or one (tells whether a pattern is absent or present)
print(find_pattern("ac", "ab?"))
print(find_pattern("abc", "ab?"))
print(find_pattern("abbc", "ab?"))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 2), match='ab'>


In [8]:
# '+': One or more
print(find_pattern("ac", "ab+"))
print(find_pattern("abc", "ab+"))
print(find_pattern("abbc", "ab+"))

Not Found!
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>


In [9]:
# {n}: Matches if a character is present exactly n number of times
print(find_pattern("abbc", "ab{2}"))

<re.Match object; span=(0, 3), match='abb'>


In [10]:
# {m,n}: Matches if a character is present from m to n number of times
print(find_pattern("aabbbbbbc", "ab{3,5}"))   # return true if 'b' is present 3-5 times
print(find_pattern("aabbbbbbc", "ab{7,10}"))  # return true if 'b' is present 7-10 times
print(find_pattern("aabbbbbbc", "ab{,10}"))   # return true if 'b' is present atmost 10 times
print(find_pattern("aabbbbbbc", "ab{10,}"))   # return true if 'b' is present from at least 10 times

<re.Match object; span=(1, 7), match='abbbbb'>
Not Found!
<re.Match object; span=(0, 1), match='a'>
Not Found!


In [11]:
### II. Anchors

# '^': Indicates start of a string
# '$': Indicates end of string

print(find_pattern("James", "^J"))   # return true if string starts with 'J' 
print(find_pattern("Pramod", "^J"))  # return true if string starts with 'J' 
print(find_pattern("India", "a$"))   # return true if string ends with 'a'
print(find_pattern("Japan", "a$"))   # return true if string ends with 'a'

<re.Match object; span=(0, 1), match='J'>
Not Found!
<re.Match object; span=(4, 5), match='a'>
Not Found!


In [12]:
### III. Wildcard

# '.': Matches any character
print(find_pattern("a", "."))
print(find_pattern("#", "."))

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='#'>


In [13]:
### IV. Character sets

# Now we will look at '[' and ']'.
# They're used for specifying a character class, which is a set of characters that you wish to match.
# Characters can be listed individually as follows
print(find_pattern("a", "[abc]"))

# Or a range of characters can be indicated by giving two characters and separating them by a '-'.
print(find_pattern("c", "[a-c]"))  # same as above

<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 1), match='c'>


In [14]:
# '^' is used inside character set to indicate complementary set
print(find_pattern("a", "[^abc]"))  # return true if neither of these is present - a,b or c

Not Found!


### Character sets
| Pattern  | Matches                                                                                    |
|----------|--------------------------------------------------------------------------------------------|
| [abc]    | Matches either an a, b or c character                                                      |
| [abcABC] | Matches either an a, A, b, B, c or C character                                             |
| [a-z]    | Matches any characters between a and z, including a and z                                  |
| [A-Z]    | Matches any characters between A and Z, including A and Z                                  |
| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
| [0-9]    | Matches any character which is a number between 0 and 9                                    |

### Meta sequences

| Pattern  | Equivalent to    |
|----------|------------------|
| \s       | [ \t\n\r\f\v]    |
| \S       | [^ \t\n\r\f\v]   |
| \d       | [0-9]            |
| \D       | [^0-9]           |
| \w       | [a-zA-Z0-9_]     |
| \W       | [^a-zA-Z0-9_]    |

In [15]:
### Greedy vs non-greedy regex
print(find_pattern("aabbbbbb", "ab{3,5}")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 7), match='abbbbb'>


In [16]:
print(find_pattern("aabbbbbb", "ab{3,5}?")) # return if a is followed by b 3-5 times GREEDY

<re.Match object; span=(1, 5), match='abbb'>


In [17]:
# Example of HTML code
print(re.search("<.*>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 35), match='<HTML><TITLE>My Page</TITLE></HTML>'>


In [18]:
# Example of HTML code
print(re.search("<.*?>","<HTML><TITLE>My Page</TITLE></HTML>"))

<re.Match object; span=(0, 6), match='<HTML>'>


### The five most important re functions that you would be required to use most of the times are

match() Determine if the RE matches at the beginning of the string

search() Scan through a string, looking for any location where this RE matches

finall() Find all the substrings where the RE matches, and return them as a list

finditer() Find all substrings where RE matches and return them as asn iterator

sub() Find all substrings where the RE matches and substitute them with the given string

In [19]:
# - this function uses the re.match() and let's see how it differs from re.search()
def match_pattern(text, patterns):
    if re.match(patterns, text):
        return re.match(patterns, text)
    else:
        return ('Not found!')

In [20]:
print(find_pattern("abbc", "b+"))

<re.Match object; span=(1, 3), match='bb'>


In [21]:
print(match_pattern("abbc", "b+"))

Not found!


In [22]:
## Example usage of the sub() function. Replace Road with rd.

street = '21 Ramakrishna Road'
print(re.sub('Road', 'Rd', street))

21 Ramakrishna Rd


In [23]:
print(re.sub('R\w+', 'Rd', street))

21 Rd Rd


In [24]:
## Example usage of finditer(). Find all occurrences of word Festival in given sentence

text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'
for match in re.finditer(pattern, text):
    print('START -', match.start(), end="")
    print('END -', match.end())

START - 12END - 20
START - 42END - 50


In [25]:
# Example usage of findall(). In the given URL find all dates
url = "http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12"
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'
print(re.findall(date_regex, url))

[('2017', '10', '28')]


In [26]:
## Exploring Groups
m1 = re.search(date_regex, url)
print(m1.group())  ## print the matched group

/2017/10/28/


In [27]:
print(m1.group(1)) # - Print first group

2017


In [28]:
print(m1.group(2)) # - Print second group

10


In [29]:
print(m1.group(3)) # - Print third group

28


In [30]:
print(m1.group(0)) # - Print zero or the default group

/2017/10/28/


### Regular Expressions: Grouping

Sometimes you need to extract sub-patterns out of a larger pattern. This can be done by using grouping. Suppose you have textual data with dates in it and you want to extract only the year. from the dates. You can use a regular expression pattern with grouping to match dates and then you can extract the component elements such as the day, month or the year from the date.

Grouping is achieved using the parenthesis operators. Let’s understand grouping using an example.

Let’s say the source string is: “Kartik’s birthday is on 15/03/1995”. To extract the date from this string you can use the pattern - “\d{1,2}\/\d{1,2}\/\d{4}”.

Now to extract the year, you can put parentheses around the year part of the pattern. The pattern is: “^\d{1,2}/\d{1,2}/(\d{4})$”.

In [31]:
# items contains all the files and folders of current directory
items = ['photos', 'documents', 'videos', 'image001.jpg','image002.jpg','image005.jpg', 'wallpaper.jpg',
         'flower.jpg', 'earth.jpg', 'monkey.jpg', 'image002.png']

# create an empty list to store resultant files
images = []

# regex pattern to extract files that end with '.jpg'
pattern = ".*\.jpg$"

for item in items:
    if re.search(pattern, item):
        images.append(item)

# print result
print(images)

['image001.jpg', 'image002.jpg', 'image005.jpg', 'wallpaper.jpg', 'flower.jpg', 'earth.jpg', 'monkey.jpg']


In [32]:
# items contains all the files and folders of current directory
items = ['photos', 'documents', 'videos', 'image001.jpg','image002.jpg','image005.jpg', 'wallpaper.jpg',
         'flower.jpg', 'earth.jpg', 'monkey.jpg', 'image002.png']

# create an empty list to store resultant files
images = []

# regex pattern to extract files that start with 'image' and end with '.jpg'
pattern = "image.*\.jpg$"

for item in items:
    if re.search(pattern, item):
        images.append(item)

# print result
print(images)

['image001.jpg', 'image002.jpg', 'image005.jpg']


### Q1.
Write a regular expression to match all the files that have either .exe, .xml or .jar extensions. A valid file name can contain any alphabet, digit and underscore followed by the extension.

In [33]:
files = ['employees.xml', 'calculator.jar', 'nfsmw.exe', 'bkgrnd001.jpg', 'sales_report.ppt']

pattern = "^.+\.(xml|jar|exe)$"

result = []

for file in files:
    match = re.search(pattern, file)
    if match !=None:
        result.append(file)

# print result - result should only contain the items that match the pattern
print(result)

['employees.xml', 'calculator.jar', 'nfsmw.exe']


### Q2
Write a regular expression to match all the addresses that have Koramangala embedded in them.

Strings that should match:
* 466, 5th block, Koramangala, Bangalore
* 4th BLOCK, KORAMANGALA - 560034

Strings that shouldn't match:
* 999, St. Marks Road, Bangalore

In [35]:
addresses = ['466, 5th block, Koramangala, Bangalore', '4th BLOCK, KORAMANGALA - 560034', '999, St. Marks Road, Bangalore']

pattern = "^[\w\d\s,-]*koramangala[\w\d\s,-]*$"

result = []

for address in addresses:
    match = re.search(pattern, address, re.I)
    if match !=None:
        result.append(address)

# print result - result should only contain the items that match the pattern
print(result)

['466, 5th block, Koramangala, Bangalore', '4th BLOCK, KORAMANGALA - 560034']


### Q3. 
Write a regular expression that matches either integer numbers or floats upto 2 decimal places.

Strings that should match: 
* 2
* 2.3
* 4.56
* .61

Strings that shoudln't match:
* 4.567
* 75.8792
* abc

In [36]:
numbers = ['2', '2.3', '4.56', '.61', '4.567', '75.8792', 'abc']

pattern = "^[0-9]*(\.[0-9]{,2})?$"

result = []

for number in numbers:
    match = re.search(pattern, number)
    if match != None:
        result.append(number)

# print result - result should only contain the items that match the pattern
print(result)

['2', '2.3', '4.56', '.61']


### Q4. 
Write a regular expression to match the model names of smartphones which follow the following pattern: 

mobile company name followed by underscore followed by model name followed by underscore followed by model number

Strings that should match:
* apple_iphone_6
* samsung_note_4
* google_pixel_2

Strings that shouldn’t match:
* apple_6
* iphone_6
* google\_pixel\_


In [37]:
phones = ['apple_iphone_6', 'samsung_note_4', 'google_pixel_2', 'apple_6', 'iphone_6', 'google_pixel_']

pattern = "^.*_.*_\d$"

result = []

for phone in phones:
    match = re.search(pattern, phone)
    if match !=None:
        result.append(phone)

# print result - result should only contain the items that match the pattern
print(result)

['apple_iphone_6', 'samsung_note_4', 'google_pixel_2']


### Q5. 
Write a regular expression that can be used to match the emails present in a database. 

The pattern of a valid email address is defined as follows:
The '@' character can be preceded either by alphanumeric characters, period characters or underscore characters. The length of the part that precedes the '@' character should be between 4 to 20 characters.

The '@' character should be followed by a domain name (e.g. gmail.com). The domain name has three parts - a prefix (e.g. 'gmail'), the period character and a suffix (e.g. 'com'). The prefix can have a length between 3 to 15 characters followed by a period character followed by either of these suffixes - 'com', 'in' or 'org'.


Emails that should match:
* random.guy123@gmail.com
* mr_x_in_bombay@gov.in

Emails that shouldn’t match:
* 1@ued.org
* @gmail.com
* abc!@yahoo.in
* sam_12@gov.us
* neeraj@

In [51]:
emails = ['random.guy123@gmail.com', 'mr_x_in_bombay@gov.in', '1@ued.org',
          '@gmail.com', 'abc!@yahoo.in', 'sam_12@gov.us', 'neeraj@']

pattern = "^[a-z_.0-9]{4,20}@[a-z]{3,15}\.(com|in|org)$"

result = []

for email in emails:
    match = re.search(pattern, email, re.I)
    if match !=None:
        result.append(email)

# print result - result should only contain the items that match the pattern
print(result)

['random.guy123@gmail.com', 'mr_x_in_bombay@gov.in']


# Basic Lexical Processing

In this session you will learn basic lexical processing. You will get to know the various preprocessing steps you need to apply before you can do any kind of text analytics such as apply machine learning on text, building language models, building chatbots, building sentiment analysis systems and so on. These steps are used in almost all applications that work with textual data. We will also build a spam-ham detector system side-by-side on a very unclean corpus of text. Corpus is just a name to refer to textual data in NLP jargon.

Now, you have already built a spam detector while learning about the naive-bayes classifier. Here, you will learn all the preprocessing steps that one needs to do before using a machine learning algorithm on the spam messages dataset. Note that, the preprocessing steps that we teach you here are not limited to building a spam detector.


Specifically, you will learn:

- How to preprocess text using techniques such as
    - Tokenisation
    - Stop words removal
    - Stemming
    - Lemmatization

- How to build a spam detector using one of the following models:
    - Bag-of-words model
    - TF-IDF model

#### Word Frequencies and Stop Words

While working with any kind of data, the first step that you usually do is to explore and understand it better. In order to explore text data, you need to do some basic preprocessing steps. In the next few segments, you will learn some basic preprocessing and exploratory steps applicable to almost all types of textual data.

Now, a text is made of characters, words, sentences and paragraphs. The most basic statistical analysis you can do is to look at the word frequency distribution, i.e. visualising the word frequencies of a given text corpus.

It turns out that there is a common pattern you see when you plot word frequencies in a fairly large corpus of text, such as a corpus of news articles, user reviews, Wikipedia articles, etc. In the following lecture, professor Srinath will demonstrate some interesting insights from word frequency distributions. You will also learn what stopwords are and why they are lesser relevant than other words.

 Zipf's law (discovered by the linguist-statistician George Zipf) states that the frequency of a word is inversely proportional to the rank of the word, where rank 1 is given to the most frequent word, 2 to the second most frequent and so on. This is also called the power law distribution.

The Zipf's law helps us form the basic intuition for stopwords - these are the words having the highest frequencies (or lowest ranks) in the text, and are typically of limited 'importance'.

Broadly, there are three kinds of words present in any text corpus:
- Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc.
- Significant words, which are typically more important to understand the text
- Rarely occurring words, which are again less important than significant words


Generally speaking, stopwords are removed from the text for two reasons:
- They provide no useful information, especially in applications such as spam detector or search engine. Therefore, you’re going to remove stopwords from the spam dataset.
- Since the frequency of words is very high, removing stopwords results in a much smaller data as far as the size of data is concerned. Reduced size results in faster computation on text data. There’s also the advantage of less number of features to deal with if stopwords are removed.

However, there are exceptions when these words should not be removed. In the next module, you’ll learn concepts such as POS (parts of speech) tagging and parsing where stopwords are preserved because they provide meaningful (grammatical) information in those applications. Generally, stopwords are removed unless they prove to be very helpful in your application or analysis.

On the other hand, you’re not going to remove the rarely occurring words because they might provide useful information in spam detection. Also, removing them provides no added efficiency in computation since their frequency is so low.

## Plotting word frequencies

In [52]:
import requests
from nltk import FreqDist
from nltk.corpus import stopwords
import seaborn as sns
%matplotlib inline

Download text of 'Alice in Wonderland' ebook from https://www.gutenberg.org/

In [81]:
url = "https://www.gutenberg.org/files/16/16-0.txt"
alice = requests.get(url,verify = False).text



Define a function to plot word frequencies

In [82]:
def plot_word_frequency(words, top_n=10):
    word_freq = FreqDist(words)
    labels = [element[0] for element in word_freq.most_common(top_n)]
    counts = [element[1] for element in word_freq.most_common(top_n)]
    plot = sns.barplot(labels, counts)
    return plot

Plot words frequencies present in the gutenberg corpus 

In [83]:
alice_words = alice.text.split()
plot_word_frequency(alice_words, 15)

AttributeError: 'str' object has no attribute 'text'

## Stopwords

Import stopwords from nltk

In [75]:
from nltk.corpus import stopwords

Look at the list of stopwords

In [56]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

Let's remove stopwords from the following piece of text.

In [58]:
sample_text = "the great aim of education is not knowledge but action"

Break text into words

In [60]:
sample_words = sample_text.split()
print(sample_words)

['the', 'great', 'aim', 'of', 'education', 'is', 'not', 'knowledge', 'but', 'action']


Remove stopwords

In [62]:
sample_words = [word for word in sample_words if word not in stopwords.words('english')]
print(sample_words)

['great', 'aim', 'education', 'knowledge', 'action']


Join words back to sentence

In [63]:
sample_text = " ".join(sample_words)
print(sample_text)

great aim education knowledge action


## Removing stopwords in the genesis corpus

Some other things that can be done
* Need to change tokens to lower case
* Need to get rid of punctuations

All the preprocessing steps will be covered while creating the classifier

### Tokenisation

You already know that you’re going to build a spam detector by the end of this module. In the spam detector application, you’re going to use word tokenisation, i.e. break the text into different words, so that each word can be used as a feature to detect whether the given message is a spam or not.

Now, let’s take a look at the spam messages dataset to get a better understanding of how to approach the problem of building a spam detector.