In [1]:
sample_text = "'I am a student from the University of Alabama. I was born in Ontario, Canada and I am a huge fan of the United States. I am going to get a degree in Philosophy to improve my chances of becoming a Philosophy professor. I have been working towards this goal for 4 years. I am currently enrolled in a PhD program. It is very difficult, but I am confident that it will be a good decision'"

In [2]:
sample_text

"'I am a student from the University of Alabama. I was born in Ontario, Canada and I am a huge fan of the United States. I am going to get a degree in Philosophy to improvemy chances of becoming a Philosophy professor. I have beenworking towards this goal for 4 years. I am currently enrolledin a PhD program. It is very difficult, but I am confident thatit will be a good decision'"

* You should observe that the computer reads bodies of text, even if punctuated, as single string objects. 
* Because of this, we need to find a way to separate this single body of text so that the computer evaluates each word as an individual string object. 
* This brings us to the concept of word tokenization, which is simply the process of separating a single string object, usually a body of text of varying length, into individual tokens that represent words or characters 

#### Natural Language Toolkit (NLTK) module .


#NLTK allows you to use some of the more basic NLP functionalities, as well as pretrained models for different tasks.
#It is my goal to allow you to train your own models, so we will not be working with any of the pretrained models in NLTK. 
#However, you should read through the NLTK module documentation to become familiar with certain functions and algorithms that expedite text preprocessing. 
#Relating back to our example, let’s tokenize the sample data via the following code:


In [3]:
from nltk.tokenize import word_tokenize, sent_tokenize

#### sent_tokenize()

In [4]:
sent_tokens = sent_tokenize(sample_text)
print(sent_tokens)

["'I am a student from the University of Alabama.", 'I was born in Ontario, Canada and I am a huge fan of the United States.', 'I am going to get a degree in Philosophy to improvemy chances of becoming a Philosophy professor.', 'I have beenworking towards this goal for 4 years.', 'I am currently enrolledin a PhD program.', "It is very difficult, but I am confident thatit will be a good decision'"]


#### word_tokenize() 

In [5]:
word_tokens = word_tokenize(sample_text)
print(word_tokens)

["'", 'I', 'am', 'a', 'student', 'from', 'the', 'University', 'of', 'Alabama', '.', 'I', 'was', 'born', 'in', 'Ontario', ',', 'Canada', 'and', 'I', 'am', 'a', 'huge', 'fan', 'of', 'the', 'United', 'States', '.', 'I', 'am', 'going', 'to', 'get', 'a', 'degree', 'in', 'Philosophy', 'to', 'improvemy', 'chances', 'of', 'becoming', 'a', 'Philosophy', 'professor', '.', 'I', 'have', 'beenworking', 'towards', 'this', 'goal', 'for', '4', 'years', '.', 'I', 'am', 'currently', 'enrolledin', 'a', 'PhD', 'program', '.', 'It', 'is', 'very', 'difficult', ',', 'but', 'I', 'am', 'confident', 'thatit', 'will', 'be', 'a', 'good', 'decision', "'"]


#### Tokenize Words

##### We split the text sentence/paragraph into a list of words. Each word in the list is called a token.

Now we have individual tokens that we can preprocess!

From this step forward, we can clean out some of the junk text that we would not want to extract features from. 

Typically, the first thing we want to get rid of are stop words, which are usually defined as very common words in a given language. Most often, lists of stop words that we build or utilize in software packages include function words , which are words that express a grammatical relationship (rather than having an intrinsic meaning). 

Examples of function words include the, and, for, and of.

#### English Stopwords

import nltk
nltk.download('reuters')  # 'punkt', 'stopwords'

In [6]:
from nltk.corpus import stopwords
#Loading stopwords Corpus

## Check the list of stop words

In [7]:
print (stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [8]:
print (stopwords.fileids())

['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [9]:
print (stopwords.words('nepali'))

['छ', 'र', 'पनि', 'छन्', 'लागि', 'भएको', 'गरेको', 'भने', 'गर्न', 'गर्ने', 'हो', 'तथा', 'यो', 'रहेको', 'उनले', 'थियो', 'हुने', 'गरेका', 'थिए', 'गर्दै', 'तर', 'नै', 'को', 'मा', 'हुन्', 'भन्ने', 'हुन', 'गरी', 'त', 'हुन्छ', 'अब', 'के', 'रहेका', 'गरेर', 'छैन', 'दिए', 'भए', 'यस', 'ले', 'गर्नु', 'औं', 'सो', 'त्यो', 'कि', 'जुन', 'यी', 'का', 'गरि', 'ती', 'न', 'छु', 'छौं', 'लाई', 'नि', 'उप', 'अक्सर', 'आदि', 'कसरी', 'क्रमशः', 'चाले', 'अगाडी', 'अझै', 'अनुसार', 'अन्तर्गत', 'अन्य', 'अन्यत्र', 'अन्यथा', 'अरु', 'अरुलाई', 'अर्को', 'अर्थात', 'अर्थात्', 'अलग', 'आए', 'आजको', 'ओठ', 'आत्म', 'आफू', 'आफूलाई', 'आफ्नै', 'आफ्नो', 'आयो', 'उदाहरण', 'उनको', 'उहालाई', 'एउटै', 'एक', 'एकदम', 'कतै', 'कम से कम', 'कसै', 'कसैले', 'कहाँबाट', 'कहिलेकाहीं', 'का', 'किन', 'किनभने', 'कुनै', 'कुरा', 'कृपया', 'केही', 'कोही', 'गए', 'गरौं', 'गर्छ', 'गर्छु', 'गर्नुपर्छ', 'गयौ', 'गैर', 'चार', 'चाहनुहुन्छ', 'चाहन्छु', 'चाहिए', 'छू', 'जताततै', 'जब', 'जबकि', 'जसको', 'जसबाट', 'जसमा', 'जसलाई', 'जसले', 'जस्तै', 'जस्तो', 'जस्तोसुकै', 'जहाँ'

In [22]:
stopWords = set(stopwords.words('english'))
print(stopWords, '\n')
print('len(stopWords): ', len(stopWords))

{'herself', 'or', 'you', 'couldn', 'because', 'then', 'myself', 'on', 'she', 'during', 'a', 'such', 'needn', 'shan', 'but', 'these', 'who', 'will', 'yourselves', 'your', 'while', "hasn't", 'an', "weren't", 'not', 'does', 'some', 'what', 'be', 'until', 'doing', 's', 'both', 'own', "isn't", 'wouldn', 'after', "couldn't", 'very', 'himself', 'were', 't', 'itself', 'too', "don't", 'have', 'don', 'and', "shan't", 'there', 'm', 'now', 'can', "hadn't", 'hasn', 'ma', "didn't", 'isn', 'below', 'had', 'll', "doesn't", 'haven', 'his', 'by', 'between', 'hadn', "mightn't", 'with', 'in', 'y', 'whom', 'my', 'once', "haven't", 'themselves', 'before', 'doesn', "that'll", 'further', 'was', 'yourself', 'did', 'having', 'about', 'him', 'over', 'shouldn', "you've", 'same', 'off', 'where', 'so', 'those', 'their', "wouldn't", 'any', "shouldn't", 'them', 'theirs', "she's", "aren't", 'when', 'i', 'why', 'our', 'above', 'is', 'from', 'won', 'at', 'ours', 'are', "you're", 'ain', 'for', 'mustn', 'each', 'into', 'm

In [11]:
wordsFiltered = []
for w in word_tokens:
    if w not in stopWords:
        wordsFiltered.append(w)

Exercise 4

Let's define a function to compute what fraction of words in a text are not in the stopwords list:

In [12]:
from nltk.corpus import reuters

In [20]:
def content_fraction(word_lst):
    stop_words = stopwords.words('english')
    content = [w for w in word_lst if w.lower() not in stop_words]
    return len(content) / len(word_lst)

In [23]:
print(reuters.words(), '\n', len(reuters.words()))

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...] 
 1720901


In [21]:
content_fraction(reuters.words())

0.735240435097661