# Text Analytics with NLTK

### What is NLTK?
The Natural Language Toolkit (NLTK) is a Python package for natural language processing.
NLTK comes with many corpora, toy grammars, trained models, etc. https://www.nltk.org/data.html

Let's start by downloading nltk. See https://www.nltk.org/data.html 

In [1]:
import nltk, re, json, io #https://pypi.org/project/nltk/ --- pip install nltk 
from collections import Counter
import pandas as pd

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Error loading punkt: <urlopen error [Errno 61] Connection
[nltk_data]     refused>
[nltk_data] Error loading stopwords: <urlopen error [Errno 61]
[nltk_data]     Connection refused>


False

## Tokenization
Read documentation: https://www.nltk.org/api/nltk.tokenize.html

* sentence tokenization
* word tokenization

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

# tokenize text into words
words = word_tokenize(text)
words

['Netanyahu',
 "'s",
 'visit',
 'was',
 'cut',
 'short',
 'by',
 'reports',
 'late',
 'Sunday',
 'that',
 'a',
 'rocket',
 'was',
 'fired',
 'from',
 'Gaza',
 'into',
 'central',
 'Israel',
 ',',
 'wounding',
 'at',
 'least',
 'seven',
 'people',
 '.',
 'Following',
 'criticism',
 'from',
 'political',
 'opponents',
 'over',
 'what',
 'they',
 'consider',
 'the',
 'prime',
 'minister',
 "'s",
 'unclear',
 'stance',
 'toward',
 'the',
 'militant',
 'political',
 'group',
 ',',
 'Israel',
 'responded',
 'with',
 'a',
 'series',
 'of',
 'strikes',
 'into',
 'Gaza',
 'against',
 'Hamas',
 ',',
 'which',
 'largely',
 'governs',
 'the',
 'contested',
 'strip',
 '.',
 'President',
 'Donald',
 'Trump',
 'tacitly',
 'endorsed',
 'the',
 'strike',
 'following',
 'his',
 'meetings',
 'with',
 'Netanyahu',
 ',',
 'calling',
 'the',
 'Hamas',
 'attack',
 '``',
 'despicable',
 '.',
 "''"]

**sentence tokenizer**

https://www.nltk.org/_modules/nltk/tokenize/punkt.html

Punkt Sentence Tokenizer

The NLTK data package includes a pre-trained Punkt tokenizer for English. 
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm.

In [4]:
# split text into sentences
sents = sent_tokenize(text)
sents

["Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people.",
 "Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip.",
 'President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack "despicable."']

sent_tokenize is quite smart. See examples below

In [5]:
text2 = "Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.  And sometimes sentences can start with non-capitalized words.  i is a good variable name."

sents2 = sent_tokenize(text2)
sents2

['Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.',
 'And sometimes sentences can start with non-capitalized words.',
 'i is a good variable name.']

In [6]:
text3 = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''


sents3 = sent_tokenize( text3 )
sents3

['\n(How does it deal with this parenthesis?)',
 '"It should be part of the\nprevious sentence."',
 '"(And the same with this one.)"',
 "('And this one!')",
 '"(\'(And (this)) \'?)"',
 '[(and this. )]']

### RegexpTokenizer
使用正则分词器,按照自己的规则进行分词

In [42]:
from nltk.tokenize import RegexpTokenizer

compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter']

#按照正则表达式进行re.findall()
match_tokenizer = RegexpTokenizer("[\w']+") #\w+ 连续匹配多个字符(字母、数字、下划线),匹配任何非空白字符
#指定gaps=True会按照正则表达式进行re.split()
match_tokenizer_gap = RegexpTokenizer("[\w']+",gaps=True)

match_tokens = []
match_tokens_gap = []
for sent in compare_list:   
    print(match_tokenizer.tokenize(sent))
    print(match_tokenizer_gap.tokenize(sent))
    match_tokens.append(match_tokenizer.tokenize(sent))
    match_tokens_gap.append(match_tokenizer_gap.tokenize(sent))

['https', 't', 'co', '9z2J3P33Uc']
['://', '.', '/']
['laugh', 'cry']
['/']
[]
['😬😭😓🤢🙄😱']
["world's", 'problems']
[' ']
['datageneral']
['@']
["It's", 'interesting']
[' ']
["don't", 'spell', 'my', 'name', 'right']
[' ', ' ', ' ', ' ']
['all', 'nighter']
['-']


#### WhitespaceTokenizer

```WhitespaceTokenizer```
Tokenize a string on whitespace (space, tab, newline). In general, users should use the string split() method instead.


```WhitespaceTokenizer```是RegexpTokenizer的子类  内部预设了使用正则表达式r'\s*\n\s*\n\s*'进行分割

In [12]:
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
print(s)

print('\nWhitespaceTokenizer:')
WhitespaceTokenizer().tokenize(s)

Good muffins cost $3.88
in New York.  Please buy me
two of them.

Thanks.

WhitespaceTokenizer:


['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'Thanks.']

#### WordPunctTokenizer
```WordPunctTokenizer``` Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp ```\w+|[^\w\s]+```

In [13]:
print('WordPunctTokenizer:')
WordPunctTokenizer().tokenize(s)

WordPunctTokenizer:


['Good',
 'muffins',
 'cost',
 '$',
 '3',
 '.',
 '88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

**<span class="mark">TODO</span>**

Try to tokenize tweets with the `tweetokenizer`
https://www.nltk.org/api/nltk.tokenize.html?highlight=word_tokenize

`tweetokenizer`: a Tokenizer specifically suited for tweets.

You can start with any sample tweets.

Here is one example: "@remy This is waaaaayyyy too much for you!!!!!!"

In [50]:
# Your code below
from nltk.tokenize import TweetTokenizer
tweettoken = TweetTokenizer()
tweet = "@remy This is waaaaayyyy too much for you!!!!!!"
print('TweeTokenizer:')
print(tweettoken.tokenize(tweet),'\n')

# comparing with whitespace tokenizer
print('WhitespaceTokenizer:')
print(WhitespaceTokenizer().tokenize(tweet))

TweeTokenizer:
['@remy', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!'] 

WhitespaceTokenizer:
['@remy', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you!!!!!!']


**<span class="mark">TODO</span>**

Test to see how two different tokenizers would function when you pass the same text. 

 * Test with the same tweet that you picked as your sample data previously
 * Test with 2 tokenizers from NLTK: ```word_tokenize``` and ```casual_tokenize```

In [16]:
# Your code below
test_tweet = "@john lol that was #awesome :)"
print(nltk.word_tokenize(test_tweet))

test_tweet = "@john lol that was #awesome :)"
print(nltk.casual_tokenize(test_tweet))

['@', 'john', 'lol', 'that', 'was', '#', 'awesome', ':', ')']
['@john', 'lol', 'that', 'was', '#awesome', ':)']


### Comparing tokenizers. 
Refer to this great resource: https://towardsdatascience.com/an-introduction-to-tweettokenizer-for-processing-tweets-9879389f8fe7

Instead of taking the time to analyze the outcome of each tokenizer, we can put everything in one pd.dataframe for fast and accurate interpretation. How would you do it?

**<span class="mark">TODO for later</span>**

In [51]:
from nltk.tokenize import word_tokenize,RegexpTokenizer, WordPunctTokenizer, WhitespaceTokenizer,TweetTokenizer

compare_list = ['https://t.co/9z2J3P33Uc',
               'laugh/cry',
               '😬😭😓🤢🙄😱',
               "world's problems",
               "@datageneral",
                "It's interesting",
               "don't spell my name right",
               'all-nighter']

word_tokens = []
for sent in compare_list:
    print(word_tokenize(sent))
    word_tokens.append(word_tokenize(sent))

match_tokenizer = RegexpTokenizer("[\w']+")
match_tokens = []
for sent in compare_list:   
    print(match_tokenizer.tokenize(sent))
    match_tokens.append(match_tokenizer.tokenize(sent))

space_tokenizer=WhitespaceTokenizer()
space_tokens=[]
for sent in compare_list:
    print(space_tokenizer.tokenize(sent))
    space_tokens.append(space_tokenizer.tokenize(sent))

punct_tokenizer = WordPunctTokenizer()
punct_tokens = []
for sent in compare_list:
    print(punct_tokenizer.tokenize(sent))
    punct_tokens.append(punct_tokenizer.tokenize(sent))

tweet_tokenizer = TweetTokenizer()
tweet_tokens = []
for sent in compare_list:
    print(tweet_tokenizer.tokenize(sent))
    tweet_tokens.append(tweet_tokenizer.tokenize(sent))


['https', ':', '//t.co/9z2J3P33Uc']
['laugh/cry']
['😬😭😓🤢🙄😱']
['world', "'s", 'problems']
['@', 'datageneral']
['It', "'s", 'interesting']
['do', "n't", 'spell', 'my', 'name', 'right']
['all-nighter']
['https://t.co/9z2J3P33Uc']
['laugh/cry']
['😬😭😓🤢🙄😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']
['https', '://', 't', '.', 'co', '/', '9z2J3P33Uc']
['laugh', '/', 'cry']
['😬😭😓🤢🙄😱']
['world', "'", 's', 'problems']
['@', 'datageneral']
['It', "'", 's', 'interesting']
['don', "'", 't', 'spell', 'my', 'name', 'right']
['all', '-', 'nighter']
['https://t.co/9z2J3P33Uc']
['laugh', '/', 'cry']
['😬', '😭', '😓', '🤢', '🙄', '😱']
["world's", 'problems']
['@datageneral']
["It's", 'interesting']
["don't", 'spell', 'my', 'name', 'right']
['all-nighter']


In [53]:
import pandas as pd
tokenizers = {'word_tokenize': word_tokens,
              'RegrexTokenizer':match_tokens,
              'WhitespaceTokenizer': space_tokens,
             'WordPunctTokenizer':punct_tokens,
             'TweetTokenizer': tweet_tokens }
df = pd.DataFrame.from_dict(tokenizers)
df

Unnamed: 0,word_tokenize,RegrexTokenizer,WhitespaceTokenizer,WordPunctTokenizer,TweetTokenizer
0,"[https, :, //t.co/9z2J3P33Uc]","[https, t, co, 9z2J3P33Uc]",[https://t.co/9z2J3P33Uc],"[https, ://, t, ., co, /, 9z2J3P33Uc]",[https://t.co/9z2J3P33Uc]
1,[laugh/cry],"[laugh, cry]",[laugh/cry],"[laugh, /, cry]","[laugh, /, cry]"
2,[😬😭😓🤢🙄😱],[],[😬😭😓🤢🙄😱],[😬😭😓🤢🙄😱],"[😬, 😭, 😓, 🤢, 🙄, 😱]"
3,"[world, 's, problems]","[world's, problems]","[world's, problems]","[world, ', s, problems]","[world's, problems]"
4,"[@, datageneral]",[datageneral],[@datageneral],"[@, datageneral]",[@datageneral]
5,"[It, 's, interesting]","[It's, interesting]","[It's, interesting]","[It, ', s, interesting]","[It's, interesting]"
6,"[do, n't, spell, my, name, right]","[don't, spell, my, name, right]","[don't, spell, my, name, right]","[don, ', t, spell, my, name, right]","[don't, spell, my, name, right]"
7,[all-nighter],"[all, nighter]",[all-nighter],"[all, -, nighter]",[all-nighter]


## Stopwords

In [24]:
from nltk.corpus import stopwords #nltk.download('stopwords')

In [25]:
stopeng = set(stopwords.words('english'))
stopeng

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [26]:
# remove stopwords from text

#text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""
text = "NASA Mars Rover Is Bringing 10.9 Million Names to the Red Planet"

# first tokenize text into words
tokens = word_tokenize( text.lower() )
print(tokens)
tokens_nostop = [w for w in tokens if w not in stopeng]

print('\n', tokens_nostop)

['nasa', 'mars', 'rover', 'is', 'bringing', '10.9', 'million', 'names', 'to', 'the', 'red', 'planet']

 ['nasa', 'mars', 'rover', 'bringing', '10.9', 'million', 'names', 'red', 'planet']


## Stemming词干提取

http://www.nltk.org/howto/stem.html

In [54]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
tokens_porter = [(w, ps.stem(w)) for w in tokens_nostop if w != ps.stem(w)]
tokens_porter

[('mars', 'mar'), ('bringing', 'bring'), ('names', 'name')]

# Text cleaning

In [1]:
## Download an example tweets file  
## https://raw.githubusercontent.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv

import urllib.request

print('Beginning file download with urllib2...')

# url at which the file is in direct downloadable format
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv'

urllib.request.urlretrieve(url, './noisy_twitter.csv')

Beginning file download with urllib2...


('./noisy_twitter.csv', <http.client.HTTPMessage at 0x7f8c68448970>)

<span class="mark">**What is `urllib.request.urlretrieve` doing??**</span>

download the file with `urlretrieve` and specify the name by which it will be stored

### Reading the csv as pandas dataframe

In [13]:
# first install pandas library and then import it
import pandas as pd

#read the file into dataframe. "header=0" means the first row will be considered as a header
frame = pd.read_csv("./noisy_twitter.csv", header=0, dtype={'id':str,"created_at":str,'text':str})

#print first 5 rows
frame.head()

Unnamed: 0,id,created_at,text
0,765629939811020802,8/16/2016 19:22:57,"It's just a 2-point race, Clinton 38%, Trump 3..."
1,758731880183193601,7/28/2016 18:32:31,"""@LallyRay: Poll: Donald Trump Sees 17-Point P..."
2,758350470402408449,7/27/2016 17:16:56,Great new poll - thank you!\n#MakeAmericaGreat...
3,757577508346888192,7/25/2016 14:05:27,Great POLL numbers are coming out all over. Pe...
4,753603401854881793,7/14/2016 14:53:46,Another new poll. Thank you for your support! ...


In [14]:
#print last 5 rows
frame.tail()

Unnamed: 0,id,created_at,text
443,628255460467077120,8/3/2015 17:25:48,RT @JonScottFNC: .@realDonaldTrump Surges in N...
444,628224939041136640,8/3/2015 15:24:31,RT @foxandfriends: Days before the first Repub...
445,628115657431891968,8/3/2015 8:10:17,"""@FoxNews: .@ericbolling: ""Polls show [@realDo..."
446,628044625962512384,8/3/2015 3:28:01,"""@CoachZachCooper: Congratulations on leading..."
447,627526914169806849,8/1/2015 17:10:49,"RT @FoxNews: .@ericbolling: ""Polls show [@real..."


In [15]:
# create a list of all text columns

textlist = frame['text'].tolist() # here, the order of rows is preserved
textlist

["It's just a 2-point race, Clinton 38%, Trump 36%' https://t.co/EzDzJ4EzIN",
 '"@LallyRay: Poll: Donald Trump Sees 17-Point Positive Swing in Two Weeks - Breitbart https://t.co/bVAj52fA3Y @realdonaldtrump"  Great!',
 'Great new poll - thank you!\n#MakeAmericaGreatAgain https://t.co/mXovx0TLPC',
 "Great POLL numbers are coming out all over. People don't want another four years of Obama, and Crooked Hillary would be even worse. #MAGA",
 'Another new poll. Thank you for your support! Join the MOVEMENT today! \n#ImWithYou https://t.co/3KWOl2ibaW https://t.co/miT4atHxQz',
 'Great new poll- thank you America!\n#Trump2016 #ImWithYou https://t.co/aVH9c5QRwc',
 'Despite spending $500k a day on TV ads alone #CrookedHillary falls flat in nationwide @QuinnipiacPoll. Having ZERO impact. Sad!!',
 'Great poll- Florida! Thank you! https://t.co/4FuPpL5WOM',
 'New poll - thank you! #Trump2016\nhttps://t.co/Mi87Vmw06H https://t.co/WmqvcYG4r3',
 'New Q poll out- we are going to win the whole deal- and MA

### function for calculating top 20 frequent items from a list

nltk `most_common`: https://tedboy.github.io/nlps/generated/generated/nltk.FreqDist.most_common.html

In [16]:
from collections import Counter
from collections import defaultdict
def top20(thislist):
    # First make a string out of the entire list
    BIGstr = " ".join(thislist)
    wordlist = BIGstr.split(" ")
    wordcount = Counter(wordlist)
    return(wordcount.most_common(20))

In [17]:
print(top20(textlist))

[('the', 222), ('in', 165), ('', 129), ('poll', 122), ('I', 112), ('to', 101), ('Trump', 91), ('and', 89), ('is', 83), ('a', 78), ('of', 74), ('on', 64), ('New', 64), ('Poll', 64), ('@realDonaldTrump', 62), ('just', 56), ('new', 56), ('polls', 55), ('at', 55), ('for', 50)]


### What are the most frequent mentions?

Steps:
* Extract mentions using regular expressions
* Count the most common

In [7]:
import re
def extractmentions(row):
    row = row.lower()
    result = re.findall("(?<![@\w])@(\w{1,25})", row)
    return result

all_mentions = []

for t in textlist:
    result = extractmentions(t)
    if len(result) > 0:
        all_mentions = all_mentions + result
        
print(top20(all_mentions))

[('realdonaldtrump', 87), ('cnn', 30), ('foxnews', 25), ('jebbush', 15), ('danscavino', 15), ('megynkelly', 10), ('abc', 8), ('drudge_report', 8), ('cnbc', 8), ('wsj', 7), ('oann', 7), ('cbsnews', 7), ('todayshow', 6), ('morning_joe', 5), ('nbcnews', 4), ('foxbusiness', 4), ('gstephanopoulos', 4), ('frankluntz', 4), ('realbencarson', 4), ('washingtonpost', 3)]


### What are the most frequent hashtags?

<span class="mark">**TODO**</span>

write a similar function to extract hashtags

print top 20 most frequent hashtags

In [8]:
# Your code below

def extracthashtags(row):
    row = row.lower()
    result = re.findall("(?<![#\w])#(\w{1,25})", row)
    return result

all_hashtags = []

for t in textlist:
    result = extracthashtags(t)
    if len(result) > 0:
        all_hashtags = all_hashtags + result
        
print(top20(all_hashtags))


[('trump2016', 50), ('makeamericagreatagain', 34), ('1', 17), ('gopdebate', 8), ('2', 5), ('gop', 5), ('fitn', 3), ('iacaucus', 3), ('trump', 3), ('tcot', 3), ('3', 3), ('imwithyou', 2), ('votetrump', 2), ('iowa', 2), ('iowacaucus', 2), ('rubio', 2), ('votetrump2016', 2), ('cnn', 2), ('wakeupamerica', 2), ('cnbcgopdebate', 2)]


### top 20 words without text cleaning

In [9]:
# what are the top words from the entire text corpus
top20(textlist)

[('the', 222),
 ('in', 165),
 ('', 129),
 ('poll', 122),
 ('I', 112),
 ('to', 101),
 ('Trump', 91),
 ('and', 89),
 ('is', 83),
 ('a', 78),
 ('of', 74),
 ('on', 64),
 ('New', 64),
 ('Poll', 64),
 ('@realDonaldTrump', 62),
 ('just', 56),
 ('new', 56),
 ('polls', 55),
 ('at', 55),
 ('for', 50)]

#### Let's do some text cleaning

In [10]:
def textcleaner(row):
    row = row.lower()
    #remove urls
    row  = re.sub(r'http\S+', '', row)
    #remove mentions
    row = re.sub(r"(?<![@\w])@(\w{1,25})", '', row)
    #remove hashtags
    row = re.sub(r"(?<![#\w])#(\w{1,25})", '',row)
    #remove other special characters
    row = re.sub('[^A-Za-z .-]+', '', row)
    #remove digits
    row = re.sub('\d+', '', row)
    row = re.sub('\s+', ' ', row)
    row = row.strip(" ")
    return row

cleaned_textlist = []

for t in textlist:
    cleaned_textlist.append(textcleaner(t))
    
top20(cleaned_textlist)

[('the', 261),
 ('poll', 258),
 ('in', 176),
 ('trump', 131),
 ('new', 124),
 ('.', 120),
 ('i', 112),
 ('to', 104),
 ('polls', 95),
 ('and', 93),
 ('a', 90),
 ('is', 88),
 ('you', 86),
 ('of', 81),
 ('just', 73),
 ('great', 72),
 ('on', 67),
 ('thank', 59),
 ('-', 58),
 ('at', 57)]

**TODO for later** There are still few things that could be cleaned up. Like appearance of that last character -

# Stopwords

**<span class="mark">TODO</span>**:
1. Fetch english stopwords
2. write code to remove stopwords from the text that you are working with: `cleaned_textlist`

In [11]:
from nltk.corpus import stopwords

In [12]:
stopWords = set(stopwords.words('english'))

nostop_list = []
for word in  cleaned_textlist:
    if word not in stopWords:
        nostop_list.append(word)
        
print(nostop_list)

['its just a -point race clinton trump', 'poll donald trump sees -point positive swing in two weeks - breitbart great', 'great new poll - thank you', 'great poll numbers are coming out all over. people dont want another four years of obama and crooked hillary would be even worse.', 'another new poll. thank you for your support join the movement today', 'great new poll- thank you america', 'despite spending k a day on tv ads alone falls flat in nationwide . having zero impact. sad', 'great poll- florida thank you', 'new poll - thank you', 'new q poll out- we are going to win the whole deal- and make america great again', 'the dirty poll done by is a disgrace. even they admit that many more democrats were polled. other polls were good.', 'hillary clinton is not a change agent just the same old status quo she is spending a fortune i am spending very little. close in polls', 'the poll sample is heavy on democrats. very dishonest - why would they do that other polls good', 'so many great th