# Natural Language Processing

### Part 1: NLP intro 

##### NLP is the task we give computers to read and understand (process) written text (natural language). By far, the most popular toolkit or API to do natural language processing is the Natural Language Toolkit for the Python programming language. 

##### The NLTK module comes packed full of everything from trained algorithms to identify parts of speech to unsupervised machine learning algorithms to help you train your own machine to understand a specific bit of text. 


##### Great site for python documentation/tutorials/help:
https://pythonprogramming.net/
##### Walkthrough/tutorial from Sentdex on Youtube


In [2]:
pip install nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 3.3MB/s eta 0:00:01
[?25hCollecting click (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82kB)
[K     |████████████████████████████████| 92kB 25.4MB/s eta 0:00:01
[?25hCollecting joblib (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/b8/a6/d1a816b89aa1e9e96bcb298eb1ee1854f21662ebc6d55ffa3d7b3b50122b/joblib-0.15.1-py3-none-any.whl (298kB)
[K     |████████████████████████████████| 307kB 25.3MB/s eta 0:00:01
[?25hCollecting regex (from nltk)
[?25l  Downloading https://files.pythonhosted.org/packages/60/7c/0d46b10a87b3087e8e303fac923beb19ec839d7c5ea34971a12fafb22b52/regex-2020.5.14-cp36-cp36m-manylinux2010_x86_64.whl (675kB)
[

In [3]:
import nltk

In [8]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  L



Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [ ] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] comtrans............ ComTrans Corpus Sample


Hit Enter to continue:  q



---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  d



Download which package (l=list; x=cancel)?


  Identifier>  all


    Downloading collection 'all'
       | 
       | Downloading package abc to /home/jupyterlab/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /home/jupyterlab/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to
       |     /home/jupyterlab/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package brown to /home/jupyterlab/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to
       |     /home/jupyterlab/nltk_data...
       |   Unzipping corpora/brown_tei.zip.
       | Downloading package cess_cat to /home/jupyterlab/nltk_data...
       |   Unzipping corpora/cess_cat.zip.
       | Downloading package cess_esp to /home/jupyterlab/nltk_data...
       |   Unzipping corpora/cess_esp.zip.
       | Downloading package chat80 to /home/jupyterlab/nltk_data...
       |   Unzipping corpora/chat80.zip.
       | Downloading packag


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  q


True

### Part 1 continued

##### Word and sentence tokenizers: break up by word or sentence

In [9]:
# word and sentence tokenizers: break up by word or sentence
# lexicoon and corporas
# corpora: body of text 
# lexicon: dictionary (words and their meanings)

In [10]:
# word and sent tokenizers 
from nltk.tokenize import sent_tokenize, word_tokenize

In [11]:
example_text = 'Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome.'
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.']
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is', 'awesome', '.']


### Part 3: Stopwords: garbage words (i.e. not informative) in english language 

##### One of the largest elements to any data analysis, natural language processing included, is pre-processing. This is the methodology used to "clean up" and prepare your data for analysis. 

##### One of the first steps to pre-processing is to utilize stop-words. Stop words are words that you want to filter out of any analysis. These are words that carry no meaning, or carry conflicting meanings that you simply do not want to deal with. 

In [12]:
# Stopwords: garbage words in english language 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

ex = 'This is an example showing off stop word filtration.'
stop_words = set(stopwords.words('english'))

filtered_list = []
word_toke = word_tokenize(ex)
for word in word_toke:
    if word not in stop_words:
        filtered_list.append(word)
print(filtered_list)
# one liner
filtered_sentence = [w for w in word_toke if w not in stop_words]
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']
['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']


### Part 4: Stemming

##### Another form of data pre-processing with natural language processing is called "stemming." This is the process where we remove word affixes from the end of words. The reason we would do this is so that we do not need to store the meaning of every single tense of a word. For example:

##### Reader, Reading, Read

##### Aside from tense, and even one of these is a noun, they all have the same meaning for their "root" stem (read).

##### This way, we store one single value for the root stem of "read." Then, when we wish to learn more, we can look into the affixes that were on the end, like "ing" is an active word, or in the past, then you have reader as someone who reads... then just plain read as either past tense or current. 

In [13]:
## stemming 
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
ps = PorterStemmer()
ex = ['python','pythoner', 'pythoning', 'pythoned', 'pythonly']

for w in ex:
    print(ps.stem(w))
new_text = 'It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once.'

words = word_tokenize(new_text)
#for w in words: 
     #print(ps.stem(w))

python
python
python
python
pythonli


### Part 5: Part of Speech Tagging 

##### Part of Speech tagging does exactly what it sounds like, it tags each word in a sentence with the part of speech for that word. This means it labels words as noun, adjective, verb, etc. PoS tagging also covers tenses of the parts of speech. 

In [14]:
# part of speech tagging 
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer 

train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try: 
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))
        
#process_content()

### Part 6: Chunking 

##### Chunking in Natural Language Processing (NLP) is the process by which we group various words together by their part of speech tags. 

##### One of the most popular uses of this is to group things by what are called "noun phrases." We do this to find the main subjects and descriptive words around them, but chunking can be used for any combination of parts of speech.

In [15]:
## Chunking: grouping words into phrases

train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try: 
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # reg expression doc :https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""  
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            #print(chunked)
    except Exception as e:
        print(str(e))
        
process_content()

### Part 7: Chinking

##### Chinking is a part of the chunking process with natural language processing with NLTK. A chink is what we wish to remove from the chunk. We define a chink in a very similar fashion compared to how we defined the chunk. 

##### The reason why you may want to use a chink is when your chunker is getting almost everything you want, but is also picking up some things you don't want. You could keep adding chunker rules, but it may be far easier to just specify a chink to remove from the chunk.


In [16]:
### Chinking is a part of the chunking process with natural language processing with NLTK. 

def process_content():
    try: 
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # reg expression doc :https://pythonprogramming.net/regular-expressions-regex-tutorial-python-3/
            chunkGram = r"""Chunk: {<.*>+} 
                                    }<VB.?|IN|DT>+{"""  
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            #print(chunked)
    except Exception as e:
        print(str(e))
        
process_content()


### Part 8: Named Entity 

##### Named entity recognition is useful to quickly find out what the subjects of discussion are. NLTK comes packed full of options for us. We can find just about any named entity, or we can look for specific ones.

##### NLTK can either recognize a general named entity, or it can even recognize locations, names, monetary amounts, dates, and more. 

In [17]:
# Named Entity Recognition 

def process_content():
    try: 
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            namedEnt = nltk.ne_chunk(tagged, binary = True)
    
            #print(namedEnt)
    except Exception as e:
        print(str(e))
        
process_content()

### Part 9: Lemmatizing 

##### A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words.

##### So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary.

##### A root lemma, on the other hand, is a real word. Many times, you will wind up with a very similar word, but sometimes, you will wind up with a completely different word.

In [18]:
# Lemmatizing: finding root stems of original words in text data

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


print(lemmatizer.lemmatize("better", pos='a'))
print(lemmatizer.lemmatize("best", pos='a'))
print(lemmatizer.lemmatize("ran", pos='v'))
print(lemmatizer.lemmatize("interested", pos='v'))

good
best
run
interest


### Part 10: Corpora 

##### Remember from the beginning, we talked about this term, "corpora."

##### Again, corpora is just a body of texts. Generally, corpora are grouped by some sort of defining characteristic.

##### NLTK is a massive toolkit for you. part of what they give you is a ton of highly valuable corpora to learn with, train against, and some of them are even capable of using in production.

In [19]:
# Corpora: accessing amd viewing 

print(nltk.__file__)

from nltk.corpus import gutenberg 
from nltk.tokenize import sent_tokenize

sample = gutenberg.raw('bible-kjv.txt')
sentence = sent_tokenize(sample)
print(sentence[5:15])

/home/jupyterlab/conda/envs/python/lib/python3.6/site-packages/nltk/__init__.py
['1:5 And God called the light Day, and the darkness he called Night.', 'And the evening and the morning were the first day.', '1:6 And God said, Let there be a firmament in the midst of the waters,\nand let it divide the waters from the waters.', '1:7 And God made the firmament, and divided the waters which were\nunder the firmament from the waters which were above the firmament:\nand it was so.', '1:8 And God called the firmament Heaven.', 'And the evening and the\nmorning were the second day.', '1:9 And God said, Let the waters under the heaven be gathered together\nunto one place, and let the dry land appear: and it was so.', '1:10 And God called the dry land Earth; and the gathering together of\nthe waters called he Seas: and God saw that it was good.', '1:11 And God said, Let the earth bring forth grass, the herb yielding\nseed, and the fruit tree yielding fruit after his kind, whose seed is\nin itsel

### Part 11: WordNet

##### Part of the NLTK Corpora is WordNet. I wouldn't totally classify WordNet as a Corpora, if anything it is really a giant Lexicon, but, either way, it is super useful. With WordNet we can do things like look up words and their meaning according to their parts of speech, we can find synonyms, antonyms, and even examples of the word in use. 

In [20]:
# WordNet 
from nltk.corpus import wordnet


### finding synonyms, defintion and examples
## ex = program
syns = wordnet.synsets('program') 
print(syns[0].name()) 
# lemma gets root word (similar to lemmatizer)
print(syns[0].lemmas()[0].name()) # word
print(syns[0].definition()) # definition
print(syns[0].examples()) # examples

# finding antonyms 
synonyms = []
antonyms = []


# finding synonyms and antonyms 
for syn in wordnet.synsets('good'):
    # since using lemma, gonna give synonyms for root word
    # good could mean an object or an adjective etc. 
    for l in syn.lemmas(): 
        #print("l:",l) # looking at the different lemmas 
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
            
print(set(synonyms))
print(set(antonyms))

plan.n.01
plan
a series of steps to be carried out or goals to be accomplished
['they drew up a six-step plan', 'they discussed plans for a new bond issue']
{'sound', 'unspoilt', 'honest', 'adept', 'salutary', 'goodness', 'undecomposed', 'dependable', 'just', 'secure', 'safe', 'serious', 'upright', 'dear', 'skilful', 'honorable', 'practiced', 'full', 'trade_good', 'right', 'ripe', 'in_force', 'good', 'thoroughly', 'soundly', 'well', 'expert', 'proficient', 'respectable', 'estimable', 'near', 'unspoiled', 'commodity', 'skillful', 'effective', 'in_effect', 'beneficial'}
{'ill', 'evil', 'badness', 'bad', 'evilness'}


In [21]:
# wordnet continued 

# how similar are two words 
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))

# how similar are two words 
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))

# how similar are two words 
w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cat.n.01')
print(w1.wup_similarity(w2))

0.9090909090909091
0.6956521739130435
0.32


### Part 11: Binary Text Classification 

##### Now that we understand some of the basics of of natural language processing with the Python NLTK module, we're ready to try out text classification. This is where we attempt to identify a body of text with some sort of label. 

##### To start, we're going to use some sort of binary label. Examples of this could be identifying text as spam or not, or, like what we'll be doing, positive sentiment or negative sentiment. 

In [39]:
# Text Classification 
import random
from nltk.corpus import movie_reviews

# there are 2000 movie reviews

# one liner 
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories() 
            for fileid in movie_reviews.fileids(category)]

# multiple lines 
documents = []
for category in movie_reviews.categories ():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

neg
pos


In [38]:
#print(documents[0])

In [28]:
# Text classification continues

# randomizing in preparation for training and testing 
random.shuffle(documents)

#print(documents[0])

# all words from all movies 
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
    
# frequency distribution     
# freq distribution is a dictionary of words and frequencies ordered by frequency 
all_words = nltk.FreqDist(all_words) 
print(all_words.most_common(15))
print(all_words['stupid'])
print(all_words)

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
253
<FreqDist with 39768 samples and 1583820 outcomes>


### Part 12: Words as Features for Learning

##### For our text classification, we have to find some way to "describe" bits of data, which are labeled as either positive or negative for machine learning training purposes. 

##### These descriptions are called "features" in machine learning. For our project, we're just going to simply classify each word within a positive or negative review as a "feature" of that review. 

##### Then, as we go on, we can train a classifier by showing it all of the features of positive and negative reviews (all the words), and let it try to figure out the more meaningful differences between a positive review and a negative review, by simply looking for common negative review words and common positive review words. 

In [35]:
### get the top 3000 occurring words and the context in which they are used (either positive or negative) 
### train on this data and then determine if a review is positive or negative based on words used 

word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        # dictionary of booleans determing whether or not top 3000 words 
        # across all movie reviews is in document/single review 
        features[w] = (w in words) 
    return features

#print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))

featuresets = [ (find_features(review), category) for (review, category) in documents]

In [70]:
#featuresets[0]

### Part 13: Naives Bayes

##### The algorithm of choice, at least at a basic level, for text analysis is often the Naive Bayes classifier. Part of the reason for this is that text data is almost always massive in size. The Naive Bayes algorithm is so simple that it can be used at scale very easily with minimal process requirements.

In [41]:
# categorizing as negative or positive sentiment 

# prepping training and testing sets 
training_set = featuresets[:1900]
testing_set = featuresets[1900:]

# NB algorithm: posterior = prior occurences x liklihood / evidence 

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

Naive Bayes Algo Accuracy: 77.0
Most Informative Features
                 idiotic = True              neg : pos    =     12.1 : 1.0
                  annual = True              pos : neg    =     10.7 : 1.0
               atrocious = True              neg : pos    =     10.5 : 1.0
                   sucks = True              neg : pos    =      9.5 : 1.0
                 frances = True              pos : neg    =      9.3 : 1.0
           unimaginative = True              neg : pos    =      7.5 : 1.0
                 cunning = True              pos : neg    =      7.0 : 1.0
                  sexist = True              neg : pos    =      6.9 : 1.0
             silverstone = True              neg : pos    =      6.9 : 1.0
                  regard = True              pos : neg    =      6.9 : 1.0
              schumacher = True              neg : pos    =      6.7 : 1.0
                    mena = True              neg : pos    =      6.3 : 1.0
                  shoddy = True           

### Part 14: Save classifier with Pickle

##### As you will likely find with any form of data analysis, there is going to be some sort of processing bottleneck, that you repeat over and over, often yielding the same object in Python memory. 

##### Examples of this might be loading a massive dataset into memory, some basic pre-processing of a static dataset, or, like in our case, the training of a classifier. 

##### In our case, we spend much time on training our classifier, and soon we may add more. It is a wise choice to go ahead and pickle the trained classifer. This way, we can load in the trained classifier in a matter of milliseconds, rather than waiting 3-5+ minutes for the classifier to be trained. 

##### To do this, we use the standard library's "pickle" module. What pickle does is serialize, or de-serialize, python objects. This could be lists, dictionaries, or even things like our trained classifier!

In [47]:
# for saving a trained algorithm
import pickle 

# saving trained classifier 

#save_classifier = open("naivebayes.pickle",'wb')
#pickle.dump(classifier, save_classifier)
#save_classifier.close()

# open to read 
classifier_f = open('naivebayes.pickle', 'rb')
classifier = pickle.load(classifier_f)
classifier_f.close()

print("Naive Bayes Algo Accuracy:", (nltk.classify.accuracy(classifier, testing_set))*100)


Naive Bayes Algo Accuracy: 77.0


### Part 15: Scikit-Learn Incorporation 
    
##### Despite coming packed with some classifiers, NLTK is mainly a toolkit focused on natural language processing, and not machine learning specifically. 

##### A module that is focused on machine learning is scikit-learn, which is packed with a large array of machine learning algorithms which are optimized in C. 

##### Luckily NLTK has recognized this and comes packaged with a special classifier that wraps around scikit learn. In NLTK, this is: nltk.classify.scikitlearn, specifically the class:  SklearnClassifier is what we're interested in.

In [56]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
# support vector machines
from sklearn.svm import SVC, LinearSVC, NuSVC

In [57]:
# converting sklearn classifier to nltk classifier using SklearnClassifier 

# dont forget to customize the parameters of the different classifying algos

# Multinomial 
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("Multinomial Naive Bayes Algo Accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set))*100)

# gaussian
#GNB_classifier = SklearnClassifier(GaussianNB())
#GNB_classifier.train(training_set)
#print("Gaussian Naive Bayes Algo Accuracy:", (nltk.classify.accuracy(GNB_classifier, testing_set))*100)

# bernoulli
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("Bernoulli Naive Bayes Algo Accuracy:", (nltk.classify.accuracy(BNB_classifier, testing_set))*100)

# logistic
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression Algo Accuracy:", (nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100)

# SGDClassifier
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier Algo Accuracy:", (nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100)

# SVC
# much more innaccurate than others so removing
#SVC_classifier = SklearnClassifier(SVC())
#SVC_classifier.train(training_set)
#print("SVC Algo Accuracy:", (nltk.classify.accuracy(SVC_classifier, testing_set))*100)


# Linear SVC
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("Linear SVC Algo Accuracy:", (nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100)

# Linear SVC
NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("Number SVC Algo Accuracy:", (nltk.classify.accuracy(NuSVC_classifier, testing_set))*100)


Multinomial Naive Bayes Algo Accuracy: 77.0
Bernoulli Naive Bayes Algo Accuracy: 77.0




LogisticRegression Algo Accuracy: 80.0




SGDClassifier Naive Bayes Algo Accuracy: 65.0




SVC Algo Accuracy: 57.99999999999999
Linear SVC Algo Accuracy: 81.0




Number SVC Algo Accuracy: 80.0


### Part 16: Combining Algos with a Vote

##### Now that we have many classifiers, what if we created a new classifier, which combined the votes of all of the classifiers, and then classified the text whatever the majority vote was? 

##### Turns out, doing this is super easy. NLTK has considered this in advance, allowing us to inherit from their ClassifierI class from nltk.classify, which will give us the attributes of a classifier, yet allow us to write our own custom classifier code. 

In [68]:
# Voting system: each classifier gets one vote and category is chosen based on most votes
from nltk.classify import ClassifierI
from statistics import mode 

# building class 

class VoteClassifier (ClassifierI):
    def __init__(self, *classifiers):
        self.classifiers = classifiers
    
    def classify(self, features):
        votes=[]
        for c in self.classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)
    
    def confidence(self, features):
        votes = [] 
        for c in self.classifiers:
            v = c.classify(features)
            votes.append(v)
        choice_votes = votes.count(mode(votes))
        conf = choice_votes/len(votes)
        return conf
    
voted_classifier = VoteClassifier(classifier, MNB_classifier, BNB_classifier, 
                                  LogisticRegression_classifier, SGDClassifier_classifier,
                                  LinearSVC_classifier, NuSVC_classifier)

print("Voted Classifier Algo Accuracy:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

print("Classification:", voted_classifier.classify(testing_set[0][0]), ", Confidence %:", voted_classifier.confidence(testing_set[0][0]))
print("Classification:", voted_classifier.classify(testing_set[1][0]), ", Confidence %:", voted_classifier.confidence(testing_set[1][0]))
print("Classification:", voted_classifier.classify(testing_set[2][0]), ", Confidence %:", voted_classifier.confidence(testing_set[2][0]))
print("Classification:", voted_classifier.classify(testing_set[3][0]), ", Confidence %:", voted_classifier.confidence(testing_set[3][0]))
print("Classification:", voted_classifier.classify(testing_set[4][0]), ", Confidence %:", voted_classifier.confidence(testing_set[4][0]))
print("Classification:", voted_classifier.classify(testing_set[5][0]), ", Confidence %:", voted_classifier.confidence(testing_set[5][0]))


Voted Classifier Algo Accuracy: 76.0
Classification: pos , Confidence %: 1.0
Classification: pos , Confidence %: 1.0
Classification: pos , Confidence %: 1.0
Classification: pos , Confidence %: 1.0
Classification: pos , Confidence %: 0.5714285714285714
Classification: pos , Confidence %: 1.0


### Part 17: Investigating Bias 

##### At this point in our project, we're interested in moving on to a real dataset, but we're concerned still about our volatility in accuracy. 

##### In this video, we peak into the classifiers to see if we have any bias leans towards positive or negative, and we wind up finding out that not only do we have a bias, we have a bug!

In [72]:
voted_classifier = VoteClassifier(MNB_classifier, BNB_classifier, 
                                  LogisticRegression_classifier,
                                  LinearSVC_classifier, NuSVC_classifier)

print("Voted Classifier Algo Accuracy:", (nltk.classify.accuracy(voted_classifier, testing_set))*100)

Voted Classifier Algo Accuracy: 80.0


### Part 18: Better Training data 

##### After some consideration it became clear that a new dataset would solve a lot of problems. This tutorial covers employing a new dataset, and what is involved in this process. 

##### This time, we're using a movie reviews data set that contains much shorter movie reviews. 

##### You can get this data set from: http://pythonprogramming.net/static/d...

##### This one yields us a far more reliable reading across the board, and is far more fitting for the tweets we intend to read from the Twitter API soon. 


### Part 19: Sentiment Analysis 
 
##### Now that we've got a more reliable classifier, we're ready to push forward. Here, we cover how we can convert our classifier training script to an actual sentiment analysis module. 

##### We pickle everything, and create a new sentiment function, which, with a parameter of "Text" will perform a classification and return the result. 

##### By pickling everything, we find that we can load this module in seconds, rather than the prior 3-5 minutes. After this, we're ready to apply this module to a live Twitter stream. 

### Part 20: Twitter Sentiment Analysis

##### Finally, the moment we've all been waiting for and building up to. A live test! We've decided to employ this classifier to the live Twitter stream, using Twitter's API. 

##### We've already covered how to do live Twitter API streaming, if you missed it, you can catch up here: http://pythonprogramming.net/twitter-...After this, we output the findings to a text file, which we intend to graph!

### Part 21: Graphing Live Twitter Sentiment
    
##### For a current conclusion to this series, we go ahead and graph our basic sentiment analysis results to a live Matplotlib graph.