<h1>NLTK</h1>

NLTK, or "Natural Language Toolkit" is a python library for working with human languge data. It is useful for tasks such as sentiment analysis, parsing and tokenization

In [None]:
import nltk
# we have to download some sub-resources here 
# to perform the tasks in this notebook. These 
# are obtained by running the nltk downloader
# and you will get an error if they are not present.
nltk.download('popular')
nltk.download('nps_chat')
nltk.download('subjectivity')
nltk.download('webtext')
nltk.download('vader_lexicon')

To start, let's parse some sentences with NLTK

In [None]:
sentence = "This is a sentence! Isn't that cool?"
tokens = nltk.word_tokenize(sentence)
tokens

NLTK has several packages we can use. nltk.book is one that contains texts from common books

In [None]:
from nltk.book import text1

Any time we want to find out about these texts, we just have to enter their names at the Python prompt:

In [None]:
text1

Now, we can easily search through the texts that NLTK gives us for keywords

In [None]:
text1.concordance("climate")

In [None]:
# We can also use NLTK to find synonyms in a piece of text
# by observing other words that appear in similar context
text1.similar("monstrous")

We can also obtain some statistics about the input text

In [None]:
print("text len {}, text vocabulary {}".format(len(text1), sorted(set(text1))[500:600]))

Using simple functions from NLTK, we can manipulate text extensivley.

<h1> Sentiment Analysis </h1>

NLTK includes many classifiers and sentiment analysis tools that make sentiment analysis easy

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
    "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
    "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
    "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
    "VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity
    "VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",# booster words & punctuation make this close to ceiling for score
    "The book was good.",         # positive sentence
    "The book was kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
    "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
    "A really bad, horrible book.",       # negative sentence with booster words
    "At least it isn't a horrible book.", # negated negative sentence with contraction
    ":) and :D",     # emoticons handled
    "",              # an empty string is correctly handled
    "Today sux",     #  negative slang handled
    "Today sux!",    #  negative slang with punctuation emphasis handled
    "Today SUX!",    #  negative slang with capitalization emphasis
    "Today kinda sux! But I'll get by, lol" # mixed sentiment example with slang and constrastive conjunction "but"
]

In [None]:
# Here, we can predict the polarity of sentiment 
# for the sentences we were given
sid = SentimentIntensityAnalyzer()
for sentence in sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
        print()

<h1> Training a model </h1>
In this section, we will load data from a CSV and train a model on it using sklearn and nltk

In [None]:
import pandas as pd

There are many more features of NLTK, such as stopword removal and lexicon normalization. Check them out at https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk

In [None]:
#The dataset is a tab-separated file. 
#Dataset has four columns PhraseId, SentenceId, Phrase, and Sentiment.


#This data has 5 sentiment labels:
#0 - negative 1 - somewhat negative 2 - neutral 
#3 - somewhat positive 4 - positive
data=pd.read_csv('train.tsv', sep='\t')
data.head()

In [None]:
data.info()

In [None]:
data.Sentiment.value_counts()


In [None]:
# In the Text Classification Problem, 
# we have a set of texts and their respective labels. 
# But we directly can't use text for our model. 
# You need to convert these text into some numbers or 
# vectors of numbers.

# Bag-of-words model(BoW ) is the simplest way of 
# extracting features from the text. 
# BoW converts text into the matrix of occurrence of words 
# within a document. This model concerns about whether given 
# words occurred or not in the document.
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
#tokenizer to remove unwanted elements from out data like symbols and numbers
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(lowercase=True,stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts= cv.fit_transform(data['Phrase'])

In [None]:
# split train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    text_counts, data['Sentiment'], test_size=0.3, random_state=1)

In [None]:
from sklearn.naive_bayes import MultinomialNB
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Generation Using Multinomial Naive Bayes
clf = MultinomialNB().fit(X_train, y_train)
predicted= clf.predict(X_test)
print("MultinomialNB Accuracy:",metrics.accuracy_score(y_test, predicted))
