# Text Analysis

In this module, we will use the Natural Language Toolkit Library (NLTK) to look at individual words and sentences in a text and clean unneccessary features from the text data to prepare for sentiment analysis. Then using the textblob library, we will analyze the sentiment of opinioned data to give a numerical value for use in a predictive model.

#### Tokenizing Words and Sentences

Recall in the "Python Dictionaries and String Manipulation" notebook, we used the .split() function to break a sentence apart.

In [None]:
text = "My favorite color is purple"
text.split()

However, because the default character to split on is a space, the .split() function does not work well with sentences that have punctuation.

In [None]:
#the comma is attached to the previous word
text2 = "My favorite foods are french fries, bacon, and cheese"
text2.split()

The NLTK library was built to separate punctuation from words when tokenizing (splitting into parts).

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#this is sample data
from nltk.corpus import names  

from string import punctuation

#if the next cell does not work
#remove number symbol on following lines and re-run this cell
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('names')

In [None]:
#the comma is now its own token when the sentence is split
word_tokenize(text2)

In [None]:
#muti-sentence texts can be tokenized by sentence
#each sentence is an item in the list
text3 = "My name is Kenisha Priester. My favorite color is purple. My favorite foods are french fries, bacon, and cheese."

sent_tokenize(text3)

A lot of social media data is in the form of tweets. A tweet can have an @ sign to tag another user or a # sign to tag a particular subject. When analyzing data, you may want to retain these symbols with the words/phrase they're attached to.

In [None]:
#tweets are more difficult to use the word tokenize function on
#using word_tokenize, common social media signs get separated
tweet = "@animegurl OMG you are so #funny :P"

word_tokenize(tweet)

In [None]:
#keeps the @ sign, # sign, and the tongue-out emoji
TweetTokenizer().tokenize(tweet)

#### Remember the Word Count exercise text?

Let's clean it up using NLTK and do a basic analysis.

In [None]:
#work count exercise paragraph
wctext = '''
A CerTaiN kING HaD a bEaUtIFul gaRDEn, ANd IN THe GarDen StooD A trEe
whICH bORe GoLDeN ApPlEs. THesE aPples WerE alwAyS CoUntEd, aNd abOuT
tHE TiMe WheN tHEY BEgAn tO grow RipE IT wAs foUND THat EVeRY NIgHT ONE
OF THeM Was gOne. thE kiNg bECAMe veRy ANGRy at thiS, aND ORDEred the
GarDEneR TO KEeP WAtch ALL NIgHT uNDER the tREE. tHE gardener sEt hIs
ELdEsT SoN to WATCH; buT ABout TweLve O'clOCK He fELL ASlEEp, And in
the morNIng aNOTheR Of thE aPPLes Was mIssinG. tHEn THE sECONd Son waS
oRdERED to waTch; aNd AT mIDniGhT he tOO FELl ASleEP, aND iN thE mOrNIng
ANoThER AppLE WaS gOne . TheN THe thIrd Son oFfeREd tO KeEp wATCh; buT
thE garDENer At First WoULd NoT LET Him, FOr fEaR sOMe HArM ShOuLD cOME
To hIM: hoWeveR, at lAST hE coNSEnteD, AND tHE YouNg MAN laID HimSELf
uNDER tHE tREe TO wAtch. AS tHE clocK STRuCk tweLvE He Heard A rustlinG
NoISe IN ThE aIr, And a biRd CAME FlYing ThAt was Of PUre gOLd; AND as
IT WAs SNApPING At onE oF ThE aPpleS wIth iTS BeaK, tHE GArDEner’S son
jUMpED UP And SHOT AN aRrOw at iT. But THE arrOw DID thE BiRD nO HaRm;
ONlY iT dRoPPEd a GoLDEn FEather FROM iTS tAiL, aND THEN FLEw AwaY.
the gOLdEN FEAthER WAS bRoUght to THe KinG IN THE MOrNING, AnD aLL ThE
cOunCil WAs called TogETHER. EVERYoNE aGREed ThAt it wAS wORth MoRe THAn
aLl The weAltH Of tHE kIngDOm: But THE KiNg sAID, ‘One FeatHeR Is Of NO
use tO me, I MusT HaVE ThE wHOLE BIRD .’
'''

In [None]:
#first, change all the words to lowercase
wctext = wctext.lower()

#then tokenize each part of the text
tknz_wct = word_tokenize(wctext)

In [None]:
tknz_wct[:5]

In [None]:
#the NLTK FreqDist gives a count for how often each part of the text occurs
fd_wct = FreqDist(tknz_wct)
fd_wct

In [None]:
#shows the top 10 words in the text
fd_wct.most_common(10)

The most common parts of this text seem to be filler words and punctuation. We need to remove them to get a better understand of what the text is about.

In [None]:
#number of tokens in list before puntuation removal
len(tknz_wct)

In [None]:
#remove the puntuation tokens from the list
for word in tknz_wct:
    if word in punctuation:
        tknz_wct.remove(word)

In [None]:
#number of tokens in list after puntuation removal
len(tknz_wct)

In [None]:
punctuation

In [None]:
#list of english stopwords
eng_stopwords = stopwords.words('english')
eng_stopwords

In [None]:
rm_count = 0
new_words = []  #list to hold new words

for word in tknz_wct:
    if word not in eng_stopwords:
        new_words.append(word)
    else: rm_count += 1

In [None]:
rm_count

In [None]:
len(new_words)

Now let's see the new top 10 words in this text.

In [None]:
fd_nw = FreqDist(new_words)
fd_nw.most_common(10)

Words in Enlgish can change their form depending on if it's past tense, present tense, or future tense. We can reduce these words to their dictionary form so that it's easier for the computer to interpret the words. This is called **lemmatization**.

In [None]:
#put the word lemmatization function into a variable
wnl = WordNetLemmatizer()

In [None]:
#this sentence contains words with different tenses
sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."

In [None]:
#tokenize the sentence into a list
#this is before we lemmatize it
non_lem = word_tokenize(sentence)
non_lem

In [None]:
#empty list to hold the new lemmatized words
lemm = []

for word in non_lem:
    lemm.append(wnl.lemmatize(word, pos="v"))  #lemmatize using 'verb' part-of-speech

In [None]:
#this is the list of tokens after being lemmatized
lemm

#### Visual analysis of last letter of male and female names

In [None]:
#there are two text files within the Names sample data
names.fileids()

In [None]:
#tokenize the words in the text to lists
m_names = names.words('male.txt')
f_names = names.words('female.txt')

In [None]:
#sample the first 5 items in m_names
m_names[:5]

In [None]:
#make a frequency distribution of names that end with a particular letter (by gender)
cfd = nltk.ConditionalFreqDist(
            (fileid, name[-1])
            for fileid in names.fileids()
            for name in names.words(fileid))

cfd.plot()

### Sentiment Analysis

In order to understand how people feel about something, we need to do sentiment analysis on text data that contains their opinion.

You will need to [install the textblob library](https://anaconda.org/conda-forge/textblob).

In [None]:
from textblob import TextBlob

In [None]:
myday = "Today is a great day, but it is boring"

In [None]:
tb = TextBlob(myday)

tb.sentiment

In [None]:
tb.sentiment.polarity

In [None]:
tb_pol = tb.sentiment.polarity
type(tb_pol)

#### Make a sentiment value column in a dataframe

Using the [Amazon Book Reviews dataset on Kaggle](https://www.kaggle.com/shrutimehta/amazon-book-reviews-webscraped), we add a new column to the dataset that will have a numerical value for the sentiment of each review.

In [None]:
import pandas as pd

#load the data from the Reviews.csv file
filepath = "Reviews.csv"
df = pd.read_csv(filepath, encoding = "latin-1") #this file is encoded differently

df.head()

In [None]:
#create a function to clean up each review
#then it will analyze and assign a sentiment polarity
def reviewSentiment(review):
    
    #make text lowercase
    review = review.lower()
    
    #tokenize the review
    tknz_review = word_tokenize(review)
    
    #remove puntuation
    for token in tknz_review:
        if token in punctuation:
            tknz_review.remove(token)
    
    clean_tokens = []
    #remove filler words
    for token in tknz_review:
        if token not in eng_stopwords:
            clean_tokens.append(token)
            
    #put sentence back together with remaining clean words
    clean_review = ' '.join(clean_tokens)
    
    #turn into textblob
    blob_rev = TextBlob(clean_review)
    
    #get sentiment polarity
    r_pol = blob_rev.sentiment.polarity
    
    return r_pol

In [None]:
#create a new column to hold sentiment value from function
df['review_sentiment'] = df['ReviewContent'].apply(reviewSentiment)

In [None]:
#erify sentiment values in new column
df.head()

In [None]:
#create a function to assign a polarity category to the sentiment
def sentimentCategory(sent_num):
    if sent_num >= 0.2:
        return "positive"
    if sent_num <= -0.2:
        return "negative"
    else:
        return "neutral"

In [None]:
#create a new column to hold sentiment category
df['sentiment_category'] = df['review_sentiment'].apply(sentimentCategory)

In [None]:
df.head()

In [None]:
#compare frequency of positive, negative, and neutral reviews
df['sentiment_category'].value_counts()

Overall, it seems that most readers feel so-so about the book (maybe some good parts and some bad parts) and some readers really like the book.