# Text Analysis:  fuzzy matching, sentiment analysis

<U>Notes if you are using Jupyter Notebook</U>:  to call <B>exit()</B> from a notebook, please use <B>sys.exit()</B> (requires <B>import sys</B>); if a strange error occurs, it may be because Jupyter retains variables from all executed cells.  To reset the notebook's variables, click 'Restart Kernel' (the circular arrow) -- this will not undo any text changes.  

FUZZY MATCHING

(Note:  these examples were taken from a <A HREF="https://medium.com/@categitau/fuzzy-string-matching-in-python-68f240d910fe">blog tutorial by Catherine Gitau</A>)

<pre>
pip install fuzzywuzzy
pip install python-Levenshtein
</pre>

In [None]:
from fuzzywuzzy import fuzz, process
from fuzzywuzzy import process

In [None]:
fuzz.ratio("Catherine M Gitau", "Catherine Gitau")

fuzz.partial_ratio("Catherine M. Gitau", "Catherine Gitau")  #100

In [None]:
fuzz.ratio("Catherine M Gitau", "Gitau Catherine")           #55

fuzz.partial_ratio("Catherine M. Gitau", "Gitau Catherine")  #60

In [None]:
str1 = "Los Angeles Lakers"
str2 = "Lakers"

ratio = fuzz.ratio(str1.lower(), str2.lower())

partial_ratio = fuzz.partial_ratio(str1.lower(), str2.lower())

print(ratio)
print(partial_ratio)

In [None]:
fuzz.token_sort_ratio("Catherine Gitau M.", "Gitau Catherine") #94

str1 = "united states v. nixon"
str2 = "Nixon v. United States"

ratio = fuzz.ratio(str1.lower(), str2.lower())
partial_ratio = fuzz.partial_ratio(str1.lower(), str2.lower())

token_sort_ratio = fuzz.token_sort_ratio(str1, str2)

print(ratio)
print(partial_ratio)
print(token_sort_ratio)

In [None]:
query = 'geeks for geeks'
choices = ['geek for geek', 'geek geek', 'g. for geeks']

# get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
[('geeks geeks', 95), ('g. for geeks', 95), ('geek for geek', 93)]

# if we want only the top one
process.extractOne(query, choices)
('geeks geeks', 95)

In [None]:
fuzz.WRatio('geeks for geeks', 'Geeks For Geeks')        # 100
fuzz.WRatio('geeks for geeks!!!','geeks for geeks')      # 100

# whereas simple ratio will give for above case
fuzz.ratio('geeks for geeks!!!','geeks for geeks')       # 91

NATURAL LANGUAGE ANALYSIS

Note:  these examples are taken from <A HREF="https://www.pearson.com/us/higher-education/program/Deitel-Intro-to-Python-for-Computer-Science-and-Data-Science-Learning-to-Program-with-AI-Big-Data-and-The-Cloud/PGM2392788.html">Intro to Python for the Computer and Data Sciences from Pearson Publishing.

<B>10.1:  install textblob and supporting libraries

<pre>
# TextBlob:  simple interface to NLTK
<B>conda install -c conda-forge textblob</B>

# wordcloud - to create word cloud graphics
<B>conda install -c conda-forge wordcloud</B>

# spaCy - additional analysis
<B>conda install -c conda-forge spacy</B>
</pre>

<B>10.2:  download NLTK corpora

<pre>
python -m textblob.download_corpora

# download NLTK stop words
import nltk; nltk.download('stopwords')

# spaCy English components
python -m spacy download en

python -m spacy download en_core_web_lg       # 827MB!
</pre>

<B>10.3:  Create a TextBlob object.  

In [None]:
from textblob import TextBlob

text = 'Today is a beautiful day.  Tomorrow looks like bad weather.'

blob = TextBlob(text)   # TextBlob object

print(blob)

In [None]:
blob_from_file = TextBlob(open('data/shakespeare/hamlet.txt').read())

print(len(blob_from_file))       # number of chars

<B>10.4:  Tokenizing text:  separating into words, sentences, phrases

In [None]:
text = 'Today is a beautiful day.  Tomorrow looks like bad weather.'

blob = TextBlob(text)      # TextBlob object

print('words:')
print(blob.words)          # WordList object
print()

print('sentences:')
print(blob.sentences)      # Sentences object
print()

print('noun phrases:')
print(blob.noun_phrases)   # WordList object
print()                    # "beautiful day", "bad weather"

<B>10.5:  POS ("parts of speech" tagging

In [None]:
blob.tags

<pre>
nouns
pronouns
verbs
adjectives
adverbs
prepositions
conjunctions
interjections
</pre>

<pre>
NN:   singular noun, mass noun
VBZ:  third person singular present verb
DT:   determiner
JJ:   adjective
NNP:  poper singular noun
IN:   subordinating conjunction or preposition
</pre>

https://www.clips.uantwerpen.be/pattern
https://www.clips.uantwerpen.be/pages/MBSP-tags

<B>10.6:  Determine sentiment.  

In [None]:
text = ('Today is a beautiful day.  '          # single string (implicit concatenation)
        'Tomorrow looks like bad weather.  '
        'The food was not good.  '
        'The movie was not bad.  ')

blob = TextBlob(text)

# Sentiment object, inludes 'polarity' and 'subjectivity' attributes
smt = blob.sentiment

# smt.polarity     (-1 to 1, positive/negative)
# smt.subjectivity (0 to 1, objective to subjective)

for sentence in blob.sentences:
    print(f'"{sentence}"')
    print(f'Polarity (-1 to 1):  {sentence.sentiment.polarity}')
    print(f'Subjectivity:        {sentence.sentiment.subjectivity}')
    print()

<B>10.7:  Analyze sentiment using <I>Naive Bayes</I> Analysis

<B>Naive Bayes</B> is a classification system used in machine learning, that assumes that any individual predictor is unrelated to other predictors when evaluating a candidate.

For example, an orange may be identified by its roundness, its orangeness and its size.  A Bayes predictor takes each of these features into account when evaluating a candidate to see if it is an orange, but it does not assume they are related to one another (thus "naive").

The TextBlob NaiveBayesAnalyzer was trained on movie reviews.


In [None]:
from textblob.sentiments import NaiveBayesAnalyzer

text = ('Today is a beautiful day.  '          # single string (implicit concatenation)
        'Tomorrow looks like bad weather.  '
        'The food was not good.  '
        'The movie was not bad.  ')

blob = TextBlob(text, analyzer=NaiveBayesAnalyzer())

for sentence in blob.sentences:
    print(f'"{sentence}"')
    print(f'classifiction:    {sentence.sentiment.classification}')
    print(f'p_pos:            {sentence.sentiment.p_pos}')
    print(f'p_neg:            {sentence.sentiment.p_neg}')
    print()

<B>10.8:  Work with inflections.  

Inflections allow us to access an alternate version of a word.


In [None]:
import textblob
from textblob import Word

windex = textblob.Word('index')

print(windex.pluralize())          # 'indices'


wcacti = textblob.Word('cacti')

print(wcacti.singularize())        # 'cactus'


animals = TextBlob('dog cat fish bird').words

print(animals.pluralize())         # WordList(['dogs', 'cats', 'fish', 'birds'])

<B>10.9:  Work with stemming and lemmatization.  

When preparing a text for analysis we may wish to normalize the text.  'Normalizing' in this context means dismissing word variations:
<UL>
  <LI>lowercasing</LI>
  <LI>stemming:  finding the word stem of all word variations</LI>
  <LI>lemmatizing:  finding a single base word, for example 'program' can stand for 'programming', 'programmer', etc.</LI>
</UL>

In [None]:
word = Word('varieties')

# stem():  common stem to all word variations (may not be a word)
print(word.stem())        # varieti

# lemmatize():  create "base" word
print(word.lemmatize())   # variety

<B>10.10:  Delete stop words.

"Stop words" are common words like 'a', 'an', 'as', etc. that are very common and usually excluded from analysis.

In [None]:
import nltk

#nltk.download('stopwords')           # one time operation, loads words
#                                     # into module for future use

from nltk.corpus import stopwords

stops = stopwords.words('english')

print(f'{len(stops)} words in stop words collection')

for word in stops:
    print(word)

In [None]:
# quick and easy exclusion of stop words from any text

blob = TextBlob('Today is a beautiful day.')

words = [ word for word in blob.words if word not in stops ]

print(words)       # ['Today', 'beuatiful', 'day']

<B>10.11:  Analyze "n-grams"

An <B>ngram</B> is a sequence of text items -- letters or words that appear sequentially in a text.


In [None]:
text = 'Today is a beautiful day.  Tomorrow looks like bad weather.'

blob = TextBlob(text)

for wordlist in blob.ngrams():
    print(wordlist)



In [None]:
for wordlist in blob.ngrams(n=5):
    print(wordlist)

<B>10.12:  Count word and phrase frequency.  

In [None]:
fname = 'data/shakespeare/romeo_and_juliet.txt'

romeotext = open(fname).read()

blob = TextBlob(romeotext)


print(blob.word_counts['juliet'])

print(blob.words.count('juliet'))   # only if TextBlob has been tokenized into a WordList (or Words object?)

print(blob.noun_phrases.count('lady capulet'))

<B>10.13:  Use spaCy to identify proper nouns:  names, dates, places, etc.

In [None]:
import spacy

nlp = spacy.load('en')

document = nlp('In 1994, Tim Berners-Lee founded the '
               'World Wide Web Consortium (W3C), '
               'devoted to developing web technologies.')

for entity in document.ents:
    print(f'{entity.text}:  {entity.label_}')

        # 1994:  DATE
        # Tim Berners-Lee:  PERSON
        # the World Wide Web Consortium:  ORG

<B>10.14:  Use spaCy to detect textual similarities.  

In [None]:
import spacy

# nlp = spacy.load('en')                # small model - comes with spacy
# nlp = spacy.load('en_core_web_sm')
nlp = spacy.load('en_core_web_lg')      # has to be downloaded:
                                        # python -m spacy download en_core_web_lg
                                        # 827MB!

# sh_:  Shakespeare
# ma_:  Christopher Marlowe
# be_:  Beaumont and Fletcher

ws_romeo_juliet = open('data/shakespeare/romeo_and_juliet.txt').read()
ws_hamlet =       open('data/shakespeare/hamlet.txt').read()
cm_dr_faustus =   open('data/not_shakespeare/marlowe_dr_faustus.txt').read()
bf_maids =        open('data/not_shakespeare/beaumont_fletcher_the_maids_tragedy.txt').read()
tw_streetcar =    open('data/not_shakespeare/williams_streetcar_named_desire.txt').read()

ws_romeo_text = nlp(ws_romeo_juliet)
ws_hamlet_text = nlp(ws_hamlet)

cm_faustus_text = nlp(cm_dr_faustus)

bf_maids_text = nlp(bf_maids)
tw_streetcar_text = nlp(tw_streetcar)


print(f'romeo to hamlet:     {ws_romeo_text.similarity(ws_hamlet_text)}')        # Shakespeare -> Shakespeare
print(f'romeo to faustus:    {ws_romeo_text.similarity(cm_faustus_text)}')       # Shakespeare -> Marlowe
print(f"romeo to maid's:     {ws_romeo_text.similarity(bf_maids_text)}")         # Shakespeare -> Beaumont & Fletcher

print(f"faustus to maid's:   {cm_faustus_text.similarity(bf_maids_text)}")      # Marlowe -> Beaumont & Fletcher
print(f'romeo to streetcar:  {ws_romeo_text.similarity(tw_streetcar_text)}')  # Shakespeare -> Williams

<B>10.15:  Detect language and translate

TextBlob is integrated with Google Translate, allows it to detect language and perform tralsations.

In [None]:
from textblob import TextBlob

text = ('Today is a beautiful day.  '          # single string (implicit concatenation)
        'Tomorrow looks like bad weather.  '
        'The food was not good.  '
        'The movie was not bad.  ')

blob = TextBlob(text)

print(blob.detect_language())           # 'en' (English)

spanish_blob = blob.translate(from_lang='en', to='es')  # defaul from_lang='auto'
print(spanish_blob)

language codes:  https://en.wikipedia.org/wiki/List\_of\_ISO\_639-1\_codes

supported langs:  https://cloud.google.com/translate/docs/languages

<B>10.16:  Perform a spell check.  

In [None]:
from textblob import Word

word = Word('theyr')

# suggest corrections
print(word.spellcheck())     # [('they', 0.57), ('their', 0.43)]

# choose most common correction
print(word.correct())        # 'they'


sentence = TextBlob('Ths sentense has svral mispellings.')

blob = sentence.correct()

print(blob)                  # The sentence has several misspellings.

<B>10.17:  Visualize word freqency.  

In [None]:
from pathlib import Path
from textblob import TextBlob
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords


blob = TextBlob(open('data/shakespeare/the_merry_wives_of_windsor.txt').read())

# get list of stop words ('a', 'an', etc.)
stop_words = stopwords.words('english')

# tuples of (word, count)
items = blob.word_counts.items()

# exclude stop words
items = [ item for item in items if item[0] not in stop_words ]

# sort by 2nd item (the count)
sorted_items = sorted(items, key=lambda x: x[1], reverse=True)

# slice first 20 of sorted list
top20 = sorted_items[1:21]

df = pd.DataFrame(top20, columns=['word', 'count'])

axes = df.plot.bar(x='word', y='count', legend=False)

plt.gcf().tight_layout()

plt.show()