<div style="text-align: right">INFO 6105 Data Sci Engineering Methods and Tools, midterm hints</div>
<div style="text-align: right">Dino Konstantopoulos, 28 March 2022</div>

# Shakespeare and a quick intro to NLP
The take-home is an introduction to googling and using libraries, with pandas manipulations and lots of plotting. 

Does NLP intrigue you? I teach an advanced class in NLP, but you need to take an ML class, first!

# Text2Emotion
```
pip install text2emotion
```

## Fourty Winters sonnet

In [1]:
import text2emotion as te

forty_winters = """
When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty’s field,
Thy youth’s proud livery, so gazed on now,
Will be a tatter’d weed, of small worth held:
Then being ask’d where all thy beauty lies,
Where all the treasure of thy lusty days,
To say, within thine own deep-sunken eyes,
Were an all-eating shame and thriftless praise.
How much more praise deserved thy beauty’s use,
If thou couldst answer ‘This fair child of mine
Shall sum my count and make my old excuse,’
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel’st it cold."""

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sachi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
te.get_emotion(forty_winters)

{'Happy': 0.29, 'Angry': 0.0, 'Surprise': 0.14, 'Sad': 0.24, 'Fear': 0.33}

## Modern English explanation
When forty winters have attacked your brow and wrinkled your beautiful skin, the pride and impressiveness of your youth, so much admired by everyone now, will have become a worthless, tattered weed. Then, when you are asked where your beauty’s gone and what’s happened to all the treasures you had during your youth, you will have to say only within your own eyes, now sunk deep in their sockets, where there is only a shameful confession of greed and self-obsession. How much more praise you would have deserved if you could have answered, ‘This beautiful child of mine shall give an account of my life and show that I made no misuse of my time on earth,’ proving that his beauty, because he is your son, was once yours! This child would be new-made when you are old and you would see your own blood warm when you are cold.

Also able to identify the emotion from the emojis!

In [3]:
text = "What an amazing day😃😃"
te.get_emotion(text)

{'Happy': 0.0, 'Angry': 0.0, 'Surprise': 1.0, 'Sad': 0.0, 'Fear': 0.0}

# NLTK
```
pip install nltk
```

NLTK is arguably the #1 NLP library. It already has a built-in, pretrained sentiment analyzer called `VADER` (Valence Aware Dictionary and sEntiment Reasoner).

Since VADER is pretrained, you can get results more quickly than with many other analyzers. However, VADER is best suited for language used in social media, like short sentences with some slang and abbreviations. It’s less accurate when rating longer, structured sentences, but it’s often a good launching point.

In [12]:
import nltk

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [16]:
import nltk
#nltk.download('shakespeare')
w = nltk.corpus.shakespeare.words()

TypeError: Expected a single file identifier string

In [None]:
stopwords = nltk.corpus.stopwords.words("english")

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
import operator
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(forty_winters)

You’ll get back a dictionary of different scores. The `neg`ative, `neu`tral, and `pos`itive scores are related: They all add up to 1 and can’t be negative. The `compound` score is calculated differently. It’s not just an average, and it can range from -1 to 1.

## A little bit of fun with built-in datasets
`.fileids()` exists in most, if not all, corpora. In the case of `movie_reviews`, each file corresponds to a single review. Note also that you’re able to filter the list of file IDs by specifying categories. This categorization is a feature specific to this corpus and others of the same type.

The corpus `movie_reviews` is a collection of movie reviews included in NLTK. The special thing about this corpus is that it’s already been classified. Therefore, we can use it to judge the accuracy of the algorithms we choose when rating similar texts.

In [10]:
import nltk
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

Next, we define `is_positive()` to work on an entire review. We’ll need to obtain that specific review using its file ID and then split it into sentences before rating.

In [11]:
from statistics import mean

def is_positive(review_id: str) -> bool:
    """True if the average of all sentence compound scores is positive."""
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [
        sia.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(text)
    ]
    return mean(scores) > 0

Let's rate all the reviews and see how accurate VADER is with this setup:

In [17]:
all_review_ids[0]

'neg/cv590_20712.txt'

In [18]:
nltk.corpus.movie_reviews.raw(all_review_ids[0])

'if you\'re debating whether or not to see _breakfast_of_champions_ , ask yourself one simple question : do you want to see nick nolte in lingerie ? \nthe only people who would get much enjoyment from alan rudolph\'s chaotic adaptation of the kurt vonnegut novel is the cross-section of the population with the unhealthy urge to see that unpleasant sight . \neveryone else--and i\'m hoping that\'s most people--would be wise to steer clear of this excrutiatingly unfunny mess . \nactually , though , the sight of nolte in high heels is one of the more amusing things about this muddle , which focuses dwayne hoover ( bruce willis ) , the owner of dwayne hoover\'s exit 11 motor village in midland city . \nnot only is he a huge success as a businessman , he\'s also something of a celebrity , his face made recognizable by an ongoing series of television commercials . \nwith a nice home and family to boot , dwayne appears to have it all the ingredients to be happy--yet he\'s not . \nhis wife celia

In [13]:
from random import shuffle

shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
    if is_positive(review_id):
        if review_id in positive_review_ids:
            correct += 1
    else:
         if review_id in negative_review_ids:
            correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

64.00% correct


After rating all reviews, we can see that only 64% were correctly classified by VADER using the logic defined in `is_positive()`.

A 64% accuracy rating isn’t great, but it’s a start.

In order to train and evaluate an improved classifier, we’ll need to build a list of features for each text we’ll analyze.

By using the predefined categories in the `movie_reviews` corpus, let's create sets of positive and negative words, then determine which ones occur most frequently across each set. Let's begin by excluding unwanted words and building the initial category groups:

In [23]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Dino\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\names.zip.


True

In [24]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

Now we’re ready to create frequency distributions for our custom feature. Since some words are present in both positive and negative sets, let's begin by finding the common set so we can remove it from the distribution objects:

In [25]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
    del positive_fd[word]
    del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

In [46]:
import random
random.sample(common_set, 10)

['soft',
 'morosely',
 'extravaganzas',
 'sense',
 'trick',
 'chased',
 'opening',
 'betrays',
 'old',
 'wiser']

In [47]:
random.sample(top_100_positive, 10)

['hanks',
 'vividly',
 'kimble',
 'profile',
 'attentive',
 'soviet',
 'ghost',
 'spacey',
 'societal',
 'fa']

In [49]:
random.sample(top_100_negative, 10)

['manchurian',
 'putrid',
 'flubber',
 'snipes',
 'tediously',
 'abysmal',
 'consecutive',
 'weighed',
 'droppingly',
 'fetch']

Once we’re left with unique positive and negative words in each frequency distribution object, we can finally build sets from the most common words in each distribution. 

The amount of words in each set is something we could tweak in order to determine its effect on sentiment analysis.

With our new feature set ready to use, the first prerequisite for training a classifier is to define a function that will extract features from a given piece of data.

Since we’re looking for positive movie reviews, let's focus on the features that indicate positivity, including VADER scores:

In [50]:
def extract_pos_features(text):
    features = dict()
    wordcount = 0
    compound_scores = list()
    positive_scores = list()

    for sentence in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sentence):
            if word.lower() in top_100_positive:
                wordcount += 1
        compound_scores.append(sia.polarity_scores(sentence)["compound"])
        positive_scores.append(sia.polarity_scores(sentence)["pos"])

    # Adding 1 to the final compound score to always have positive numbers
    # since some classifiers you'll use later don't work with negative numbers.
    features["mean_compound"] = mean(compound_scores) + 1
    features["mean_positive"] = mean(positive_scores)
    features["wordcount"] = wordcount

    return features

In [52]:
def extract_neg_features(text):
    features = dict()
    wordcount = 0
    negative_scores = list()

    for sentence in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sentence):
            if word.lower() in top_100_negative:
                wordcount += 1
        negative_scores.append(sia.polarity_scores(sentence)["neg"])

    # Adding 1 to the final compound score to always have positive numbers
    # since some classifiers you'll use later don't work with negative numbers.
    features["mean_negative"] = mean(negative_scores)
    features["wordcount2"] = wordcount

    return features

In [53]:
features = [
    (extract_pos_features(nltk.corpus.movie_reviews.raw(review)), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_neg_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

Training the classifier involves splitting the feature set so that one portion can be used for training and the other for evaluation, then calling `.train()`.

We can use `classifier.show_most_informative_features()` to determine which features are most indicative of a specific property.

In [54]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)

classifier = nltk.NaiveBayesClassifier.train(features[:train_count])
classifier.show_most_informative_features(10)

Most Informative Features


In [55]:
nltk.classify.accuracy(classifier, features[train_count:])

0.9886666666666667

Adding features has dramatically improved VADER’s initial accuracy, from 64% to 99%!

## Comparing Additional Classifiers
NLTK provides a class that can use most classifiers from the popular machine learning framework `scikit-learn`.

Many of the classifiers that `scikit-learn` provides can be instantiated quickly since they have defaults that often work well. Let's integrate them within NLTK to classify text data.

In [30]:
from sklearn.naive_bayes import (
    BernoulliNB,
    ComplementNB,
    MultinomialNB,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

With these classifiers imported, we’ll first have to instantiate each one. Thankfully, all of these have pretty good defaults and don’t require much tweaking.

To aid in accuracy evaluation, it’s helpful to have a mapping of classifier names and their instances:

In [31]:
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "KNeighborsClassifier": KNeighborsClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "LogisticRegression": LogisticRegression(),
    "MLPClassifier": MLPClassifier(max_iter=1000),
    "AdaBoostClassifier": AdaBoostClassifier(),
}

Now we can use these instances for training and accuracy evaluation.

Since NLTK allows us to integrate scikit-learn classifiers directly into its own classifier class, the training and classification processes will use the same methods we’ve already seen, `.train()` and `.classify()`.

We’ll also be able to leverage the same features list we built earlier by means of `extract_features()`. To jog our memory, here’s how we built the features list:
```
features = [
    (extract_pos_features(nltk.corpus.movie_reviews.raw(review)), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_neg_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])
```

The features list contains tuples whose first item is a set of features given by extract_features(), and whose second item is the classification label from preclassified data in the movie_reviews corpus.

Since the first half of the list contains only positive reviews, begin by shuffling it, then iterate over all classifiers to train and evaluate each one:

In [32]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)

for name, sklearn_classifier in classifiers.items():
    classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
    classifier.train(features[:train_count])
    accuracy = nltk.classify.accuracy(classifier, features[train_count:])
    print(F"{accuracy:.2%} - {name}")

66.60% - BernoulliNB
66.60% - ComplementNB
66.60% - MultinomialNB
69.00% - KNeighborsClassifier
64.07% - DecisionTreeClassifier
68.67% - RandomForestClassifier
71.33% - LogisticRegression
72.73% - MLPClassifier
71.27% - AdaBoostClassifier


While this doesn’t mean that the `MLPClassifier` will continue to be the best one as you engineer new features, having additional classification algorithms at our disposal is clearly advantageous.

NTLK that allow us to process text into objects that we can filter and manipulate, which allows us to analyze text data to gain information about its properties. We can also use different classifiers to perform sentiment analysis on our data.

# TextBlob
Text Blob is a simple python library used to perform NLP task like tokenization, Noun phrase extraction, POS-Tagging, Words inflection and lemmatization, N-grams, Sentiment Analysis.
```
pip install textblob
```

In [34]:
from textblob import TextBlob
TextBlob(forty_winters).sentiment

Sentiment(polarity=0.22491258741258743, subjectivity=0.5234265734265734)

In [35]:
from textblob import TextBlob   
text = '''                                       
Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text 
data using text analysis techniques. Sentiment analysis allows 
businesses to identify customer sentiment toward products, brands or services in online conversations and feedback.
'''
blob = TextBlob(text) 
blob.tags  

[('Sentiment', 'NN'),
 ('analysis', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('interpretation', 'NN'),
 ('and', 'CC'),
 ('classification', 'NN'),
 ('of', 'IN'),
 ('emotions', 'NNS'),
 ('positive', 'JJ'),
 ('negative', 'JJ'),
 ('and', 'CC'),
 ('neutral', 'JJ'),
 ('within', 'IN'),
 ('text', 'NN'),
 ('data', 'NNS'),
 ('using', 'VBG'),
 ('text', 'JJ'),
 ('analysis', 'NN'),
 ('techniques', 'NNS'),
 ('Sentiment', 'NN'),
 ('analysis', 'NN'),
 ('allows', 'VBZ'),
 ('businesses', 'NNS'),
 ('to', 'TO'),
 ('identify', 'VB'),
 ('customer', 'NN'),
 ('sentiment', 'NN'),
 ('toward', 'IN'),
 ('products', 'NNS'),
 ('brands', 'NNS'),
 ('or', 'CC'),
 ('services', 'NNS'),
 ('in', 'IN'),
 ('online', 'JJ'),
 ('conversations', 'NNS'),
 ('and', 'CC'),
 ('feedback', 'NN')]

In [36]:
sentence = TextBlob("This is really good !") 
sentence.translate(to="zh")

TextBlob("这真的很好！")

In [37]:
sentence.translate(to="hi")

TextBlob("यह सचमुच अच्छा है !")

In [39]:
sentence.translate(to="el")

TextBlob("Αυτό είναι πραγματικά καλό!")

# Flair
Another NLP library, [Flair](https://github.com/flairNLP/flair) allows us to apply our state-of-the-art natural language processing (NLP) models to our text, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, with support for a rapidly growing number of languages.