# AI Trainee Journey - Practical part in NLP

In this session, we will continue our AI Trainee Journey with NLP in practice.

More specifically, we will be working with a (freely available digital) book or novel, written by J.P.Lovecraft, called Call of Cthulhu. The book can be downloaded from [manybooks.net](https://manybooks.net/). 

For your own quest/practice, you should download your own book, create a copy of this notebook and compare your results with results from this analysis.

# 0. Reading in the libraries

Because python without libraries is like a snake without legs.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import string # special operations on strings
import spacy # language models

from matplotlib.pyplot import imread
from matplotlib import pyplot as plt
from wordcloud import WordCloud
%matplotlib inline

Additional module is needed that is not part of default Kaggle environment. Therefore we need to download it and install it first.

In [None]:
!python -m spacy download en_core_web_md

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# 1. Reading in the input file and inspecting it

In [None]:
filename = '/kaggle/input/poirot3/Poirot-Investigates.txt' 
with open(filename) as f:
    book = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
book[0:10]

In [None]:
len(book)

# 2. Cleaning the data

Before fitting a machine learning or statistical model, we always have to clean the data. No models create meaningful results with messy data. is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

Luckily, a book is usually already a clean document that went over spellchecking, editing, grammar overview, etc. So the words and sentences we get are usually without errors and readable. Unlike answers in some questionnaire, where people can write anything and make many mistakes even if they don't want to. But there are still some redundant parts of the text that we don't need in our analysis, so let's get to it.

## Empty lines

We don't need empty lines or empty string in our text as they don't contain any information. So it's easiest to remove them in the beginning.

In [None]:
book = [x.strip() for x in book] # removes line breaks
book = [x for x in book if x] # removes empty strings, because they are considered in Python as False
book[0:10]

## Removing redundant parts of the text from the book

Because what we need to analyse is text of the book and not the author, name of the book or year of publication. Therefore we will remove the redundant parts from the text for analysis. 

In [None]:
# we see that we don't need first 43 lines
core_book = book[43:]
core_book[0:10]

In [None]:
# Joining the list into one string/text
text = ' '.join(core_book)
len(text)

## Punctuation String Translate
It really doesn't help us to have punctuation in inspecting words and their meaning, so let's get rid of it too.

In [None]:
no_punc_text = text.translate(str.maketrans('', '', string.punctuation))
no_punc_text[0:550]

In [None]:
#Lenght of text without punctuation
len(no_punc_text)

In [None]:
len(text) - len(no_punc_text)
#removed 9920 punctuation signs

## Stopwords
Stopwords are special case of words that work as a filler and usually don't hold any special meaning. We will be removing them later as we are comparing their occurence in the text with meaningful words. But let's take a look at what stopwords are. 

# 3. Feature engineering

**Feature engineering** is a process of creating new variables for given dataset with an idea of improving model's prediction accuraccy or better description of dataset.

Features can be:
- numerical (number of words in a sentence)
- categorical (what kind of sentence is it?)
- boolean (Is the sentence longer than 50 signs? True/False)
- ordinal (is the sentence short, medium or long?)

## Tokenisation
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

In [None]:
from nltk.tokenize import word_tokenize
text_tokens = word_tokenize(no_punc_text)
print(text_tokens[0:50])

In [None]:
len(text_tokens)

After we tokenised the text, we can remove stopwords from it.

In [None]:
from nltk.corpus import stopwords
my_stop_words = stopwords.words('english')
my_stop_words.append('the')
no_stop_tokens = [word for word in text_tokens if not word in my_stop_words]
print(no_stop_tokens[0:40])

In [None]:
len(no_stop_tokens)

## Lowercase
Something as simple as lowercasing all letters in all words helps a lot, because first letter in new sentence is uppercase by default and there are a few names of people and things also with uppercase. Lowercasing is standardizing.

In [None]:
lower_words = [x.lower() for x in no_stop_tokens]
print(lower_words[0:25])

## Stemming

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in lower_words]
print(stemmed_tokens[0:40])

## Lemmatisation
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research. [2][3][4] (Wikipedia)

In [None]:
# NLP english language model of spacy library
nlp = spacy.load('en')

In [None]:
# convert text into words with language properties, lemmas being one of them, but mostly POS, which will follow later
doc = nlp(' '.join(no_stop_tokens))
print(doc[0:40])

In [None]:
lemmas = [token.lemma_ for token in doc]
print(lemmas[0:25])

## Counting words
Convert a collection of text documents to a matrix of token counts. Lemmas being more precise, let's use those as a token for counting.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lemmas)
X

In [None]:
print(vectorizer.get_feature_names()[40:90])

In [None]:
print(X.toarray())

In [None]:
sum_words = X.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
words_freq[0:25]

## Term Frequency - Inverse Document Frequency

term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.[2] (Wikipedia)

In [None]:
# Look this up yourself and fill in the code :) This was not part of theory, but it's a bonus task for special reasearch

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(max_features=4000, stop_words='english')
x = vec.fit_transform(lemmas)
x.shape

In [None]:
X2 = vectorizer.fit_transform(lemmas)
features = (vectorizer.get_feature_names())

# Getting top ranking features
sums2 = X2.sum(axis=0)
ranking_results = []
for col, term in enumerate(features):
    ranking_results.append((term, sums2[0, col]))
ranking = pd.DataFrame(ranking_results, columns=['term', 'rank'])

ranking_dict = {}
for item in ranking.iterrows():
    ranking_dict[item[1]['term']] = item[1]['rank']

words = (ranking.sort_values('rank', ascending=False))
print("\n\nWords head : \n", words.head(50))


## Part Of Speech Tagging

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. (Wikipedia)

In [None]:
one_block = book[94]
doc_block = nlp(one_block)
spacy.displacy.render(doc_block, style='ent', jupyter=True)

In [None]:
for token in doc_block[0:20]:
    print(token, token.pos_)

# Filtering for nouns and verbs only

There are many kinds of POS words. Exact list can be found at [spacy documentation](https://spacy.io/api/annotation#pos-universal). Those that we will look at are quite standard and basic.

In [None]:
nouns_verbs = [token.text for token in doc if token.pos_ in ('NOUN', 'VERB')]
print(nouns_verbs[5:25])

## Counting tokens again

In [None]:
cv = CountVectorizer()

X = vectorizer.fit_transform(nouns_verbs)
sum_words = X.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
wf_df = pd.DataFrame(words_freq)
wf_df.columns = ['word', 'count']
wf_df[0:10]

# 4. Sentiment analysis

Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. In it's simplest form, it tries to identify, whether a sentence is positive or negative.

## Lexicon (rule) based approach

In [None]:
afinn = pd.read_csv('/kaggle/input/bing-nrc-afinn-lexicons/Afinn.csv', sep=',', encoding='latin-1')
afinn.shape

In [None]:
afinn.head()

In [None]:
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

In [None]:
affinity_scores = afinn.set_index('word')['value'].to_dict()
take(20, affinity_scores.items())

In the emotions lexicon, we have lemmatized for, but we want to show original sentence and original form of words in the results? How to do this?

In following steps:
1. put a unique id on each sentence (row)
2. make a column for sentence
3. calculate score for each sentence (row) by converting a word to lemmatized form only for the comparison and save it to new column 
4. order sentences by score to show top 10 and bottom 10

In [None]:
from nltk import tokenize
sentences = tokenize.sent_tokenize(" ".join(core_book))
sentences[5:15]

In [None]:
sent_df = pd.DataFrame(sentences, columns=['sentence'])
sent_df

## Custom function
Sometimes there is no predefined function that does everything that we want it to. Therefore, we define our own function that is specific for our use case. In this case, we want to score each word in a sentence in lemmatised form, but calculate the score for the whole original sentence.

In [None]:
nlp = spacy.load('en')
sentiment_lexicon = affinity_scores

def calculate_sentiment(text: str = None) -> float:
    sent_score = 0
    if text:
        sentence = nlp(text)
        for word in sentence:
            sent_score += sentiment_lexicon.get(word.lemma_, 0)
    return sent_score

In [None]:
# test that it works
calculate_sentiment(text = 'Amazing boys, very good!')

In [None]:
sent_df['sentiment_value'] = sent_df['sentence'].apply(calculate_sentiment)

In [None]:
# how many words are in the sentence?
sent_df['word_count'] = sent_df['sentence'].str.split().apply(len)
sent_df['word_count'].head(10)

In [None]:
sent_df.sort_values(by='sentiment_value').tail(10)

In [None]:
# Sentiment score of the whole book
sent_df['sentiment_value'].sum()

# 5. Visualizing results

## Barchart for top 10 nouns + verbs

In [None]:
wf_df[0:10].plot.bar(x='word', figsize=(12,8), title='Top verbs and nouns')

## Wordcloud

Word Clouds are visual representations of words that give greater prominence to words that appear more frequently.

In [None]:
wordcloud = WordCloud(background_color ='black', 
                       min_font_size = 10).generate(text)
plt.figure(figsize = (12, 10), facecolor = None) 
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

## Scatterplot
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

In [None]:
sent_df.plot.scatter(x='word_count', y='sentiment_value', figsize=(12,8), title='Sentence sentiment value to sentence word count')

## Correlation
Is there a correlation between word count in sentence and sentiment?

In [None]:
from scipy.stats import pearsonr
corr, _ = pearsonr(sent_df['word_count'], sent_df['sentiment_value'])
corr

A: No, there isn't

# 6. Summary

This concludes our NLP practical session with the most frequent tasks to be done in text analytics. Very often we want to know what is going on in a text without reading it all, especially when we have a lot of text with some similar features. 

There are always some cleaning, preprocessing, feature engineering and visualisation steps in NLP analysis, no matter what the input dataset is.

Sentiment analysis, wordclouds, scatter plots are then just one way of how to look into the dataset. Other very popular technique we didn't look at here, is topic modeling or topic classification in texts. https://www.kaggle.com/arthurtok/spooky-nlp-and-topic-modelling-tutorial

# 7. Quest

Your task in this quest is to use the book you downloaded from manybooks.net and 

**1. Do a sentiment analysis on the book to find 10 most positive sentences (with score) and 10 most negative sentences (with score). Calculate the final sentiment score for the whole book and print it somewhere.**

and/or

**2. Visualize the most frequent nons + verbs (or all words from original text) in a wordloud and plot a bar chart with number of top 10 words. Stemmed or lemmatized or without any modification. Change the background color to match your style**

BONUS

**3. Do a TF-IDF research and calculate these frequencies**