<a href="https://colab.research.google.com/github/shstreuber/AdvDM/blob/main/Module14_PythonCats_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Module 14: Analyzing Unstructured Data**

So far we have been dealing with **structured data**. Structured data is ... well ... structured. This means that an instance of our data has nice attributes that can be represented in a DataFrame or a table. But the majority of data in the world is **unstructured**.

In this module you will learn how to:
* Analyze unstructured text
* Clean text with nltk tools
* Build a word cloud
* Find a source to help you build a sentiment analysis





# **Analyzing Text**
Suppose I have a corpus of twitter posts about cats and my goal is to match the query for "Healthy cat food" to the most appropriate tweet I have found.

A common way to represent text is to treat the text as an unordered set of words, which is called the **bag of words** approach.

## Bag of Words

With the bag of words approach we count word occurrences and the features (what we might think of as columns) are the words. This 'bag of words' allows us to use any classification methods we may want to use later. But first, we need to cover some extensive preprocessing.

##**0. Our Data** ##
Let's assume that we have downloaded a number of tweets from Twitter. Note that each tweet is treated as a string and is stored in a variable.

In [None]:
doc1 = "Stray cats are running all over the place. I see 10 a day!"
doc2 = "Cats are killers. They kill billions of animals a year."
doc3 = "The best food in Columbus, OH is the North Market."
doc4 = "Brand A is the best tasting cat food around. Your cat will love it."
doc5 = "Buy Brand C cat food for your cat. Brand C makes healthy and happy cats."
doc6 = "The Arnold Classic came to town this weekend. It reminds us to be healthy."
doc7 = "I have nothing to say. In summary, I have told you nothing."
doc8 = "Healthy cat food."

## **1. Transform our text into an Array of Text Vectors**
Converting **unstructured** text to something **structured** is a multistep process. Let's learn the bits before putting it together. And we will start with the last step first-- creating the bag of words.

A CountVectorizer converts a collection of text documents to a matrix of token counts. Follow the link below to see how the CountVectorizer contains methods that helps us preprocess our text:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

Now we assemble our separate text vectors into an array (i.e. a corpus of text) and call it tinyCorpus.

In [None]:
tinyCorpus = [doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8]
tinyCorpus

['Stray cats are running all over the place. I see 10 a day!',
 'Cats are killers. They kill billions of animals a year.',
 'The best food in Columbus, OH is the North Market.',
 'Brand A is the best tasting cat food around. Your cat will love it.',
 'Buy Brand C cat food for your cat. Brand C makes healthy and happy cats.',
 'The Arnold Classic came to town this weekend. It reminds us to be healthy.',
 'I have nothing to say. In summary, I have told you nothing.',
 'Healthy cat food.']

## **2. Clean and Transform Text**
For much of this we use the Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques [is here](https://www.nltk.org/).



###**2.1 Transform tinyCorpus to String to Clean and Transform the Data**

In [None]:
tinyCorpusStr=str(tinyCorpus)
tinyCorpusStr

"['Stray cats are running all over the place. I see 10 a day!', 'Cats are killers. They kill billions of animals a year.', 'The best food in Columbus, OH is the North Market.', 'Brand A is the best tasting cat food around. Your cat will love it.', 'Buy Brand C cat food for your cat. Brand C makes healthy and happy cats.', 'The Arnold Classic came to town this weekend. It reminds us to be healthy.', 'I have nothing to say. In summary, I have told you nothing.', 'Healthy cat food.']"

### **2.2 Remove any HTML tags**
This is useful for webscraped data. We don't have any HTML tags here, but the code is still important.

In [None]:
import re
clean = re.compile('<.*?>')
tinyCorpusStrHTML = re.sub(clean, '', tinyCorpusStr)
tinyCorpusStrHTML

"['Stray cats are running all over the place. I see 10 a day!', 'Cats are killers. They kill billions of animals a year.', 'The best food in Columbus, OH is the North Market.', 'Brand A is the best tasting cat food around. Your cat will love it.', 'Buy Brand C cat food for your cat. Brand C makes healthy and happy cats.', 'The Arnold Classic came to town this weekend. It reminds us to be healthy.', 'I have nothing to say. In summary, I have told you nothing.', 'Healthy cat food.']"

###**2.3 Make All Characters Lowercase**

In [None]:
tinyCorpusStrLower=tinyCorpusStr.lower()
tinyCorpusStrLower

"['stray cats are running all over the place. i see 10 a day!', 'cats are killers. they kill billions of animals a year.', 'the best food in columbus, oh is the north market.', 'brand a is the best tasting cat food around. your cat will love it.', 'buy brand c cat food for your cat. brand c makes healthy and happy cats.', 'the arnold classic came to town this weekend. it reminds us to be healthy.', 'i have nothing to say. in summary, i have told you nothing.', 'healthy cat food.']"

### **2.4 Remove any Punctuation**
You can do this in different ways. One of the ways is to set up a punctuation variable and then eliminate the contents of this variable from the string. Another is to use regular expressions. We will use the second option because it is more effective. For that, we need the Regular Expression library called re at https://docs.python.org/3/library/re.html

In [None]:
import re

tinyCorpusStrLowerPunct = re.sub(r'[^\w\s]', '', tinyCorpusStrLower)
tinyCorpusStrLowerPunct

'stray cats are running all over the place i see 10 a day cats are killers they kill billions of animals a year the best food in columbus oh is the north market brand a is the best tasting cat food around your cat will love it buy brand c cat food for your cat brand c makes healthy and happy cats the arnold classic came to town this weekend it reminds us to be healthy i have nothing to say in summary i have told you nothing healthy cat food'

###**2.5 Remove any Numbers**
You may or may not want to do this, depending on whether the numbers in your text are important.

NOTE that the code below still needs to be fixed because it separates the words into individual letters.

In [None]:
from string import digits

remove_digits = str.maketrans('', '', digits)

tinyCorpusStrLowerPunctNum1 = [i.translate(remove_digits) for i in tinyCorpusStrLowerPunct]

print(tinyCorpusStrLowerPunctNum1)

['s', 't', 'r', 'a', 'y', ' ', 'c', 'a', 't', 's', ' ', 'a', 'r', 'e', ' ', 'r', 'u', 'n', 'n', 'i', 'n', 'g', ' ', 'a', 'l', 'l', ' ', 'o', 'v', 'e', 'r', ' ', 't', 'h', 'e', ' ', 'p', 'l', 'a', 'c', 'e', ' ', 'i', ' ', 's', 'e', 'e', ' ', '', '', ' ', 'a', ' ', 'd', 'a', 'y', ' ', 'c', 'a', 't', 's', ' ', 'a', 'r', 'e', ' ', 'k', 'i', 'l', 'l', 'e', 'r', 's', ' ', 't', 'h', 'e', 'y', ' ', 'k', 'i', 'l', 'l', ' ', 'b', 'i', 'l', 'l', 'i', 'o', 'n', 's', ' ', 'o', 'f', ' ', 'a', 'n', 'i', 'm', 'a', 'l', 's', ' ', 'a', ' ', 'y', 'e', 'a', 'r', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 'f', 'o', 'o', 'd', ' ', 'i', 'n', ' ', 'c', 'o', 'l', 'u', 'm', 'b', 'u', 's', ' ', 'o', 'h', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'n', 'o', 'r', 't', 'h', ' ', 'm', 'a', 'r', 'k', 'e', 't', ' ', 'b', 'r', 'a', 'n', 'd', ' ', 'a', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 't', 'a', 's', 't', 'i', 'n', 'g', ' ', 'c', 'a', 't', ' ', 'f', 'o', 'o', 'd', ' ', 'a', 'r', 'o', '

## **3. Eliminate Low-Information words**
For some applications, some words provide less information than others. For example,  the word *this* may be informative for some tasks. But for other tasks like deciding if the text is about pianos or motorcycles the word is considered uninformative. Other examples  of low-information words might be *the, a, this, that, on, of,*

Some data scientists believe that these low-information, high-frequency words constitute noise and they remove them in a pre-processing step. These words we are removing are called **stop words**

For example, if the stop words are *a, and, be, the, will* and we have the sentence

         be a kind and compassionate person
         
we will end up with

              kind    compassionate  person

###**3.0 Remove Noise**
BEFORE we apply any canned stopword lists, we can also decide to make additional noise lists ourselves. For example, the letter "c", the letters "us", and the word "oh" show up in our text. We will remove them.

In [None]:
noise_list = ["oh", "c", "us"]
def _remove_noise(input_text):
    words = input_text.split()
    noise_free_words = [word for word in words if word not in noise_list]
    noise_free_text = " ".join(noise_free_words)
    return noise_free_text

# _remove_noise("oh i c us here")
tinyCorpusStrLowerPunctNoise = _remove_noise(tinyCorpusStrLowerPunct)
tinyCorpusStrLowerPunctNoise

'stray cats are running all over the place i see 10 a day cats are killers they kill billions of animals a year the best food in columbus is the north market brand a is the best tasting cat food around your cat will love it buy brand cat food for your cat brand makes healthy and happy cats the arnold classic came to town this weekend it reminds to be healthy i have nothing to say in summary i have told you nothing healthy cat food'

### **3.1 Install  stopword lists.**
Stopword lists contain low-information words that we feed into a filter in order to extract them from the text that we want to analyze.

In [None]:
# ONLY DO THIS ONCE
# import nltk
# nltk.download('all', halt_on_error=False)

Now that we downloaded the lists to our computers let's take a look at the English stopwords. We will be using these shortly.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

swords = stopwords.words('english')
len(swords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


179

In [None]:
print(swords)

###**3.2 Remove Stopwords**

In [None]:
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(tinyCorpusStrLowerPunctNoise)
print(word_tokens)

In [None]:
tinyCorpusStrLowerPunctNoiseStop = [w for w in word_tokens if not w in swords]

tinyCorpusStrLowerPunctNoiseStop = []

for w in word_tokens:
	if w not in swords:
		tinyCorpusStrLowerPunctNoiseStop.append(w)

print(tinyCorpusStrLowerPunctNoiseStop)

##**4. Stem the Document (= Morphological Analysis)**
Words have internal structure. So dogs is really dog+PLURAL and chased is chase+PAST. This structure is called morphology and the analysis step is called morphological analysis. For many classification tasks, we don't care whether the person wrote cats or cat. Or running or runs instead of run. We might want to count all those variants of run simply as run. So instead of having separate attributes for run, running, and runs, we reduce it to one.

There are a number of stemming algorithms available to us. Here is how to use the Snowball Stemmer and the Porter Stemmer:

In [None]:
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer

stemmer1 = SnowballStemmer('english')
stemmer2 = PorterStemmer()

print(stemmer1.stem('cats'))
print(stemmer2.stem('cats'))

In [None]:
## NOTE that stemming does not recognize irregular verbs!

print(stemmer1.stem('running'))
print(stemmer1.stem('runned'))
print(stemmer1.stem('ran'))
print(stemmer1.stem('runs'))
print("")
print(stemmer2.stem('running'))
print(stemmer2.stem('runned'))
print(stemmer2.stem('ran'))
print(stemmer2.stem('runs'))

Let's try this with our cats now. NOTE that this does not play well with the stop words.

In [None]:
stemmer1.stem(tinyCorpusStrLowerPunctNoise)
# stemmer2.stem(tinyCorpusStrLowerPunctNoiseStop)

A great summary of all preprocessing techniques in a text cleaning script is here: https://gist.github.com/jiahao87/d57a2535c2ed7315390920ea9296d79f

#**5. Word Cloud**
Before we do anything else, let's make a word cloud that shows how the words are distributed within our corpus.

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd

wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                min_font_size = 10).generate(tinyCorpusStrLowerPunctNoise)

# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

#**6. Sentiment Analysis**
To see a sentiment analysis in action, follow [these instructions](https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk).