<img src="Images/PoweredTechGirlz.png" width="15%" align="right">

# Activity 3: Text Mining Harry Potter - Most Popular Words

We will be using data provided by [Bradley Boehmke](https://github.com/bradleyboehmke/harrypotter).

The goal of this class is to do a textual analysis of the seven Harry Potter books. We will use Python to discover some interesting insights that maybe nobody else in the world has realized about the Harry Potter books! In this activity we will find the most popular words and combination of words in book 1.

<img src="Images/book_covers.png" width="60%" align="left">

In [None]:
import Helpers
from Helpers.load_data import *
from Helpers.plot_data import *
from Helpers.clean_data import *
from collections import Counter
#from wordcloud import WordCloud, STOPWORDS

# Most popular words

## Clean up

We will try to find out which words are the most popular in each book.

We already know how to split each book into words, but this time we will need to do two additional steps.

First, look at the following piece of text:

In [None]:
text = (
    'Hagrid: "You are a wizard, Harry" '
    'Harry: "I am a what?" '
    'Hagrid: "A wizard, Harry" '
)
print(text)

How can we find the most popular words in this piece of text?

As it turns out, Python has another neat function which we can use to do that. It's called a `Counter`. You can give `Counter` a list of words and it will count how many times does each word appear. 

To do that, we first need a list of words. In the cell below, split the text into words and assign the result to a new variable called `words`:

In [None]:
words = text.split()
print(words)

Now that we have separate words, we can pass those to a `Counter`:

In [None]:
Counter(words)

What happened? `Counter` counted the words "Harry" and "Harry:" as two separate words. It also counted "A" and "a" as two separate words.

To fix this, we will need to remove all characters from the text which are not letters, and conver the text to lowercase.

First, lets remove special characters. We prepared a helper function that you can use to remove those special characters. It's called `remove_special_characters` :) This is how it works:

In [None]:
my_word = '"Harry!"'

print("Original word:", my_word)

clean_word = remove_special_characters(my_word)

print("Clean word:", clean_word)

Can you do this for all words in our text? Try to do that in the field below:

In [None]:
clean_words = []

for word in words:
    clean_words.append(remove_special_characters(word))
    
print(clean_words)

Lets try the counter again:

In [None]:
Counter(clean_words)

We still have some uppercase and lowercase words that make Python think 'A' and 'a' are two separate words. We need to convert everything to lowercase.

In Python that can be done using a function called `lower`:

In [None]:
print("HELLO")
print("HELLO".lower())

Lets do this for the words in our text:

In [None]:
lowercase_words = []

for word in clean_words:
    lowercase_words.append(word.lower())
    
print(lowercase_words)

Let's try the counter again:

In [None]:
Counter(lowercase_words)

# It worked!

## Most popular words

Let's try to apply this to the first book. First, let's load the book:

In [None]:
book = load_book_1()

print(book[0:500])

Let's do what we just learned: split the book to words, turn all words to lowercase and remove special characters. To avoid writing the same code again, we prepared another helper function for you that does all the cleanup we need. It's called `make_clean_words`:

In [None]:
clean_words = make_clean_words(book)
    
print(clean_words[0:26])

We just cleaned the whole book one!

We can use `Counter` to count occurences of all words in the book!

The list would be veeeeeeeeeeery long! So let's ask counter to only give us the 10 most common words:

In [None]:
Counter(clean_words).most_common(10)

# That's it! The 10 most common words in book 1!

Something happened though. Are all of the words useful to us? Is it useful knowing that "the" is the most common word?

In English, words like "the", "a", and "I" appear very often, but don't give us any useful information. The frequently appearing words are called "stopwords". They are important because they help to structure our sentences, but they don't tell us anything about the meaning of the text.

That's why in Text Mining we usually remove those words.

We prepared another helper function that you can use -- this function removes all stopwords from a list of words. It's called `remove_stopwords`:

In [None]:
print("Before:", clean_words[0:26])
print()

no_stopwords = remove_stopwords(clean_words)

print("After:", no_stopwords[0:26])

Let's try the `Counter` again:

In [None]:
Counter(no_stopwords).most_common(10)

# Yay! That worked!

Can you tell me who are the most frequently mentioned students in book 1?

Let's try to do this for book 2!

We prepared a bit of code to help you. Fill in the missing lines:

In [None]:
book = load_book_2()

clean_words = make_clean_words(book)

no_stopwords = remove_stopwords(clean_words)

Counter(no_stopwords).most_common(10)

Remember that in the previous activity we used a function called `plot` to visualize the results? Let's try that again:

In [None]:
top_words = Counter(no_stopwords).most_common(10)

plot_words(top_words)

Try this for other books by changing the fields above.

## Word pairs

What about combinations of words?

So far we were looking at a single word at a time. What if we want to find how often do two words appear in the text together? For example "professor lockhart".

Remember our text from the start of this activity?

In [None]:
print(text)

We prepared a function which takes a list of words and turns those words into pairs:

In [None]:
words = text.split()

pairs = get_word_pairs(words)

Counter(pairs).most_common(10)

Let's do this for book 1.

In [None]:
book = load_book_1()

clean_words = make_clean_words(book)
    
no_stopwords = remove_stopwords(clean_words)

word_pairs = get_word_pairs(no_stopwords)

Counter(word_pairs).most_common(10)

In [None]:
plot_words(Counter(word_pairs).most_common(10))

# Word Clouds

You might remember word clouds from the last class. Since we already read all the books into the list `books[]`, we can easily create word clouds for every book. 

Which one is used in the code below?

Run it for the different books or copy the code, so you can have a word cloud for all seven books in this one notebook.

In [None]:
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS

wc = WordCloud(width=1000, height=1000, background_color="white", stopwords=STOPWORDS).generate(books[5])

plt.figure(figsize=(10, 10))
plt.axis("off")
plt.imshow(wc);

## Yeak you made it through activity 3.

Feel free to experiment with the notebook to learn even more about the Harry Potter books. For example, you can try other word clouds or finding most common word pairs in the other books.