# Edgar Allan Poe Word Cloud

This is a simple exercise in getting familiar with the wordcloud library.  Some code snippets were borrowed from Ken Jee's You Tube video Data Science Project From Scratch Part 4 - (Exploratory Data Analysis) https://www.youtube.com/watch?v=QWgg4w1SpJ8&pbjreload=101

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
##Load the data in a dataframe
Poe = pd.read_csv('../input/poe-short-stories-corpuscsv/preprocessed_data.csv')

##Load the image into a numpy array
##This will give us a mask for the word cloud and a color scheme
Portrait = np.array(Image.open('../input/poesilhouette/PoeSilhouette.jpg'))

##Visualize some data
Poe.head()

The text column contains the full text for each short story.  We'll combine all of those stories into a single string for processing.  The preprocess function will remove all punctuation and common stop words.

In [None]:
##create a single variable with all text from all short stories
words = ' '.join(Poe.text)

In [None]:
def preprocess(text):
    ##remove punctuation and stop words
    result = []
    stops = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    for t in tokens:
        if t not in stops and t.isalpha():
            result.append(t.lower())
    return result

In [None]:
filtered_words = preprocess(words)
##rejoin all filtered words into a single string
filtered_text = ' '.join(filtered_words)

### Wordcloud
The wordcloud object is predefined with a given background color and shape.  Wordcloud also has it's own list of stopwords to further filter words that are unimportant.  The mask can be defined to limit the area where the text can appear.  We're using a black and white silhouette here, so the words should appear only where the Portrait image is black.

In [None]:
##create Word Cloud object
wc = WordCloud(background_color='white', mask=Portrait, stopwords=STOPWORDS,max_words=500, width=1080, height = 720)
wc.generate(filtered_text)

In [None]:
##create color_imprint
color_imprint = ImageColorGenerator(Portrait)
##Recolor the word cloud using the colors in the imprint
wc.recolor(color_func=color_imprint)
##export the wordcloud to a file
wc.to_file('PoeCloud.png')

In [None]:
##Visualize the wordcloud
plt.figure(figsize=[10,10])
plt.imshow(wc,interpolation='bilinear')
plt.axis('off')
plt.show()