### Table of Contents

1. [Getting Started](#Getting-Started)
  - [File descriptions](#File-descriptions)
2. [Background of the problem domain](#Background)
3. [Objective](#Objective)
4. [Datasets](#Datasets)
  - [About authors](#About-authors)
  - [Import libraries](#Import-libraries)
  - [Load datasets](#Load-datasets)
5. [Data Exploration](#Data-Exploration)
  - [Data fields](#Data-fields)
  - [Quick statistics by grouping authors](#Quick-statistics-by-grouping-authors)
  - [Data distribution by each author](#Data-distribution-by-each-author)
  - [Distribution of semantics](#Distribution-of-semantics)
  - [Wordcloud](#Wordcloud)
6. [Data Preprocessing](#Data-Preprocessing)
  - [Binarizing target variable](#Binarizing-target-variable)
  - [Scrubbing of data](#Scrubbing-of-data)
7. [Shuffle and split data](#Shuffle-and-split-data)
8. [Word vectors](#Word-vectors)
9. [Evaluating model performance](#Evaluating-model-performance)
  - [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)
  - [Logistic Regression](#Logistic-Regression)
  - [SVM](#SVM)
  - [Stochastic Gradient Descent Classifier](#Stochastic-Gradient-Descent-Classifier)
  - [XGBoost](#XGBoost)
  - [Model performance conclusion](#Model-performance-conclusion)
10. [Parameter tuning using GridSearchCV](#Parameter-tuning)
11. [Model training](#Model-training)
  - [Fit and predict the data]()
  - [Accuracy score](#Accuracy-score)
  - [Classification report](#Classification-report)
  - [ROC and AUC curves](#ROC-and-AUC-curves)
12. [Predict unexplored data](#Predict-unexplored-data)
  - [Predict](#Predict)
  - [Reflect author names](#Reflect-author-names)
13. [Conclusion](#Conclusion)
  - [Compare heatmaps from the prediction](#Compare-heatmaps-from-the-prediction)
  - [Compare Wordclouds from the prediction](#Compare-Wordclouds-from-the-prediction)

### Getting Started

The dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer, so you may notice the odd non-sentence here and there. Your objective is to accurately identify the author of the sentences in the test set.

##### File descriptions
`train.csv` - the training set

`test.csv` - the test set

`sample_submission.csv` - a sample submission file in the correct format

### Background

Since ancient days, discoveries, patents, analyses were and are often documented. It does not matter how they are documented.

They can be documented either in the form of wall paintings or wall embedded drawings or any sort of scripts.

Older documents are often referred to continue the research even now in every fields like literature, experiments, myths, etc.

However, it is necessary for the seeker or organisation that the authenticity of these documents are met. Still, there could be lots of anonymous and non-classified documents exists. 

So, how were these documents verified? In the ancient days, these were all manually verified. There are different patterns like style of writing, language of communication, etc. of understand from where these documents came in and potentially by whom. Also, do we have any similar documents identified earlier?

This is broadly known as Stylometry. Stylometry deals with the study of linguistic style and is widely adopted across academic, literary, music and fine arts. Also, known as authorship attribution.

Also, now-a-days, the anonymous nature of online-message distribution makes identity tracing a critical problem in our society where everything is online.

### Objective

Scan through the test documents and identify the potential author of each document.

### Datasets

Authorship attribution is indeed a huge field and is applicable across multiple domains.

This holds it application everywhere like forensics, mythologies, online messages, ancient belongings, etc.

For now, we shall limit our practive with the minimum dataset. The attached dataset holds for only three authors and they are:
- Edgar Allan Poe [EAP]
- Howard Phillips Lovecraft [HWS]
- Mary Wollstonecraft Shelley [MWS]

##### About authors

_Edgar Allan Poe_ was an American writer, editor, and literary critic. He is best known for his poetry and short stories, particularly his tales of mystery and the macabre.

_Howard Phillips Lovecraft_ was an American writer who achieved posthumous fame through his influential works of horror fiction.

_Mary Wollstonecraft Shelley_ was an English novelist, short story writer, dramatist, essayist, biographer, and travel writer, best known for her Gothic novel Frankenstein: or, The Modern Prometheus.

##### Import libraries

In [2]:
# !pip install wordcloud
# !pip install nltk

# Load libraries #

import csv
import json

import numpy
import pandas
from time import time

from matplotlib import pyplot
import seaborn as sns

import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

##### Load datasets

In [3]:
# Load all data #

authored_contents = pandas.read_csv("../input/train.csv")

unauthored_contents = pandas.read_csv("../input/train.csv")

In [4]:
authored_contents.head()

In [5]:
unauthored_contents.head()

In [6]:
authored_contents.shape

### Data Exploration

##### Data fields

`id` - a unique identifier for each sentence

`text` - some text written by one of the authors and is our feature.

`author` - the author of the sentence (EAP: Edgar Allan Poe, HPL: HP Lovecraft; MWS: Mary Wollstonecraft Shelley). Hence, this is our target variable.

##### Quick statistics by grouping authors

In [7]:
training_records = len(authored_contents)

author_eap, author_mws, author_hpl = authored_contents.author.value_counts()

print("Total number of authored contents: ", training_records)
print("Total number of authored contents by EAP: ", author_eap)
print("Total number of authored contents by MWS: ", author_mws)
print("Total number of authored contents by HPL: ", author_hpl)

Below, we shall study the text length in the training dataset.

In [8]:
# grab text length of each contents

authored_contents['text_length'] = authored_contents['text'].str.len()

In [9]:
authored_contents.head()

##### Data distribution by each author

In [10]:
pyplot.figure(figsize=(14,5))
sns.countplot(authored_contents['author'],)
pyplot.xlabel('Author')
pyplot.title('Target variable distribution')
pyplot.show()

Now, grouping all the documents by each author to better understand the count.

In [11]:
authored_contents.groupby('author').size()

Below, we shall study the text length in the testing dataset.

In [12]:
# examine the same in test data

testing_records = len(unauthored_contents)

unauthored_contents['text_length'] = unauthored_contents['text'].str.len()

In [13]:
unauthored_contents.head()

##### Distribution of semantics

Lets study the pattern of the documents.

In the code cell below, we shall grab few statistics to understand the writing patterns of the authors. We shall evaluate following parameters for each document:
- Number of words
- Number of unique words
- Number of characters
- Number of stopwords
- Number of punctuations
- Number of words in uppercase
- Number of words in titlecase
- Mean length of each word

This is to study the pattern of writing of each given author.

In [14]:
def text_len(df):
    df['num_words'] = df['text'].apply(lambda x: len(str(x).split()))
    df['num_uniq_words'] = df['text'].apply(lambda x: len(set(str(x).split())))
    df['num_chars'] = df['text'].apply(lambda x: len(str(x)))
    df['num_stopwords'] = df['text'].apply(lambda x: len([w for w in str(x).lower().split() 
                                                          if w in set(stopwords.words('english'))]))
    df['num_punctuations'] = df['text'].apply(lambda x: len([w for w in str(x) if w in string.punctuation]))
    df['num_words_upper'] = df['text'].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
    df['num_words_title'] = df['text'].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
    df['mean_word_len'] = df['text'].apply(lambda x: numpy.mean([len(w) for w in str(x).split()]))
    df['num_character_len'] = df['text'].apply(lambda x: len(x))

In [15]:
text_len(authored_contents)
text_len(unauthored_contents)

In [16]:
def plot_heatmap(df):
    pyplot.figure(figsize=(14,6))

    pyplot.subplot(211)
    sns.heatmap(pandas.crosstab(df['author'], df['num_words']), cmap='gist_earth', xticklabels=True)
    pyplot.xlabel('Original text word count')
    pyplot.ylabel('Author')
    pyplot.tight_layout()
    pyplot.show()


    pyplot.subplot(212)
    sns.heatmap(pandas.crosstab(df['author'], df['num_uniq_words']), cmap='gist_heat', xticklabels=True)
    pyplot.xlabel('Unique text word count')
    pyplot.ylabel('Author')
    pyplot.tight_layout()
    pyplot.show()


    pyplot.subplot(212)
    sns.heatmap(pandas.crosstab(df['author'], df['num_punctuations']), cmap='gist_heat', xticklabels=True)
    pyplot.xlabel('Punctuations')
    pyplot.ylabel('Author')
    pyplot.tight_layout()
    pyplot.show()


    pyplot.subplot(212)
    sns.heatmap(pandas.crosstab(df['author'], df['mean_word_len']), cmap='gist_heat', xticklabels=False)
    pyplot.xlabel('Mean word length')
    pyplot.ylabel('Author')
    pyplot.tight_layout()
    pyplot.show()

In [17]:
plot_heatmap(authored_contents)

So far this heatmap explains what words were frequently used across the documents for each author.

We need more insights what words do they pick to better explain their interests of literature.

##### Wordcloud

In [18]:
eap_documents = authored_contents[authored_contents.author == 'EAP']['text'].values
hpl_documents = authored_contents[authored_contents.author == 'HPL']['text'].values
mws_documents = authored_contents[authored_contents.author == 'MWS']['text'].values

In [19]:
eap_words = " ".join(eap_documents)
hpl_words = " ".join(hpl_documents)
mws_words = " ".join(mws_documents)

In [20]:
from wordcloud import WordCloud, STOPWORDS

In [21]:
pyplot.figure(figsize=(16,13))

wordcloud = WordCloud(relative_scaling = 1.0, stopwords = STOPWORDS, max_font_size= 35)
wordcloud.generate(eap_words)
pyplot.imshow(wordcloud.recolor(colormap= 'Pastel2' , random_state=17), alpha=0.98)
pyplot.axis('off')
pyplot.show()

In [22]:
pyplot.figure(figsize=(16,13))

wordcloud = WordCloud(relative_scaling = 1.0, stopwords = STOPWORDS, max_font_size= 35)
wordcloud.generate(hpl_words)
pyplot.imshow(wordcloud.recolor(colormap= 'Pastel2' , random_state=17), alpha=0.98)
pyplot.axis('off')
pyplot.show()

In [23]:
pyplot.figure(figsize=(16,13))

wordcloud = WordCloud(relative_scaling = 1.0, stopwords = STOPWORDS, max_font_size= 35)
wordcloud.generate(mws_words)
pyplot.imshow(wordcloud.recolor(colormap= 'Pastel2' , random_state=17), alpha=0.98)
pyplot.axis('off')
pyplot.show()

These wordclouds does now explains more about authors and their preferences towards their writings.

As __Edgar Allan Poe__ was particularly known for tales of mystery and the macabre. So, words like _death, soul, life, corpse, spirit and shadow_ better describes his interests.

As __Howard Phillips Lovecraft__ was famous for the work of horror fiction. So, words like _fear, horror and body_ describes his interests.

As __Mary Wollstonecraft Shelley__ was a novelist, short story writer, dramatist, essayist, biographer, and travel writer. So, words like _love, affection, happiness, misery, despair, beauty and pleasure_ best describes her interests.

Since, these 2 authors completely shares their own areas of interests and being mutually exclusive, we can better understand the dataset now.

We can later verify the the unpredicted dataset once they are predicted.

### Data Preprocessing

Typically, learning algorithms expect input to be numeric, which requires that non-numeric features (called categorical variables) be converted. 

As with the non-numeric features, we need to convert the non-numeric target label, `author` to numerical values for the learning algorithm to work. Since there are only three possible categories for this label (`EAP`, `HPL` and `MWS`), we can avoid using one-hot encoding and simply encode these two categories as 0, 1 and 2 respectively.

In code cell below, you will need to implement the following:

- Convert the target label `author` to numerical entries. Set records with "EAP" to 0, records with "HPL" to 1 and records with "MWS" to 2 and storing it in `numerical_author` column.

##### Binarizing target variable

In [24]:
authored_contents['numerical_author'] = authored_contents.author.map({ 'EAP': 0, 'HPL': 1, 'MWS': 2 })

In [25]:
# Quick view of preprocessing

authored_contents[['text', 'author', 'numerical_author']].head()

In [26]:
authored_contents.head()

We will now clean texts by tokenizing, removing punctuations and stemming. This will reduce the feature set.

##### Scrubbing of data

In [27]:
all_stopwords = stopwords.words('english')
ps = PorterStemmer()

def scrub_text(data_frame):
    sentences = []
    for i in data_frame.values:
#         sentence = unicode(i[1], 'utf-8')
        sentence = i[1]

        # remove all punctuations
        sentence = sentence.translate(string.punctuation)

        # break sentence into words
        array_of_words = word_tokenize(sentence)

        # removes all English stopwords
        array_of_words = [word for word in array_of_words if word.lower() not in all_stopwords]

        # singularise words in the array_of_words
        array_of_words = [ps.stem(word) for word in array_of_words]
        cleaned_sentence = ' '.join(array_of_words)

        sentences.append(cleaned_sentence)

    return sentences

In [28]:
# Run the #scrub_text over the text in the training and testing datasets.

training_cleaned_texts = scrub_text(authored_contents)
testing_cleaned_texts = scrub_text(unauthored_contents)

Saving the cleaned text into `scrubbed_text` for both `authored_contents` and `unauthored_contents`.

However, we will not use this since this will defeat the purpose to analyse the semantics of texts already authored.

Soon, we shall see what is the performance of classifiers on both set of data i.e. `scrubbed_text` and `text`.

In [29]:
authored_contents['scrubbed_text'] = training_cleaned_texts

unauthored_contents['scrubbed_text'] = testing_cleaned_texts

In [30]:
# Define labels and features set

X = authored_contents['text']
Y = authored_contents['numerical_author']

### Shuffle and split data

We will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.

In [31]:
from sklearn.cross_validation import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,
                                                    test_size = 0.2,
                                                    random_state = 0)

In [32]:
# Show the results of the split

# Features
print("Training set has " + str(X_train.shape[0]) + " features.")
print("Testing set has " + str(X_test.shape[0]) + " features.")

# Labels
print("Training set has " + str(Y_train.shape[0]) + " labels.")
print("Testing set has " + str(Y_test.shape[0]) + " labels.")

print("\nPrinting labels set...")
print(Y_train.value_counts())

print(Y_test.value_counts())

###### Another cross validation:

We will now cross-validate our training set using `sklearn.model_selection.StratifiedShuffleSplit` approach

This as mentioned in the training videos, this will divide our training set into N folds.
We will iterate over each fold acting as testing data at a single point of time and rest of them will act as training sets.

We will tweak certain parameters to learn the dataset pattern.

In [33]:
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X,
                                                        Y,
                                                        stratify = Y,
                                                        test_size = 0.2,
                                                        random_state = 42)

In [34]:
# Show the results of the split

# Features
print("Training set has " + str(X2_train.shape[0]) + " features.")
print("Testing set has " + str(X2_test.shape[0]) + " features.")

# Labels
print("\nTraining set has " + str(Y2_train.shape[0]) + " labels.")
print("Testing set has " + str(Y2_test.shape[0]) + " labels.")

print("\nPrinting labels set...")
print(Y_train.value_counts())

print(Y_test.value_counts())

### Word vectors

We shall now grab the words in a document and tokenize them and build vocabulary out of them.

We are still working on the original texts that contains punctuations and un-stemmed words.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X2_train)

training_vectorizer = vectorizer.transform(X2_train)

In [36]:
# Quick view of vectors from the texts

training_vectorizer.toarray()

In [37]:
print(len(vectorizer.get_feature_names()))

In [38]:
# run vectorizer for X2_test

testing_vectorizer = vectorizer.transform(X2_test)

As per the above transformation, `training_vectorizer` is our feature set.

### Evaluating model performance

The following are some of the supervised learning models that will be used to evaluate the performance of the model on this problem and the dataset.

- Multinomial Naive Bayes
- Logistic Regression
- XGBoost from the ensemble methods
- SVM
- Stochastic Gradient Descent Classifier (SGDC)

In [39]:
# Include libraries to evaluate performances on the attached dataset

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

##### Multinomial Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

kfold = KFold(n_splits = 10, random_state = 7)

model = MultinomialNB()
start = time()
results = cross_val_score(model, training_vectorizer, Y2_train, cv=kfold)
end = time()

print("Mean value: " + str(results.mean()))
print("Training time: " + str((end - start)))

##### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

kfold = KFold(n_splits = 10, random_state = 7)

model = LogisticRegression()
start = time()
results = cross_val_score(model, training_vectorizer, Y2_train, cv=kfold)
end = time()

print("Mean value: " + str(results.mean()))
print("Training time: " + str(end - start))

##### SVM

In [None]:
from sklearn.svm import SVC

kfold = KFold(n_splits = 10, random_state = 7)

model = SVC()
start = time()
results = cross_val_score(model, training_vectorizer, Y2_train, cv=kfold)
end = time()

print("Mean value: " + str(results.mean()))
print("Training time: " + str(end - start))

##### Stochastic Gradient Descent Classifier

In [None]:
from sklearn.linear_model import SGDClassifier

kfold = KFold(n_splits = 10, random_state = 7)

model = SGDClassifier()
start = time()
results = cross_val_score(model, training_vectorizer, Y2_train, cv=kfold)
end = time()

print("Mean value: " + str(results.mean()))
print("Training time: " + str(end - start))

##### XGBoost

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier()
start = time()
results = cross_val_score(model, training_vectorizer, Y2_train, cv=kfold)
end = time()

print("Mean value: " + str(results.mean())).format(results.mean())
print("Training time: " + str(end - start))

#### Statistics:

         Algorithm            Mean accuracy    Training time            O/P
    Multinomial Naive Bayes      0.836            0.141        .83 accuracy with the least amount of training time
     Logistic Regression         0.821            9.789        .82 accuracy with some amount of training time
            SGDC                 0.799            0.432        .79 accuracy with the least of training time
          XGBoost                0.635            84.103       .63 accuracy with unacceptable training time
            SVM                  0.403            528.707      least accuracy with highest amount of training time

#### Classifier performances over scrubbed_text and text

Now, lets see the performance of the classifiers when texts are cleaned with punctuations, lowercased and properly stemmed.

This is what I have done in first place.

         Algorithm            Mean accuracy    Training time            O/P
    Multinomial Naive Bayes      0.823            0.093        .82 accuracy with the least amount of training time
     Logistic Regression         0.805            2.654        .80 accuracy with some amount of training time
            SGDC                 0.794            0.322        .79 accuracy with the least of training time
          XGBoost                0.612            52.510       .61 accuracy with unacceptable training time
            SVM                  0.403            299.708      least accuracy with highest amount of training time

##### Model performance conclusion

Comparing both of the statistics, we see improvement in the accuracy of Multinomial Naive Bayes when raw features are iterated i.e. without cleaning up the texts.

We shall proceed with the Multinomial Naive Bayes classifier.

### Parameter tuning

Fine tune the chosen model. Use grid search (GridSearchCV) with at least one important parameter tuned with at least 3 different values. We will need to use the entire training set for this.

In the code cell below, we shall implement the following:

- Import sklearn.grid_search.GridSearchCV and sklearn.metrics.make_scorer.
- Initialize the classifier you've chosen and store it in `model`.
- Create a dictionary of parameters you wish to tune for the chosen model.
  Example: parameters = {'parameter' : [list of values]}.
- Use make_scorer to create an fbeta_score scoring object (with $\beta = 0.5$).
- Perform grid search on the classifier clf using the 'scorer', and store it in grid_obj.
- Fit the grid search object to the training data (`training_vectorizer, Y2_train`), and store it in grid_fit.

In [42]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.naive_bayes import MultinomialNB

In [43]:
model = MultinomialNB()

parameters = { 'alpha': [0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0] }

scorer = make_scorer(fbeta_score, beta=0.5)

grid_obj = GridSearchCV(model, parameters)

grid_fit = grid_obj.fit(training_vectorizer, Y2_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

grid_fit.best_params_

### Model training

##### Fit and predict the data

In [44]:
mnb = MultinomialNB(alpha = 0.5)

In [45]:
mnb.fit(training_vectorizer, Y2_train)

In [46]:
mnb.predict_proba(training_vectorizer)

In [47]:
mnb.feature_log_prob_

In [48]:
Y_train_prediction = mnb.predict(training_vectorizer)

In [49]:
Y_test_prediction = mnb.predict(testing_vectorizer)

##### Accuracy score

In [50]:
# calculate score for predicted data against testing data
from sklearn import metrics

# compare predicted resultset with the test set
metrics.accuracy_score(Y2_test, Y_test_prediction)

In [51]:
# calculate score for predicted data against training data

metrics.accuracy_score(Y2_train, Y_train_prediction)

In [52]:
# Calculate confusion matrix

metrics.confusion_matrix(Y2_test, Y_test_prediction)

In [53]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = mnb.predict_proba(testing_vectorizer)
y_pred_prob[:10]

##### Classification report

In [54]:
print(classification_report(Y2_train, Y_train_prediction, target_names=['EAP', 'HPL', 'MWS']))

##### ROC and AUC curves

In [56]:
fpr, tpr, thresholds = metrics.roc_curve(Y2_train, Y_train_prediction, pos_label = 1)

print("Multinomial naive bayes AUC: " + str(metrics.auc(fpr, tpr)))

Precision, Recall and F1 score is reported.

AUC metric reported to be 0.535

### Predict unexplored data

##### Predict

In [58]:
# vectorise the unauthored_contents

unpredicted_texts = unauthored_contents['text']

unpredicted_texts_vectorizer = vectorizer.transform(unpredicted_texts)

In [59]:
unpredicted_texts_vectorizer

In [60]:
unpredicted_texts_prediction = mnb.predict(unpredicted_texts_vectorizer)

In [61]:
# calculate predicted probabilities for X_test_dtm
predicted_prob = mnb.predict_proba(unpredicted_texts_vectorizer)
predicted_prob[:10]

In [62]:
unpredicted_texts_prediction

In [63]:
len(unauthored_contents)

In [64]:
numerical_authors = pandas.DataFrame(unpredicted_texts_prediction, columns=['num_author'])

predicted_unauthored_contents = pandas.concat([unauthored_contents, numerical_authors], axis=1)

##### Reflect author names

In [65]:
predicted_unauthored_contents['author'] = predicted_unauthored_contents.num_author.map({ 0: 'EAP', 1: 'HPL', 2: 'MWS' })

In [66]:
predicted_unauthored_contents.groupby('author').size()

In [67]:
predicted_unauthored_contents.head()

### Conclusion

##### Compare heatmaps from the prediction

In [69]:
# Plotting over the test.csv that we ran the prediction above to study the analysis.

plot_heatmap(predicted_unauthored_contents)

In [71]:
# Comparing the heatmap that we plotted above using `authored_contents`

plot_heatmap(authored_contents)

Considering parameters:

1. Punctuations: Both of them represent identical plots for both of the datasets.
   
2. Unique words used: Both of them utilised and represented identically.

##### Compare Wordclouds from the prediction

In [72]:
predicted_eap_documents = predicted_unauthored_contents[predicted_unauthored_contents.author == 'EAP']['text'].values
predicted_hpl_documents = predicted_unauthored_contents[predicted_unauthored_contents.author == 'HPL']['text'].values
predicted_mws_documents = predicted_unauthored_contents[predicted_unauthored_contents.author == 'MWS']['text'].values

In [73]:
predicted_eap_words = " ".join(predicted_eap_documents)
predicted_hpl_words = " ".join(predicted_hpl_documents)
predicted_mws_words = " ".join(predicted_mws_documents)

In [74]:
pyplot.figure(figsize=(16,13))

wordcloud = WordCloud(relative_scaling = 1.0, stopwords = STOPWORDS, max_font_size= 35)
wordcloud.generate(predicted_eap_words)
pyplot.imshow(wordcloud.recolor(colormap= 'Pastel2' , random_state=17), alpha=0.98)
pyplot.axis('off')
pyplot.show()

In [75]:
pyplot.figure(figsize=(16,13))

wordcloud = WordCloud(relative_scaling = 1.0, stopwords = STOPWORDS, max_font_size= 35)
wordcloud.generate(predicted_hpl_words)
pyplot.imshow(wordcloud.recolor(colormap= 'Pastel2' , random_state=17), alpha=0.98)
pyplot.axis('off')
pyplot.show()

In [76]:
pyplot.figure(figsize=(16,13))

wordcloud = WordCloud(relative_scaling = 1.0, stopwords = STOPWORDS, max_font_size= 35)
wordcloud.generate(predicted_mws_words)
pyplot.imshow(wordcloud.recolor(colormap= 'Pastel2' , random_state=17), alpha=0.98)
pyplot.axis('off')
pyplot.show()

In [77]:
## Closing notes to come...