# Standdown Exercise

The cell below stores the text of a set of famous books in the variable nltk_books.

In [4]:
# Run cell with no changes

import nltk
import pandas as pd
nltk.download('gutenberg')

# store raw text of books in a list
nltk_books = [nltk.corpus.gutenberg.raw(title) 
                 for title in nltk.corpus.gutenberg.fileids()]

# convert list to dataframe with titles as the index.
nltk_books = pd.DataFrame(nltk_books, 
                          index=nltk.corpus.gutenberg.fileids(),
                          columns=['raw_text'] )

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\soohy\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


In [7]:
nltk_books.head()

Unnamed: 0,raw_text
austen-emma.txt,[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAP...
austen-persuasion.txt,[Persuasion by Jane Austen 1818]\n\n\nChapter ...
austen-sense.txt,[Sense and Sensibility by Jane Austen 1811]\n\...
bible-kjv.txt,[The King James Bible]\n\nThe Old Testament of...
blake-poems.txt,[Poems by William Blake 1789]\n\n \nSONGS OF I...


The next cell below splits the books into a train and test sets.  This is an arbitrary split, but is here to remind you that we fit a vectorizer only on the training set.

In [5]:
# Run cell with no changes
from sklearn.model_selection import train_test_split

train, test = train_test_split(nltk_books, random_state=42)


In [6]:
# Here are the books whose full texts compose the training set
train.index

Index(['milton-paradise.txt', 'shakespeare-macbeth.txt',
       'shakespeare-hamlet.txt', 'edgeworth-parents.txt', 'austen-sense.txt',
       'chesterton-brown.txt', 'whitman-leaves.txt', 'blake-poems.txt',
       'melville-moby_dick.txt', 'carroll-alice.txt',
       'chesterton-thursday.txt', 'shakespeare-caesar.txt',
       'burgess-busterbrown.txt'],
      dtype='object')

Your task is to fit a TfidfVectorizer to the training set with the following specifications: max_features is set to 50, stopwords are removed using the nltk english stopwords list.  The other parameters should be the defaults.  

**After fitting the vectorizer, find the word with the highest tf-idf score in Moby Dick. Slack out both the word and tf-idf score, as well as your forked repo showing your work.**

> Hint: Converting the vectorized text into a DataFrame with column names and indices will make your life easier.  Use the following hints to make that happen:  
>> 1. The TF-IDF vectorizer returns a sparse matrix.  Chain the toarray() method off the vectorizer, then convert that array into a DataFrame.  

>> 2. The fit Tf-Idf object has a method called `get_feature_names()`. Assign the result of that method as the `columns` argument of DataFrame.  

>> 3. Pass train.index as the index argument of DataFrame.   
    



In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

sw = stopwords.words('english')

tfidf = TfidfVectorizer(max_features=50, stop_words=sw)

In [11]:
train_vectorized = tfidf.fit_transform(train['raw_text'])

In [12]:
train_vectorized

<13x50 sparse matrix of type '<class 'numpy.float64'>'
	with 594 stored elements in Compressed Sparse Row format>

In [15]:
train.index

Index(['milton-paradise.txt', 'shakespeare-macbeth.txt',
       'shakespeare-hamlet.txt', 'edgeworth-parents.txt', 'austen-sense.txt',
       'chesterton-brown.txt', 'whitman-leaves.txt', 'blake-poems.txt',
       'melville-moby_dick.txt', 'carroll-alice.txt',
       'chesterton-thursday.txt', 'shakespeare-caesar.txt',
       'burgess-busterbrown.txt'],
      dtype='object')

In [17]:
import pandas as pd

train_vectorized_df = pd.DataFrame(train_vectorized.toarray(), columns=tfidf.get_feature_names())
train_vectorized_df.set_index(train.index, inplace=True)
train_vectorized_df.head()

Unnamed: 0,come,could,day,every,first,go,good,great,head,know,...,thy,time,two,upon,us,way,well,whale,would,yet
milton-paradise.txt,0.03229,0.056932,0.101968,0.033749,0.159724,0.019544,0.089222,0.104517,0.049285,0.075626,...,0.436628,0.034839,0.039938,0.034069,0.197221,0.068828,0.060331,0.0,0.049285,0.193739
shakespeare-macbeth.txt,0.211892,0.062783,0.074555,0.0,0.050577,0.043163,0.192273,0.121642,0.035315,0.137338,...,0.272733,0.180501,0.043163,0.0,0.0,0.066707,0.12949,0.0,0.207968,0.223664
shakespeare-hamlet.txt,0.276094,0.082297,0.066369,0.0,0.048475,0.127428,0.260165,0.05044,0.069023,0.188487,...,0.296547,0.116809,0.058404,0.0,0.0,0.029202,0.188487,0.011961,0.193797,0.098226
edgeworth-parents.txt,0.119129,0.167489,0.106941,0.038551,0.067991,0.112839,0.179677,0.095539,0.042069,0.165916,...,0.001952,0.124241,0.053864,0.250642,0.10394,0.058189,0.154514,0.0,0.200908,0.044428
austen-sense.txt,0.057571,0.326235,0.084663,0.264103,0.097,0.055313,0.099902,0.084099,0.020884,0.130946,...,0.0,0.134897,0.086356,0.073925,0.063048,0.041767,0.135461,0.0,0.290677,0.044025


In [28]:
train_vectorized_df.sort_values('melville-moby_dick.txt', ascending=False, axis=1).loc['melville-moby_dick.txt']

whale      0.767525
one        0.255945
upon       0.210215
like       0.179800
man        0.146453
sea        0.145935
old        0.125054
would      0.120052
though     0.106713
thou       0.100650
head       0.095875
yet        0.095875
time       0.092818
long       0.092540
still      0.086704
great      0.085037
said       0.084481
two        0.082814
last       0.082683
every      0.080021
must       0.078645
us         0.078641
see        0.075588
way        0.075311
never      0.071053
first      0.070146
little     0.069197
say        0.067807
men        0.067807
may        0.066696
much       0.066564
well       0.063917
could      0.060026
good       0.060026
go         0.053912
thee       0.052490
thing      0.052245
might      0.050855
come       0.049744
made       0.049466
day        0.048910
let        0.043908
know       0.042241
thought    0.041685
thy        0.038976
think      0.033904
make       0.031125
mr         0.029643
shall      0.028954
mrs        0.006674
