 # Who are the Bossy Words?

 On this activity you will use TF-IDF to find the most relevant words on news articles that talk about money in the [Reuters Corpus](https://www.nltk.org/book/ch02.html#reuters-corpus) bundled in `NLTK`. Once you find the most relevant words, you should create a word cloud.

In [1]:
# initial imports
import nltk
from nltk.corpus import reuters
import numpy as np
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

plt.style.use("seaborn-whitegrid")
mpl.rcParams["figure.figsize"] = [20.0, 10.0]


 ## Loading the Reuters Corpus

 The first step is to load the Reuters Corpus.

In [2]:
# Download/update the Reuters dataset
nltk.download("reuters")


[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\themi\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

 ## Getting the News About Money

 You will analyze only news that talk about _money_. There are two categories on the Reuters Corpus that talk about money: `money-fx` and `money-supply`. In this section, you will filter the news by these categories.

 Take a look into the [Reuters Corpus documentation](https://www.nltk.org/book/ch02.html#reuters-corpus) and check how you can retrieve the categories of a document using the `reuters.categories()` method; write some lines of code to retrieve all the news articles that are under the `money-fx` or the `money-supply` categories.

 **Hint:**
 You can use a comprehension list or a for-loop to accomplish this task.

In [7]:
# Getting all documents ids under the money-fx and money-supply categories
categories = ["money-fx", "money-supply"]
all_docs_id = reuters.fileids()



In [11]:
reuters.fileids(["money-fx", "money-supply"])

['test/14849',
 'test/14861',
 'test/14890',
 'test/14913',
 'test/14919',
 'test/14931',
 'test/14964',
 'test/14987',
 'test/15048',
 'test/15212',
 'test/15234',
 'test/15242',
 'test/15246',
 'test/15253',
 'test/15364',
 'test/15375',
 'test/15378',
 'test/15431',
 'test/15436',
 'test/15442',
 'test/15444',
 'test/15448',
 'test/15449',
 'test/15450',
 'test/15452',
 'test/15453',
 'test/15460',
 'test/15510',
 'test/15522',
 'test/15523',
 'test/15527',
 'test/15539',
 'test/15552',
 'test/15615',
 'test/15621',
 'test/15625',
 'test/15636',
 'test/15656',
 'test/15677',
 'test/15689',
 'test/15694',
 'test/15950',
 'test/15951',
 'test/15976',
 'test/15977',
 'test/15978',
 'test/15987',
 'test/15989',
 'test/15996',
 'test/16004',
 'test/16006',
 'test/16009',
 'test/16053',
 'test/16063',
 'test/16066',
 'test/16067',
 'test/16068',
 'test/16069',
 'test/16072',
 'test/16083',
 'test/16106',
 'test/16111',
 'test/16133',
 'test/16143',
 'test/16161',
 'test/16177',
 'test/161

In [9]:
# Creating the working corpus containing the text from all the news articles about money
reuters.fileids(["money-fx", "money-supply"])
# Printing a sample article
doc_id = "test/14849"
doc_text = reuters.raw(doc_id)
print(doc_text)

BUNDESBANK ALLOCATES 6.1 BILLION MARKS IN TENDER
  The Bundesbank accepted bids for 6.1
  billion marks at today's tender for a 28-day securities
  repurchase pact at a fixed rate of 3.80 pct, a central bank
  spokesman said.
      Banks, which bid for a total 12.2 billion marks liquidity,
  will be credited with the funds allocated today and must buy
  back securities pledged on May 6.
      Some 14.9 billion marks will drain from the market today as
  an earlier pact expires, so the Bundesbank is effectively
  withdrawing a net 8.1 billion marks from the market with
  today's allocation.
      A Bundesbank spokesman said in answer to enquiries that the
  withdrawal of funds did not reflect a tightening of credit
  policy, but was to be seen in the context of plentiful
  liquidity in the banking system.
      Banks held an average 59.3 billion marks at the Bundesbank
  over the first six days of the month, well clear of the likely
  April minimum reserve requirement of 51 billion mark

 ## Calculating the TF-IDF Weights

 Calculate the TF-IDF weight for each word on the working corpus using the `TfidfVectorizer()` class. Remember to include the `stop_words='english'` parameter.

In [12]:
# Calculating TF-IDF for the working corpus.
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform([doc_text])
words = vectorizer.get_feature_names()
print(words)

['12', '14', '28', '51', '59', '80', 'accepted', 'accruing', 'agreement', 'allocated', 'allocates', 'allocation', 'answer', 'april', 'average', 'bank', 'banking', 'banks', 'bid', 'bidding', 'bids', 'billion', 'blunt', 'bundesbank', 'buy', 'central', 'clear', 'context', 'credit', 'credited', 'currently', 'day', 'days', 'dealers', 'did', 'drain', 'earlier', 'effectively', 'effectiveness', 'enquiries', 'expires', 'felt', 'fixed', 'fluctuations', 'funds', 'held', 'instrument', 'keen', 'likely', 'liquidity', 'main', 'market', 'marks', 'minimum', 'money', 'month', 'net', 'noted', 'open', 'outgoing', 'outside', 'pact', 'pacts', 'pct', 'pledged', 'plentiful', 'plenty', 'policy', 'possible', 'prevent', 'range', 'rate', 'rates', 'reflect', 'repurchase', 'requirement', 'reserve', 'said', 'securities', 'security', 'seen', 'short', 'shown', 'spokesman', 'steering', 'tender', 'term', 'tightening', 'today', 'total', 'weeks', 'withdrawal', 'withdrawing']




 Create a DataFrame representation of the TF-IDF weights of each term in the working corpus. Use the `sum(axis=0)` method to calculate a measure similar to the term frequency based on the TF-IDF weight, this value will be used to rank the terms for the word cloud creation.

In [6]:
# Creating a DataFrame Representation of the TF-IDF results

# Order the DataFrame by word frequency in descending order

# Print the top 10 words
money_news_df.head(10)


NameError: name 'money_news_df' is not defined

 ## Retrieving the Top Words

 In order to create the word cloud you should get the top words, in this case we will use a thumb rule that has been empirically tested by some NLP experts that states that words with a frequency between 10 and 30 might be the most relevant in a corpus.

 Following this rule, create a new DataFrame containing only those words with the mentioned frequency.

In [None]:
# Top words will be those with a frequency between 10 ans 30 (thumb rule)


top_words.head(10)


 ## Creating Word Cloud

 Now you have all the pieces needed to create a word cloud based on TF-IDF weights, so use the `WordCloud` library to create it.

In [None]:
# Create a string list of terms to generate the word cloud
terms_list = str(top_words["Word"].tolist())

# Create the word cloud



 ## Challenge: Looking for Documents that Contains Top Words

 Finally you might find interesting to search those articles that contain the most relevant words. Create a function called `retrieve_docs(terms)` that receive a list of terms as parameter and extract from the working corpus all those news articles that contains the search terms. On this function you should use the `reuters.words()` method to retrieve the tokenized version of each article as can be seen on the [Reuters Corpus documentation](https://www.nltk.org/book/ch02.html#reuters-corpus).

 **Hint:** To find any occurrence of the search terms you might find quite useful [this post on StackOverflow](https://stackoverflow.com/a/25102099/4325668), also you should lower case all the words to ease your terms search.

In [None]:
def retrieve_docs(terms):



 ### Question 1: How many articles talk about Yen?

In [None]:
len(retrieve_docs(["yen"]))


### Question 2: How many articles talk about Japan or Banks?

In [None]:
len(retrieve_docs(["japan", "banks"]))


 ### Question 3: How many articles talk about England or Dealers?

In [None]:
len(retrieve_docs(["england", "dealers"]))


In [None]:
def retrieve_docs(terms, all_docs_id):
    result_docs = []
    for doc_id in all_docs_id: # no global....
        found_terms = [
            word
            for word in reuters.words(doc_id)
            if any(term in word.lower() for term in terms)
        ]
        if len(found_terms) > 0:
            result_docs.append(doc_id)
    return result_docs