 # Who are the Bossy Words?

 On this activity you will use TF-IDF to find the most relevant words on news articles that talk about money in the [Reuters Corpus](https://www.nltk.org/book/ch02.html#reuters-corpus) bundled in `NLTK`. Once you find the most relevant words, you should create a word cloud.

In [None]:
# initial imports
import nltk
from nltk.corpus import reuters
import numpy as np
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

plt.style.use("seaborn-whitegrid")
mpl.rcParams["figure.figsize"] = [20.0, 10.0]


 ## Loading the Reuters Corpus

 The first step is to load the Reuters Corpus.

In [None]:
# Download/update the Reuters dataset
nltk.download("reuters")


 ## Getting the News About Money

 You will analyze only news that talk about _money_. There are two categories on the Reuters Corpus that talk about money: `money-fx` and `money-supply`. In this section, you will filter the news by these categories.

 Take a look into the [Reuters Corpus documentation](https://www.nltk.org/book/ch02.html#reuters-corpus) and check how you can retrieve the categories of a document using the `reuters.categories()` method; write some lines of code to retrieve all the news articles that are under the `money-fx` or the `money-supply` categories.

 **Hint:**
 You can use a comprehension list or a for-loop to accomplish this task.

In [None]:
# Getting all documents ids under the money-fx and money-suppy categories
categories = ["money-fx", "money-supply"]
all_docs_id = reuters.fileids()



In [None]:
# Creating the working corpus containing the text from all the news articles about money

# Printing a sample article


 ## Calculating the TF-IDF Weights

 Calculate the TF-IDF weight for each word on the working corpus using the `TfidfVectorizer()` class. Remember to include the `stop_words='english'` parameter.

In [None]:
# Calculating TF-IDF for the working corpus.



 Create a DataFrame representation of the TF-IDF weights of each term in the working corpus. Use the `sum(axis=0)` method to calculate a measure similar to the term frequency based on the TF-IDF weight, this value will be used to rank the terms for the word cloud creation.

In [None]:
# Creating a DataFrame Representation of the TF-IDF results

# Order the DataFrame by word frequency in descendent order

# Print the top 10 words
money_news_df.head(10)


 ## Retrieving the Top Words

 In order to create the word cloud you should get the top words, in this case we will use a thumb rule that has been empirically tested by some NLP experts that states that words with a frequency between 10 and 30 might be the most relevant in a corpus.

 Following this rule, create a new DataFrame containing only those words with the mentioned frequency.

In [None]:
# Top words will be those with an frequency between 10 ans 30 (thumb rule)


top_words.head(10)


 ## Creating Word Cloud

 Now you have all the pieces needed to create a word cloud based on TF-IDF weights, so use the `WordCloud` library to create it.

In [None]:
# Create a string list of terms to generate the word cloud
terms_list = str(top_words["Word"].tolist())

# Create the word cloud



 ## Challenge: Looking for Documents that Contains Top Words

 Finally you might find interesting to search those articles that contain the most relevant words. Create a function called `retrieve_docs(terms)` that receive a list of terms as parameter and extract from the working corpus all those news articles that contains the search terms. On this function you should use the `reuters.words()` method to retrieve the tokenized version of each article as can be seen on the [Reuters Corpus documentation](https://www.nltk.org/book/ch02.html#reuters-corpus).

 **Hint:** To find any occurrence of the search terms you might find quite useful [this post on StackOverflow](https://stackoverflow.com/a/25102099/4325668), also you should lower case all the words to ease your terms search.

In [None]:
def retrieve_docs(terms):



 ### Question 1: How many articles talk about Yen?

In [None]:
len(retrieve_docs(["yen"]))


### Question 2: How many articles talk about Japan or Banks?

In [None]:
len(retrieve_docs(["japan", "banks"]))


 ### Question 3: How many articles talk about England or Dealers?

In [None]:
len(retrieve_docs(["england", "dealers"]))
