# CIS 545 - Big Data Analytics - Fall 2019

# Homework 1: Data Wrangling and Cleaning
# Due Date: September 25, 2019 at 10pm

We all know that cryptocurrencies are all the rage today.  Could we train an algorithm to tell the difference between a webpage about cryptocurrency and a webpage about something else?

This initial assignment goes over some of the basic steps in (1) acquiring data from the web, (2) acquiring tabular data, (3) cleaning and linking data, and (4) training a simple machine learning classifer.  Along the way you'll learn a few of the basic tools, and get a very basic understanding of one way to represent documents.

**Note: You do not need to connect your local runtime to do this assignment!**

In [0]:
# Standard pip install...  Put all of your to-install packages here.
# Depending on your configuration, you may need to change pip3 to pip
#!pip install scrapy
#!pip install lxml
#!pip install scikit-learn
#!pip install swifter
#!pip install nltk

In [0]:
# Standard imports; it's cleaner to put them here so they can be used
# throughout the notebook

import pandas as pd
import numpy as np
from lxml import etree
import sqlite3
import swifter
import urllib
import urllib.request
import re

import nltk
from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.stem import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

## Task 1: Acquiring data for training our system

First let's get some information about what's a cryptocurrency.  For that -- there's always [Wikipedia](https://en.wikipedia.org/wiki/List_of_cryptocurrencies)!

But of course it won't give us the data exactly the way we want it, so we'll need to do a bit of information extraction and data wrangling. We will also try to get current price levels from [Yahoo](https://finance.yahoo.com/cryptocurrencies).

### Task 1.1: Fetch the list of pages from Wikipedia and put it into a dataframe

First we'll get the master table of "known" cryptocurrencies. Use the `read_html()` function from `pandas`. 

In [0]:
# TODO:
# (1) Fetch files from Wikipedia:  https://en.wikipedia.org/wiki/List_of_cryptocurrencies
# (2) Parse into a dataframe called cryptocurrency_df

# YOUR CODE HERE

cryptocurrency_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_cryptocurrencies')[0]

#raise NotImplementedError()

display(cryptocurrency_df)

Next, do the same for the following two sites. Yahoo gives a maximum of 100 prices at a time, so this is why we have to have two queries.

In [0]:
# TODO: Make two price dataframes from
# price_1_df: https://finance.yahoo.com/cryptocurrencies/?count=100&offset=0
# price_2_df: https://finance.yahoo.com/cryptocurrencies/?count=100&offset=100

# YOUR CODE HERE

price_1_df = pd.read_html('https://finance.yahoo.com/cryptocurrencies/?count=100&offset=0')[0]
price_2_df = pd.read_html('https://finance.yahoo.com/cryptocurrencies/?count=100&offset=100')[0]

#raise NotImplementedError()

price_df = price_1_df.append(price_2_df)

display(price_df)

In [0]:
# Quick sanity check 1.1 for cryptocurrency_df: does it have the columns from the Wikipedia table?

if not 'Currency' in cryptocurrency_df:
    raise AssertionError('Expected column called "Currency"')
    
if not 'Founder(s)' in cryptocurrency_df:
    raise AssertionError('Expected column called "Founder(s)"')

display(cryptocurrency_df)

In [0]:
# Hidden tests 1.1 for autograding cryptocurrency_df - don't delete this cell please!


### Task 1.2 First bit of data Cleaning:  Clean up the schema names.

It turns out that SQL databases often don't like parentheses and spaces in the column names.  Change the column names for the appropriate columns, by 

1. removing the parts in parentheses
2. trimming any blank spaces before or after the names
3. inserting underscores for spaces.  

Hint: there are functions called `trim`, `strip`, `find`, `replace`.

In [0]:
# TODO:
# For all column names in cryptocurrency_df, 
# (1) remove anything in parentheses, 
# (2) remove leading and trailing spaces, 
# (3) replace remaining spaces with underscores

# YOUR CODE HERE

##1. remove the parts in parantheses

cryptocurrency_df.columns = cryptocurrency_df.columns.str.replace(r"\(.*\)","")

##2. remove leading and trailing spaces

cryptocurrency_df.columns = cryptocurrency_df.columns.str.strip()

##3. replace remaining spaces with underscores

cryptocurrency_df.columns = cryptocurrency_df.columns.str.replace(" ","_")

#raise NotImplementedError()
#cryptocurrency_df

In [0]:
# Sanity check 1.2 for cryptocurrency_df

for column in cryptocurrency_df.keys():
    if column.find(' ') >= 0:
        raise AssertionError('Forgot to remove a space in "%s"'%column)
    elif column.find('(') >= 0 or column.find(')') >= 0:
        raise AssertionError('Forgot to remove a paren in %s'%column)
        
display(cryptocurrency_df)

In [0]:
# Hidden tests 1.2 for autograding cryptocurrency_df - please don't delete


### Task 1.3: Joining the tables

We are now going to try to put these two sources of information into one table. The requirement is that we want to make sure that we have an entry for every currency in the Wikipedia list, but not necessarily for every currency in the Yahoo price list. Of the four types of join, two can achieve this requirement. For extra practice, see if you can figure out both correct answers.

#### Task 1.3.1 Attempt #1

In the cell below, join `cryptocurrency_df` and `price_df` using "Name" as the join index of `price_df` and "Currency" as the join index of `cryptocurrency_df`. The result should be named `joined_on_name_df`. Do not make any changes to the data frames yet, even though you may see a problem with joining them now.

In [0]:
# TODO: Join cryptocurrency_df and price_df

# YOUR CODE HERE

joined_on_name_df = cryptocurrency_df.merge(price_df, left_on = "Currency", right_on = "Name")

#raise NotImplementedError()

display(joined_on_name_df)

In [0]:
# Sanity check 1.3.1 for joined_on_name_df

if len(joined_on_name_df.columns) != 20:
    raise AssertionError('Your joined table has %d columns, an unexpected number.'%len(joined_on_name_df.columns))

In [0]:
# Hidden tests 1.3.1 for autograding joined_on_name_df - please don't delete


#### Task 1.3.2 Cleaning up the names

You may have noticed a mismatch for how the currencies are named between the two data frames. Use the `apply` function to replace the values in the `price_df["Name"]` column so they better match the values in `cryptocurrency_df["Currency"]`.

Then rerun your join from 1.3.1 and name it the same way.

In [0]:
# TODO: Remove Fix Name column in price_df and redo the join

# YOUR CODE HERE

price_df["Name"] = price_df["Name"].apply(lambda x: x.replace(" USD",""))

#raise NotImplementedError()

display(joined_on_name_df)

In [0]:
# Sanity check 1.3.2 for joined_on_name_df

if len(joined_on_name_df[joined_on_name_df["Name"].notna()]) == 0:
    raise AssertionError('Your join did not find any matches. Maybe you did something wrong?')

In [0]:
# Hidden tests 1.3.2 for autograding joined_on_name_df cleaned - please don't delete


#### Task 1.3.3: Clean the citations out of the content.

As we saw in lecture, the html processing function converts Wikipedia citations to normal text. You may have noticed that this is keeping at least one of the cryptocurrencies from matching during the join. In the cell below, use `applymap` to remove these citations from the entire `cryptocurrency_df` table. Assume that every instance of "`[`" begins a citation. In this case only, it is okay if you delete everything after the "`[`", including the stuff after "`]`".

Then rerun your join from 1.3.2 and name it the same way. Did you get more matches?

In [0]:
# TODO: Remove citations

# YOUR CODE HERE

 cryptocurrency_df = cryptocurrency_df.applymap(lambda y: re.sub("\[.*?\]","",str(y)))

#raise NotImplementedError()

display(joined_on_name_df)

In [0]:
# Sanity check 1.3.3 for joined_on_name_df

print("%d matches found"%len(joined_on_name_df[joined_on_name_df["Name"].notna()]))
if len(joined_on_name_df[joined_on_name_df["Name"].notna()]) == 0:
    raise AssertionError('Your join did not find any matches. Maybe you did something wrong?')

In [0]:
# Hidden tests 1.3.3 for autograding citation deletion - please don't delete


#### Task 1.3.4 A Better Column

Look again at `cryptocurrency_df` and `price_df` and select better columns for indexing the join. Consider an `apply` function for the relevant column in `cryptocurrency_df` and for the relevant column in price_df` that you select. 

Name this table `joined_df`. To get the points for this section, you need to match at least as many currencies as our solution.

In [0]:
# TODO: Improve the join by switching to different columns

# We will try to use attribute "Symbol"

# YOUR CODE HERE

cryptocurrency_df["Symbol"] = cryptocurrency_df["Symbol"].apply(lambda z: z.split(",")[0])
price_df["Symbol"] = price_df["Symbol"].apply(lambda k: k.split("-")[0])

joined_df = cryptocurrency_df.merge(price_df, on="Symbol")

#raise NotImplementedError()

display(joined_df)

In [0]:
# Sanity check 1.3.4 for joined_df

print("%d matches found"%len(joined_df[joined_df["Name"].notna()]))
if len(joined_df[joined_df["Name"].notna()]) <= len(joined_on_name_df[joined_on_name_df["Name"].notna()]):
    raise AssertionError('Your new join is not better than the old one. Maybe you did something wrong?')

In [0]:
# Hidden tests 1.3.4 for autograding joined_df  - please don't delete


### Task 1.4: Save the cryptocurrency list in a database table

We don't want to continue to hit Wikipedia.org every time we want to consult the list of cryptocurrencies.  Save your `cryptocurrency_df` to sqlite, in a table called `cryptocurrency`.  

**The Dataframe `index` has no particular meaning, so don't save it!**

In [0]:
# TODO: convert cryptocurrency_df to sqlite

conn = sqlite3.connect('local.db')

# YOUR CODE HERE
cryptocurrency_df.to_sql("cryptocurrency", conn, if_exists="replace", index = False)

#raise NotImplementedError()

In [0]:
# Sanity check 1.4 for sqlite databases

crypto2 = pd.read_sql_query('select * from cryptocurrency', conn)

if 'index' in crypto2:
    raise AssertionError('Please disable the index, since it isn\'t important information')
    
display(crypto2)

### Task 1.5: Read the cryptocurrency pages

Now let's take each of the cryptocurrency names and find the associated URL. The names of the currencies were originally clickable links on the [webpage](https://en.wikipedia.org/wiki/List_of_cryptocurrencies) that we made the table from, but unfortunately, `pandas` automatically deleted the URLs. So we have to regenerate them. Feel free to look at that page to see what the correct URL is for each currency.

In the cell below, complete the function `crawl`. The function name, inputs, first line, and last line are provided for you. 

`list_of_urls` should contain the URLs of interest as a list, column of a pandas DataFrame, or some other iterable over strings. 

`prefix` contains a common string that should be added to the beginning every URL in `list_of_urls` before each URL is queried. 

The line `pages = {}` creates an empty dictionary. After running your part of the function `crawl`, `pages` should have currency names as its keys and the corresponding Wikipedia page contents as its values. This is what the function returns.

You have two options for completing this cell:

1. If you want to use `urllib.request.urlopen`, you should then use `read()` and `decode('utf-8')`.

2. If you want to use `scrapy`, follow the process in [this notebook from class](https://www.google.com/url?q=https://drive.google.com/file/d/1VfnlGr_VofdcEqACM2jRu2BwYm0QyTSh/view?usp%3Dsharing&sa=D&ust=1567968915286000&usg=AFQjCNG5iEWgUoA3DrRLhV1TKiT2OXHD1A).

For now, use a `try` statement to catch the errors and print a message that the URL could not be crawled. That is, in this cell we will have a **single rule** and not do any manual cleaning.  If you were doing this at web scale, you would be reluctant to invest a lot of manual effort...

In [0]:
# TODO: Crawl the pages.  
# Trap the errors and figure out what you need to fix (in the cleaning step below)

def crawl(list_of_urls, prefix=""):
    pages = {}
# YOUR CODE HERE
    for url in list_of_urls:
      try:
        pages[url] = str(urllib.request.urlopen(prefix + url).read().decode('utf-8'))
        print(prefix + url) #To check if the url makes sense(e.g. bitcoin cash)
      except:
        print(url + "'s URL doesn't work")       
    return pages

The following cell passes the currencies in our table to the `crawl` function. You do not need to modify the cell.

In [0]:
# Sanity check 1.5.1 for initial crawl
# I modify it to see the values, or we cant see if something is wrong using my error-handling function

pages = crawl(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')
for page in pages:
    print (page)
print ('Total crawl: %d cryptocurrencies'%len(pages))

In [0]:
# Hidden tests 1.5.1 for autograding pages  - please don't delete


Did you get any errors? Did you ever get the wrong URL (and therefore the content from the wrong page)? Fix those two problems in the function `crawl_better` below. This function has the same inputs and outputs as `crawl`, but this time, it is okay if your fixes are specific to these sites. For example, you can try attaching `_(disambiguation)`, pull up that page's `etree.HTML(content)` and look for a link that has the name of the currency plus `' (cryptocurrency)'`.

In [0]:
# TODO: Re-run the crawl, fixing the issues

# Crawl the pages.  You may use urllib.request.urlopen or scrapy
# Assemble the list of results in the list pages.
# Trap the errors and figure out what you need to fix (in the cleaning step below)

def crawl_better(list_of_urls, prefix=""):
    pages = {}
# YOUR CODE HERE
    for url in list_of_urls:
      url = url.replace(" ","_")
      url = url.replace("\"","")
      try:
        if url == "Ripple":
          url = "Ripple_(payment_protocol)"
        if url == "Dash":
          url = "Dash_(cryptocurrency)"
        if url == "Monero":
          url = "Monero_(cryptocurrency)"
        if url == "NEM":
          url = "NEM_(cryptocurrency)"
        if url == "Verge":
          url = "Verge_(cryptocurrency)"
        if url == "Stellar":
          url = "Stellar_(payment_network)"
        if url == "Tether":
          url = "Tether_(cryptocurrency)"
        if url == "Ether_or_Ethereum":
          url = "Ethereum"
        pages[url] = str(urllib.request.urlopen(prefix + url).read().decode('utf-8'))
        print(prefix + url)
      except:
        print(url + "'s URL doesn't work!!")
    return pages

As before, the cell below just runs your function and does not need to be modified.

In [0]:
# Sanity check 1.5.2 for better crawl
pages = crawl_better(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')

In [0]:
# Hidden tests 1.5.2 for autograding pages  - please don't delete


### Task 1.6: Sanity-check and fix

Note that sometimes terms in Wikipedia are **ambiguous**, so just following the page doesn't always get what you want.  The Wikipedia page for [Tether](https://en.wikipedia.org/wiki/Tether) does not describe a cryptocurrency.

We can add a data-cleaning rule to check this: every cryptocurrency should mention the term "blockchain".  Here's a sanity check you can use.  If there are any disambiguation pages, you need to go back to Task 1.5 and update your process to crawl the right page.

You do not need to modify this cell.

In [0]:
count_wrong = 0

for page,content in pages.items():
    if isinstance(content, bytes):
        raise AssertionError('Please run decode(\'utf-8\') on the content to decode to a string')
        content = content.decode('utf-8')
        
    if 'blockchain' not in content:
        print(page + ': ' + ' -- did not find blockchain!')
        count_wrong = count_wrong + 1

        
print ('Total crawl: %d cryptocurrencies'%len(pages))

if count_wrong > 0:
    raise AssertionError('Need to follow Wikipedia disambiguation pages on %d items!'%count_wrong)

### Task 1.7: Clean the articles

So far, we have captured HTML content for each Wikipedia article, but HTML is not very easy to read and process. So the next step is to clean up the text in each article. To do that, you need to complete the function definition below. The function name, and input are provided for you. 

The first step is to get a list of paragraphs of content. See our [slides](https://www.google.com/url?q=https://drive.google.com/a/seas.upenn.edu/file/d/163sCi0h5RJAXynE1Vo37bAQtOvcwW_wv/view?usp%3Dsharing&sa=D&ust=1567968915286000&usg=AFQjCNGDBY3SNFEJIh3m5k7GyYmhK2Q52w) on xpath for hints. Then, for each word (string between whitespace characters):

1. Remove the leading and trailing whitespace using `strip()`
2. Remove the word entirely if it is only white space.
3. Remove the word entirely if it is only numerics (you may use `isnumeric()` to test for this).

Finally join the words together into one string with spaces in between using `' '.join()`. The function should return that string (output).

In [0]:
# TODO: Complete the clean_article function, as described above.

def clean_article(content):
  paragraph = etree.HTML(content).xpath("//p//text()")
#Remove the leading and trailing whitespace using strip()
#Remove the word entirely if it is only white space.
#Remove the word entirely if it is only numerics (you may use isnumeric() to test for this).
  word_list = []
  for word in paragraph:
    word = word.strip()
    word = word.replace("\n","")
    word = word.replace("\t","")
    if (word == " ") or (word.isnumeric() == True):
      word = ""
    word_list.append(word)
  return ' '.join(word_list)
#raise NotImplementedError()

The following cell assembles our cleaned articles into a DataFrame. You do not need to modify the cell.

In [0]:
pages2 = []
for currency_name, content in pages.items():
    article = clean_article(content)
    pages2.append({'currency': currency_name, 'text': article})
pages_df = pd.DataFrame(pages2)

display(pages_df)

In [0]:
# Hidden tests 1.7 for autograding clean_article  - please don't delete


# Task 2: Build and run the classifier

Now that we have the cryptocurrency articles processed, it is time to return to the original task of building a classifier that can identify cryptocurrency articles.

## Task 2.1: Get the negative examples.

If we want to build a (supervised) machine learning algorithm to detect content, we need both *positive* and *negative* examples.  In fact we want each successive training example to have an equal probability of being positive or negative.

The following cell runs your `crawl` function from Task 1.5 and your `clean_article` function from Task 1.7. Note: We are using `crawl` not `crawl_better` because you may have included data-specific choices in `crawl_better` that are no longer true.

You do not need to modify this cell.

In [0]:
training = [
    'https://en.wikipedia.org/wiki/Tim_Cook',
    'https://en.wikipedia.org/wiki/The_Great_British_Bake_Off',
    'https://en.wikipedia.org/wiki/Google',
    'https://en.wikipedia.org/wiki/Chan_Zuckerberg_Initiative',
    'https://en.wikipedia.org/wiki/Politics',
    'https://en.wikipedia.org/wiki/Fake_news',
    'https://www.snopes.com/fact-check/social-media-hacker-warning/',
    'https://www.cnn.com/2019/08/31/us/dorian-animals-foster-release-wxc/index.html',
    'https://www.foxnews.com/us/indiana-dispatcher-helps-boy-who-called-911-with-fractions-homework',
    'https://www.usatoday.com/story/tech/talkingtech/2019/08/31/hello-iphone-11-new-features-we-want-apple-next-models/2153565001/',
    'http://theconversation.com/bury-fc-the-economics-of-an-english-football-clubs-collapse-122727',
    'https://fivethirtyeight.com/features/economists-are-bad-at-predicting-recessions/'
]

negative = crawl(training)
negative2 = []
for site, content in negative.items():
    article = clean_article(content)
    negative2.append({'site': site, 'text': article})

negative_df = pd.DataFrame(negative2)
display(negative_df)

## Task 2.2: Process Document Text

Right now, each Wikipedia article is a single string. This means, we only have one "feature" for the classifier. This is not enough. Tokenization (splitting up the article into words) would transform the data so that we have one feature per word. This probably would give us enough features to train a classifier.

Complete the `get_words` function in the cell below. This function should take a string as input (the raw article).

1. Create an empty list to store the good words.

1. Break the article into sentences using the NLTK sentence tokenizer.

1. Tokenize and part-of-speech tag each sentence.

1. Run the provided `clean_word` function and Porter stemmer on each word.

1. Finally, append the word stem to the list of good words if all of the following are true:
    1. The word stem is of nonzero length.
    2. The word stem has a length less than 20.
    3. The word stem is not a stopword.
    4. The word is a noun.
    5. The word stem is in `vocabulary`. Only apply this rule if `vocabulary` has nonzero length. It has zero length by default.

6. Return the list of good words.

To match our solution, it is important that you do these steps in the given order.

In [0]:
# TODO: Complete the get_words function
from nltk.tokenize import TweetTokenizer
sw = set(stopwords.words("english"))
sw.add("'s")
stemmer = PorterStemmer()
noun_tag = ["NN","NNP","NNP","NNPS"]
def clean_word(word):
    word = word.lower()
    word2 = ''
    for w in word:
        if w.isalpha() or (len(word2) > 0 and w.isnumeric()):
            word2 = word2 + w
    return word2

def get_words(article, vocabulary=[]):
  #Create an empty list to store the good words.
  good_words = []
  sentences = nltk.sent_tokenize(article)
  for sentence in sentences:
    tweet_tokenizer = TweetTokenizer(preserve_case = False, strip_handles = True)
    sentence = nltk.pos_tag(tweet_tokenizer.tokenize(sentence))
    for word in sentence:
      w = clean_word(word[0])
      tag = word[1]
      stem_word = stemmer.stem(w)
      if(len(stem_word) != 0) and (len(stem_word) <20) and (stem_word not in sw) and (tag in noun_tag):
        if (stem_word not in good_words):
          if (len(vocabulary) > 0):
              if(stem_word in vocabulary):
                good_words.append(stem_word)
          else:
              good_words.append(stem_word)
  return good_words
#raise NotImplementedError()

In [0]:
# Sanity check 2.2 for getting the word stems from articles

print(get_words("to be or not to be"))
print(get_words("He wants to test the functionality of xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx in article 091019."))

In [0]:
# Hidden tests 2.2 for autograding get_words  - please don't delete


## Task 2.3 Train the classifier

Adapt the code from the NLTK lecture notebook to complete the `build_classifier` function. This function takes as input the two column dataframes `positive_df` and `negative_df`, and also an optional vocabulary list. It should run `get_words` on each article in each dataframe, get a frequency distribution from NLTK for each article, assemble the training set for a Naive Bayes classifier in the correct format, train the classifier, and return the trained classifier.

In [0]:
# TODO: Complete the build_classifier function

from nltk import classify
from nltk import NaiveBayesClassifier

def build_classifier(positive_df, negative_df, vocabulary=[]):
# YOUR CODE HERE
  positive_df_set = []
  negative_df_set = []
  for article_pos in positive_df['text']:
    words = get_words(article_pos, vocabulary)
    positive_df_set.append((nltk.FreqDist(words),'positive'))
  for article_neg in negative_df['text']:
    words = get_words(article_neg, vocabulary)
    negative_df_set.append((nltk.FreqDist(words), 'negative'))
  return NaiveBayesClassifier.train(positive_df_set + negative_df_set)
#raise NotImplementedError()

In [0]:
# Sanity check 2.3 for training the classifier

classifier = build_classifier(pages_df, negative_df)
print(type(classifier))

# This should print <class 'nltk.classify.naivebayes.NaiveBayesClassifier'>

## Task 2.4: Run the classifier

Below are some sample pages.  Let's see if you can run the model on them.

### Task 2.4.1 Load the test set

Adapt the code from Task 2.1 for the new dataset. Call the final dataframe `inference_df`.

In [0]:
# TODO: Create inference_df

test = [
    'https://fried.com/history-of-bitcoin/',
    'https://news.wharton.upenn.edu/press-releases/2018/06/penn-launches-strategic-collaboration-ripple-accelerate-innovation-blockchain-cryptocurrency/',
    'https://en.wikipedia.org/wiki/Euro',
    'https://ew.com/movies/star-wars-rise-of-skywalker-footage-d23-expo/',
    'https://en.wikipedia.org/wiki/Donald_Trump'
]

# YOUR CODE HERE
inference = crawl(test)
inference2 = []
for site, content in inference.items():
    article = clean_article(content)
    inference2.append({'site': site, 'text': article})
inference_df = pd.DataFrame(inference2)
#raise NotImplementedError()

In [0]:
# Sanity check 2.4.1 loading the test set

display(inference_df)

### Task 2.4.2: Inference

Now let's run your classifier over your individual documents. Adapt the code from the NLTK lecture notebook. The function classify should take as input a two column dataframe as we have made previously, the trained classifier, and an optional vocabulary list. It should return a list of booleans. For example, a perfect classifier should return

`classify(inference_df, classifier) = [True, True, False, False, False]`.

Note that you will need to run `get_words` (passing the vocabulary) and then generate an NLTK frequency distribution for each test article.

In [0]:
# TODO: Complete the classify function

def classify(df, classifier, vocabulary=[]):
  bool_list=[]
  result_list=[]
# YOUR CODE HERE
  for text in df['text']:
    text_set = nltk.FreqDist(get_words(text, vocabulary))
    prob_result = classifier.prob_classify(text_set)
    result_list.append(prob_result.max() == 'positive')
  return result_list
#raise NotImplementedError()

results = classify(inference_df, classifier)
display(results)

In [0]:
# Sanity check 2.4.2 classifier results

if len(results) != 5:
    raise AssertionError('We do not have a classification for each item.')

In [0]:
# Hidden tests 2.4.2 for autograding results  - please don't delete


## Task 2.5: Make the vocabulary and re-classify

So far, our classifier is not very good. This is because it is trying to consider too many words, many of which did or did not occur in the training articles purely by chance. If we restrict the "attention" of the classifier to the most frequent words, it is much more likely to pick up real patterns rather than memorize accidents. We do this by making a vocabulary.

Complete the `make_vocabulary` function below. This function should take as input the two column dataframes `positive_df` and `negative_df`, and also an integer `num`. For the positive dataframe, run `get_words` on each article (without vocabulary), concatenate all of these lists of words together, create an NLTK frequency distribution, and then finally store a list of the `num` most frequent words. Do the same for the negative dataframe. The function should return the `num` most frequent positive words and the `num` most frequent negative words concatenated into one list (2 times `num` words in all).



In [0]:
# TODO: Complete the make_vocabulary function

def make_vocabulary(positive_df, negative_df, num):
  positive_df_set = []
  negative_df_set = []
# YOUR CODE HERE
  for article_pos in positive_df['text']:
    for words in get_words(article_pos):
      positive_df_set.append((words))
  for article_neg in negative_df['text']:
    for word in get_words(article_neg):
      negative_df_set.append((word))
  return positive_df_set[0:30] + negative_df_set[0:30]
#raise NotImplementedError()

In [0]:
# Sanity check 2.5.1 see final vocabulary size

vocabulary = make_vocabulary(pages_df, negative_df, 30)
print(len(vocabulary))
#print(vocabulary)

In [0]:
# Sanity check 2.5 improved classifier results

classifier_with_vocab = build_classifier(pages_df, negative_df, vocabulary)
results = classify(inference_df, classifier_with_vocab, vocabulary)
display(results)

In [0]:
# Hidden tests 2.5 for autograding results  - please don't delete


# Task 3: Submitting Your Homework

1. When you are done, select “Edit” at the top of the window, **under the filename, not the one that may appear above it**. Then, select “Clear all outputs”. Please do this just before turning is your homework because it reduces the size of your file.


2. In the same menu **under the filename**, select “File” and then “Download .ipynb”. It is very important that you do not change the file name of this downloaded notebook. Make sure that something like “(1)” did not get added to the filename and also that you did not download the .py version. Our autograder can only handle .ipynb files with the correct file name.

3. Compress the ipynb file into a Zip file **hw1.zip**.

4. Go to the [submission site](http://submit.dataanalytics.education), and click on the Google icon.  Log in using your Google@SEAS (if at all possible!) or (if you aren’t an Engineering student) GMail account.  

5. Click on the **Courses** icon at the top, then select **CIS 545** and **Save**. Select **cis545-2019c-hw1** and upload **hw1.zip**.

6. You should see a message on the submission site notifying you about whether your submission passed validation.  You may resubmit as necessary, but may have to withdraw your previous submission in OpenSubmit in order to do so.

**If you have not already, please go to Settings and set your Student ID to your PennID (all numbers)**.