<a href="https://colab.research.google.com/github/sri-go/BigDataAnalytics/blob/master/Web_Scraping_Data_Wrangling_Naive_Bayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CIS 545 - Big Data Analytics - Fall 2019

# Homework 1: Data Wrangling and Cleaning
# Due Date: September 25, 2019 at 10pm

We all know that cryptocurrencies are all the rage today.  Could we train an algorithm to tell the difference between a webpage about cryptocurrency and a webpage about something else?

This initial assignment goes over some of the basic steps in (1) acquiring data from the web, (2) acquiring tabular data, (3) cleaning and linking data, and (4) training a simple machine learning classifer.  Along the way you'll learn a few of the basic tools, and get a very basic understanding of one way to represent documents.

**Note: You do not need to connect your local runtime to do this assignment!**

In [0]:
# Standard pip install...  Put all of your to-install packages here.
# Depending on your configuration, you may need to change pip3 to pip
!pip3 install scrapy
!pip3 install lxml
!pip3 install scikit-learn
!pip3 install swifter

In [0]:
# Standard imports; it's cleaner to put them here so they can be used
# throughout the notebook

import pandas as pd
import numpy as np
from lxml import etree
import sqlite3
import swifter
import urllib
import re
import scrapy
from scrapy.crawler import CrawlerProcess

import nltk

from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.stem import PorterStemmer
from nltk import sent_tokenize
from nltk import word_tokenize
from nltk import tokenize
from nltk.tokenize import TweetTokenizer


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Task 1: Acquiring data for training our system

First let's get some information about what's a cryptocurrency.  For that -- there's always [Wikipedia](https://en.wikipedia.org/wiki/List_of_cryptocurrencies)!

But of course it won't give us the data exactly the way we want it, so we'll need to do a bit of information extraction and data wrangling. We will also try to get current price levels from [Yahoo](https://finance.yahoo.com/cryptocurrencies).

### Task 1.1: Fetch the list of pages from Wikipedia and put it into a dataframe

First we'll get the master table of "known" cryptocurrencies. Use the `read_html()` function from `pandas`. 

In [0]:
# TODO:
# (1) Fetch files from Wikipedia:  https://en.wikipedia.org/wiki/List_of_cryptocurrencies
# (2) Parse into a dataframe called cryptocurrency_df

# YOUR CODE HERE
cryptocurrency_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_cryptocurrencies')[0]

#raise NotImplementedError()
display(cryptocurrency_df)

Unnamed: 0,Release,Currency,Symbol,Founder(s),Hash algorithm,Programming language of implementation,"Cryptocurrency blockchain (PoS, PoW, or other)",Notes
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...


Next, do the same for the following two sites. Yahoo gives a maximum of 100 prices at a time, so this is why we have to have two queries.

In [0]:
# TODO: Make two price dataframes from
# price_1_df: https://finance.yahoo.com/cryptocurrencies/?count=100&offset=0
# price_2_df: https://finance.yahoo.com/cryptocurrencies/?count=100&offset=100

# YOUR CODE HERE
price_1_df = pd.read_html('https://finance.yahoo.com/cryptocurrencies/?count=100&offset=0')[0]
price_2_df = pd.read_html('https://finance.yahoo.com/cryptocurrencies/?count=100&offset=100')[0]
#raise NotImplementedError()

price_df = price_1_df.append(price_2_df)

display(price_df)

Unnamed: 0,Symbol,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,BTC-USD,Bitcoin USD,8429.310000,-45.420000,-0.54%,151.372B,20.057B,20.057B,20.057B,17.958M,,
1,ETH-USD,Ethereum USD,169.330000,-1.470000,-0.86%,18.27B,8.736B,8.736B,8.736B,107.896M,,
2,XRP-USD,XRP USD,0.243800,-0.002400,-0.97%,10.505B,1.61B,1.61B,1.61B,43.08B,,
3,USDT-USD,Tether USD,1.006500,-0.000600,-0.06%,4.135B,24.506B,24.506B,24.506B,4.108B,,
4,BCH-USD,Bitcoin Cash USD,225.550000,-3.240000,-1.41%,4.065B,2.342B,2.342B,2.342B,18.023M,,
5,LTC-USD,Litecoin USD,56.560000,-1.120000,-1.94%,3.582B,3.054B,3.054B,3.054B,63.337M,,
6,EOS-USD,EOS USD,2.858500,-0.043300,-1.49%,2.668B,2.275B,2.275B,2.275B,933.287M,,
7,BNB-USD,Binance Coin USD,16.100000,-0.160000,-1.00%,2.504B,163.039M,163.039M,163.039M,155.537M,,
8,BSV-USD,Bitcoin SV USD,86.370000,-0.270000,-0.31%,1.542B,398.464M,398.464M,398.464M,17.855M,,
9,XLM-USD,Stellar USD,0.058600,0.001500,+2.65%,1.178B,247.433M,247.433M,247.433M,20.105B,,


In [0]:
# Quick sanity check 1.1 for cryptocurrency_df: does it have the columns from the Wikipedia table?

if not 'Currency' in cryptocurrency_df:
    raise AssertionError('Expected column called "Currency"')
    
if not 'Founder(s)' in cryptocurrency_df:
    raise AssertionError('Expected column called "Founder(s)"')

display(cryptocurrency_df)

Unnamed: 0,Release,Currency,Symbol,Founder(s),Hash algorithm,Programming language of implementation,"Cryptocurrency blockchain (PoS, PoW, or other)",Notes
0,2009,Bitcoin,"BTC,[4] XBT, ₿",Satoshi Nakamoto[nt 1],SHA-256d[5][6],C++[7],PoW[6][8],The first and most widely used decentralized l...
1,2011,Litecoin,"LTC, Ł",Charlie Lee,Scrypt,C++[11],PoW,One of the first cryptocurrencies to use Scryp...
2,2011,Namecoin,NMC,Vincent Durham[12][13],SHA-256d,C++[14],PoW,"Also acts as an alternative, decentralized DNS."
3,2012,Peercoin,PPC,Sunny King(pseudonym)[citation needed],SHA-256d[citation needed],C++[15],PoW & PoS,The first cryptocurrency to use POW and POS fu...
4,2013,Dogecoin,"DOGE, XDG, Ð",Jackson Palmer& Billy Markus[16],Scrypt[17],C++[18],PoW,Based on the Doge internet meme.
5,2013[citation needed],Gridcoin,GRC,Rob Hälford[citation needed],Scrypt,C++[19],Decentralized PoS,Linked to citizen science through the Berkeley...
6,2013,Primecoin,XPM,Sunny King(pseudonym)[citation needed],1CC/2CC/TWN[21],"TypeScript, C++[22]",PoW[21],Uses the finding of prime chains composed of C...
7,2013,Ripple[23][24],XRP,Chris Larsen &Jed McCaleb[25],ECDSA[26],C++[27],"""Consensus""",Designed for peer to peer debt transfer. Not b...
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d[28],Java[29],PoS,Specifically designed as a flexible platform t...
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym)[30],Scrypt,C++[31],PoW,Created as an alternative currency for Iceland...


### Task 1.2 First bit of data Cleaning:  Clean up the schema names.

It turns out that SQL databases often don't like parentheses and spaces in the column names.  Change the column names for the appropriate columns, by 

1. removing the parts in parentheses
2. trimming any blank spaces before or after the names
3. inserting underscores for spaces.  

Hint: there are functions called `trim`, `strip`, `find`, `replace`.

In [0]:
# TODO:
# For all column names in cryptocurrency_df, 
# (1) remove anything in parentheses, 
import re
cryptocurrency_df.columns = [re.sub(r'\([^)]*\)',"", x) for x in cryptocurrency_df.columns]

# (2) remove leading and trailing spaces, 
cryptocurrency_df.columns = cryptocurrency_df.columns.str.strip()

# (3) replace remaining spaces with underscores
cryptocurrency_df.columns = cryptocurrency_df.columns.str.replace(' ', '_')

display(cryptocurrency_df)
# YOUR CODE HERE
#raise NotImplementedError()

In [0]:
# Sanity check 1.2 for cryptocurrency_df

for column in cryptocurrency_df.keys():
    if column.find(' ') >= 0:
        raise AssertionError('Forgot to remove a space in "%s"'%column)
    elif column.find('(') >= 0 or column.find(')') >= 0:
        raise AssertionError('Forgot to remove a paren in %s'%column)
        
display(cryptocurrency_df)

### Task 1.3: Joining the tables

We are now going to try to put these two sources of information into one table. The requirement is that we want to make sure that we have an entry for every currency in the Wikipedia list, but not necessarily for every currency in the Yahoo price list. Of the four types of join, two can achieve this requirement. For extra practice, see if you can figure out both correct answers.

#### Task 1.3.1 Attempt #1

In the cell below, join `cryptocurrency_df` and `price_df` using "Name" as the join index of `price_df` and "Currency" as the join index of `cryptocurrency_df`. The result should be named `joined_on_name_df`. Do not make any changes to the data frames yet, even though you may see a problem with joining them now.

In [0]:
# TODO: Join cryptocurrency_df and price_df

# YOUR CODE HERE
#Left Join: Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.
joined_on_name_df = cryptocurrency_df.merge(price_df, how = 'left', left_on="Currency", right_on="Name").drop_duplicates()

#raise NotImplementedError()
#display(price_df)
display(joined_on_name_df)

Unnamed: 0,Release,Currency,Symbol_x,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Symbol_y,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,BTC,Satoshi Nakamoto,SHA-256d,C++,PoW,The first and most widely used decentralized l...,BTC,Bitcoin,8429.31,-45.42,-0.54%,151.372B,20.057B,20.057B,20.057B,17.958M,,
1,2011,Litecoin,LTC,Charlie Lee,Scrypt,C++,PoW,One of the first cryptocurrencies to use Scryp...,LTC,Litecoin,56.56,-1.12,-1.94%,3.582B,3.054B,3.054B,3.054B,63.337M,,
2,2011,Namecoin,NMC,Vincent Durham,SHA-256d,C++,PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym),SHA-256d,C++,PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,,
4,2013,Dogecoin,DOGE,Jackson Palmer& Billy Markus,Scrypt,C++,PoW,Based on the Doge internet meme.,DOGE,Dogecoin,0.0022,-0.0,-0.26%,265.941M,55.406M,55.406M,55.406M,121.356B,,
5,2013,Gridcoin,GRC,Rob Hälford,Scrypt,C++,Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym),1CC/2CC/TWN,"TypeScript, C++",PoW,Uses the finding of prime chains composed of C...,,,,,,,,,,,,
7,2013,Ripple,XRP,Chris Larsen &Jed McCaleb,ECDSA,C++,"""Consensus""",Designed for peer to peer debt transfer. Not b...,,,,,,,,,,,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d,Java,PoS,Specifically designed as a flexible platform t...,,,,,,,,,,,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym),Scrypt,C++,PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,,


In [0]:
# Sanity check 1.3.1 for joined_on_name_df

if len(joined_on_name_df.columns) != 20:
    raise AssertionError('Your joined table has %d columns, an unexpected number.'%len(joined_on_name_df.columns))

#### Task 1.3.2 Cleaning up the names

You may have noticed a mismatch for how the currencies are named between the two data frames. Use the `apply` function to replace the values in the `price_df["Name"]` column so they better match the values in `cryptocurrency_df["Currency"]`.

Then rerun your join from 1.3.1 and name it the same way.

In [0]:
# TODO: Remove Fix Name column in price_df and redo the join

# YOUR CODE HERE
#strip USD from names
price_df['Name'] = price_df['Name'].apply(lambda x: x.replace(' USD', ''))

#join new dataframe together
joined_on_name_df = cryptocurrency_df.merge(price_df, how='left', left_on="Currency", right_on="Name").drop_duplicates()

#raise NotImplementedError()
display(joined_on_name_df)

Unnamed: 0,Release,Currency,Symbol_x,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Symbol_y,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,BTC,Satoshi Nakamoto,SHA-256d,C++,PoW,The first and most widely used decentralized l...,BTC,Bitcoin,8429.31,-45.42,-0.54%,151.372B,20.057B,20.057B,20.057B,17.958M,,
1,2011,Litecoin,LTC,Charlie Lee,Scrypt,C++,PoW,One of the first cryptocurrencies to use Scryp...,LTC,Litecoin,56.56,-1.12,-1.94%,3.582B,3.054B,3.054B,3.054B,63.337M,,
2,2011,Namecoin,NMC,Vincent Durham,SHA-256d,C++,PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym),SHA-256d,C++,PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,,
4,2013,Dogecoin,DOGE,Jackson Palmer& Billy Markus,Scrypt,C++,PoW,Based on the Doge internet meme.,DOGE,Dogecoin,0.0022,-0.0,-0.26%,265.941M,55.406M,55.406M,55.406M,121.356B,,
5,2013,Gridcoin,GRC,Rob Hälford,Scrypt,C++,Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym),1CC/2CC/TWN,"TypeScript, C++",PoW,Uses the finding of prime chains composed of C...,,,,,,,,,,,,
7,2013,Ripple,XRP,Chris Larsen &Jed McCaleb,ECDSA,C++,"""Consensus""",Designed for peer to peer debt transfer. Not b...,,,,,,,,,,,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d,Java,PoS,Specifically designed as a flexible platform t...,,,,,,,,,,,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym),Scrypt,C++,PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,,


In [0]:
# Sanity check 1.3.2 for joined_on_name_df

if len(joined_on_name_df[joined_on_name_df["Name"].notna()]) == 0:
    raise AssertionError('Your join did not find any matches. Maybe you did something wrong?')

#### Task 1.3.3: Clean the citations out of the content.

As we saw in lecture, the html processing function converts Wikipedia citations to normal text. You may have noticed that this is keeping at least one of the cryptocurrencies from matching during the join. In the cell below, use `applymap` to remove these citations from the entire `cryptocurrency_df` table. Assume that every instance of "`[`" begins a citation. In this case only, it is okay if you delete everything after the "`[`", including the stuff after "`]`".

Then rerun your join from 1.3.2 and name it the same way. Did you get more matches?

In [0]:
# TODO: Remove citations

# YOUR CODE HERE
import re
cryptocurrency_df = cryptocurrency_df.applymap(lambda x: re.sub(r'\[.*', '', str(x)))

joined_on_name_df = cryptocurrency_df.merge(price_df, how = 'left', left_on = 'Currency', right_on = 'Name').drop_duplicates()
#raise NotImplementedError()
display(joined_on_name_df)

Unnamed: 0,Release,Currency,Symbol_x,Founder,Hash_algorithm,Programming_language_of_implementation,Cryptocurrency_blockchain,Notes,Symbol_y,Name,Price (Intraday),Change,% Change,Market Cap,Volume in Currency (Since 0:00 UTC),Volume in Currency (24Hr),Total Volume All Currencies (24Hr),Circulating Supply,52 Week Range,1 Day Chart
0,2009,Bitcoin,BTC,Satoshi Nakamoto,SHA-256d,C++,PoW,The first and most widely used decentralized l...,BTC,Bitcoin,8429.31,-45.42,-0.54%,151.372B,20.057B,20.057B,20.057B,17.958M,,
1,2011,Litecoin,LTC,Charlie Lee,Scrypt,C++,PoW,One of the first cryptocurrencies to use Scryp...,LTC,Litecoin,56.56,-1.12,-1.94%,3.582B,3.054B,3.054B,3.054B,63.337M,,
2,2011,Namecoin,NMC,Vincent Durham,SHA-256d,C++,PoW,"Also acts as an alternative, decentralized DNS.",,,,,,,,,,,,
3,2012,Peercoin,PPC,Sunny King(pseudonym),SHA-256d,C++,PoW & PoS,The first cryptocurrency to use POW and POS fu...,,,,,,,,,,,,
4,2013,Dogecoin,DOGE,Jackson Palmer& Billy Markus,Scrypt,C++,PoW,Based on the Doge internet meme.,DOGE,Dogecoin,0.0022,-0.0,-0.26%,265.941M,55.406M,55.406M,55.406M,121.356B,,
5,2013,Gridcoin,GRC,Rob Hälford,Scrypt,C++,Decentralized PoS,Linked to citizen science through the Berkeley...,,,,,,,,,,,,
6,2013,Primecoin,XPM,Sunny King(pseudonym),1CC/2CC/TWN,"TypeScript, C++",PoW,Uses the finding of prime chains composed of C...,,,,,,,,,,,,
7,2013,Ripple,XRP,Chris Larsen &Jed McCaleb,ECDSA,C++,"""Consensus""",Designed for peer to peer debt transfer. Not b...,,,,,,,,,,,,
8,2013,Nxt,NXT,BCNext(pseudonym),SHA-256d,Java,PoS,Specifically designed as a flexible platform t...,,,,,,,,,,,,
9,2014,Auroracoin,AUR,Baldur Odinsson(pseudonym),Scrypt,C++,PoW,Created as an alternative currency for Iceland...,,,,,,,,,,,,


In [0]:
# Sanity check 1.3.3 for joined_on_name_df

print("%d matches found"%len(joined_on_name_df[joined_on_name_df["Name"].notna()]))
if len(joined_on_name_df[joined_on_name_df["Name"].notna()]) == 0:
    raise AssertionError('Your join did not find any matches. Maybe you did something wrong?')

#### Task 1.3.4 A Better Column

Look again at `cryptocurrency_df` and `price_df` and select better columns for indexing the join. Consider an `apply` function for the relevant column in `cryptocurrency_df` and for the relevant column in price_df` that you select. 

Name this table `joined_df`. To get the points for this section, you need to match at least as many currencies as our solution.

In [0]:
# TODO: Improve the join by switching to different columns

# YOUR CODE HERE

price_df['Symbol'] = price_df['Symbol'].apply(lambda x: x.replace('-USD', ''))
cryptocurrency_df['Symbol'] = cryptocurrency_df['Symbol'].apply(lambda x: re.sub(r'\,.*', '', str(x)))
joined_df = cryptocurrency_df.merge(price_df, left_on='Symbol', right_on='Symbol', how='left').drop_duplicates()

#raise NotImplementedError()

display(joined_df)

In [0]:
# Sanity check 1.3.4 for joined_df

print("%d matches found"%len(joined_df[joined_df["Name"].notna()]))
if len(joined_df[joined_df["Name"].notna()]) <= len(joined_on_name_df[joined_on_name_df["Name"].notna()]):
    raise AssertionError('Your new join is not better than the old one. Maybe you did something wrong?')

16 matches found


### Task 1.4: Save the cryptocurrency list in a database table

We don't want to continue to hit Wikipedia.org every time we want to consult the list of cryptocurrencies.  Save your `cryptocurrency_df` to sqlite, in a table called `cryptocurrency`.  

**The Dataframe `index` has no particular meaning, so don't save it!**

In [0]:
# TODO: convert cryptocurrency_df to sqlite

conn = sqlite3.connect('local.db')

cryptocurrency_df.to_sql("cryptocurrency", conn, if_exists="replace", index=False)

pd.read_sql_query('select * from cryptocurrency', conn)
# YOUR CODE HERE
#raise NotImplementedError()

In [0]:
# Sanity check 1.4 for sqlite databases

crypto2 = pd.read_sql_query('select * from cryptocurrency', conn)

if 'index' in crypto2:
    raise AssertionError('Please disable the index, since it isn\'t important information')
    
display(crypto2)

### Task 1.5: Read the cryptocurrency pages

Now let's take each of the cryptocurrency names and find the associated URL. The names of the currencies were originally clickable links on the [webpage](https://en.wikipedia.org/wiki/List_of_cryptocurrencies) that we made the table from, but unfortunately, `pandas` automatically deleted the URLs. So we have to regenerate them. Feel free to look at that page to see what the correct URL is for each currency.

In the cell below, complete the function `crawl`. The function name, inputs, first line, and last line are provided for you. 

`list_of_urls` should contain the URLs of interest as a list, column of a pandas DataFrame, or some other iterable over strings. 

`prefix` contains a common string that should be added to the beginning every URL in `list_of_urls` before each URL is queried. 

The line `pages = {}` creates an empty dictionary. After running your part of the function `crawl`, `pages` should have currency names as its keys and the corresponding Wikipedia page contents as its values. This is what the function returns.

You have two options for completing this cell:

1. If you want to use `urllib.request.urlopen`, you should then use `read()` and `decode('utf-8')`.

2. If you want to use `scrapy`, follow the process in [this notebook from class](https://www.google.com/url?q=https://drive.google.com/file/d/1VfnlGr_VofdcEqACM2jRu2BwYm0QyTSh/view?usp%3Dsharing&sa=D&ust=1567968915286000&usg=AFQjCNG5iEWgUoA3DrRLhV1TKiT2OXHD1A).

For now, use a `try` statement to catch the errors and print a message that the URL could not be crawled. That is, in this cell we will have a **single rule** and not do any manual cleaning.  If you were doing this at web scale, you would be reluctant to invest a lot of manual effort...

In [0]:
# TODO: Crawl the pages.  
# Trap the errors and figure out what you need to fix (in the cleaning step below)

def crawl(list_of_urls, prefix=""):
  pages = {}
  for url in list_of_urls:
    try:
      pages[url] = urllib.request.urlopen(prefix+url).read().decode('utf-8')
    except:
        print(url,'was not crawled')
        html = None
  return pages

# YOUR CODE HERE

The following cell passes the currencies in our table to the `crawl` function. 

1.   List item
2.   List item

You do not need to modify the cell.

In [0]:
# Sanity check 1.5.1 for initial crawl
pages = crawl(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')
for page in pages:
    print(page)
    
print ('Total crawl: %d cryptocurrencies'%len(pages))

Ether or "Ethereum" was not crawled
Bitcoin
Litecoin
Namecoin
Peercoin
Dogecoin
Gridcoin
Primecoin
Ripple
Nxt
Auroracoin
Dash
NEO
MazaCoin
Monero
NEM
PotCoin
Titcoin
Verge
Stellar
Vertcoin
Ethereum Classic
Tether
Zcash
Bitcoin Cash
EOS.IO
Total crawl: 25 cryptocurrencies


**Did** you get any errors? Did you ever get the wrong URL (and therefore the content from the wrong page)? Fix those two problems in the function `crawl_better` below. This function has the same inputs and outputs as `crawl`, but this time, it is okay if your fixes are specific to these sites. For example, you can try attaching `_(disambiguation)`, pull up that page's `etree.HTML(content)` and look for a link that has the name of the currency plus `' (cryptocurrency)'`.

In [0]:
# TODO: Re-run the crawl, fixing the issues

# Crawl the pages.  You may use urllib.request.urlopen or scrapy
# Assemble the list of results in the list pages.
# Trap the errors and figure out what you need to fix (in the cleaning step below)


def fetch_page(url, prefix=""):        
    print('url: ' + str(url))
    
    try:
      info = urllib.request.urlopen(prefix+url).read().decode('utf-8')
      print('retrieved info', len(info))
          
    except:
      print(url,'was not crawled')
      html = None
    
    else:
      if 'blockchain' in info:
        return info

      else:
        dom = etree.HTML(info)
        url_link = dom.xpath('//a[contains(text(),"cryptocurrency") or contains(text(), "payment protocol") or contains(text(), "payment network")]/@href')
        url_link = [item.replace('/wiki/', '') for item in url_link]
        info = urllib.request.urlopen(prefix+str(url_link[0])).read().decode('utf-8')
        print('retrieved new info', len(info))
        return info

    return info

def crawl_better(list_of_urls, prefix=""):
  pages = {}
  disambiguation_list = []
  count = 0
  not_blockchain_count = 0
  
  for url in list_of_urls:
    print('url: ' + str(url))
    
    if url == 'Ether or "Ethereum"':
        print('HI')
        url = 'Ethereum'
        print(url)

    try:
      pages[url] = urllib.request.urlopen(prefix+url).read().decode('utf-8')
      
    except:
      print(url,'was not crawled')
      html = None

    else:
      
      if 'blockchain' in pages[url]: 
        count += 1
        print("contains blockchain", count)
      
      elif 'blockchain' not in pages[url]:
        not_blockchain_count += 1
        print("does not contain blockhain: ", not_blockchain_count)
        
        dom = etree.HTML(pages[url])
        print(dom)
        url_link = dom.xpath('//a[contains(text(),"cryptocurrency") or contains(text(), "payment protocol") or contains(text(), "payment network") or contains(text(), "(disambiguation)")]/@href')
        print(url_link)
        url_link = [item.replace('/wiki/', '') for item in url_link]
        print(url_link)  

        pages[url] = fetch_page(url_link[0], prefix)
         

      elif url == 'Ether or "Ethereum"':
        pages[url] = urllib.request.urlopen('https://en.wikipedia.org/wiki/Ethereum').read().decode('utf-8') 

  return pages
# YOUR CODE HERE
#raise NotImplementedError()


As before, the cell below just runs your function and does not need to be modified.

In [0]:
# Sanity check 1.5.2 for better crawl

pages = crawl_better(cryptocurrency_df['Currency'], 'https://en.wikipedia.org/wiki/')


url: Bitcoin
contains blockchain 1
url: Litecoin
contains blockchain 2
url: Namecoin
contains blockchain 3
url: Peercoin
contains blockchain 4
url: Dogecoin
contains blockchain 5
url: Gridcoin
contains blockchain 6
url: Primecoin
contains blockchain 7
url: Ripple
does not contain blockhain:  1
<Element html at 0x7f7f23b5b9c8>
['/wiki/Ripple_(payment_protocol)', '/wiki/Ripple_Island_(disambiguation)']
['Ripple_(payment_protocol)', 'Ripple_Island_(disambiguation)']
url: Ripple_(payment_protocol)
retrieved info 82142
url: Nxt
contains blockchain 8
url: Auroracoin
contains blockchain 9
url: Dash
does not contain blockhain:  2
<Element html at 0x7f7f2668fa08>
['/wiki/Dash_(disambiguation)']
['Dash_(disambiguation)']
url: Dash_(disambiguation)
retrieved info 48445
retrieved new info 73820
url: NEO
contains blockchain 10
url: MazaCoin
contains blockchain 11
url: Monero
does not contain blockhain:  3
<Element html at 0x7f7f2668f048>
['/wiki/Monero_(cryptocurrency)']
['Monero_(cryptocurrency)']

### Task 1.6: Sanity-check and fix

Note that sometimes terms in Wikipedia are **ambiguous**, so just following the page doesn't always get what you want.  The Wikipedia page for [Tether](https://en.wikipedia.org/wiki/Tether) does not describe a cryptocurrency.

We can add a data-cleaning rule to check this: every cryptocurrency should mention the term "blockchain".  Here's a sanity check you can use.  If there are any disambiguation pages, you need to go back to Task 1.5 and update your process to crawl the right page.

You do not need to modify this cell.

In [0]:
count_wrong = 0

for page,content in pages.items():
    if isinstance(content, bytes):
        raise AssertionError('Please run decode(\'utf-8\') on the content to decode to a string')
        content = content.decode('utf-8')
        
    if 'blockchain' not in content:
        print(page + ': ' + ' -- did not find blockchain!')
        count_wrong = count_wrong + 1

        
print ('Total crawl: %d cryptocurrencies'%len(pages))

if count_wrong > 0:
    raise AssertionError('Need to follow Wikipedia disambiguation pages on %d items!'%count_wrong)

Total crawl: 26 cryptocurrencies


### Task 1.7: Clean the articles

So far, we have captured HTML content for each Wikipedia article, but HTML is not very easy to read and process. So the next step is to clean up the text in each article. To do that, you need to complete the function definition below. The function name, and input are provided for you. 

The first step is to get a list of paragraphs of content. See our [slides](https://www.google.com/url?q=https://drive.google.com/a/seas.upenn.edu/file/d/163sCi0h5RJAXynE1Vo37bAQtOvcwW_wv/view?usp%3Dsharing&sa=D&ust=1567968915286000&usg=AFQjCNGDBY3SNFEJIh3m5k7GyYmhK2Q52w) on xpath for hints. Then, for each paragraph:

1. Remove the leading and trailing whitespace using `strip()`
2. Remove the paragraph entirely if it is only white space.
3. Remove the paragraph entirely if it is only numerics (you may use `isnumeric()` to test for this).

Finally join the paragraphs together into one string with spaces in between using `' '.join()`. The function should return that string (output).

In [0]:
# TODO: Complete the clean_article function, as described above.

def clean_article(content):
  dom = etree.HTML(content)
  paragraphs = dom.xpath('//p//text()')
  paragraphs = [paragraph.strip() for paragraph in paragraphs]
  paragraphs = [paragraph for paragraph in paragraphs if paragraph.isspace() == False]
  paragraphs = [paragraph for paragraph in paragraphs if paragraph.isnumeric() == False]
  paragraphs = ' '.join(paragraphs)

  return paragraphs
  #print(paragraphs)
# YOUR CODE HERE
#raise NotImplementedError()

The following cell assembles our cleaned articles into a DataFrame. You do not need to modify the cell.

In [0]:
pages2 = []
for currency_name, content in pages.items():
    article = clean_article(content)
    pages2.append({'currency': currency_name, 'text': article})

pages_df = pd.DataFrame(pages2)

display(pages_df)

Unnamed: 0,currency,text
0,Bitcoin,Bitcoin [a] ( ₿ ) is a cryptocurrency . It i...
1,Litecoin,Litecoin ( LTC or Ł ) is a peer-to-peer crypt...
2,Namecoin,Namecoin ( Symbol : ℕ or NMC ) is a cryptocurr...
3,Peercoin,"Peercoin , also known as PPCoin or PPC , is a ..."
4,Dogecoin,"Dogecoin ( / ˈ d oʊ dʒ k ɔɪ n / DOHJ -koyn ,..."
5,Gridcoin,"Gridcoin implements a ""Proof-of-Research"" (POR..."
6,Primecoin,Primecoin ( sign : Ψ ; code: XPM ) is a crypto...
7,Ripple,Ripple is a real-time gross settlement syste...
8,Nxt,Nxt is an open source cryptocurrency and paym...
9,Auroracoin,"Auroracoin (code: AUR, symbol: ᚠ ) is a peer-..."


# Task 2: Build and run the classifier

Now that we have the cryptocurrency articles processed, it is time to return to the original task of building a classifier that can identify cryptocurrency articles.

## Task 2.1: Get the negative examples.

If we want to build a (supervised) machine learning algorithm to detect content, we need both *positive* and *negative* examples.  In fact we want each successive training example to have an equal probability of being positive or negative.

The following cell runs your `crawl` function from Task 1.5 and your `clean_article` function from Task 1.7. Note: We are using `crawl` not `crawl_better` because you may have included data-specific choices in `crawl_better` that are no longer true.

You do not need to modify this cell.

In [0]:
training = [
    'https://en.wikipedia.org/wiki/Tim_Cook',
    'https://en.wikipedia.org/wiki/The_Great_British_Bake_Off',
    'https://en.wikipedia.org/wiki/Google',
    'https://en.wikipedia.org/wiki/Chan_Zuckerberg_Initiative',
    'https://en.wikipedia.org/wiki/Politics',
    'https://en.wikipedia.org/wiki/Fake_news',
    'https://www.snopes.com/fact-check/social-media-hacker-warning/',
    'https://www.cnn.com/2019/08/31/us/dorian-animals-foster-release-wxc/index.html',
    'https://www.foxnews.com/us/indiana-dispatcher-helps-boy-who-called-911-with-fractions-homework',
    'https://www.usatoday.com/story/tech/talkingtech/2019/08/31/hello-iphone-11-new-features-we-want-apple-next-models/2153565001/',
    'http://theconversation.com/bury-fc-the-economics-of-an-english-football-clubs-collapse-122727',
    'https://fivethirtyeight.com/features/economists-are-bad-at-predicting-recessions/'
]

negative = crawl(training)
negative2 = []
for site, content in negative.items():
    article = clean_article(content)
    negative2.append({'site': site, 'text': article})

negative_df = pd.DataFrame(negative2)
display(negative_df)

Unnamed: 0,site,text
0,https://en.wikipedia.org/wiki/Tim_Cook,"Timothy Donald Cook (born November 1, 1960) [..."
1,https://en.wikipedia.org/wiki/The_Great_Britis...,The Great British Bake Off (often abbreviated...
2,https://en.wikipedia.org/wiki/Google,Google LLC [5] is an American multinational ...
3,https://en.wikipedia.org/wiki/Chan_Zuckerberg_...,The Chan Zuckerberg Initiative ( CZI ) is a l...
4,https://en.wikipedia.org/wiki/Politics,Politics is a set of activities associated w...
5,https://en.wikipedia.org/wiki/Fake_news,"Fake news (also known as junk news , pseudo-..."
6,https://www.snopes.com/fact-check/social-media...,Snopes needs your help! Learn more . Accept...
7,https://www.cnn.com/2019/08/31/us/dorian-anima...,"By Madeline Holcombe , CNN Updated 3:20 AM ET,..."
8,https://www.foxnews.com/us/indiana-dispatcher-...,"This material may not be published, broadcast,..."
9,https://www.usatoday.com/story/tech/talkingtec...,Settings Cancel Set Have an existing account? ...


## Task 2.2: Process Document Text

Right now, each Wikipedia article is a single string. This means, we only have one "feature" for the classifier. This is not enough. Tokenization (splitting up the article into words) would transform the data so that we have one feature per word. This probably would give us enough features to train a classifier.

Complete the `get_words` function in the cell below. This function should take a string as input (the raw article).

1. Create an empty list to store the good words.

1. Break the article into sentences using the NLTK sentence tokenizer.

1. Tokenize and part-of-speech tag each sentence.

1. Run the provided `clean_word` function and Porter stemmer on each word.

1. Finally, append the word stem to the list of good words if all of the following are true:
    1. The word stem is of nonzero length.
    2. The word stem has a length less than 20.
    3. The word stem is not a stopword.
    4. The word is a noun.
    5. The word stem is in `vocabulary`. Only apply this rule if `vocabulary` has nonzero length. It has zero length by default.

6. Return the list of good words.

To match our solution, it is important that you do these steps in the given order.

In [0]:
# TODO: Complete the get_words function
sw = set(stopwords.words("english"))
sw.add("'s")
stemmer = PorterStemmer()

def clean_word(word):
    word = word.lower()
    word2 = ''
    for w in word:
        if w.isalpha() or (len(word2) > 0 and w.isnumeric()):
            word2 = word2 + w
    return word2

def get_words(article, vocabulary = []):
# YOUR CODE HERE
  good_words = []
  sentences = nltk.sent_tokenize(article)

  for sentence in sentences:
    word_tokens = nltk.word_tokenize(sentence)
    tagged_words = nltk.pos_tag(word_tokens)
    for (word, tag) in tagged_words:
      cleaned_word = clean_word(word)
      stemmed_word = stemmer.stem(cleaned_word)
      if len(stemmed_word) > 0:
        if len(stemmed_word) < 20:
          if stemmed_word not in sw:
            if tag[0] == 'N':
              if(len(vocabulary) > 0):
                if(stemmed_word in vocabulary):
                  good_words.append(stemmed_word)
              else:
                good_words.append(stemmed_word)
  return good_words

#raise NotImplementedError()

In [0]:
# Sanity check 2.2 for getting the word stems from articles
print(get_words("He wants to test the functionality of this sentence in article 091019. To be or not to be"))



['function', 'sentenc', 'articl']


## Task 2.3 Train the classifier

Adapt the code from the NLTK lecture notebook to complete the `build_classifier` function. This function takes as input the two column dataframes `positive_df` and `negative_df`, and also an optional vocabulary list. It should run `get_words` on each article in each dataframe, get a frequency distribution from NLTK for each article, assemble the training set for a Naive Bayes classifier in the correct format, train the classifier, and return the trained classifier.

In [0]:
# TODO: Complete the build_classifier function
import matplotlib.pyplot as plt

def build_classifier(positive_df, negative_df, vocabulary=[]):
   text = positive_df['text']
   site = negative_df['text']
    
   pos_list_set = []
   for currency in text:
       pos_list_set.append((nltk.FreqDist(get_words(currency,vocabulary)), 'positive'))
   print(pos_list_set)
   
   neg_list_set = []
   for website in site:
        neg_list_set.append((nltk.FreqDist(get_words(website, vocabulary)), 'negative'))
   print(neg_list_set)
   
   from nltk import classify
   from nltk import NaiveBayesClassifier
 
   classifier = NaiveBayesClassifier.train(pos_list_set + neg_list_set)

   return classifier

# YOUR CODE HERE
#raise NotImplementedError()

In [0]:
# Sanity check 2.3 for training the classifier
classifier = build_classifier(pages_df, negative_df)
print(type(classifier))

# This should print <class 'nltk.classify.naivebayes.NaiveBayesClassifier'>

[(FreqDist({'bitcoin': 229, 'transact': 58, 'price': 40, 'block': 37, 'wallet': 29, 'exchang': 28, 'network': 26, 'blockchain': 24, 'cryptocurr': 22, 'us': 22, ...}), 'positive'), (FreqDist({'litecoin': 11, 'bitcoin': 5, 'octob': 3, 'ltc': 2, 'cryptocurr': 2, 'coin': 2, 'client': 2, 'network': 2, 'algorithm': 2, 'sha256': 2, ...}), 'positive'), (FreqDist({'namecoin': 21, 'domain': 11, 'name': 10, 'bitcoin': 8, 'record': 8, 'bit': 6, 'system': 5, 'block': 5, 'blockchain': 4, 'use': 4, ...}), 'positive'), (FreqDist({'peercoin': 3, 'system': 2, 'king': 2, 'coin': 2, 'monopoli': 2, 'ppcoin': 1, 'ppc': 1, 'cryptocurr': 1, 'august': 1, 'paper': 1, ...}), 'positive'), (FreqDist({'dogecoin': 42, 'coin': 14, 'commun': 11, 'cryptocurr': 10, 'time': 8, 'user': 6, 'bitcoin': 6, 'wise': 6, 'doge': 5, 'palmer': 5, ...}), 'positive'), (FreqDist({'gridcoin': 4, 'comput': 3, 'energi': 3, 'proofofstak': 2, 'por': 1, 'scheme': 1, 'user': 1, 'berkeley': 1, 'open': 1, 'infrastructur': 1, ...}), 'positive')

## Task 2.4: Run the classifier

Below are some sample pages.  Let's see if you can run the model on them.

### Task 2.4.1 Load the test set

Adapt the code from Task 2.1 for the new dataset. Call the final dataframe `inference_df`.

In [0]:
# TODO: Create inference_df
test = [
    'https://fried.com/history-of-bitcoin/',
    'https://news.wharton.upenn.edu/press-releases/2018/06/penn-launches-strategic-collaboration-ripple-accelerate-innovation-blockchain-cryptocurrency/',
    'https://en.wikipedia.org/wiki/Euro',
    'https://ew.com/movies/star-wars-rise-of-skywalker-footage-d23-expo/',
    'https://en.wikipedia.org/wiki/Donald_Trump'
]

inference = crawl(test)
inference2 = []

for site, content in inference.items():
    article = clean_article(content)
    inference2.append({'site': site, 'text': article})

inference_df = pd.DataFrame(inference2)
display(inference_df)

# YOUR CODE HERE
#raise NotImplementedError()

Unnamed: 0,site,text
0,https://fried.com/history-of-bitcoin/,"Follow us! Last updated: August 5th, 2019 Bitc..."
1,https://news.wharton.upenn.edu/press-releases/...,"PHILADELPHIA, PA, June 4, 2018 — The Wharton S..."
2,https://en.wikipedia.org/wiki/Euro,The euro ( sign : € ; co...
3,https://ew.com/movies/star-wars-rise-of-skywal...,With the new Star Wars: The Rise of Skywalker ...
4,https://en.wikipedia.org/wiki/Donald_Trump,"Donald John Trump (born June 14, 1946) is th..."


Unnamed: 0,site,text
0,https://fried.com/history-of-bitcoin/,"Follow us! Last updated: August 5th, 2019 Bitc..."
1,https://news.wharton.upenn.edu/press-releases/...,"PHILADELPHIA, PA, June 4, 2018 — The Wharton S..."
2,https://en.wikipedia.org/wiki/Euro,The euro ( sign : € ; co...
3,https://ew.com/movies/star-wars-rise-of-skywal...,With the new Star Wars: The Rise of Skywalker ...
4,https://en.wikipedia.org/wiki/Donald_Trump,"Donald John Trump (born June 14, 1946) is th..."


In [0]:
# Sanity check 2.4.1 loading the test set
display(inference_df)

Unnamed: 0,site,text
0,https://fried.com/history-of-bitcoin/,"Follow us! Last updated: August 5th, 2019 Bitc..."
1,https://news.wharton.upenn.edu/press-releases/...,"PHILADELPHIA, PA, June 4, 2018 — The Wharton S..."
2,https://en.wikipedia.org/wiki/Euro,The euro ( sign : € ; co...
3,https://ew.com/movies/star-wars-rise-of-skywal...,With the new Star Wars: The Rise of Skywalker ...
4,https://en.wikipedia.org/wiki/Donald_Trump,"Donald John Trump (born June 14, 1946) is th..."


Unnamed: 0,site,text
0,https://fried.com/history-of-bitcoin/,"Follow us! Last updated: August 5th, 2019 Bitc..."
1,https://news.wharton.upenn.edu/press-releases/...,"PHILADELPHIA, PA, June 4, 2018 — The Wharton S..."
2,https://en.wikipedia.org/wiki/Euro,The euro ( sign : € ; co...
3,https://ew.com/movies/star-wars-rise-of-skywal...,With the new Star Wars: The Rise of Skywalker ...
4,https://en.wikipedia.org/wiki/Donald_Trump,"Donald John Trump (born June 14, 1946) is th..."


### Task 2.4.2: Inference

Now let's run your classifier over your individual documents. Adapt the code from the NLTK lecture notebook. The function classify should take as input a two column dataframe as we have made previously, the trained classifier, and an optional vocabulary list. It should return a list of booleans. For example, a perfect classifier should return

`classify(inference_df, classifier) = [True, True, False, False, False]`.

Note that you will need to run `get_words` (passing the vocabulary) and then generate an NLTK frequency distribution for each test article.

In [0]:
# TODO: Complete the classify function
def classify(df, classifier, vocabulary=[]):
# YOUR CODE HERE
    text = df['text']
    final_result = []
    
    for each_article in text:
        text_classify = nltk.FreqDist(get_words(each_article))
        prob_result = classifier.prob_classify(text_classify)
        final_result.append(prob_result.max())
 
    return final_result
#raise NotImplementedError()

results = classify(inference_df, classifier)
display(results)

['negative', 'negative', 'negative', 'negative', 'negative']

['negative', 'negative', 'negative', 'negative', 'negative']

In [0]:
# Sanity check 2.4.2 classifier results
if len(results) != 5:
    raise AssertionError('We do not have a classification for each item.')

## Task 2.5: Make the vocabulary and re-classify

So far, our classifier is not very good. This is because it is trying to consider too many words, many of which did or did not occur in the training articles purely by chance. If we restrict the "attention" of the classifier to the most frequent words, it is much more likely to pick up real patterns rather than memorize accidents. We do this by making a vocabulary.



In [0]:
# TODO: Complete the make_vocabulary function
def make_vocabulary(positive_df, negative_df, num):
# YOUR CODE HERE
    text = positive_df['text']
    site = negative_df['text']
    
    pos_list_set = []
    neg_list_set = []
    new_list = []
    
    neg_full_article = []
    pos_full_article = []

    for currency in text:
        pos_full_article.extend(get_words(currency))
    #print(pos_full_article)
    
    for website in site:
        neg_full_article.extend(get_words(website))
    #print(neg_full_article)

    frequency_dist = dict(nltk.FreqDist(pos_full_article).most_common(num))
    pos_list_set = list(frequency_dist.keys())
    print('Pos List: ', len(pos_list_set))

    neg_frequency_dist = dict(nltk.FreqDist(neg_full_article).most_common(num))
    neg_list_set = list(neg_frequency_dist.keys())
    print('Neg List: ', len(neg_list_set))
    
    new_list = pos_list_set + neg_list_set
    print('New List: ', new_list)
    
    return new_list
#raise NotImplementedError()

In [0]:
# Sanity check 2.5.1 see final vocabulary size
vocabulary = make_vocabulary(pages_df, negative_df, 30)
print(len(vocabulary))

In [0]:
# Sanity check 2.5 improved classifier results
classifier_with_vocab = build_classifier(pages_df, negative_df, vocabulary)
results = classify(inference_df, classifier_with_vocab, vocabulary)
display(results)