## Scraping, Entropy and ICML papers.

ICML - the International Conference on Machine Learning - is a top research conference in Machine learning. Scrape all the pdfs of all ICML 2019 papers from http://proceedings.mlr.press/v97/.
1. What are the top 10 common words in the ICML papers?
2. Let $Z$ be a randomly selected word in a randomly selected ICML paper. Estimate the entropy of $Z$.
3. Synthesize a random paragraph using the marginal distribution over words.

In [1]:
from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib

### Connect to the website, get the links of the pdfs and download the files

In [2]:
# connect to website and get list of all pdfs
url='http://proceedings.mlr.press/v97/'
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(.pdf)')) #find the links .pdf in the website

In [3]:
# clean the pdf link names
url_list = []
for el in links:
    url_list.append((el['href'])) #The url of the pdfs are in ['href']

In [4]:
folder_location = r'C:\Users\yixup\Desktop\ee_460j\lab5\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location) #If the folder doesn't exist, create a new folder

In [None]:
# download the pdfs to a specified location
for url in url_list:
    #Name the pdfs according to the link 
    fullfilename = os.path.join(folder_location, url.replace('http://proceedings.mlr.press/v97/', '').replace('/', '_'))
    request.urlretrieve(url, fullfilename)

### Extract all the text from PDFs

In [16]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import glob
from tika import parser
import re

In [41]:
keywords = []
for filename in glob.glob(os.path.join(folder_location, '*.pdf')): # find all the files ended with .pdf in the folder 
    #key = filename.replace(folder_location, '').replace('.pdf', '')[1:]
    raw = parser.from_file(filename) #Get the content, title and other information of the pdfs
    regex = '[a-zA-Z]' #the form is a-z, A-Z
    if raw['content']==None:
        continue
    
    #The word_tokenize() function will break our text phrases into #individual words
    tokens = word_tokenize(raw['content'])
    #Create a list with word in the form of the regex and longer than 1
    keyword = [word for word in tokens if not len(word) <= 1 and not re.search(regex, word) == None]
    keywords.extend(keyword) #A long list with all the words from all files

### 1. Get the most common 10 elements

In [44]:
from collections import Counter #A way to get the top N common elements in the list fast

### Without filtering

In [45]:
c0 = Counter(keywords)
print('Without filtering all the common English words')
c0.most_common(10)

Without filtering all the common English words


[('the', 369109),
 ('of', 201161),
 ('and', 179808),
 ('to', 121539),
 ('is', 106814),
 ('in', 102706),
 ('for', 89356),
 ('that', 71473),
 ('we', 68803),
 ('with', 60573)]

### The words above cannot give us any useful information, therefore, we need to do some filtering below

In [61]:
# Filter all the common words out e.g. 'a', 'an', 'we', et cl.
stop_words = stopwords.words('english')
update1 = ['et', 'The', 'In', 'We', 'al.', 'For', 'using', 'M.', 'A.', 'S.', 'Figure', 'J.', 'use', 'pp', 'also', 'This',
          'D.', 'al', 'one', 'two', 'first', 'R.', 'Let', 'i=1', 'To', 'used', 'C.', 'P.', 'T.', 'K.', 'Then', 'based', 'L.',
          'ing', 'G.', 'Our', 'Proof', 'Section', 'Y.', 'Table', 'As', 'xi', 'via', 'H.', 'It', 'However', 'example', 'https',
          'graph', 'may', 'since', 'E.', 'N.', 'B.', 'xt', 'i.e.', 'Since', 'By', 'On', 'Thus', 'let', 'Eq', 'even', 'F.',
          'Appendix', 'us', 't=1', 'I.', 'Rd', 'three', 'Li', 'V.', 'W.', 'respectively', 'n∑', 'thus', 'j=1', 'fact', 'An',
          'e.g.', 'known', 'note', 'xk', 'ln', 'st', 'Here', 'x1', 'Z.', 'Finally', 'graphs', 'These', 'Fig', 'Pr', 'tions',
          'instead', '/latexit', 'latexit', 're-', 'sha1_base64=', 'First', 'RL', 'like', 'real', 'com-', 'while', 'e.g.', 'T∑',
          'often', 'X.', 'yi', 'gives', 'within', 'x*', 'At', 'If', 'Given', 'still', 'per', 'e.g', 'in-', 'x∗', 'x0', 's′',
          'One', 't+1', 'th', 'de-', 'O.', 'wt']
stop_words.extend(update1)
keywords_filtered = [word for word in keywords if not word in stop_words]

In [63]:
c1 = Counter(keywords_filtered)
print('Meaningful words after filtering')
c1.most_common(10)

Meaningful words after filtering


[('learning', 18821),
 ('model', 17814),
 ('data', 15188),
 ('set', 14700),
 ('function', 14483),
 ('log', 13214),
 ('Learning', 12412),
 ('training', 11866),
 ('algorithm', 11459),
 ('distribution', 9601)]

### 2. To estimate the entropy
use the top 10000 elements since when p is close to 0, the corresponding entropy is low

In [149]:
import numpy as np

In [150]:
# Define a function to calculate an single term in entropy
def entropy(p):
    return -p * np.log2(p)

In [164]:
# Define a function to calcualte the estimated entropy of a randomly selected variable Z
def entropy_tot(key, n_estimate=10000):
    
    keycopy = key.copy()
    c = Counter(keycopy)
    n_tot = len(keycopy)
    
    #n_estimate is the num. chosen to estimate the entropy (the larger this number is, the closer it is to the real value)
    top = c.most_common(n_estimate)
    
    #Vectorize top in order to faciliate the computation: translate to an numpy array; all the elements are str, but we want 
    #the second column to be int, do another transformation
    top_result = np.array(top)
    p = top_result[:, 1].astype(int) / n_tot #get the probability of each word (top n_estimate words)
    
    #Vectorize the map of entropy
    map_entropy = lambda x : entropy(x)
    v_map_entropy = np.vectorize(map_entropy)
    entropy_array = v_map_entropy(p)
    
    return np.sum(entropy_array)

In [165]:
print('Estimated entropy is: ', entropy_tot(keywords))

Estimated entropy is:  9.140278196305813


### 3. Synthesize a random paragraph using the marginal distribution over words.

In [179]:
top = c0.most_common(100)
#Vectorize top in order to faciliate the computation and get a list with all the probability in the p_values 1d numpy array
top_result = np.array(top)
p_values = top_result[:, 1].astype(int) / np.sum(top_result[:, 1].astype(int)) # Normalization
word_list = top_result[:, 0] # 1d numpy array have the words

In [193]:
# Generate a random paragraph (300 words) with weighted values
previous = ''
for i in range (300):
    word = str(np.random.choice(word_list, 1, p=p_values))
    w = re.sub(r'[\W]', '', word)
    
    if previous==w or len(w)==1:
        continue
        
    if (i%15 == 0) and (i > 15):
        print('.')
        
    print(w, end=' ')
    
    previous = w

in the same time not that in algorithm In data Theorem and if the and the al the in the that of problem of with is For to of .
as log of the an Figure the of are gradient that for .
the et from bound which where that and the to for the .
in be and The and as on be the with the network in if .
as are by problem Figure be is of same not in method an the .
in also case which the different when loss in the where The with by data are different in use then by time is networks on for to .
be which of the of the with of The of data The and it .
have are of case In to al in of our Let set if we .
et the also the of learning is for of loss it of to an .
used number for the using we Learning We and of and In in for .
with in has is al is of is the on and of .
problem only we that to we the as log the from using results for .
The that are with the in that has log the pp and the of for .
of for The of and the to algorithm of for then the for .
which not our of or Conference is also to the is of ove