#Scraping web pages

In this notebook we examine how to access the text that is in one or more web pages. Web pages are structured in the HTML format, which you can view from your web browser. Unfortunately web page authors can structure the information in their web pages in an arbitrary manner. To determine which HTML codes are associated with information of interest, we should examine the HTML code underlying a web page.

To see the HTML of a web page:<p>
  

*   **Google Chrome** - from the upper right menu (3 vertical dots) select  "More Tools"-"Developer Tools"
*   **Microsoft Internet Explorer** - Press Ctrl+U or F12, then click the Debugger tab 
*  **Microsoft Edge** - Press Ctrl+U or F12 , then select the Elements tab 
* **Mozilla Firefox** - Press Ctrl+U



# Tuck Faculty Page

In [0]:
faculty_url="https://www.tuck.dartmouth.edu/faculty/faculty-directory"


from bs4 import BeautifulSoup

import urllib.request
import urllib
import re


In [0]:

def convert_relative_link_to_full_url(link):
  link_text = link.get('href')
  link_text = link_text.replace("/faculty/faculty-directory","")
  link_to_professor_page = faculty_url + link_text
  return link_to_professor_page

def get_professor_name(soup_of_professor_page):
  title_items = soup_of_professor_page.find('div', attrs={'class': re.compile("row title fullWidth")})
  ZZZ = title_items.find('div', attrs={'class': "large-6 medium-6 columns"})
  prof_name_element = ZZZ.find('h2')
  prof_name = prof_name_element.text
  return prof_name 

In [0]:
# The list professor_stats will contain a triplet for each professor. 
# The triplet for a professor contains the three strings:
#     professor's name, the hyperlink to the professor's web page, and the text on the bio section of the professor's web page 
professor_stats = []

results_page = urllib.request.urlopen(faculty_url)
faculty_page_parse = BeautifulSoup(results_page, 'html.parser')

# Iterate through all professor links that are on the faculty page
# A link to a professor's web page is one which contains "/faculty/faculty-directory/"
for relative_link in faculty_page_parse.findAll('a', attrs={'href': re.compile("/faculty/faculty-directory/")}):
  link_to_professor_page = convert_relative_link_to_full_url(relative_link)
    
  
  professor_page_contents = urllib.request.urlopen(link_to_professor_page)
  professor_page_parse = BeautifulSoup(professor_page_contents, 'html.parser')

  professor_name = get_professor_name(professor_page_parse)
  
#   for bio_items in professor_page_parse.findAll('div', attrs={'class': re.compile("bio")}):
#     # get all text from the bio section
#         bio_text = bio_items.text
  
  bio_items = professor_page_parse.find('div', attrs={'class': re.compile("bio")})
    # get all text from the bio section
  bio_text = bio_items.text
  
  professor_triplet = (professor_name, link_to_professor_page, bio_text)
  professor_stats.append(professor_triplet)
  if len(professor_stats) % 10 == 0:
    print('Found faculty pages for',len(professor_stats),'professors so far.')  


  
print('Found faculty pages for',len(professor_stats),'professors in total.') 

# Make a word cloud from the text of all professor bios

In [0]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import string

allText = ""
num = 0
for (_,_,bio_text) in professor_stats:
#   bio_text = bio_text.translate( string.punctuation)
  bio_text = re.sub(r'[^\w\s]','',bio_text)
  allText += bio_text.lower()

# generate word cloud from the string
wc = WordCloud().generate(allText)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off") 



Create a word frequency table

In [0]:
from collections import Counter

def get_word_frequencies_from_string(text):
  word_frequencies = Counter()
  tokens = text.split()
  word_frequencies.update(tokens)

  return word_frequencies

print()
word_frequencies =   get_word_frequencies_from_string(allText)
  
print("Most common words:")
print("Word\tFrequency:")
for word, frequency in word_frequencies.most_common(20):
  print("{}\t{:,}".format(word, frequency))
  
  
print()
print("Least common words:")
print("Word\tFrequency:")
for word, frequency in word_frequencies.most_common()[:-11:-1] :
  print("{}\t{:,}".format(word, frequency))  

# Removing our own stop words from the bio text

In [0]:
# generate word cloud from the string using our own stopwords

stop_words = STOPWORDS.union({ "tuck", "school", "business", "professor", "academic", "university", "award", "awards", "research","journal", "publications"})

wc = WordCloud(stopwords=stop_words).generate(allText)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off") 

From the word frequency table, remove the stop words we defined above.

In [0]:
for word in stop_words:
    if word in word_frequencies:
        del word_frequencies[word]

        
print("Most common words:")
print("Word\tFrequency:")
for word, frequency in word_frequencies.most_common(20):
  print("{}\t{:,}".format(word, frequency))
  
  
print()
print("Least common words:")
print("Word\tFrequency:")
for word, frequency in word_frequencies.most_common()[:-11:-1] :
  print("{}\t{:,}".format(word, frequency))  