# <font color=blue>_Scraping Kellogg Faculty Pages_.</font>

## Task - Extract data from Kellogg Faculty Index
From the Kellogg Faculty Directory, we will extract each faculty member's name, website, position using both __Beautiful Soup__ and __Selenium__ Libraries in python.

## <font color=blue>Step 1: Inspect the Webpage</font>

First let's take a closer look at the webpage we are scraping.

In [None]:
empirical-workshop-2020/5-harvest/image1.png
from IPython.display import Image
Image(filename= "empirical-workshop-2020/5-harvest/image1.png", width=5000, height=5000)

In a Chrome Browser, we can highlight the element we want (a professor's name), right click, and select __Inspect Element__ or __Inspect__.

In [None]:
from IPython.display import Image
Image(filename= "empirical-workshop-2020/5-harvest/image2.png", width=5000, height=5000)

This will take you to Chrome's Developer Tools where you can inspect the html __tags__ and __attributes__ of the element you highlighted.

In [None]:
from IPython.display import Image
Image(filename= "empirical-workshop-2020/5-harvest/image3.png", width=5000, height=5000)

In the Developer Tools, you can see that Professor Abdallah's webpage ("/faculty/directory/abdallah_tarek") can be found within the __href__ attribute of the __&lt;a>__ tag. 

The __&lt;a>__ tag is actually a 'child' tag to the 'parent' __&lt;h2>__ tag with an __id__ attribute and a "facName" value.

Note that if you use your mouse to hoover over an html tag, it will highlight that element on the webpage. 

In [None]:
from IPython.display import Image
Image(filename= "empirical-workshop-2020/5-harvest/image4.png", width=5000, height=5000)

To see the HTML table that contains information for all faculty members, we can follow the same steps to see that the table falls under the __id__ attribute of the __&lt;div>__ tag. (Shown above) 

For more information on understanding the HTML code, see: https://www.tutorialrepublic.com/html-tutorial/


If you further inspect this page, you will notice that there are only 36 faculty members.  

In [None]:
from IPython.display import Image
Image(filename= "empirical-workshop-2020/5-harvest/image5.png", width=5000, height=5000)

In order to see all the faculty members at Kellogg, we need to click on the __MORE FACULTY__ button.  This is where _selenium_ comes in.

## <font color=blue>Step 2: Using Selenium to Control a Web Browser</font>

### Import Packages
Before we can use selenium, we need to load the necessary libraries and file paths.  Note that selenium uses a __geckodriver__.

In [None]:
# import library for sleep times
import time

# import BeautifulSoup
import bs4 as bs



# import selenium libraries and options
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = Options()
options.set_headless(headless=True) 

Let's load the webpage we are using.

In [None]:
url = 'https://www.kellogg.northwestern.edu/faculty/faculty_directory.aspx'
print(url)

### Start a Browser Session

Next, we will create a Firefox session that directs selenium to the directory that has the geckodriver.  

In [None]:
# create a new Firefox Session
driver = webdriver.Firefox(options=options)
driver.implicitly_wait(3)

# load the website
driver.get(url)

### Button Clicks

You'll notice that a Firefox browser window opened.  Let's tell selenium to click on the __MORE FACULTY__ button to load all faculty profiles.

In [None]:
# click open the "More Faculty" button
python_button = driver.find_element_by_link_text('MORE FACULTY')
python_button.click()

We can repeat this exercise, until the full directory is loaded. 

In [None]:
# click a few more times
while True:
    try:
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, 'MORE FACULTY'))).click()
        time.sleep(3)
    except TimeoutException:
        break

## <font color=blue>Step 3: Use Beautiful Soup to parse the elements</font>

Now let's extract the elements with _Beautiful Soup_.

### Save Source Code
First let's save the source code from _selenium_.
 

In [None]:
# save the source code from Selenium
page_source = driver.page_source

Now let's transfer that source code to a _soup_.

In [None]:
soup = bs.BeautifulSoup(page_source, 'html.parser')
print(soup)

## Identify an HTML Tag

Using the tags, attributes, and values, we idenfified for the HTML table and the faculty webpages, we can save this data to objects. 

In [None]:
# find the html table with faculty profiles
faculty = soup.find('div',{'id':"bindFaculty"})
profs = faculty.findAll('h2',{'id':"facName"})

## Inspecting a Beautiful Soup object

Let's look at the attributes of the faculty table

In [None]:
print(faculty.attrs)

Let's see how many faculty profiles we found in the Kellog index.

In [None]:
print(len(profs))

Lets inspect the first item in the profs list.  What does it look like?  What are its associated attributes?

In [None]:
print(profs[0])
print(profs[0].attrs)

We can see it contains more information than we need.  Let's extract the webpage.

In [None]:
# extract the website for this faculty member
website = profs[0].find('a', href=True)
website = website['href']
print(website)

## <font color=red>Exercise 1</font>

<font color=red> Extract the first professor's name and title?</font> 

## Save results to a List Object

In [None]:
# save the full url for the first professor into an empty list object
website = 'https://www.kellogg.northwestern.edu' + str(website)
prof_sites = []
prof_sites.append(website)
print(prof_sites)

## <font color=red>Exercise 2</font>

<font color=red>Save all of the professors' names and websitesinto a list object.</font>

## <font color=red>Exercise 3</font>

<font color=red>Save all of the professors' titles to a list object.</font>

## Export Results to csv File

In [None]:
# Let's save the results of this variable to a csv file
import csv
with open('faculty_pages.csv', "w") as myfile:
     wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
     for site in prof_sites:
        wr.writerow([site])

## Close the Chrome Browser

In [None]:
# end browser session
driver.quit()

## <font color=red>Answer Key</font>

In [None]:
# Exercise 1
name = profs[0].text
name = name.strip()
print(name)
titles = faculty.findAll('div',{'id':"Faculty-Guide-content"})
title1 = titles[0].text
title1 = title1.strip()
print(title1)

In [None]:
# Exercise 2
prof_names = []
prof_sites = []

for i in profs:
    name = i.text
    name = name.strip()
    print(name)
    prof_names.append(name)
    
    website = i.find('a', href=True)
    website = website['href']
    website = 'https://www.kellogg.northwestern.edu' + str(website)
    print(website)
    prof_sites.append(website)

In [None]:
# Exercise 3
prof_titles = []
for i in titles:
    title = i.text
    title = title.strip()
    print(title)
    prof_titles.append(title)