# Please don't edit directly in this document. Create your own copy first.#

# Exercise 3: IUPUI List of Professors

# [Web Scraping]

#Extract
First, let's run the cell below to import neccesary libraries. Although most of the commonly used Python libraries are pre-installed, new libraries can be installed as !pip install [package name] or !apt-get install [package name].

##1. Libraries

*   [requests](https://github.com/psf/requests): an elegant and simple HTTP library for Python.
*   [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): Python library for pulling data out of HTML and XML files

In [0]:
# import
import requests
from bs4 import BeautifulSoup

Second, set the url of the website from which we'd like to extract data using the requests library that we imported in the first step. If the access was successful, you should see the output as <Response [200]>.

## 2. Set the URL

In [0]:
# Set the URL you want to scrape from
url='https://et.iupui.edu/people/?type=Faculty'

# Connect to the URL
response = requests.get(url)
response

<Response [200]>

Third, parse the html with BeautifulSoup.

## 3. Parse HTML file

In [0]:
# Parse HTML and save to BeautifulSoup object
# Instead of default parser, html.parser, use 'html5lib' with response.content (not response.text)
# The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the “correct” way.
soup = BeautifulSoup(response.content, 'html5lib')
soup

<!DOCTYPE html>
<html class="no-js ie9" itemscope="itemscope" itemtype="http://schema.org/Webpage" lang="en-US"><head prefix="og: http://ogp.me/ns# profile: http://ogp.me/ns/profile# article: http://ogp.me/ns/article#"><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><title>People: Purdue School of Engineering &amp; Technology: IUPUI</title><meta content="People" name="description"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><link href="https://assets.iu.edu/favicon.ico" rel="shortcut icon" type="image/x-icon"/><!-- Canonical URL --><link href="https://et.iupui.edu/people/index.html" itemprop="url" rel="canonical"/><meta content="People" property="og:title"/><meta content="People" property="og:description"/><meta content="https://et.iupui.edu/people/index.html" property="og:url"/><meta content="Purdue School of Engineering &amp; Technology" property="og:site_name"/><meta content="en_US" property="og:locale"/><meta content="website

Fourth, find an element with its attribute name. 

###Syntax: find_all(name, attrs)
Find all elements following the same syntax rules.

## 4. Extract professor name and title.

In [0]:
# Extract a specific part of the HTML document - find div whose id is filter-results and save it as table
table = soup.find('div', attrs = {'id':'filter-results'}) 
table

<div id="filter-results"><article class="profile feed-item" itemscope="itemscope" itemtype="http://schema.org/Person">
		    		<figure class="media image circle" itemscope="itemscope" itemtype="http://schema.org/ImageObject">
		            <img alt="D A" src="/people/_images/dacheson.png"/>
		            </figure><div class="content"><h1 class="no-margin title"><a href="/people/dacheson.html" itemprop="url"><span itemprop="name">Douglas Acheson</span></a></h1><p class="sub-title">Associate Professor of Mechanical Engineering Technology<br/><span style="font-style: italic; font-size: 14px;">Engineering Technology<br <="" span=""/><br/><span inline"="" meta="" style="font-style: italic; font-size: 14px;&lt;/span&gt;&lt;/p&gt;&lt;dl class="></span></span></p><dt>Phone: </dt><dd itemprop="telephone">317-274-4186</dd><dt>Email: </dt><dd itemprop="email"><span ery="absbyybj" uers="qnpurfba@vhchv.rqh">dacheson@iupui.edu</span></dd></div></article><article class="profile feed-item" itemscope="

In [0]:
# Create an empty list called professors (example: list=[])
professors=[]

# Extract name and title 
for row in table.findAll('article', attrs = {'class':'profile feed-item'}): 
    # Create a dictionary to save all information
    professor = {} 
    professor['name'] = row.h1.text 
    professor['title'] = row.p.text 
    professors.append(professor) 

professors

[{'name': 'Douglas Acheson',
  'title': 'Associate Professor of Mechanical Engineering TechnologyEngineering Technology'},
 {'name': 'Eric Adams',
  'title': 'Senior Lecturer of Mechanical EngineeringMechanical and Energy Engineering'},
 {'name': 'Mangilal Agarwal',
  'title': 'Professor of Mechanical and Energy EngineeringMechanical and Energy Engineering'},
 {'name': 'Randy Albright',
  'title': 'Lecturer of Music and Arts TechnologyMusic and Arts Technology'},
 {'name': 'Karen Alfrey',
  'title': 'Associate Dean of Undergraduate Academic Affairs and ProgramsBiomedical EngineeringUndergraduate ProgramsSchool of E & T Administration'},
 {'name': 'John Alvarado',
  'title': 'Senior Lecturer of Music and Arts TechnologyMusic and Arts Technology'},
 {'name': 'Babak Anasori',
  'title': 'Assistant ProfessorMechanical and Energy Engineering'},
 {'name': 'Sohel Anwar',
  'title': 'Associate Professor of Mechanical EngineeringMechanical and Energy Engineering'},
 {'name': 'Darrell Bailey',
 

# Export to CSV
Import neccesary libraries. The file will be saved in the virtual machine, so in order to download a csv file to your local computer, you need to import *files* from google.colab. 

## 1. Libraries

*   [csv](https://docs.python.org/3/library/csv.html): the most common import and export format for spreadsheets and databases.


In [0]:
# import
import csv
from google.colab import files

In [0]:
# Save it from list to csv

filename = 'iupui_list.csv'

with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','title']) 
    w.writeheader() 
    for professor in professors: 
        w.writerow(professor) 

# Download the file to your computer
files.download("iupui_list.csv")

# Save in the dataframe

Import the panda library to convert a list to data frame. You could use a list as well.

## 1. Library

*   [pandas](https://pandas.pydata.org/): open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.

In [0]:
# import
import pandas as pd

In [0]:
# Convert a list of professors to the data frame
df = pd.DataFrame(list(professors))
df

Unnamed: 0,name,title
0,Douglas Acheson,Associate Professor of Mechanical Engineering ...
1,Eric Adams,Senior Lecturer of Mechanical EngineeringMecha...
2,Mangilal Agarwal,Professor of Mechanical and Energy Engineering...
3,Randy Albright,Lecturer of Music and Arts TechnologyMusic and...
4,Karen Alfrey,Associate Dean of Undergraduate Academic Affai...
...,...,...
130,Ken Yoshida,Associate Professor of Biomedical EngineeringB...
131,Huidan (Whitney) Yu,Associate Professor of Mechanical EngineeringM...
132,Jing Zhang,Associate Professor of Mechanical EngineeringM...
133,Qingxue Zhang,Assistant Professor Electrical and Computer En...


# [Wikidata: Verify if a professor exists in the Wikidata]

There are several Python libraries (e.g. [Pywikibot](https://pypi.org/project/pywikibot/), [WikidataIntegrator](https://github.com/SuLab/WikidataIntegrator),  [Wikidata](https://github.com/dahlia/wikidata) and etc.) to work with Wikidata API. Initially, I would like to use Pywikibot which is a great tool to automate work on MediaWiki sites (like Wikipedia and Wikidata). However, this requres to save a configuration file in the designated directory, so it wasn't ideal to demonstrate in the Google virtual machine, so instead, I chose [WikidataR](https://cran.r-project.org/web/packages/WikidataR/WikidataR.pdf) which is the one of the R packages.

# 1. Working both with R and Python in the Colab

R and Python are two different languages and both have pros and cons. There are communities who would like to use both languages in the same working environment and they developed tools for us to facilitate this process. Like any other libraries in Python, you need to import necessary libraries first.

*   [rpy2](https://rpy2.readthedocs.io/en/version_2.8.x/overview.html): Python interface to the R language
*   [WikidataR](https://cran.r-project.org/web/packages/WikidataR/WikidataR.pdf): R package for the API Client Library for Wikidata


In [0]:
# import
from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
import rpy2.robjects.packages as rpackages
pandas2ri.activate()

# import R's "base" package
base = importr('base')

# import R's utility package
utils = rpackages.importr('utils')

# need to install R packages
utils.install_packages("WikidataR")

# import R's "WikidataR" package
wikidataR = importr('WikidataR')

(as ‘lib’ is unspecified)












	‘/tmp/RtmpTwc0rn/downloaded_packages’


## 2. Converting Python data frame into R object (R data.frame)

In [0]:
# convert python data frame to R data frame using a function called pandas2ri.py2ri
r_df = pandas2ri.py2ri(df)
r_df

name,title
'Douglas ...,'Associat...
'Eric Ada...,'Senior L...
'Mangilal...,'Professo...
'Randy Al...,'Lecturer...
...,...
'Huidan (...,'Associat...
'Jing Zha...,'Associat...
'Qingxue ...,'Assistan...
'Likun Zhu','Associat...


## 3. Create a function to identify if a professor exists in the Wikidata

By using a function called find_item() from the WikidataR package, create your own function to find out. If s/he exists, the function returns "Yes", it s/he doesn't exists, it retursn "No."

In [0]:
# Create function called professor_wikidata
def professor_wikidata(x): 
    professor = wikidataR.find_item(x)
    if(len(professor) > 0 ):
        available = "Yes" 
    else:
        available = "No"
    return available

In order to apply the function to more than one item, I created another function using lambda and map functions.

In [0]:
# in order to apply the professor_wikidata function to more than one professor
IUPUI_professor = lambda x: professor_wikidata(x)

#r_df[0] is a list of professors from the r_df dataframe
work = list(map(IUPUI_professor, r_df[0]))
work

['No',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'No',
 'Yes',
 'No',
 'No',
 'No',
 'Yes',
 'No',
 'Yes',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'No',
 'No',
 'No',
 'Yes',
 'Yes',
 'Yes',
 'No',
 'N

## 4. User friendly way to see who exist or not in the data frame

You can add the **work** list created in the previous step to the **df** dataframe.

In [0]:
# assign the work list
df['wikidata'] = work
df

Unnamed: 0,name,title,wikidata
0,Douglas Acheson,Associate Professor of Mechanical Engineering ...,No
1,Eric Adams,Senior Lecturer of Mechanical EngineeringMecha...,Yes
2,Mangilal Agarwal,Professor of Mechanical and Energy Engineering...,No
3,Randy Albright,Lecturer of Music and Arts TechnologyMusic and...,No
4,Karen Alfrey,Associate Dean of Undergraduate Academic Affai...,No
...,...,...,...
130,Ken Yoshida,Associate Professor of Biomedical EngineeringB...,Yes
131,Huidan (Whitney) Yu,Associate Professor of Mechanical EngineeringM...,Yes
132,Jing Zhang,Associate Professor of Mechanical EngineeringM...,Yes
133,Qingxue Zhang,Assistant Professor Electrical and Computer En...,No


# 5. How many professors exist

You can count using count() from pandas

In [0]:
# count
pd.value_counts(df['wikidata'].values)

No     76
Yes    59
dtype: int64