# Scraping JNU CS faculty homepages

## System setup 

Before we start, make sure to install the required libraries
    
    pip install bs4
    pip install selenium

Since JNU's website has some javascript rendered HTML content, we'll be using Selenium for scraping the content loaded dynamically by javascript. For this,you would also need to download a selenium supported browser webdriver.

e.g. For Chrome, download the appropriate webdriver from here: http://chromedriver.chromium.org/downloads, unzip it and save in current directory.

In [114]:
from bs4 import BeautifulSoup
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
import re 
import urllib
import time

In [115]:
#create a webdriver object and set options for headless browsing
options = Options()
options.headless = True
driver = webdriver.Chrome('C:/MCS-DS/410/MP/test1/chromedriver',options=options)

If you visit JNU's CS Faculty Directory Listing: https://www.jnu.ac.in/scss-faculty , you'll notice that it has all the faculty listed there. 
Clicking on a faculty's name takes you to the Faculty Profile page. 
The profile page has all info about the faculty.

Before we start scraping, we'll define some helper functions

In [116]:
#uses webdriver object to execute javascript code and get dynamically loaded webcontent
def get_js_soup(url,driver):
    driver.get(url)
    res_html = driver.execute_script('return document.body.innerHTML')
    soup = BeautifulSoup(res_html,'html.parser') #beautiful soup object to be used for parsing html content
    return soup

#tidies extracted text 
def process_bio(bio):
    bio = bio.encode('ascii',errors='ignore').decode('utf-8')       #removes non-ascii characters
    bio = re.sub('\s+',' ',bio)       #repalces repeated whitespace characters with single space
    return bio

''' More tidying
Sometimes the text extracted HTML webpage may contain javascript code and some style elements. 
This function removes script and style tags from HTML so that extracted text does not contain them.
'''
def remove_script(soup):
    for script in soup(["script", "style"]):
        script.decompose()
    return soup



We will now start scraping.

First, let's get links to all Faculty Profile pages by scraping the Directory Listing. You can use your browser's developer tools to find the required links within the HTML content. In Chrome, this can be done by right cliking on the webpage and choosing Inspect. Some basic knowledge of HTML and CSS would be required. After a bit of digging, you should notice that the link can be found under the <a\> tag of <div\> with the class "name" as shown:

Now we can specify exactly what needs to be extracted from the directory listing page

In [117]:
#extracts all Faculty Profile page urls from the Directory Listing Page
def scrape_dir_page(dir_url,driver):
    print ('-'*20,'Scraping directory page','-'*20)
    faculty_links = []
    
    faculty_base_url = 'https://nyulangone.org'
    #execute js on webpage to load faculty listings on webpage and get ready to parse the loaded HTML 
    soup = get_js_soup(dir_url,driver)   
    count = 0
    for link_holder in soup.find_all('ul',class_='our-subspecialties__list'): #get list of all <div> of class 'name'
        if link_holder is not None:
            #print(link_holder)
            print('end')
            for rel_link in link_holder.find_all('a'):
                if rel_link is not None:
                    rel_link = rel_link['href'] #get url
                    #print(faculty_base_url+rel_link)
                    #url returned is relative, so we need to add base url
                    #faculty_links.append(faculty_base_url+rel_link) 
                    specialist_url = faculty_base_url+rel_link
                    soup_specialist = get_js_soup(specialist_url,driver)
                    
                    for doclink in soup_specialist.find_all('div',class_='provider-tile__details-header'):
                        doctor_url = ''
                        if doclink is not None:
                            plink = doclink.find('p')
                            if plink is not None:
                                doctor_url = str(count) + '|' +  plink.text
                            alink = doclink.find('a')
                            if alink is not None:
                                link_href = alink['href']
                                doctor_url = doctor_url + '|' + alink.text + '|' + faculty_base_url + link_href
                                #print(doctor_url)
                            faculty_links.append(doctor_url)
                        count = count + 1
    #print ('-'*20,'Found {} doctor profile urls'.format(len(faculty_links)),'-'*20)
    print(faculty_links)
    return faculty_links

It might take a few minutes to get all the urls

In [118]:
dir_url = 'https://nyulangone.org/doctors' #url of directory listings of CS faculty
faculty_links = scrape_dir_page(dir_url,driver)

-------------------- Scraping directory page --------------------
end
end
['0|Allergy|Nathanael Horne, MD|https://nyulangone.org/doctors/1912965435/nathanael-horne', '1|Allergy|Wang Y. Mak, MD|https://nyulangone.org/doctors/1053392514/wang-y-mak', '2|Allergy, Pediatric Allergy|Samuel D. Grubman, MD|https://nyulangone.org/doctors/1598761819/samuel-d-grubman', '3|Allergy|Stephanie L. Mawhirt, D.O.|https://nyulangone.org/doctors/1992141212/stephanie-l-mawhirt', '4|Allergy|Erin M. Banta, MD|https://nyulangone.org/doctors/1144433830/erin-m-banta', '5|Allergy, Pediatric Allergy|Sujan Patel, MD|https://nyulangone.org/doctors/1174783633/sujan-patel', '6|Internal Medicine, Allergy|Roger I. Emert, MD|https://nyulangone.org/doctors/1184623134/roger-i-emert', '7|Allergy|Amina Abdeldaim, MD, MPH|https://nyulangone.org/doctors/1770740490/amina-abdeldaim', '8|Allergy|Mark Davis-Lorton, MD|https://nyulangone.org/doctors/1396727665/mark-davis-lorton', '9|Allergy|Niha Qamar, MD|https://nyulangone.org/do

Now let's scrape the faculty profile pages. 



In [119]:
def scrape_faculty_page(fac_url,driver):
    lst = fac_url.split('|')
    soup = get_js_soup(lst[3],driver)
    #print('soup return')
    homepage_found = False
    bio_url = ''
    bio = ''
    faculty_bio = soup.find('article',class_='doctor-sections rail')
    if faculty_bio is not None:
        #print(facultyName)
        #facultyName = facultyName.contents[0].contents[0].string
        #print(facultyName)
        #faculty_page_soup = get_js_soup(fac_url,driver)
        #faculty_soup = faculty_page_soup.find('div',class_='field-name-field-faculty-name')
        bio_soup = remove_script(faculty_bio)
        bio = process_bio(bio_soup.get_text(separator=' ')) 
        bio_url = lst[0] #treat faculty profile page as homepage
        #bio = process_bio(profile_sec.get_text(separator=' '))
        #bio = facultyName
        #print(bio)
   
    return bio_url,bio

It takes a few minutes to scrape all the urls

In [None]:
#Scrape homepages of all urls
bio_urls, bios = [],[]
tot_urls = len(faculty_links)
print(tot_urls)
with open('doctorlist.txt','w') as f:
    for l in faculty_links:
        f.write(l)
        f.write('\n')
        
for i,link in enumerate(faculty_links):
    print ('-'*20,'Scraping faculty url {}/{}'.format(i+1,tot_urls),'-'*20)
    bio_url,bio = scrape_faculty_page(link,driver)
    if bio.strip()!= '' and bio_url.strip()!='':
        #bio_urls.append(bio_url.strip())
        #bios.append(bio)
        with open(bio_url + '.txt','w') as f:
            f.write(bio)
driver.close()

2084
-------------------- Scraping faculty url 1/2084 --------------------
-------------------- Scraping faculty url 2/2084 --------------------
-------------------- Scraping faculty url 3/2084 --------------------
-------------------- Scraping faculty url 4/2084 --------------------
-------------------- Scraping faculty url 5/2084 --------------------
-------------------- Scraping faculty url 6/2084 --------------------
-------------------- Scraping faculty url 7/2084 --------------------
-------------------- Scraping faculty url 8/2084 --------------------
-------------------- Scraping faculty url 9/2084 --------------------
-------------------- Scraping faculty url 10/2084 --------------------
-------------------- Scraping faculty url 11/2084 --------------------
-------------------- Scraping faculty url 12/2084 --------------------
-------------------- Scraping faculty url 13/2084 --------------------
-------------------- Scraping faculty url 14/2084 --------------------
----------

-------------------- Scraping faculty url 117/2084 --------------------
-------------------- Scraping faculty url 118/2084 --------------------
-------------------- Scraping faculty url 119/2084 --------------------
-------------------- Scraping faculty url 120/2084 --------------------
-------------------- Scraping faculty url 121/2084 --------------------
-------------------- Scraping faculty url 122/2084 --------------------
-------------------- Scraping faculty url 123/2084 --------------------
-------------------- Scraping faculty url 124/2084 --------------------
-------------------- Scraping faculty url 125/2084 --------------------
-------------------- Scraping faculty url 126/2084 --------------------
-------------------- Scraping faculty url 127/2084 --------------------
-------------------- Scraping faculty url 128/2084 --------------------
-------------------- Scraping faculty url 129/2084 --------------------
-------------------- Scraping faculty url 130/2084 -------------

-------------------- Scraping faculty url 231/2084 --------------------
-------------------- Scraping faculty url 232/2084 --------------------
-------------------- Scraping faculty url 233/2084 --------------------
-------------------- Scraping faculty url 234/2084 --------------------
-------------------- Scraping faculty url 235/2084 --------------------
-------------------- Scraping faculty url 236/2084 --------------------
-------------------- Scraping faculty url 237/2084 --------------------
-------------------- Scraping faculty url 238/2084 --------------------
-------------------- Scraping faculty url 239/2084 --------------------
-------------------- Scraping faculty url 240/2084 --------------------
-------------------- Scraping faculty url 241/2084 --------------------
-------------------- Scraping faculty url 242/2084 --------------------
-------------------- Scraping faculty url 243/2084 --------------------
-------------------- Scraping faculty url 244/2084 -------------

-------------------- Scraping faculty url 345/2084 --------------------
-------------------- Scraping faculty url 346/2084 --------------------
-------------------- Scraping faculty url 347/2084 --------------------
-------------------- Scraping faculty url 348/2084 --------------------
-------------------- Scraping faculty url 349/2084 --------------------
-------------------- Scraping faculty url 350/2084 --------------------
-------------------- Scraping faculty url 351/2084 --------------------
-------------------- Scraping faculty url 352/2084 --------------------
-------------------- Scraping faculty url 353/2084 --------------------
-------------------- Scraping faculty url 354/2084 --------------------
-------------------- Scraping faculty url 355/2084 --------------------
-------------------- Scraping faculty url 356/2084 --------------------
-------------------- Scraping faculty url 357/2084 --------------------
-------------------- Scraping faculty url 358/2084 -------------

-------------------- Scraping faculty url 459/2084 --------------------
-------------------- Scraping faculty url 460/2084 --------------------
-------------------- Scraping faculty url 461/2084 --------------------
-------------------- Scraping faculty url 462/2084 --------------------
-------------------- Scraping faculty url 463/2084 --------------------
-------------------- Scraping faculty url 464/2084 --------------------
-------------------- Scraping faculty url 465/2084 --------------------
-------------------- Scraping faculty url 466/2084 --------------------
-------------------- Scraping faculty url 467/2084 --------------------
-------------------- Scraping faculty url 468/2084 --------------------
-------------------- Scraping faculty url 469/2084 --------------------
-------------------- Scraping faculty url 470/2084 --------------------
-------------------- Scraping faculty url 471/2084 --------------------
-------------------- Scraping faculty url 472/2084 -------------

-------------------- Scraping faculty url 573/2084 --------------------
-------------------- Scraping faculty url 574/2084 --------------------
-------------------- Scraping faculty url 575/2084 --------------------
-------------------- Scraping faculty url 576/2084 --------------------
-------------------- Scraping faculty url 577/2084 --------------------
-------------------- Scraping faculty url 578/2084 --------------------
-------------------- Scraping faculty url 579/2084 --------------------
-------------------- Scraping faculty url 580/2084 --------------------
-------------------- Scraping faculty url 581/2084 --------------------
-------------------- Scraping faculty url 582/2084 --------------------
-------------------- Scraping faculty url 583/2084 --------------------
-------------------- Scraping faculty url 584/2084 --------------------
-------------------- Scraping faculty url 585/2084 --------------------
-------------------- Scraping faculty url 586/2084 -------------

-------------------- Scraping faculty url 687/2084 --------------------
-------------------- Scraping faculty url 688/2084 --------------------
-------------------- Scraping faculty url 689/2084 --------------------
-------------------- Scraping faculty url 690/2084 --------------------
-------------------- Scraping faculty url 691/2084 --------------------
-------------------- Scraping faculty url 692/2084 --------------------
-------------------- Scraping faculty url 693/2084 --------------------
-------------------- Scraping faculty url 694/2084 --------------------
-------------------- Scraping faculty url 695/2084 --------------------
-------------------- Scraping faculty url 696/2084 --------------------
-------------------- Scraping faculty url 697/2084 --------------------
-------------------- Scraping faculty url 698/2084 --------------------
-------------------- Scraping faculty url 699/2084 --------------------
-------------------- Scraping faculty url 700/2084 -------------

-------------------- Scraping faculty url 801/2084 --------------------
-------------------- Scraping faculty url 802/2084 --------------------
-------------------- Scraping faculty url 803/2084 --------------------
-------------------- Scraping faculty url 804/2084 --------------------
-------------------- Scraping faculty url 805/2084 --------------------
-------------------- Scraping faculty url 806/2084 --------------------
-------------------- Scraping faculty url 807/2084 --------------------
-------------------- Scraping faculty url 808/2084 --------------------
-------------------- Scraping faculty url 809/2084 --------------------
-------------------- Scraping faculty url 810/2084 --------------------
-------------------- Scraping faculty url 811/2084 --------------------
-------------------- Scraping faculty url 812/2084 --------------------
-------------------- Scraping faculty url 813/2084 --------------------
-------------------- Scraping faculty url 814/2084 -------------

-------------------- Scraping faculty url 915/2084 --------------------
-------------------- Scraping faculty url 916/2084 --------------------
-------------------- Scraping faculty url 917/2084 --------------------
-------------------- Scraping faculty url 918/2084 --------------------
-------------------- Scraping faculty url 919/2084 --------------------
-------------------- Scraping faculty url 920/2084 --------------------
-------------------- Scraping faculty url 921/2084 --------------------
-------------------- Scraping faculty url 922/2084 --------------------
-------------------- Scraping faculty url 923/2084 --------------------
-------------------- Scraping faculty url 924/2084 --------------------
-------------------- Scraping faculty url 925/2084 --------------------
-------------------- Scraping faculty url 926/2084 --------------------
-------------------- Scraping faculty url 927/2084 --------------------
-------------------- Scraping faculty url 928/2084 -------------

-------------------- Scraping faculty url 1029/2084 --------------------
-------------------- Scraping faculty url 1030/2084 --------------------
-------------------- Scraping faculty url 1031/2084 --------------------
-------------------- Scraping faculty url 1032/2084 --------------------
-------------------- Scraping faculty url 1033/2084 --------------------
-------------------- Scraping faculty url 1034/2084 --------------------
-------------------- Scraping faculty url 1035/2084 --------------------
-------------------- Scraping faculty url 1036/2084 --------------------
-------------------- Scraping faculty url 1037/2084 --------------------
-------------------- Scraping faculty url 1038/2084 --------------------
-------------------- Scraping faculty url 1039/2084 --------------------
-------------------- Scraping faculty url 1040/2084 --------------------
-------------------- Scraping faculty url 1041/2084 --------------------
-------------------- Scraping faculty url 1042/2084

-------------------- Scraping faculty url 1142/2084 --------------------
-------------------- Scraping faculty url 1143/2084 --------------------
-------------------- Scraping faculty url 1144/2084 --------------------
-------------------- Scraping faculty url 1145/2084 --------------------
-------------------- Scraping faculty url 1146/2084 --------------------
-------------------- Scraping faculty url 1147/2084 --------------------
-------------------- Scraping faculty url 1148/2084 --------------------
-------------------- Scraping faculty url 1149/2084 --------------------
-------------------- Scraping faculty url 1150/2084 --------------------
-------------------- Scraping faculty url 1151/2084 --------------------
-------------------- Scraping faculty url 1152/2084 --------------------
-------------------- Scraping faculty url 1153/2084 --------------------
-------------------- Scraping faculty url 1154/2084 --------------------
-------------------- Scraping faculty url 1155/2084

-------------------- Scraping faculty url 1255/2084 --------------------
-------------------- Scraping faculty url 1256/2084 --------------------
-------------------- Scraping faculty url 1257/2084 --------------------
-------------------- Scraping faculty url 1258/2084 --------------------
-------------------- Scraping faculty url 1259/2084 --------------------
-------------------- Scraping faculty url 1260/2084 --------------------
-------------------- Scraping faculty url 1261/2084 --------------------
-------------------- Scraping faculty url 1262/2084 --------------------
-------------------- Scraping faculty url 1263/2084 --------------------
-------------------- Scraping faculty url 1264/2084 --------------------
-------------------- Scraping faculty url 1265/2084 --------------------
-------------------- Scraping faculty url 1266/2084 --------------------
-------------------- Scraping faculty url 1267/2084 --------------------
-------------------- Scraping faculty url 1268/2084

-------------------- Scraping faculty url 1368/2084 --------------------
-------------------- Scraping faculty url 1369/2084 --------------------
-------------------- Scraping faculty url 1370/2084 --------------------
-------------------- Scraping faculty url 1371/2084 --------------------
-------------------- Scraping faculty url 1372/2084 --------------------
-------------------- Scraping faculty url 1373/2084 --------------------
-------------------- Scraping faculty url 1374/2084 --------------------
-------------------- Scraping faculty url 1375/2084 --------------------
-------------------- Scraping faculty url 1376/2084 --------------------
-------------------- Scraping faculty url 1377/2084 --------------------
-------------------- Scraping faculty url 1378/2084 --------------------
-------------------- Scraping faculty url 1379/2084 --------------------
-------------------- Scraping faculty url 1380/2084 --------------------
-------------------- Scraping faculty url 1381/2084

-------------------- Scraping faculty url 1481/2084 --------------------
-------------------- Scraping faculty url 1482/2084 --------------------
-------------------- Scraping faculty url 1483/2084 --------------------
-------------------- Scraping faculty url 1484/2084 --------------------
-------------------- Scraping faculty url 1485/2084 --------------------
-------------------- Scraping faculty url 1486/2084 --------------------
-------------------- Scraping faculty url 1487/2084 --------------------
-------------------- Scraping faculty url 1488/2084 --------------------
-------------------- Scraping faculty url 1489/2084 --------------------
-------------------- Scraping faculty url 1490/2084 --------------------
-------------------- Scraping faculty url 1491/2084 --------------------
-------------------- Scraping faculty url 1492/2084 --------------------
-------------------- Scraping faculty url 1493/2084 --------------------
-------------------- Scraping faculty url 1494/2084

-------------------- Scraping faculty url 1594/2084 --------------------
-------------------- Scraping faculty url 1595/2084 --------------------
-------------------- Scraping faculty url 1596/2084 --------------------
-------------------- Scraping faculty url 1597/2084 --------------------
-------------------- Scraping faculty url 1598/2084 --------------------
-------------------- Scraping faculty url 1599/2084 --------------------
-------------------- Scraping faculty url 1600/2084 --------------------
-------------------- Scraping faculty url 1601/2084 --------------------
-------------------- Scraping faculty url 1602/2084 --------------------
-------------------- Scraping faculty url 1603/2084 --------------------
-------------------- Scraping faculty url 1604/2084 --------------------
-------------------- Scraping faculty url 1605/2084 --------------------
-------------------- Scraping faculty url 1606/2084 --------------------
-------------------- Scraping faculty url 1607/2084

-------------------- Scraping faculty url 1707/2084 --------------------
-------------------- Scraping faculty url 1708/2084 --------------------
-------------------- Scraping faculty url 1709/2084 --------------------
-------------------- Scraping faculty url 1710/2084 --------------------
-------------------- Scraping faculty url 1711/2084 --------------------
-------------------- Scraping faculty url 1712/2084 --------------------
-------------------- Scraping faculty url 1713/2084 --------------------
-------------------- Scraping faculty url 1714/2084 --------------------
-------------------- Scraping faculty url 1715/2084 --------------------
-------------------- Scraping faculty url 1716/2084 --------------------
-------------------- Scraping faculty url 1717/2084 --------------------
-------------------- Scraping faculty url 1718/2084 --------------------
-------------------- Scraping faculty url 1719/2084 --------------------
-------------------- Scraping faculty url 1720/2084

Finally, write urls and extracted bio to txt files

In [8]:
def write_lst(lst,file_):
    with open(file_,'w') as f:
        for l in lst:
            f.write(l)
            f.write('\n')

In [9]:
#bio_urls_file = 'bio_urls.txt'
#bios_file = 'bios.txt'
#write_lst(bio_urls,bio_urls_file)
#write_lst(bios,bios_file)