# Scrap Login Required Web Site with Selenium
#### CSCE 670 - Ying Lyu
## Introdution

In development, Selenium is widely used as a portable framework for testing web applications.

Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way.

Selenium Python bindings provide a convenient API to access Selenium WebDrivers like Firefox, Ie, Chrome, Remote etc. The current supported Python versions are 2.7, 3.5 and above.

In information retrieval field, we can utilize Selenium Python to overcome obstacles in the scraping process.
## Installation
Type the following command in Unix/Linux Terminal or Windows cmd.

In [45]:
pip install selenium  

Note: you may need to restart the kernel to use updated packages.


Or download from https://pypi.org/project/selenium/#downloads

In [41]:
import selenium

## Drivers for Browser
Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your PATH, e. g., place it in /usr/bin or /usr/local/bin.

Failure to observe this step will give you an error selenium.common.exceptions.WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers follow.

Chrome:	https://sites.google.com/a/chromium.org/chromedriver/downloads  
Edge:	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/  
Firefox:	https://github.com/mozilla/geckodriver/releases  
Safari:	https://webkit.org/blog/6900/webdriver-support-in-safari-10/  

In [42]:
from selenium import webdriver

Use Chrome as an example. Notice that you need to have the brower application installed in your computer. Download the corresponding version of driver from the above links.

In [43]:
driver = webdriver.Chrome()
driver.get("https://www.ventureradar.com/")

You will see the a new Chrome window is launched and the target webpage is opened.

In [44]:
 driver.close()

## Example 

To scrap content from some websites, we need to log in at first. For the webpage with Ajax techniques, we need to mimic user action on the web page to login or click a button. In this spotlight, I will use VentureRadar, an informative website, as an example and scrap the content from it.
### VentureRadar 
VentureRadar provides a ranking of corporations for a given keyword.
It works for some of the World’s largest corporations, removing the barriers they traditionally face in discovering and tracking companies that can impact their business. From landscaping emerging technology themes to locating specialist skills that can solve internal business challenges, our clients rely on us to help them find partners, competitors, acquisition targets, investment targets and understand markets.

In [50]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ventureradar.com/database/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')

Use reqests and BeautifulSoup package to check the element and function of login.

In [51]:
login_div = soup.find("div", {"id": "i_d_LoginButtons"})
print(login_div)

<div class="navbar-form navbar-right col-md-3" id="i_d_LoginButtons" role="search">
<input id="i_h_UserID" name="i_h_UserID" type="hidden"/>
<input id="i_h_SearchKeyword" name="i_h_SearchKeyword" type="hidden"/>
<div id="i_d_loginregistercontainerparent" style="">
<button class="btn btn-default" id="i_b_Login" onclick="toggleLogin();" type="button">login</button>
<button class="btn btn-default" id="i_b_Register" onclick="toggleRegister();" type="button">free sign-up</button>
</div>
<div id="i_d_logincontainer">
<div id="i_d_login" style="display: none;">
<div id="loginCookiesEnabled">
<p>
</p>
<div style="margin-bottom: 10px">
<div id="i_d_Login_Message"></div>
<div>
<input id="UserName" name="UserName" placeholder="Email address" type="text"/>
<div class="formErrorsContainer">
</div>
</div>
</div>
<div style="margin-bottom: 10px">
<div>
<input id="Password" name="Password" placeholder="Password" type="password"/>
<div class="formErrorsContainer">
</div>
<div class="help-block text-rig

Login action calls for javascript function which cannot be achieved by passing key-value pairs in post request.

We can use Selenium IDE, an integrated development environment that can record your interactions with websites to help you generate and maintain site automation, tests, and remove the need to manually step through repetitive takes.    
The Chrome Extension version:  
https://chrome.google.com/webstore/detail/selenium-ide/mooikfkahbdckldjjndioackbalphokd?hl=en  
The Firefox ADD-ON version:  
https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/

Run the Selenium IDE and record the interaction to login.

![title](img/login_export.png)

The interaction can be exported to python script with Selenium Python API as shown in the following snippet.

In [None]:
# Generated by Selenium IDE
import pytest
import time
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

class TestLogin():
    def setup_method(self, method):
        self.driver = webdriver.Chrome()
        self.vars = {}

    def teardown_method(self, method):
        self.driver.quit()

    def test_login(self):
        # Test name: login
        # Step # | name | target | value
        # 1 | open | /database/?override=true | 
        self.driver.get("https://www.ventureradar.com/database/?override=true")
        # 2 | setWindowSize | 1326x801 | 
        self.driver.set_window_size(1326, 801)
        # 3 | click | id=i_b_Login | 
        self.driver.find_element(By.ID, "i_b_Login").click()
        # 4 | click | id=UserName | 
        self.driver.find_element(By.ID, "UserName").click()
        # 5 | type | id=UserName | lyuying@bupt.edu.cn
        self.driver.find_element(By.ID, "UserName").send_keys("lyuying@bupt.edu.cn")
        # 6 | type | id=Password | ventureradar
        self.driver.find_element(By.ID, "Password").send_keys("ventureradar")
        # 7 | click | name=ctl01 | 
        self.driver.find_element(By.NAME, "ctl01").click()

Extract and run the codes that we need.

In [52]:
driver = webdriver.Chrome()
driver.get("https://www.ventureradar.com/database/")
driver.find_element_by_id("i_b_Login").click()
driver.find_element_by_id("UserName").click()
driver.find_element_by_id("UserName").clear()
driver.find_element_by_id("UserName").send_keys('lyuying@bupt.edu.cn')
driver.find_element_by_id("Password").send_keys('ventureradar')
driver.find_element_by_name("ctl01").click()

The username and password were typed in the fields and the login button was clicked automatically.

In [53]:
print(driver.current_url)

https://www.ventureradar.com/home


Redirect to home page with login information.

In [56]:
search_key = "animation"
driver.get("https://www.ventureradar.com/search/ranked/" + search_key + '/')

In the web page we get a ranking list of organizations for keyword "animation". There is a link for each organization with more detailed information. So we extract all the links from the html.

In [62]:
html = driver.page_source 
url_all = re.findall('href=\"(.*?)\"',html,re.S)
print(url_all)

['/styles/v-637151127063009095/core-bundle.min.css', '/styles/v-637182206846152958/s_search.min.css', 'https://use.fontawesome.com/4c66a74dd9.css', 'https://www.ventureradar.com/home', '/pricing?utm_source=menu&amp;utm_medium=web&amp;utm_content=Upgrade', '/account/manage.aspx', '#', '/account/forgotpassword.aspx', '#', '/termsconditions.aspx', '/privacypolicy.aspx', '/cookiepolicy.aspx', '/organisation/Fiverr/6ffeccbb-8b37-442b-828d-c1cd136ac73b/', '{lnk}', '{lnk}', "/source/Wired - Europe's 100 Hottest Startups/7168ecdf-ffff-4e86-871d-f3c26dbf8725", '/organisation/Wideo/4f791a19-3986-4b87-bf3a-61027a68e384/', '{lnk}', '{lnk}', '/source/500startups/4aa7c8b6-3efc-4ee3-83d3-55bb7c5038da', '/organisation/PowToon/b4d469d1-9d12-430f-b934-fabc59e251cd/', '{lnk}', '{lnk}', '/organisation/Autodesk/072387ba-01f3-47bd-901d-d74086a034f5/', '{lnk}', '{lnk}', '/source/3D Printing Industry Directory/0ac53536-74ee-46d2-b082-7b7b9f36589e', '/organisation/Animaker/c4434cab-5e03-4946-a1f4-8d090d116cb4/

Select the organization url that we want from all the urls.

In [64]:
url_orgs = []
num = 0
for url in url_all:
    if "organisation" in url and "{cne}" not in url:
        num = num + 1
        if num > 25:
            break
        url_org = 'https://www.ventureradar.com' + url
        url_orgs.append(url_org)
        print (url_org)

https://www.ventureradar.com/organisation/Fiverr/6ffeccbb-8b37-442b-828d-c1cd136ac73b/
https://www.ventureradar.com/organisation/Wideo/4f791a19-3986-4b87-bf3a-61027a68e384/
https://www.ventureradar.com/organisation/PowToon/b4d469d1-9d12-430f-b934-fabc59e251cd/
https://www.ventureradar.com/organisation/Autodesk/072387ba-01f3-47bd-901d-d74086a034f5/
https://www.ventureradar.com/organisation/Animaker/c4434cab-5e03-4946-a1f4-8d090d116cb4/
https://www.ventureradar.com/organisation/Blender/12e24263-2cc3-4f7f-8c57-5630d14f1d31/
https://www.ventureradar.com/organisation/Furhat Robotics/c010cc3a-fcd1-4e89-9515-8df0392dc419/
https://www.ventureradar.com/organisation/Moovly/4ae56488-3387-4415-a3bc-bdae72651d6c/
https://www.ventureradar.com/organisation/Anthropics/789d722b-2e96-4a83-aefa-ee77f1daf1c8/
https://www.ventureradar.com/organisation/YellowDog /439ac226-7b55-4850-af1c-324185da12f7/
https://www.ventureradar.com/organisation/Renderforest/72b83dd4-027f-44e0-805d-2d4c46d2b80a/
https://www.ven

In [75]:
from time import sleep
htmls = []
for i in range (25):
    driver.get(url_orgs[i])
    sleep(1) # wait for the page to be generated
    htmls.append(driver.page_source)

The view of the page is as follows.
![title](img/venture.png)

Pre-define some key-value pairs for tags in html.

In [79]:
id_data = ['i_d_CompanyName','main_i_s_CompanyCountry','main_i_s_CompanyFounded',
    'main_i_s_CompanyType','i_d_CompanyDescription',
    'i_d_CompanyWebsiteLink','i_l_CompanyEmailLink','i_d_OverallScore_Detail',
    'main_i_s_SocialProofScore','main_i_s_WebsitePopularityScore',
    'main_i_s_WebsiteAutoAnalystScore','main_i_s_VRPopularityScore',
    'i_d_EmployeeSatisfaction_Overall_Bar','i_d_EmployeeSatisfaction_Recommend_Bar',
    'i_d_Glassdoor_CEO_Bar',#'i_d_WebsiteHealth_ScoreBox_Child',
    'i_d_VRPopularity_HeatMapRow_Low',
    'i_d_VRPopularity_HeatMapRow_Med','i_d_VRPopularity_HeatMapRow_High']  

id_herf = ['i_d_CompanyLinkedInLink','i_l_CompanyTeamLink',
    'i_d_CompanyTwitterLink','i_l_WikipediaLink']
    
mapHit_class = {"map1 mapHit":"1","map2 mapHit":"2","map3 mapHit":"3",
    "map4 mapHit":"4","map5 mapHit":"5","map6 mapHit":"6","map7 mapHit":"7",
    "map8 mapHit":"8","map9 mapHit":"9","map10 mapHit":"10"}#Website Popularity

updates = ["c_d_InsightCard_Inner c_d_InsightCard_Conference", "c_d_InsightCard_Inner c_d_InsightCard_Award", "c_d_InsightCard_Inner c_d_InsightCard_Equity-Funding", "c_d_InsightCard_Inner c_d_InsightCard_Research-Report", "c_d_InsightCard_Inner c_d_InsightCard_Startup-PitchingCompetition", "c_d_InsightCard_Inner c_d_InsightCard_Business-Insight", "c_d_InsightCard_Inner c_d_InsightCard_Grant"]
    
fund_class = {"sprite-profile sprite-profile-crowd-funded-30":"EQUITY CROWDFUNDED",
    "sprite-profile sprite-profile-government-grant-30":"GOVERNMENT GRANT",
    "sprite-profile sprite-profile-spin-off-30":"SPIN-OFF",
    "sprite-profile sprite-profile-venture-funded-30":"VENTURE FUNDED"
    }

target_website = {'i_d_CompanyLinkedInLink':'https://www.linkedin.com/company',
    'i_l_CompanyTeamLink':'www.linkedin.com/search/',
    'i_d_CompanyTwitterLink':'www.twitter.com/',
    'i_l_WikipediaLink':'en.wikipedia.org/'}


id_name = {'i_d_CompanyName':'Name',
    'main_i_s_CompanyCountry':'Location',
    'main_i_s_CompanyFounded':'Founded year',
    'main_i_s_CompanyType':'Type',
    'i_d_CompanyDescription':'Description',
    'i_d_CompanyWebsiteLink':'Website',
    'i_l_CompanyEmailLink':'Email',
    'i_d_OverallScore_Detail':'VentureRadar Score',
    'main_i_s_SocialProofScore':'Social Proof',#Scores
    'main_i_s_WebsitePopularityScore':'Website Traffic',#Scores
    'main_i_s_WebsiteAutoAnalystScore':'Auto Analyst Score',#Scores
    'main_i_s_VRPopularityScore':'Popularity on VentureRadar',#Scores
    #Website Popularity Trend
    'i_d_EmployeeSatisfaction_Overall_Bar':'Overall Employee Rating',
    'i_d_EmployeeSatisfaction_Recommend_Bar':'Recommend to a Friend',
    'i_d_Glassdoor_CEO_Bar':'Approval of CEO ',
    #<div id="i_d_VRPopularity_HeatMapRow_High" class="mapHit map10">High</div>
    'main_alexaHeatMapRow':'Website Popularity',
    #'i_d_WebsiteHealth_ScoreBox_Child':'Auto Analyst Score 2', 
    #1/10 of 'main_i_s_WebsiteAutoAnalystScore'
    'i_d_VRPopularity_HeatMapRow_Low':'VentureRadar Popularity',
    'i_d_VRPopularity_HeatMapRow_Med':'VentureRadar Popularity',
    'i_d_VRPopularity_HeatMapRow_High':'VentureRadar Popularity',
    'i_d_CompanyLinkedInLink':'LinkedIn Profile',
    'i_l_CompanyTeamLink':'View Team on LinkedIn',
    'i_d_CompanyTwitterLink':'Twitter',
    'i_l_WikipediaLink':'Wikipedia Page'
    
    }

Build HTMLParser for scraping.

In [87]:
from html.parser import HTMLParser
class TargetHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.website = target_website   #temp pattern
        self.flag = 0               # self.flag = 1 target id, self.flag = 2 keyword id
                                    #self.flag = 3 description self.flag = 4 funding
                                    #self.flag = 5 updates 
        self.gap = 1                # self.gap = 0 in target, self.gap = 1 not in targegt
        self.id_data = {} #key:tag_id value:data
        self.current_id = ''
        self.keyword_dict = {}
        self.fund_type = ''
        self.href = ''
        self.fund_dict = {} # key:fund_organization value:fund_type, href 
        self.updates_list = [] 
        self.updates_e = {}
        # value:c_d_InsightCard_Title, c_d_InsightCard_Date, c_d_InsightCard_Content, c_d_InsightCard_Source, href 
    def handle_starttag(self, tag, attrs):
        
        self.gap = 0
        for name,values in attrs:
            
            if name =='class' and values == "c_d_TagSources":
                self.flag = 4
            elif name == 'id' and values == 'i_d_InsightCardsContents':
                self.flag = 5
                    
            elif name == 'id' and values == "i_d_CompanyDescription":
                
                self.flag = 3
                self.current_id = values
            elif name =='id' and values in id_data:
                self.flag = 1
                self.current_id = values
            elif tag == 'span'and name =='id' and values == 'i_s_CompanySectors':
                self.flag = 2
                self.current_id = values
            elif name == 'class' and values in fund_class:
                self.fund_type = fund_class[values]
            elif name == 'class' and values in mapHit_class:#target_class appears in 1-in-3 and 1-in-10, 1-in-10 here
                self.id_data['main_alexaHeatMapRow'] = mapHit_class[values]
            elif tag == 'div' and name == 'id' and self.flag == 5:
                self.current_id = values
            elif tag == 'a' and name == 'href' and self.flag == 4:
                self.href = values
            elif tag == 'a' and name == 'href' and self.flag == 2:
                self.href = values
            elif tag == 'a' and name == 'href' and self.flag == 5:
                self.href = values  
            elif tag == 'a' and name == 'href':
                for ws_id in self.website:
                    if values.count(self.website[ws_id]) > 0:
                        self.id_data[ws_id] = values
                        del self.website[ws_id]
                        break

    def handle_endtag(self,tag):
        
        self.gap = 1
        
        if (self.flag==2 or self.flag==1) and (tag == 'span' or tag == 'div'):#clear flag with id
            self.flag = 0
            self.current_id = ''
            self.href = ''
        elif self.flag == 4 and tag == 'div':
            self.flag = 0
            self.href = ''
            self.current_id = ''
        elif self.flag == 5 and tag == 'a':
            self.flag = 0
            self.href = ''
            self.current_id = ''
        
    def handle_data(self, data):
        if self.flag == 1:
            empty_line = re.search('^\n\s*$', data)
            if empty_line == None:
                self.id_data[self.current_id] = data
        elif self.flag == 2 and self.gap == 0:
            self.keyword_dict[data] = self.href
        elif self.flag == 3 and self.gap == 1:
            self.id_data[self.current_id] = data
            self.flag = 0
            self.current_id = ''
        elif self.flag == 4:
            self.fund_dict[data] = (self.fund_type,self.href)
        elif self.flag == 5:
            pass

Scrap information from HTMLs and save them to Excel.

In [88]:
import xlwt
file = xlwt.Workbook()
table = file.add_sheet('sheet 1',cell_overwrite_ok=True)
xls_x = 0
xls_y = 0
table.write(xls_x,xls_y,'Rank')
xls_y = xls_y + 1

for element in id_name:

    if element not in ['i_d_VRPopularity_HeatMapRow_Low',                        
                                          'i_d_VRPopularity_HeatMapRow_Med']:
        table.write(xls_x,xls_y,id_name[element])
        xls_y = xls_y + 1
#xls_y = xls_y + 1#error delete
table.write(xls_x,xls_y,'Keyword') 
xls_y = xls_y + 1
table.write(xls_x,xls_y,'Funding Signals') 
num = 1
for i in range (25):

    hp = TargetHTMLParser()
    hp.feed(htmls[i])
    fund_dict = hp.fund_dict
    xls_x = xls_x + 1
    xls_y = 0
    table.write(xls_x,xls_y,str(i+1))
    print (hp.id_data)
    for element in id_name:
        if element in hp.id_data:
            xls_y = xls_y + 1
            table.write(xls_x,xls_y,hp.id_data[element]) 
        else:
            if id_name[element] != "VentureRadar Popularity":
                xls_y = xls_y + 1
                table.write(xls_x,xls_y,'NO DATA') 

    #Keyword: 
    keyword_str = ''
    for key in hp.keyword_dict:
        keyword_str = keyword_str + key + '; '
    xls_y = xls_y + 1
    table.write(xls_x,xls_y,keyword_str) 
    #Funding Signals
    fund_type = ''
    fund_str = ''
    for org in hp.fund_dict:
        if hp.fund_dict[org][0] != fund_type:
            fund_type = hp.fund_dict[org][0]
            fund_str = fund_str + '['+fund_type + ']:'
        fund_str = fund_str + org + '; '
    xls_y = xls_y + 1
    table.write(xls_x,xls_y,fund_str)

fo_name = search_key + '.xls'  
file.save(fo_name)

{'i_d_CompanyName': 'Fiverr', 'main_i_s_CompanyCountry': 'USA', 'main_i_s_CompanyFounded': '2010', 'main_i_s_CompanyType': 'Private Company', 'i_d_CompanyDescription': "Browse. Buy. Done. Fiverr gives you instant access to millions of Gigs from people who love what they do. It's the easiest way for individuals and businesses to get everything done, at unbeatable value. Fiverr is the world's largest marketplace for creative and professional services, currently listing over 3 million Gigs in more than 100 different categories across 196 countries. Fiverr is one of the top 130 websites in the world according to Alexa.com. With a team of 130 people, Fiverr has primary offices in New York, Miami, and Tel Aviv.", 'i_d_CompanyWebsiteLink': 'http://Fiverr.com', 'i_l_CompanyEmailLink': 'n/a', 'i_d_OverallScore_Detail': '926', 'main_i_s_SocialProofScore': '830', 'main_i_s_WebsitePopularityScore': '997', 'main_i_s_WebsiteAutoAnalystScore': '890', 'main_i_s_VRPopularityScore': '989', 'main_alexaHe

{'i_d_CompanyName': 'Reallusion', 'main_i_s_CompanyCountry': 'USA', 'main_i_s_CompanyFounded': '1993', 'main_i_s_CompanyType': 'Private Company', 'i_d_CompanyDescription': "Reallusion Inc. is a 2D and 3D animation software and content developer. Headquartered in Silicon Valley, with R&D centers in Taiwan, and offices and training centers in Germany and Japan. Reallusion specializes in the development of real-time 2D and 3D cinematic animation tools for consumers, students and professionals. The company provides users with easy-to-use avatar animation, facial morphing and voice lip-sync solutions for real-time 3D filmmaking, and previsualization for professional post-production. Reallusion's core technologies are widely used by trainers, educators, gamers and filmmakers providing them with stand-alone movie studio packages.", 'i_d_CompanyWebsiteLink': 'http://www.reallusion.com/', 'i_l_CompanyEmailLink': 'more@reallusion.com', 'i_d_OverallScore_Detail': '815', 'main_i_s_SocialProofScore

Read from Excel.

In [89]:
import pandas as pd
df = pd.read_excel('Animation.xls')
print (df)

    Rank                                Name        Location Founded year  \
0      1                              Fiverr             USA         2010   
1      2                               Wideo       Argentina         2012   
2      3                             PowToon  United Kingdom         2012   
3      4                            Autodesk             USA         1982   
4      5                            Animaker             USA         2014   
5      6                             Blender          Israel         2014   
6      7                     Furhat Robotics          Sweden         2014   
7      8                              Moovly          Canada         2012   
8      9                          Anthropics  United Kingdom         1998   
9     10                          YellowDog   United Kingdom         2015   
10    11                        Renderforest         Armenia      NO DATA   
11    12                           Artomatix         Ireland         2014   

The scraping is succeeded. 

In [90]:
driver.close()

##  Tips
You can open the driver in the background.

In [93]:
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.ventureradar.com/")

The Chrome showed in the Dock and disappeared in an instant. The windows of Chrome won't appear. Keep in mind to close it.

In [94]:
driver.close()

##  Resources
Home page: http://www.seleniumhq.org   
Official documentation: https://www.selenium.dev/selenium/docs/api/py/index.html

Reference:  
https://pypi.org/project/selenium  
https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class  
https://chrome.google.com/webstore/detail/selenium-ide/mooikfkahbdckldjjndioackbalphokd?hl=en  
https://addons.mozilla.org/en-US/firefox/addon/selenium-ide/  
https://intellipaat.com/community/30397/headless-chrome-selenium-running-selenium-with-headless-chrome-webdriver