# Web Scraping HondaWeb to Obtain Member Skills

In [1]:
import requests, lxml.html
from getpass import getpass
from bs4 import BeautifulSoup

s = requests.session()

The default login page requires that a user enters their user name and password.  But, there may be some additional data that we may need to send with our request in addition to the user name and password.  Most often, they are defined as **hidden inputs** in the html's ```form``` tag.

**We can programmatically obtain hidden input fields in the log-in page:**

In [12]:
login_url = 'https://hondasites.com/auth/default.html'

login = s.get(login_url)
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')

# Create Python dictionary containing key-value pairs of hidden inputs
form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
print(form)

{'login_referrer': '', 'login': 'Y'}


From above, we see that there are 2 hidden inputs: ```login_referrer``` and ```login```.

**Alternatively, we can inspect the log-in page source page to also find those 2 hidden inputs.**

**Using a browser's inpector tools' Network scanner, I was able to determine that HondaWeb uses 3 stages of authentication.  Below are the URLs for the default log-in page and URLs 2 through 4 are the 3 stages of authentication.  The last URL (url5) is just a test URL of an actual person's profile page.  In order to be fully authenticated, we must be able to request the first 4 URLs below:**

In [3]:
s = requests.session()
login_url = 'https://hondasites.com/auth/default.aspx'
login_url2 = 'https://myhonda.hondasites.com/_layouts/Authenticate.aspx?Source=/'
login_url3 = 'https://myhonda.hondasites.com/_layouts/accessmanagersignin.aspx?ReturnUrl=/_layouts/Authenticate.aspx?Source=%2F&Source=/'
login_url4 = 'https://myhonda.hondasites.com/_layouts/15/Authenticate.aspx?Source=/'
login_url5 = 'https://myhonda.hondasites.com/Person.aspx?accountname=i:0%23.f|AccessManagerMembershipProvider|17151'

**To log into the defaul login page, we have all the pieces of information we need: user name, password, login_referrer, and login.**

We will create a Python dictionary that will contain our credentials.

In [4]:
username = getpass('User Name:')
password = getpass('Password:')

credentials = {
    'username': username,
    'password': password,
    'login_referrer': '',
    'login': 'Y'
}

User Name:········
Password:········


### To test things, we will attempt to request those 5 URLs that we defined earlier above:

status code = 200 means we were granted access

In [6]:
request1 = s.post(login_url, data=credentials)
print('request1:', request1.status_code)
request2 = s.get(login_url2)
print('request2:', request2.status_code)
request3 = s.get(login_url3)
print('request3:', request3.status_code)
request4 = s.get(login_url4)
print('request4:', request4.status_code)
request5 = s.get(login_url5)
print('request5:', request5.status_code)

request1: 200
request2: 200
request3: 200
request4: 200
request5: 200


### Now that we know we were able to request all 5 pages, let's look at the first 500 characters of a user's profile page (request5):

**NOTE:** - Due to confidentiality, only the first 500 characters were shown.

In [7]:
request5.content[:500]

b'\r\n\r\n<!DOCTYPE html >\r\n<html lang="en" dir="ltr" class="ms-isBot">\r\n    <head id="ctl00_Head1"><meta http-equiv="X-UA-Compatible" content="IE=Edge" /><meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="description" /><meta name="author" /><meta http-equiv="Content-type" content="text/html; charset=utf-8" /><meta http-equiv="Cache-control" content="NO-CACHE" /><meta http-equiv="Expires" content="0" /><title>\r\n\t\r\n  Daniel Somebody\r\n  \r\n  \r\n\r\n</title><link rel="shortcut ic'

**From above, we can see it appears we have the data we want.**

**Now we can go proceed with actually web scraping the profile page.**

In [8]:
soup = BeautifulSoup(request5.content, 'html.parser')

In [9]:
divs = soup.find_all('div', id='blah_blah_ProfileViewer_SPS-Skills')

In [10]:
if divs:
    print('User Skills:', divs[0].span.next.next.text)
else:
    print('User did not enter skills.')

User Skills: failure forecasting, SQL, programming, Python, R


### Web Scraping Multiple Profiles:

Given a list of 2 or more members, we can web scrape them all using a FOR loop:

In [11]:
base_profile_url = 'https://myhonda.hondasites.com/Person.aspx?accountname=i:0%23.f|AccessManagerMembershipProvider|'

members = ['17151', '38623', '10770']
for member in members:
    member_url = base_profile_url + member
    request = s.get(member_url)
    soup = BeautifulSoup(request.content, 'html.parser')
    skills_div = soup.find_all('div', id='ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_SPS-Skills')
    if skills_div:
        print('User(', member, ') Skills:', skills_div[0].span.next.next.text)
    else:
        print('User(', member, ') did not enter skills.')

User( 17151 ) Skills: failure forecasting, SQL, programming, Python, R
User( 38623 ) did not enter skills.
User( 10770 ) did not enter skills.


### Complete Standalone Script:

In [13]:
import requests, lxml.html
from getpass import getpass
from bs4 import BeautifulSoup

s = requests.session()

login_url = 'https://hondasites.com/auth/default.html'
login_url2 = 'https://myhonda.hondasites.com/_layouts/15/Authenticate.aspx?Source=/'
login_url3 = 'https://myhonda.hondasites.com/_layouts/accessmanagersignin.aspx?ReturnUrl=/_layouts/15/Authenticate.aspx?Source=%2F&Source=/'
login_url4 = 'https://myhonda.hondasites.com/_layouts/15/Authenticate.aspx?Source=/'

base_profile_url = 'https://myhonda.hondasites.com/Person.aspx?accountname=i:0%23.f|AccessManagerMembershipProvider|'

username = getpass('User Name:')
password = getpass('Password:')

credentials = {
    'username': username,
    'password': password,
    'login_referrer': '',
    'login': 'Y'
}

request1 = s.post(login_url, data=credentials)
print('Submitted login')
request2 = s.get(login_url2)
print('Passed authentication #1')
request3 = s.get(login_url3)
print('Passed authentication #2')
request4 = s.get(login_url4)
print('Passed authentication #3')

members = ['17151', '38623', '10770']
for member in members:
    member_url = base_profile_url + member
    request = s.get(member_url)
    soup = BeautifulSoup(request.content, 'html.parser')
    skills_div = soup.find_all('div', id='blah_blah_ProfileViewer_SPS-Skills')
    if skills_div:
        print('User(', member, ') Skills:', skills_div[0].span.next.next.text)
    else:
        print('User(', member, ') did not enter skills.')

User Name:········
Password:········
Submitted login
Passed authentication #1
Passed authentication #2
Passed authentication #3
User( 17151 ) Skills: failure forecasting, SQL, programming, Python, R
User( 38623 ) did not enter skills.
User( 10770 ) did not enter skills.
