### BACKGROUND:

Currently, HondaWeb is the only known source of obtaining associate's basic information for almost any or all Honda associates from any Honda company.  Basic information such as company name, division, department, location, email, etc.  To discover what information can be obtained through HondaWeb profile pages, just simply observe your profile page.  Several attempts and inquires have been made to obtain a single source of profile information for any or all Honda associates, regardless of Honda company.  So far, HondaWeb appears to be the only good source.  To web scrape an associate's profile information from HondaWeb, besides the Python libraries, all that is needed is the associate's Windows log in user ID if they are non-American Honda associates.  For American Honda associates, their user ID is just their "FirstName LastName"

For example, if you are non-AHM associate, copy this URL:

```https://somesite.com/REDACTED|AccessManagerMembershipProvider|```

Then paste it into your browser and then add or type your Windows user ID at the end or right after the "|" symbol, then hit ENTER key.  You should then see your HondaWeb profile page.  For AHM associates, you would just type or enter their first name, space, then their last name instead, then hit the ENTER key.

With the knowledge above, if you belong to an internal organization where your membership or users can come from any Honda company, then all you need to have is a compiled list of their Windows user name/ID or first and last name (if AHM associate).  Then with this list, you can programmatically obtain their basic profile information with this web scraping technique.

### Python libraries that were installed that do not come with standard Python:

- lxml
- Selenium
- tqdm
- pandas

### Import necessary Python libraries

In [1]:
from getpass import getpass                           # built-in Python library to enable hiding sensitive info such as password
from lxml import html                                 # Library to web scrape HTML pages                   
from selenium import webdriver                        # Needed to automate or simulate browser activity
from selenium.webdriver.common.keys import Keys       # Import Keys to send key strokes or key inputs to Chrome
from selenium.webdriver.chrome.options import Options # Needed if you want to use Chrome in "headless" mode
from tqdm import tqdm_notebook                        # library to embed progress bar
import pandas as pd                                   # Library for working with heterogenous tabular data
import sqlite3                                        # Members Windows user IDs are saved in a sqlite3 database
pd.options.display.max_colwidth=500

### Obtain a  list of BRAIN BRG Member's Windows user ID 

In [2]:
conn = sqlite3.connect(r'\\some_site.honda.com\REDACTED\database.db')

sql = """
SELECT
    RTRIM(OPRID) as OPRID

FROM
    members

WHERE
    Member = 'X'
"""

members = pd.read_sql_query(sql, conn)
conn.close()

### Let's look at our list of Windows user IDs of BRAIN members

In [None]:
members.OPRID.values

### HondaWeb is a secured site, so you need to provide your credentials

In [None]:
username = input('Enter your username: ')
password = getpass('Enter your password: ')

### We will be using Chrome browser in this example and therefore, we need to load the Chrome driver

In [5]:
# First, set Chrome into "headless mode" for quicker page navigation
options = Options()
options.headless = True
browser = webdriver.Chrome(r'C:\Users\user\Downloads\chromedriver_win32\chromedriver.exe', options=options)

### Instruct the Chrome browser to visit the Honda HondaWeb log in page and then:

- Enter user name and then
- Enter password and then
- Hit Enter key to submit the user name and password

In [6]:
browser.get('https://some_site.com/auth/default.aspx')

elem_username = browser.find_element_by_name('username')  # find username text box
elem_username.send_keys(username)

elem_password = browser.find_element_by_name('password')  # find the password text box
elem_password.send_keys(password + Keys.RETURN)

### Loop through the members list and for each member, extract the data

In [7]:
%%time

# Initialize Python lists to contain the data we want to capture
first_last_name_list = []
company_list = []
division_list = []
department_list = []
office_location_list = []
email_list = []
skills_list = []
interests_list = []
profile_url_list = []

# This is the "base" URL needed to append or concatenate the member's Windows user ID with
base_profile_url = 'somesite.com/REDACTED|AccessManagerMembershipProvider|'

# Now loop through the list of members' Windows user IDs and visit their HondaWeb profile page
# and extract their data with lxml's XPath query language
print("Running Chrome in headless mode...")
for member in tqdm_notebook(members.OPRID, desc='Looping thru members...'):
    member_url = base_profile_url + member
    browser.get(member_url)
    profile_html = html.fromstring(browser.page_source)

    first_last_name_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_PreferredName"] \
                                        /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    company_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_HondaCompanyName"] \
                                    /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    division_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_HondaDivisionName"] \
                                    /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    department_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_HondaDepartmentName"] \
                                    /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    office_loc_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_SPS-Location"] \
                                    /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    email_span = profile_html.xpath('//span[@id="ProfileViewer_ValueWorkEmail"]/text()')
    skills_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_SPS-Skills"] \
                                    /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    interests_div = profile_html.xpath('//div[@id="ctl00_SPWebPartManager1_g_402dacf0_24c9_49f7_b128_9a852fc0ae8a_ProfileViewer_SPS-Interests"] \
                                    /span[@class="ms-tableCell ms-profile-detailsValue"]/text()')
    
    # With each member's data, we will add them/append to their respective Python list
    if first_last_name_div:
        first_last_name_list.append(first_last_name_div[0])
    else:
        first_last_name_list.append('')
    
    if company_div:
        company_list.append(company_div[0])
    else:
        company_list.append('')
        
    if division_div:
        division_list.append(division_div[0])
    else:
        division_list.append('')
        
    if department_div:
        department_list.append(department_div[0])
    else:
        department_list.append('')
        
    if office_loc_div:
        office_location_list.append(office_loc_div[0])
    else:
        office_location_list.append('')
    
    if email_span:
        email_list.append(email_span[0].lower())  # Discovered that for some reason, some emails can have mix cases
    else:
        email_list.append('')
    
    if skills_div:
        skills_list.append(skills_div[0])
    else:
        skills_list.append('')
        
    if interests_div:
        interests_list.append(interests_div[0])
    else:
        interests_list.append('')
        
    profile_url_list.append(member_url)

# Close/Quite Chrome browser
print("Web scraping complete.  Quitting Chrome browser...")
browser.quit()

Running Chrome in headless mode...


HBox(children=(IntProgress(value=0, description='Looping thru members...', max=153, style=ProgressStyle(descri…


Web scraping complete.  Quitting Chrome browser...
Wall time: 2min 23s


### Let's take a peek (first 5 records) at our Python lists to see if they have the data we wanted

In [8]:
first_last_name_list[:5]

['Nick Allen', 'Jonathan Alvarez', 'Greta Backus', 'Steve Baker', 'Mark Bar']

In [9]:
company_list[:5]

['Honda of America Mfg., Inc.',
 'Honda of America Mfg., Inc.',
 'Honda of America Mfg., Inc.',
 'Honda of America Mfg., Inc.',
 'Honda of America Mfg., Inc.']

In [10]:
division_list[:5]

['Manufacturing Tech Division',
 'NA Quality Division',
 'NA Quality Division',
 'Human Resource ＆ Corp Services',
 'NA Quality Division']

In [11]:
department_list[:5]

['Discrete Simulation', 'MQ INFO', 'MQ Warranty Cost', 'HAM MFG IT', 'MQ INFO']

In [12]:
office_location_list[:5]

['Marysville, OH', 'Raymond, OH', 'Raymond, OH', 'Anna, OH', 'Raymond, OH']

In [None]:
email_list[:5]

In [14]:
skills_list[:5]

['',
 'SQL, VBA, programming, Forecasting, Excel, Excel Macros, Access, data analysis, Sharepoint',
 '',
 '',
 '']

In [15]:
interests_list[:5]

['', '', '', '', '']

In [None]:
profile_url_list[:5]

### Basic data check: Making sure we have same number of data as the number of BRAIN BRG members in our Python lists

In [17]:
assert len(first_last_name_list) == members.shape[0]
assert len(company_list) == members.shape[0]
assert len(division_list) == members.shape[0]
assert len(department_list) == members.shape[0]
assert len(office_location_list) == members.shape[0]
assert len(email_list) == members.shape[0]
assert len(skills_list) == members.shape[0]
assert len(interests_list) == members.shape[0]
assert len(profile_url_list) == members.shape[0]

For more comprehensive data validation, check out great_expectations [library](http://docs.greatexpectations.io/en/latest/core_concepts/expectations.html).

### If our data check passed, then let's go ahead and make a pandas dataframe from our Python lists

In [18]:
members_df = pd.DataFrame({'First_Last_Name': first_last_name_list, 'Company': company_list, 
                          'Division': division_list, 'Department': department_list,
                          'Office_Location': office_location_list, 'Email': email_list,
                          'Skills': skills_list, 'Interests': interests_list,
                          'Profile_Url': profile_url_list})

In [None]:
members_df.head()

In [None]:
members_df.tail()

### Now, we can save our dataframe as Excel, csv, to a database, email it, etc...

In [21]:
# members_df.to_excel(r'path_to_where_you_want_to_save\filename.xlxs')
# members_df.to_csv(r'path_to_where_you_want_to_save\filename.csv)

### Make HTML table from pandas dataframe

But first, need to create a column containing HTML ```<a>``` tags with ```HREF=``` pointed to their profile page URL

In [22]:
def makeHyperlink(row):
    """ Function to convert a string URL to HTML <a> tag """
    
    value = '<a href="' + str(row['Profile_Url']) + '"' + ">Profile Page</>"
    
    return value

#### Apply the function above to create new ```URL_Hyperlink``` column:

In [23]:
members_df['URL_Hyperlink'] = members_df.apply(makeHyperlink, axis='columns')

### Now display dataframe as HTML table

In [None]:
from ipywidgets import HTML

HTML(members_df.drop(columns='Profile_Url', axis='columns').to_html(escape=False, index=False))