# Scraping Maryland Business Licenses with Selenium

Maryland has a [great portal](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp) for searching business licenses, but the only problem is you have to check a box in order to get in.

1. Try to visit [the public search page](https://jportal.mdcourts.gov/license/pbPublicSearch.jsp)
2. Get redirected to a "I agree to this" page. Click that you've read the disclaimer, click Enter the Site.
3. Click "Search License Records" down at the bottom of the page
4. You're now on the search page! From the "Jurisdiction" dropdown, select "Statewide"
5. In the "Trade Name" field, type "Vap%" to try to find vape shops
6. Click "Next" in the bottom right-hand corner to go to the next page
7. Click "Click for detail" to see the details for a specific business license.

That's a lot of stuff! **Let's get to work.**

## Preparation

### When you search for a business license, what URL should Selenium try to visit first?

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select

In [27]:
from bs4 import BeautifulSoup

In [81]:
import pandas as pd

In [2]:
driver = webdriver.Chrome()

In [51]:
url = "https://jportal.mdcourts.gov/license/index_disclaimer.jsp"

In [52]:
driver.get(url)

**It isn't going to work, though! It's going to redirect to that intro page.** You can use *Incognito mode* to go back through the "Check the box, etc" series of pages.

### How will you identify the checkbox to check it?

Selenium can find elements by:

- name
- Class
- ID
- CSS selector (**ASK ME WHAT THIS IS** if you don't know)
- XPath (**ASK ME WHAT THIS IS** because you definitely don't know)
- Link text
- Partial link text

So in other words, what's unique about this element?

- *TIP: I have a secret awesome way to do this, but you have to ask.*

### How will you identify the button to select it, or the form to submit it?

Selenium can submit forms by either

- Selecting the form and using `.submit()`, or
- Selecting the button and using `.click()`

You only need to be able to get **one, not both.**

In [53]:
driver.find_element_by_id("checkbox").click()

In [54]:
driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]').click()

### Once you're on the next page, how will you click the "SEARCH LICENSE RECORDS" link or follow its URL?

In [55]:
driver.find_element_by_link_text('SEARCH LICENSE RECORDS').click()

### Now you're on the form page. How do you pick "Statewide" for the jurisdiction dropdown?

In [56]:
Select(driver.find_element_by_name('slcJurisdiction')).select_by_visible_text('Statewide')

### How do you type "vap%" into the Trade Name field?

In [57]:
driver.find_element_by_id('txtTradeName').send_keys("vap%")

### How do you click the submit button or submit the form?

In [58]:
driver.find_element_by_tag_name('form').submit()

### How can you find and click the 'Next' button on the search results page?

In [25]:
driver.find_element_by_link_text('Next »').click()

# Okay, let's scrape!

### Use Selenium to search for vape shops statewide, and then click through until it reaches the last page.

Don't scrape yet!

In [59]:
while True:
    try:
        driver.find_element_by_link_text('Next »').click()
    except:
        break

### Use Selenium to scrape the first page of search results for vape shops statewide (well, we'll try).

- *TIP: You can find elements/text using Selenium, or use BeautifulSoup with `doc = BeautifulSoup(driver.page_source)`*

I've included some BeautifulSoup code that might be helpful. If you use it, **ask me how I made it.** It's important.

In [60]:
doc = BeautifulSoup(driver.page_source, "html.parser")

In [66]:
# IF YOU ARE USING BEAUTIFULSOUP, HERE IS SOME SAMPLE CODE
business_headers = doc.find_all('tr',class_='searchfieldtitle')

<tr class="searchfieldtitle">
<td class="searchlistnumber">1.</td>
<td class="searchlistitem"><span class="copybold">VAPE IT STORE II</span></td>
<td><a href="pbLicenseDetail.jsp?owi=6rq%2BeY63IN0%3D"><img alt="Click for Detail of VAPE IT STORE II" src="images/link_click-detail.gif"/></a></td>
</tr>

In [80]:
#SCRAPING ONE PAGE

businesses = []

for header in business_headers:
    
    current = {}
    
    name = header.find_all('td')[1].span.text.strip()
    current["Name"] = name
    print ("Name:", name)
    
    link = ""
    if header.find_all('td')[2].a:
        link = header.find_all('td')[2].a["href"]
    current["Link"] = link
    print ("Link:", link)
    
    rows = header.find_next_siblings('tr')
    
    company = rows[0].find_all('td')[1].text.strip()
    current["Company"] = company
    print ("Company:", company)
    
    adr_1 = rows[1].find_all('td')[1].text.strip()
    adr_2 = rows[2].find_all('td')[1].text.strip()
    current["Adressline_1"] = adr_1
    current["Adressline_2"] = adr_2
    print ("Adress:", adr_1, "|", adr_2)
    
    county = rows[3].find_all('td')[1].text.strip()
    current["County"] = county
    print ("County:", county)
    
    lic_status = rows[0].find_all('td')[2].span.next_sibling.strip()
    current["Lic. Status"] = lic_status
    print ("Lic. Status:", lic_status)
    
    license = rows[1].find_all('td')[2].text
    if license != "":
        license = rows[1].find_all('td')[2].span.next_sibling.strip()
    current["License"] = license    
    print ("License:", license)
    
    issued = rows[2].find_all('td')[2].text
    if issued != "":
        issued = rows[2].find_all('td')[2].span.next_sibling.strip()
    current["Issued"] = issued
    print ("Issued:", issued)
    
    businesses.append(current)
    
    print("----")

Name: VAPE IT STORE II
Link: pbLicenseDetail.jsp?owi=6rq%2BeY63IN0%3D
Company: AMIN NARGIS
Adress: 1015 S SALISBURY BLVD | SALISBURY, MD 21801
County: Wicomico County
Lic. Status: Issued
License: 22173808
Issued: 4/27/2017
----
Name: VAPE IT STORE I
Link: pbLicenseDetail.jsp?owi=Z2yBcwi5H0A%3D
Company: AMIN NARGIS
Adress: 1724 N SALISBURY BLVD UNIT 2 | SALISBURY, MD 21801
County: Wicomico County
Lic. Status: Issued
License: 22173807
Issued: 4/27/2017
----
Name: VAPEPAD THE
Link: pbLicenseDetail.jsp?owi=aoxGKTEKPq4%3D
Company: ANJ DISTRIBUTIONS LLC
Adress: 2299 JOHNS HOPKINS ROAD | GAMBRILLS, MD 21054
County: Anne Arundel County
Lic. Status: Issued
License: 02104436
Issued: 4/05/2017
----
Name: VAPE FROG
Link: pbLicenseDetail.jsp?owi=cAocC89gtTI%3D
Company: COX TRADING COMPANY L L C
Adress: 110 S. PINEY RD | CHESTER, MD 21619
County: Queen Anne's County
Lic. Status: Issued
License: 17165957
Issued: 5/31/2017
----
Name: VAPE FROG
Link: 
Company: COX TRADING LLC
Adress: 346 RITCHIE HIGHWA

### Save these into `vape-results.csv`

In [87]:
df = pd.DataFrame(businesses, columns=['Name', 'Link', 'License', 'Lic. Status', 'Issued', 'Company', 'Adressline_1', 'Adressline_2', 'County'])

In [89]:
df.to_csv("vape-results.csv", index=False)

### Open `vape-results.csv` to make sure there aren't any extra weird columns

In [90]:
df_in = pd.read_csv("vape-results.csv")

In [91]:
df_in #Looks good!

Unnamed: 0,Name,Link,License,Lic. Status,Issued,Company,Adressline_1,Adressline_2,County
0,VAPE IT STORE II,pbLicenseDetail.jsp?owi=6rq%2BeY63IN0%3D,22173808.0,Issued,4/27/2017,AMIN NARGIS,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",Wicomico County
1,VAPE IT STORE I,pbLicenseDetail.jsp?owi=Z2yBcwi5H0A%3D,22173807.0,Issued,4/27/2017,AMIN NARGIS,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",Wicomico County
2,VAPEPAD THE,pbLicenseDetail.jsp?owi=aoxGKTEKPq4%3D,2104436.0,Issued,4/05/2017,ANJ DISTRIBUTIONS LLC,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",Anne Arundel County
3,VAPE FROG,pbLicenseDetail.jsp?owi=cAocC89gtTI%3D,17165957.0,Issued,5/31/2017,COX TRADING COMPANY L L C,110 S. PINEY RD,"CHESTER, MD 21619",Queen Anne's County
4,VAPE FROG,,,Pending,,COX TRADING LLC,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",Anne Arundel County


## Use Selenium to scrape ALL pages of results, save the results into `vape-results-all.csv`.

In [92]:
#bring browser back to the 1st page:
driver.get(url)

In [93]:
driver.find_element_by_id("checkbox").click()
driver.find_element_by_xpath('/html/body/table/tbody/tr[7]/td/form/div/input[2]').click()

In [94]:
driver.find_element_by_link_text('SEARCH LICENSE RECORDS').click()

In [95]:
Select(driver.find_element_by_name('slcJurisdiction')).select_by_visible_text('Statewide')
driver.find_element_by_id('txtTradeName').send_keys("vap%")
driver.find_element_by_tag_name('form').submit()

In [96]:
#set up the main list container where will will scrape into
businesses = []

In [98]:
#define a function we will use to scrape data
def scrape():
    
    #get page content into BeautifulSoup
    doc = BeautifulSoup(driver.page_source, "html.parser")
    
    #Create array of data entries
    business_headers = doc.find_all('tr',class_='searchfieldtitle')
    
    #Get the data fields and store in a dictionary
    for header in business_headers:

        current = {}

        name = header.find_all('td')[1].span.text.strip()
        current["Name"] = name

        link = ""
        if header.find_all('td')[2].a:
            link = header.find_all('td')[2].a["href"]
        current["Link"] = link

        rows = header.find_next_siblings('tr')

        company = rows[0].find_all('td')[1].text.strip()
        current["Company"] = company

        adr_1 = rows[1].find_all('td')[1].text.strip()
        adr_2 = rows[2].find_all('td')[1].text.strip()
        current["Adressline_1"] = adr_1
        current["Adressline_2"] = adr_2

        county = rows[3].find_all('td')[1].text.strip()
        current["County"] = county

        lic_status = rows[0].find_all('td')[2].span.next_sibling.strip()
        current["Lic. Status"] = lic_status

        license = rows[1].find_all('td')[2].text
        if license != "":
            license = rows[1].find_all('td')[2].span.next_sibling.strip()
        current["License"] = license    

        issued = rows[2].find_all('td')[2].text
        if issued != "":
            issued = rows[2].find_all('td')[2].span.next_sibling.strip()
        current["Issued"] = issued
        
        businesses.append(current)

In [99]:
#set up the loop to bring browser through all the pages
while True:
    try:
        scrape()
        driver.find_element_by_link_text('Next »').click()
    except:
        break

In [100]:
businesses #looks good!

[{'Adressline_1': '1015 S SALISBURY BLVD',
  'Adressline_2': 'SALISBURY, MD 21801',
  'Company': 'AMIN NARGIS',
  'County': 'Wicomico County',
  'Issued': '4/27/2017',
  'Lic. Status': 'Issued',
  'License': '22173808',
  'Link': 'pbLicenseDetail.jsp?owi=6rq%2BeY63IN0%3D',
  'Name': 'VAPE IT STORE II'},
 {'Adressline_1': '1724 N SALISBURY BLVD UNIT 2',
  'Adressline_2': 'SALISBURY, MD 21801',
  'Company': 'AMIN NARGIS',
  'County': 'Wicomico County',
  'Issued': '4/27/2017',
  'Lic. Status': 'Issued',
  'License': '22173807',
  'Link': 'pbLicenseDetail.jsp?owi=Z2yBcwi5H0A%3D',
  'Name': 'VAPE IT STORE I'},
 {'Adressline_1': '2299 JOHNS HOPKINS ROAD',
  'Adressline_2': 'GAMBRILLS, MD 21054',
  'Company': 'ANJ DISTRIBUTIONS LLC',
  'County': 'Anne Arundel County',
  'Issued': '4/05/2017',
  'Lic. Status': 'Issued',
  'License': '02104436',
  'Link': 'pbLicenseDetail.jsp?owi=aoxGKTEKPq4%3D',
  'Name': 'VAPEPAD THE'},
 {'Adressline_1': '110 S. PINEY RD',
  'Adressline_2': 'CHESTER, MD 2161

In [101]:
#store the dictionary in a DataFrame
df = pd.DataFrame(businesses, columns=['Name', 'Link', 'License', 'Lic. Status', 'Issued', 'Company', 'Adressline_1', 'Adressline_2', 'County'])

In [102]:
#write the DataFrame into a csv-File
df.to_csv("vape-results.csv", index=False)

In [103]:
#Open the csv-File to check
df_in = pd.read_csv("vape-results.csv")
df_in #looks good!

Unnamed: 0,Name,Link,License,Lic. Status,Issued,Company,Adressline_1,Adressline_2,County
0,VAPE IT STORE II,pbLicenseDetail.jsp?owi=6rq%2BeY63IN0%3D,22173808.0,Issued,4/27/2017,AMIN NARGIS,1015 S SALISBURY BLVD,"SALISBURY, MD 21801",Wicomico County
1,VAPE IT STORE I,pbLicenseDetail.jsp?owi=Z2yBcwi5H0A%3D,22173807.0,Issued,4/27/2017,AMIN NARGIS,1724 N SALISBURY BLVD UNIT 2,"SALISBURY, MD 21801",Wicomico County
2,VAPEPAD THE,pbLicenseDetail.jsp?owi=aoxGKTEKPq4%3D,2104436.0,Issued,4/05/2017,ANJ DISTRIBUTIONS LLC,2299 JOHNS HOPKINS ROAD,"GAMBRILLS, MD 21054",Anne Arundel County
3,VAPE FROG,pbLicenseDetail.jsp?owi=cAocC89gtTI%3D,17165957.0,Issued,5/31/2017,COX TRADING COMPANY L L C,110 S. PINEY RD,"CHESTER, MD 21619",Queen Anne's County
4,VAPE FROG,,,Pending,,COX TRADING LLC,346 RITCHIE HIGHWAY,"SEVERNA PARK, MD 21146",Anne Arundel County
5,VAPE LOFT (THE),pbLicenseDetail.jsp?owi=OkOPFZ5CCNo%3D,2102408.0,Issued,4/13/2017,DISBROW II EMERSON HARRINGTON,185 MITCHELLS CHANCE RD,"EDGEWATER, MD 21037",Anne Arundel County
6,VAPE N CIGAR,pbLicenseDetail.jsp?owi=CX8%2BVN1zzbs%3D,13141786.0,Issued,5/19/2017,DISCOUNT TOBACCO ESSEX LLC,7104 MINSTREL UNIT #7,"COLUMBIA, MD 21045",Howard County
7,VAPE DOJO,pbLicenseDetail.jsp?owi=GOHsZYgZMmw%3D,6126253.0,Issued,4/21/2017,FAIRGROUND VILLAGE LLC,330 ONE FORTY VILLAGE ROAD,"WESTMINSTER, MD 21157",Carroll County
8,VAPE HAVEN,,,Pending,,GRIMM JENNIFER,29890 THREE NOTCH ROAD,"CHARLOTTE HALL, MD 20622",St. Mary's County
9,VAPE BIRD,pbLicenseDetail.jsp?owi=JaBhKUOOJMM%3D,17166688.0,Issued,4/13/2017,HUTCH VAPES LLC,356 ROMANCOKE ROAD,"STEVENSVILLE, MD 21666",Queen Anne's County
