Creating a database of all Cosmetic Dentists in the US from a [directory](https://cosmeticdentistdir.com/index.html) using **Selenium** to automate reaching the site and opening links repeatedly **BeautifulSoup** to extract desired details to fill dataset columns.

The complexity of this web scraping project lies in the fact that each entry into the database is located about 2 sublinks deep in the main directory in an HTML structure that is not tagged or uniformly labelled.

## Importing Relevant Libraries

In [13]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import re

## Setting Up Selenium and BeautifulSoup

Setting up Selenium to automate Google Chrome and open main URL and Beautiful Soup to create a parse tree and extract data from HTML structure of the web page.

In [14]:
url = 'https://cosmeticdentistdir.com/index.html'

# content = requests.get(url)
# content.status_code
# content.text
# soup = BeautifulSoup(content.text, 'html.parser')

In [15]:
from webdriver_manager.chrome import ChromeDriverManager
# driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
# driver = webdriver.Chrome(ChromeDriverManager())
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
content = driver.page_source
soup = BeautifulSoup(content)
# soup.prettify()

  driver = webdriver.Chrome(ChromeDriverManager().install())


## Setting Up Selenium and Beautiful Soup for Sublinks

Each Cosmetic Dental in set up as a sublink at the main url. In the next blocks of code, Selenium automates Google Chrome to open up each sublink and BeautifulSoup creates a parse tree and extracts data from HTML structure of the web page.

In [16]:
### extracting all sublinks from main URL and storing them in a txt file
for a_href in soup.findAll('a', href=True):
    with open("h_links.txt", "a") as linkfile:
            linkfile.write(a_href["href"]+"\n")

In [17]:
hlinks = open('h_links.txt')
hyplinks = hlinks.readlines()

In [18]:
urlpart = 'https://cosmeticdentistdir.com/'
full_link= urlpart+hyplinks[4].strip('\n') ### full url for each sublink

In [19]:
driver.get(full_link)
contenteg = driver.page_source
soupeg = BeautifulSoup(contenteg)
# soupeg

## Using regex to extract State, Phone Number and Location
 The code parses displayed text at the sublink and extracts the state, phone number and location of the cosmetic dental. String matching is the ideal choice here since these texts are not tagged in the HTML structure of the webpage.

This is done for each sublink which corresponds to each Cosmetic Dental and extracts the information into a dataset called **dataframe**.

In [20]:
pattern = '(?<=<h3><strong>)(.*)(?=</strong></h3>)'
canadian_states = ['Alberta', 'British Columbia', 'Manitoba', 'New Brunswick', 'Newfoundland', 'Newfoundland and Labrador', 'Northwest Territories', 'Nova Scotia', 'Nunavut', 'Ontario', 'Prince Edward Island', 'Quebec', 'Saskatchewan', 'Yukon']
zeebs = {}

for a_href in soup.findAll('a', href=True):
    hrefs = a_href["href"]
    state = a_href.findPrevious('h3')
    state = re.findall(pattern, str(state))
    # hrefs = a_href["href"]
    if len(state) > 0:
        state = state[0]
        if state not in canadian_states:
            zeebs[hrefs] = state

# zeebs

In [21]:
format = '{}, {} \n'
with open("h_links_states.txt", "a") as linkfile:
    for link in zeebs:
        linkfile.write(format.format(link, zeebs[link]))

In [22]:
hlinks = open('h_links_states.txt')
hyplinks = hlinks.readlines()
main_dict = {}

urlpart = 'https://cosmeticdentistdir.com/'

pattern3 = '(?<=Phone : )(.*)(?=\n)'
pattern4 = '(?<=Location : )(.*)(?=\n)'
for x in range(len(hyplinks)):
    state = hyplinks[x].split(',')[1]
    full_link = urlpart+hyplinks[x].split(',')[0]
    driver.get(full_link)
    contenteg = driver.page_source
    soupeg = BeautifulSoup(contenteg)
    cos_dentists = []
    locations = []
    phone_numbers = []

    for href in soupeg.findAll('a', href=True):
        cos_dentists.append(href.contents[0])
    cos_dentists = cos_dentists[4:-6]

    if soupeg.find(id="content"):
        for p in soupeg.find(id="content").find_all("p"):
            content = p.get_text()
            phnbrs = re.search(pattern3, content)
            if phnbrs:
                phone_numbers.append(phnbrs[0].strip())
            else:
                phone_numbers.append("n/a")
            lctns = re.search(pattern4, content)
            if lctns:
                locations.append(lctns[0].strip())
            else:
                locations.append("n/a")

    for x in range(len(cos_dentists)):
        main_dict[cos_dentists[x]] = [state.strip('\n'), phone_numbers[x], locations[x]]

dataframe = pd.DataFrame.from_dict(main_dict, orient='index', columns=['state', 'phone', 'location'])

In [23]:
dataframe.head(5)

Unnamed: 0,state,phone,location
"Victoria L. Vest, D.M.D.",Alabama,256-536-6860,"Huntsville, AL"
"Yaritza Wright, DMD",Alabama,256-533-1611,"Huntsville, AL"
Bryant Dental,Alabama,256-217-4121,"Huntsville, AL"
"Doug Booth, DDS",Alabama,256.533.4770,"Huntsville, AL"
"Steve W. Murphree, DMD",Alabama,256-852-9878,"Huntsville, AL"


## Tentative Number of Cosmetic Dentals

In [24]:
print('Number of Cosmetic Dentals in the US: ', len(dataframe))

Number of Cosmetic Dentals in the US:  1060
