# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

## Setup: Import what you'll need to scrape the page

We'll be using Selenium for this, *not* BeautifulSoup and requests.

In [1]:
from selenium import webdriver

In [2]:
driver = webdriver.Chrome()

In [3]:
driver.get("https://www.tdlr.texas.gov/cimsfo/fosearch.asp")

In [4]:
driver.page_source

'<html lang="en" class=" supports csstransforms3d"><head>\n    <title>Administrative Orders - Search</title>\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\n\t<link rel="shortcut icon" href="/images/favicon.png" type="image/x-icon">\n\n    <!-- Google Tag Manager -->\n    <script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script><script async="" src="https://www.googletagmanager.com/gtm.js?id=GTM-MPP5WF5"></script><script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n    new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n    j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n    \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n    })(window,document,\'script\',\'dataLayer\',\'GTM-MPP5WF5\');</script>\n    <!-- End Google Tag Ma

## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for **cosmetologist violations** for people with the last name **Nguyen**.

In [5]:
driver.find_element_by_xpath("/html/body/div[1]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[3]/td/select/option[10]").click()

In [6]:
textbox = driver.find_element_by_xpath('//*[@id="pht_lnm"]')

In [7]:
textbox.send_keys("NGUYEN")

In [8]:
submit_button = driver.find_element_by_xpath('//*[@id="dat-menu"]/div/div[2]/div[1]/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[18]/td/input[1]')

In [9]:
submit_button.click()

## Scraping

Once you are on the results page, do this.

### Loop through each result and print the entire row

Okay wait, that's a heck of a lot. Use `[:10]` to only do the first ten (`listname[:10]` gives you the first ten).

In [10]:
import pandas as pd



In [11]:
driver.page_source

'<html lang="en" class=" supports csstransforms3d"><head>\n\t\t<!-- Meta Tags -->\n\t\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n\t\t<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\n\t\t<!-- Favicon -->\n\t\t<link rel="shortcut icon" href="/images/favicon.png" type="image/x-icon">\n\t\t<!-- Stylesheets -->\n\t\t<link type="text/css" rel="stylesheet" href="/css/reset.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/font-awesome.min.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/animate.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/main-stylesheet.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/lightbox.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/shortcodes.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/custom-fonts.css">\n\t\t<link type="text/css" rel="stylesheet" href="/css/custom-colors.css">\n\t\t<link type="text/css" rel="stylesheet" h

In [12]:
tables = pd.read_html(driver.page_source)

In [13]:
nguyen = tables[0]

In [14]:
nguyen[:10]
    

Unnamed: 0,Name and Location,Order,Basis for Order
0,"NGUYEN, MIMI PHAM City: KATY County: HARRIS Zi...",Date: 11/12/2020Respondent is assessed an admi...,Respondent failed properly clean and sanitize ...
1,"NGUYEN, HA City: ARLINGTON County: TARRANT Zip...",Date: 11/12/2020Respondent is assessed an admi...,Respondent failed to clean and sanitize four (...
2,"NGUYEN, THAO HONG City: SAN ANTONIO County: BE...",Date: 11/12/2020Respondent is assessed an admi...,"Respondent failed to clean, disinfect, and ste..."
3,"NGUYEN, CINDY City: CORPUS CHRISTI County: NUE...",Date: 10/29/2020Respondent is assessed an admi...,Respondent failed to clean and disinfect all w...
4,"NGUYEN, CHAU KHANH LINH City: MONTGOMERY Count...",Date: 10/26/2020The Respondent's Cosmetology M...,Respondent engaged in fraud or deceit in obtai...
5,"NGUYEN, TRANG T City: SEGUIN County: GUADALUPE...",Date: 10/26/2020Respondent is assessed an admi...,Respondent failed properly clean and sanitize ...
6,"NGUYEN, DUNG MINH City: HOUSTON County: HARRIS...",Date: 10/19/2020Respondent is assessed an admi...,Respondent failed properly clean and sanitize ...
7,"NGUYEN, YEN NHI THI City: AUSTIN County: TRAVI...",Date: 10/14/2020Respondent is assessed an admi...,Respondent failed to clean and disinfect all w...
8,"NGUYEN, JOHNNY DAT City: MISSION County: HIDAL...",Date: 10/14/2020Respondent is assessed an admi...,Respondent failed to clean and sanitize whirlp...
9,"NGUYEN, KELLY PHUONG N City: CORPUS CHRISTI Co...",Date: 9/29/2020Respondent is assessed an admin...,Respondent failed properly clean and sanitize ...


### Loop through each result and print each person's name

You'll get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   # try to do something
except:
   # Instead of stopping on an error, it'll jump down here instead
   print("It didn't work')
```

It should help you out. If you don't want to print anything, you can type `pass` instead of the `print` statement. Most people use `pass`, but it's also nice to print out debug statements so you know when/where it's running into errors.

**Why doesn't the first one have a name?**

In [15]:
for names in tables:
    try:
        print(names['Name and Location'])
    except:
        pass

0      NGUYEN, MIMI PHAM City: KATY County: HARRIS Zi...
1      NGUYEN, HA City: ARLINGTON County: TARRANT Zip...
2      NGUYEN, THAO HONG City: SAN ANTONIO County: BE...
3      NGUYEN, CINDY City: CORPUS CHRISTI County: NUE...
4      NGUYEN, CHAU KHANH LINH City: MONTGOMERY Count...
                             ...                        
152    NGUYEN, SHARON City: BASTROP County: BASTROP Z...
153    NGUYEN, BINH THANH City: LAREDO County: WEBB Z...
154    NGUYEN, SAMANTHA TRAN City: MCKINNEY County: C...
155    NGUYEN, THU LE City: SAN ANTONIO County: BEXAR...
156    NGUYEN, KIM NGAN THI City: HOUSTON County: HAR...
Name: Name and Location, Length: 157, dtype: object


In [16]:
names = driver.find_elements_by_tag_name("tr")
for name in names:
    try:
        full_name = name.find_elements_by_class_name("results_text")[0]
        print(full_name.text)
    except:
        print("None")

    
    
#     try:
#         print(names)
#     except:
#         pass

None
NGUYEN, MIMI PHAM
NGUYEN, HA
NGUYEN, THAO HONG
NGUYEN, CINDY
NGUYEN, CHAU KHANH LINH
NGUYEN, TRANG T
NGUYEN, DUNG MINH
NGUYEN, YEN NHI THI
NGUYEN, JOHNNY DAT
NGUYEN, KELLY PHUONG N
NGUYEN, NGA THU
NGUYEN, IVY
NGUYEN, DIEMTRINH T
NGUYEN, HUAN CAO
NGUYEN, THOA KIM
NGUYEN, TONY
NGUYEN, HIEN
NGUYEN, NGOC TRAM
NGUYEN, TRAN NAM
NGUYEN, PHILLIP
NGUYEN, THUY T
NGUYEN, TRACY
NGUYEN, LE PHUC
NGUYEN, HAI MINH
NGUYEN, TUYET
NGUYEN, BA VAN
NGUYEN, LAN THANH
NGUYEN, PHUOC BA
NGUYEN, LINH THUY KIEU
NGUYEN, TONY MINH
NGUYEN, THANH
NGUYEN, HIEN THI MINH
NGUYEN, VAN NGOC THAO
NGUYEN, TAM THANH
NGUYEN, ANDY HUU
NGUYEN, CUONG HULL
NGUYEN, NANCY HOA
NGUYEN, THOA THI KIM
NGUYEN, TRANG
NGUYEN, THONG VAN
NGUYEN, TUAN
NGUYEN, LYNDA
NGUYEN, CATHY H
NGUYEN, PAMELA DAN
NGUYEN, SON QUOC
NGUYEN, KIM
NGUYEN, THAI VAN
NGUYEN, HANH THI
NGUYEN, DIEP THI NGOC
NGUYEN, CASEY
NGUYEN, ANTHONY VAN
NGUYEN, CHRISTINA D
NGUYEN, PHUONG T
NGUYEN, KENNY
NGUYEN, THAO (MINH THAO
NGUYEN, KENNY
NGUYEN, BA VAN
NGUYEN, KHUYEN THI
N

In [17]:
# for names in name:
#     print(names.find_element_by_class_name("results_text")[0]

In [18]:
# https://stackoverflow.com/questions/24795198/get-all-child-elements
# print(driver.find_elements_by_css_selector("*")[10].text)


## Loop through each result, printing each violation description ("Basis for order")

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: You can get the HTML of something by doing `.get_attribute('innerHTML')` - it might help you diagnose your issue.*
> - *Tip: Or I guess you could just skip the one with the problem...*

In [19]:
# for violation in tables:
#     try:
#         print(violation['Basis for Order'])
#     except:
#         pass

In [20]:
violations = driver.find_elements_by_tag_name("tr")
for violation in violations:
    try:
        order_basis = violation.find_elements_by_tag_name("td")[2]
        print(order_basis.text)
    except:
        pass
    

Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution.
Respondent failed to clean and sanitize four (4) whirlpool foot spas as required at the end of each day, constituting two (2) violations; Respondent failed to keep a record of the date and time of four (4) foot spas daily or bi-weekly cleaning and if the foot spas were not used, constituting two (2) violations.
Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use.
Respondent failed to clean and disinfect all wax pots; Respondent failed to properly clean multi-use items prior to each service.
Respondent engaged in fraud or deceit in obtaining a certificate, license, or permit.
Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to wipe clean and disinfect electrical equipment that cannot be immers

## Loop through each result, printing the complaint number

- TIP: Think about the order of the elements

In [21]:
complaints = driver.find_elements_by_tag_name("tr")
for complaint in complaints:
    try:
        complaint_number = complaint.find_elements_by_class_name("results_text")[5]
        print(complaint_number.text)
    except:
        print("None")

None
COS20190010072
COS20190016762
COS20200010387
COS20200010502
COS20190008104
COS20200010511
COS20200004202
COS20190004199
COS20200000101
COS20200011664
COS20200010961
COS20200008858
COS20200008859
COS20200009732
COS20200006548
COS20200009605
COS20190016479
COS20190012148
COS20190010318
COS20190014688
COS20190004016
COS20190016499
COS20200006146
COS20190016549
COS20190017217
COS20190016854
COS20190017175
COS20190013017
COS20190016161
COS20200000652
COS20190007906
COS20190003737
COS20190006118
COS20200002642
COS20190009211
COS20190010611
COS20190010665
COS20190001932
COS20190014576
COS20190001897
COS20190009239
COS20190009484
ROUND ROCK
COS20190011923
COS20190005539
COS20190010176
COS20190010397
COS20190011044
COS20190003373
COS20190010299
COS20190013988
COS20190012652
COS20180015395
COS20190010057
COS20190010254
COS20190000183
COS20190010606
COS20190010961
COS20190010635
COS20190014566
COS20190010869
COS20190007921
COS20180009800
COS20180011968
COS20180014900
COS20190008675
COS201900

## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

> *Tip: If you want to ask for the "next sibling," you can't use `find_next_sibling` in Selenium, you need to use `element.find_element_by_xpath("following-sibling::div")` to find the next div, or `element.find_element_by_xpath("following-sibling::*")` to find the next anything.

In [31]:
items = driver.find_elements_by_tag_name("tr")

rows = []
for item in items:
    row = {}
    try:
        row['name'] = item.find_elements_by_class_name("results_text")[0].text
    except:
        pass
    try:
        row['violations'] = item.find_elements_by_tag_name("td")[2].text
    except:
        pass
    try:
        row['complaint'] = item.find_elements_by_class_name("results_text")[5].text
    except:
        pass
    try:
        row['license'] = item.find_elements_by_class_name("results_text")[4].text
    except:
        pass
    try:
        row['zipcode'] = item.find_elements_by_class_name("results_text")[3].text
    except:
        pass
    try:
        row['county'] = item.find_elements_by_class_name("results_text")[2].text
    except:
        pass
    try:
        row['city'] = item.find_elements_by_class_name("results_text")[1].text
    except:
        pass
    print(row)
    rows.append(row)
    print('--------')
    
    
    

{}
--------
{'name': 'NGUYEN, MIMI PHAM', 'violations': 'Respondent failed properly clean and sanitize the metal implements used at the Salon; Respondent failed to disinfect tools, implements, and supplies with an EPA-registered disinfectant solution.', 'complaint': 'COS20190010072', 'license': '784210', 'zipcode': '77449', 'county': 'HARRIS', 'city': 'KATY'}
--------
{'name': 'NGUYEN, HA', 'violations': 'Respondent failed to clean and sanitize four (4) whirlpool foot spas as required at the end of each day, constituting two (2) violations; Respondent failed to keep a record of the date and time of four (4) foot spas daily or bi-weekly cleaning and if the foot spas were not used, constituting two (2) violations.', 'complaint': 'COS20190016762', 'license': '764888', 'zipcode': '76017', 'county': 'TARRANT', 'city': 'ARLINGTON'}
--------
{'name': 'NGUYEN, THAO HONG', 'violations': 'Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use.', 'com

### Save that to a CSV

- Tip: Use `pd.DataFrame` to create a dataframe, and then save it to a CSV.

In [43]:
df = pd.DataFrame(rows)
df.head()

Unnamed: 0,name,violations,complaint,license,zipcode,county,city
0,,,,,,,
1,"NGUYEN, MIMI PHAM",Respondent failed properly clean and sanitize ...,COS20190010072,784210,77449.0,HARRIS,KATY
2,"NGUYEN, HA",Respondent failed to clean and sanitize four (...,COS20190016762,764888,76017.0,TARRANT,ARLINGTON
3,"NGUYEN, THAO HONG","Respondent failed to clean, disinfect, and ste...",COS20200010387,"799926, 1753491",78238.0,BEXAR,SAN ANTONIO
4,"NGUYEN, CINDY",Respondent failed to clean and disinfect all w...,COS20200010502,"806232, 1260359, 1280071",78414.0,NUECES,CORPUS CHRISTI


In [45]:
df.to_csv("nguyen_licenses1.csv", index=False, header=True)

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.

In [46]:
df=pd.read_csv("nguyen_licenses1.csv")
df.head()

Unnamed: 0,name,violations,complaint,license,zipcode,county,city
0,,,,,,,
1,"NGUYEN, MIMI PHAM",Respondent failed properly clean and sanitize ...,COS20190010072,784210,77449.0,HARRIS,KATY
2,"NGUYEN, HA",Respondent failed to clean and sanitize four (...,COS20190016762,764888,76017.0,TARRANT,ARLINGTON
3,"NGUYEN, THAO HONG","Respondent failed to clean, disinfect, and ste...",COS20200010387,"799926, 1753491",78238.0,BEXAR,SAN ANTONIO
4,"NGUYEN, CINDY",Respondent failed to clean and disinfect all w...,COS20200010502,"806232, 1260359, 1280071",78414.0,NUECES,CORPUS CHRISTI


## Let's do this an easier way

Use Selenium and `pd.read_html` to get the table as a dataframe.

In [1]:
df2 = pd.read_html(driver.page_source)

NameError: name 'pd' is not defined