# Texas Cosmetologist Violations

Texas has a system for [searching for license violations](https://www.tdlr.texas.gov/cimsfo/fosearch.asp). You're going to search for cosmetologists!

> You can use the classwork notebook and also [my Everything scraping reference](https://jonathansoma.com/everything/scraping)

I shouldn't have to say this, but **do not use ChatGPT for this assignment.**

## Setup: Import what you'll need to scrape the page

We'll be using either Playwrightfor this, *not* requests.

In [1]:
!pip install playwright
!playwright install
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
page = await browser.new_page()


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Starting your search

Starting from [here](https://www.tdlr.texas.gov/cimsfo/fosearch.asp), search for cosmetologist violations for people with the last name **Nguyen**.

In [2]:
await page.goto("https://www.tdlr.texas.gov/cimsfo/fosearch.asp") 

<Response url='https://www.tdlr.texas.gov/cimsfo/fosearch.asp' request=<Request url='https://www.tdlr.texas.gov/cimsfo/fosearch.asp' method='GET'>>

In [3]:
await page.get_by_label("Search by Last Name").fill('Nguyen')

In [None]:
# Click the search button
# await page.locator("input").click()
await page.locator("input[name=\"B1\"]").click()

In [6]:
import io
import pandas as pd
html = await page.content()

In [7]:
html = await page.content()

In [8]:
tables = pd.read_html(io.StringIO(html))

# Display the first table
print(tables[0])

                                    Name and Location  \
0   NGUYEN, ELLIE UYEN City: HUMBLE  County: HARRI...   
1   NGUYEN, PHUC City: HOUSTON  County: HARRIS  Zi...   
2   NGUYEN, TRONG City: FORT WORTH  County: TARRAN...   
3   NGUYEN, SA City: AUSTIN  County: TRAVIS  Zip C...   
4   NGUYEN, LYNH City: RICHARDSON  County: DALLAS ...   
..                                                ...   
78  NGUYEN, KELLY PHUONG N  City: GARDEN GROVE  Co...   
79  NGUYEN, QUOC BAO City: ARLINGTON  County: TARR...   
80  NGUYEN, DANIEL THAO City: PLANO  County: COLLI...   
81  NGUYEN, THANH THU City: EL PASO  County: EL PA...   
82  NGUYEN, CINDY City: TEMPLE  County: BELL  Zip ...   

                                                Order  \
0   Date: 10/30/2024 Respondents Ellie Uyen Nguyen...   
1   Date: 10/9/2024 Respondent is assessed an admi...   
2   Date: 9/18/2024 Respondents Hanh Ngo, Linh Ngo...   
3   Date: 8/26/2024 Respondent is assessed an admi...   
4   Date: 8/8/2024 Respondent 

## Scraping

Once you are on the results page, do this. **I step you through things bit by bit, so it's going to be a little different than we did in class.** Also, no `pd.read_html` allowed because this isn't reaaallly tabular data (or there's just too much to clean up).

Once you've loaded the page: **what selector are you going to do use to find the data?** In class we used things like `.article` or `.head-list`.

> You honestly can do this all with Playwright, no BeautifulSoup involved! But we didn't cover that in class so you should stick to this method:
> 
> ```python
> html = await page.content()
> doc = BeautifulSoup(html)
> ```

In [13]:
from bs4 import BeautifulSoup

# Parse the HTML content using BeautifulSoup
doc = BeautifulSoup(html, 'html.parser')

# Find the results table by locating the tbody
results_table = doc.find('tbody')

# Extract all rows from the table
rows = results_table.find_all('tr')

# Loop through the rows and extract data
for row in rows:
    cells = row.find_all('td')
    if len(cells) == 3:
        # Extract and format the data from each cell
        name_location = cells[0].get_text(separator='\n').strip()
        order_details = cells[1].get_text(separator='\n').strip()
        basis_for_order = cells[2].get_text(separator='\n').strip()
        
        # Print the formatted data
        print("------")
        print(f"Name and Location:\n{name_location}\n\nOrder:\n{order_details}\n\nBasis for Order:\n{basis_for_order}")

------
Name and Location:
NGUYEN, ELLIE UYEN 
 
City:
 
HUMBLE
 
County:
 
HARRIS
 
Zip Code:
 
77346
 License #:
 
798636
Complaint #
 
COS20240003068

Order:
Date:
 
10/30/2024
Respondents Ellie Uyen Nguyen and Bao Q. Truong are assessed an administrative penalty in the amount of $3,750.

Basis for Order:
Respondents failed to clean, disinfect, and sterilize metal instruments with a Department-approved sterilizer in accordance with the sterilizer or sanitizers manufacturer's instructions; Respondents failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondents failed to clean and disinfect all wax pots; Respondents failed to clean diamond, carbide, natural and metal bits after each use with a brush or ultrasonic cleaner, or by immersing in acetone; Respondents failed to dispose of single use items after each use.
------
Name and Location:
NGUYEN, PHUC 
 
City:
 
HOUSTON
 
County:
 
HARRIS
 
Zip Code:
 
77083
 

### Grab the rows, loop through each result and print the entire row

It's probably too many rows to do all at once: once you have a list, use `[:10]` to only show the first ten! For example, if you saved the table rows into `results` you might do something like this:

```python
for result in results[:10]:
    print("------")
    print(result)
```

Although you'd want to print out the text from the row (I give example output below).

The result should look something like this:

```
Name and Location Order Basis for Order
NGUYEN, THANH
City: FRISCO
County: COLLIN
Zip Code: 75034


License #: 790672

Complaint # COS20210004784 Date: 11/16/2021

Respondent is assessed an administrative penalty in the amount of $1,875. Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
NGUYEN, LONG D
City: SAN SABA
County: SAN SABA
Zip Code: 76877
```

### Loop through each result and print each person's name

You'll probably get an error because the first one doesn't have a name. How do you make that not happen?! If you want to ignore an error, you use code like this:

```python
try:
   # try to do something
except:
   print("It didn't work')
```

It should help you out. If you don't want to print anything when there's an error, you can type `pass` instead of the `print` statement.

**Why doesn't the first one have a name?**

Output should look like this:

```
Doesn't have a name
NGUYEN, THANH
NGUYEN, LONG D
NGUYEN, LUCIE HUONG
NGUYEN, CHINH
NGUYEN, JIMMY
```

* *Tip: The name has a class you can use. The class name is reused in a lot of places, but because it's the first one you don't have to worry about that!*
* *Tip: Instead of searching across the entire page – like `doc.select_one` – you should be doing your searching just inside of each **row** (I used this technique in the beginning of class with BeautifulSoup when we were scraping the books page)* 

In [22]:
# Loop through the rows again and print each person's name
print("\nNames List:")
for row in rows:
    try:
        name_span = row.find('span', class_='results_text')
        if name_span:
            name = name_span.get_text().strip()
            print(name)
        else:
            print("Doesn't have a name")
    except:
        pass


Names List:
Doesn't have a name
NGUYEN, ELLIE UYEN
NGUYEN, PHUC
NGUYEN, TRONG
NGUYEN, SA
NGUYEN, LYNH
NGUYEN, LISA
NGUYEN, CHAU MINH
NGUYEN, KEVIN C
NGUYEN, THANH D
NGUYEN, TAM THI
NGUYEN, DUNG THANH
NGUYEN, AJ
NGUYEN, LAM THE
NGUYEN, QUY KIM
NGUYEN, THI THIBICH
NGUYEN, TONY MINH
NGUYEN, JIMMY
NGUYEN, CECILIA L
NGUYEN, JENNIFER HUE
NGUYEN, ANH DUC PHONG
NGUYEN, HUNG
NGUYEN, THAO THANH
NGUYEN, JOHN T
NGUYEN, THUY THU THI
NGUYEN, STEVEN T
NGUYEN, DANIEL THAO
NGUYEN, THI THUY
NGUYEN, JOHN
NGUYEN, STEVEN QUOC
NGUYEN, THU THI
NGUYEN, TUONG
NGUYEN, HUONG T
NGUYEN, STEVEN QUOC
NGUYEN, HUONG THI NGOC
NGUYEN, TAMMY
NGUYEN, TOAN DUC
NGUYEN, NHAN TRONG
NGUYEN, TAM PHUONG
NGUYEN, NHUNG NGOC THI
NGUYEN, ANH NGOC
NGUYEN, HUYEN THI
NGUYEN, DANIEL
NGUYEN, BANG
NGUYEN, HONG N
NGUYEN, TUYET
NGUYEN, TUYET MINH
NGUYEN, JOHN THANH NHAN
NGUYEN, THE THI
NGUYEN, THAO HONG
NGUYEN, NGOC THI
NGUYEN, TUYEN THANH
NGUYEN, MINH THU
NGUYEN, LOAN THI NGOC
NGUYEN, JACK
NGUYEN, HOANG DRAGON
NGUYEN, KELLY
NGUYEN, NIEN
N

## Loop through each result, printing each violation description ("Basis for order")

Your results should look something like:

```
Doesn't have a violation
Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.
Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.
...
```

> - *Tip: You'll get an error even if you're ALMOST right - which row is causing the problem?*
> - *Tip: Or I guess you could just skip the one with the problem using try/except...*

In [16]:
# Loop through each result and print each violation description ("Basis for Order")
print("\nViolation Descriptions:")
for row in rows:
    try:
        cells = row.find_all('td')
        if len(cells) == 3:
            basis_for_order = cells[2].get_text(separator='\n').strip()
            if basis_for_order:
                print(basis_for_order)
            else:
                print("Doesn't have a violation")
    except:
        print("Doesn't have a violation")


Violation Descriptions:
Respondents failed to clean, disinfect, and sterilize metal instruments with a Department-approved sterilizer in accordance with the sterilizer or sanitizers manufacturer's instructions; Respondents failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; Respondents failed to clean and disinfect all wax pots; Respondents failed to clean diamond, carbide, natural and metal bits after each use with a brush or ultrasonic cleaner, or by immersing in acetone; Respondents failed to dispose of single use items after each use.
Respondent failed to follow whirlpool foot spas cleaning and sanitization procedures as required daily, bi-weekly, and before use by each patron.
Respondents failed to follow whirlpool foot spas cleaning and sanitization procedures as required; Respondents failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used; R

## Loop through each result, printing the complaint number

Output should look similar to this:

```
Doesn't have a complaint number
COS20210004784
COS20210009745
COS20210011484
...
```

- *Tip: Think about the order of the elements. Can you count from the opposite direction than you normally do?*

In [20]:
# Loop through each result and print the complaint number
print("\nComplaint Numbers:")
for row in rows:
    try:
        cells = row.find_all('td', recursive=False)
        if len(cells) >= 1:
            complaint_number_span = cells[0].find_all('span', class_='results_text')
            for span in complaint_number_span:
                if 'COS' in span.get_text():
                    complaint_number = span.get_text().strip()
                    print(complaint_number)
                    break
            else:
                print("Doesn't have a complaint number")
    except:
        print("Doesn't have a complaint number")


Complaint Numbers:
COS20240003068
COS20230005653
COS20240011518
COS20240008640
COS20230004635
COS20230010686
COS20220007736
COS20230020006
COS20240000594
COS20240008070
COS20240001947
COS20230009966
COS20220016043
COS20230015854
COS20220002410
COS20230002569
COS20230013709
COS20230020593
COS20240005064
COS20230018810
Doesn't have a complaint number
COS20230000401
COS20220009661
COS20230013043
COS20230001042
COS20230002419
COS20240002426
COS20220015806
COS20230007454
COS20230009980
COS20230001960
COS20220017901
COS20220018135
COS20220016107
COS20220013443
COS20220015714
COS20220012646
COS20220012645
COS20230010084
COS20230010404
COS20220015853
COS20220008852
COS20230009558
COS20230010492
COS20230000043
COS20220006122
COS20220009835
COS20210007272
COS20210012244
COS20210016437
COS20220007001
COS20220009623
COS20220014041
COS20220016162
COS20210009264
COS20210012950
COS20210013902
COS20220005426
Doesn't have a complaint number
COS20210012469
COS20220004591
COS20220005795
COS20230000013
C

## Saving the results

### Loop through each result to create a list of dictionaries

Each dictionary must contain

- Person's name
- Violation description
- Violation number
- License Numbers
- Zip Code
- County
- City

Create a new dictionary for each result (except the header).

Based on what you print out, the output might look something like:

```
This row is broken: Name and Location Order Basis for Order
{'name': 'NGUYEN, THANH', 'city': 'FRISCO', 'county': 'COLLIN', 'zip_code': '75034', 'complaint_no': 'COS20210004784', 'license_numbers': '790672', 'complaint': 'Respondent failed to clean and sanitize whirlpool foot spas as required at the end of each day, the Department is charging 2 violations; Respondent operated a cosmetology salon without the appropriate license.'}
{'name': 'NGUYEN, LONG D', 'city': 'SAN SABA', 'county': 'SAN SABA', 'zip_code': '76877', 'complaint_no': 'COS20210009745', 'license_numbers': '760420, 1620583', 'complaint': 'Respondent failed to keep a record of the date and time of each foot spa daily or bi-weekly cleaning and if the foot spa was not used, the Department is charging 2 violations; Respondent failed to clean, disinfect, and sterilize manicure and pedicure implements after each use; Respondent failed to clean and disinfect manicure tables prior to use for each client.'}
```

> _**Tip:** Depending on how you do this, you might want to figure out how to search for "next siblings" or "following siblings"_

### Save that to a CSV named `output.csv`

The dataframe should look something like...

|index|name|city|county|zip_code|complaint_no|license_numbers|complaint|
|---|---|---|---|---|---|---|---|
|0|NGUYEN, THANH|FRISCO|COLLIN|75034|COS20210004784|790672|Respondent failed to clean and sanitize whirlp...|
|1|NGUYEN, LONG D|SAN SABA|SAN SABA|76877|COS20210009745|760420, 1620583|Respondent failed to keep a record of the date...|


- *Tip: If you send a list of dictionaries to `pd.DataFrame(...)`, it will create a dataframe out of that list!*

### Open the CSV file and examine the first few. Make sure you didn't save an extra weird unnamed column.