# Texas Tow Trucks

We're going to scrape some [tow trucks in Texas](https://www.tdlr.texas.gov/tools_search/). 

# Part One: Building a company list

Search for businesses with the word **WRECK** in their names.

* **Tip:** Start by scraping the first page to a dataframe, then expand to a loop that combines all of the pages. Finally combine all of the dataframes with `pd.concat`.
* **Tip:** There are a lot of ways to do this, although raw `pd.read_html` with a URL won't work! Some approaches are playwright-driven, some use [curlconverter](https://curlconverter.com/), etc etc. I recommend using requests and BeautifulSoup.
* **Tip:** You can't just do a `try`/`except`, because even if you ask for page 99999 it will always give you the last page again! Watch out that you don't get stuck in an infinite loop!

In [185]:
from playwright.async_api import async_playwright
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless = False)
page = await browser.new_page()
url = "https://www.tdlr.texas.gov/tools_search/"
await page.goto(url)

<Response url='https://www.tdlr.texas.gov/tools_search/' request=<Request url='https://www.tdlr.texas.gov/tools_search/' method='GET'>>

In [186]:
await page.locator("id=namedata").type('WRECK')
await page.locator("id=submit3").click()

In [187]:
import pandas as pd
tables = pd.read_html(await page.content())
len(tables)
tables[2]
df = tables[2]
df.head()

Unnamed: 0,0,1,2,3,4,5
0,Customer,DBA Name,TDLR Number,City,State,Zip code
1,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC",006096604C (Insurance not applied !),TERRELL,TX,75160
2,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC.",0612137VSF (Insurance not applied !),TERRELL,TX,75160
3,1ST CHOICE WRECKER SERVICE LLC,,006529369C,SILSBEE,TX,77656
4,1ST CHOICE WRECKER SERVICE LLC,,0652937VSF,SILSBEE,TX,77656


In [181]:
await page.get_by_text("Next >>").click()
await page.get_by_text("<< Prev").click()

In [182]:
#await page.get_by_text("<< Prev").click()

In [188]:
dataframes = []
while True:
    try:
        tables = pd.read_html(await page.content())
        df = tables[2]
        dataframes.append(df)
        
        print("Trying to click the button")
        await page.get_by_text("Next >>").click(timeout=2000)
    except:
        print("Could not click the button")
        # Oh no, we got an error, no button found!
        # pass means "don't do anything"
        # 'break' is going to exit the loop
        # and not print an error
        break

print("Now we have", len(dataframes), "dataframes")

Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
Trying to click the button
T

In [189]:
df = pd.concat(dataframes, ignore_index=True)
df.shape
df.head()

Unnamed: 0,0,1,2,3,4,5
0,Customer,DBA Name,TDLR Number,City,State,Zip code
1,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC",006096604C (Insurance not applied !),TERRELL,TX,75160
2,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC.",0612137VSF (Insurance not applied !),TERRELL,TX,75160
3,1ST CHOICE WRECKER SERVICE LLC,,006529369C,SILSBEE,TX,77656
4,1ST CHOICE WRECKER SERVICE LLC,,0652937VSF,SILSBEE,TX,77656


### Cleanup

If you haven't already, rename the columns to be:
    
    * Customer
    * DBA Name
    * TDLR Number
    * City
    * State
    * Zip code

and remove all of the rows where the customer name is `Customer`.

In [190]:
df.columns = ['Customer','DBA Name','TDLR Number', 'City','State','Zip Code']

In [191]:
df = df[df['Customer'] != 'Customer']

In [192]:
df.head(50)

Unnamed: 0,Customer,DBA Name,TDLR Number,City,State,Zip Code
1,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC",006096604C (Insurance not applied !),TERRELL,TX,75160
2,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC.",0612137VSF (Insurance not applied !),TERRELL,TX,75160
3,1ST CHOICE WRECKER SERVICE LLC,,006529369C,SILSBEE,TX,77656
4,1ST CHOICE WRECKER SERVICE LLC,,0652937VSF,SILSBEE,TX,77656
5,1ST CHOICE WRECKER SERVICE LLC,,0655000VSF,SILSBEE,TX,77656
6,1STCHOICEWRECKERSERVICELLC,,0654581VSF,SILSBEE,TX,77656
7,24 HOUR WRECKER SERVICE INC.,,005052021C,LANCASTER,TX,75146
8,"24 HOUR WRECKER SERVICE, INC",,0514204VSF,LANCASTER,TX,75146
9,24/7 WRECKER SERVICE LLC,,006551087C,HARLINGEN,TX,78550
10,290 WRECKER SERVICE INC,HWY 290 WRECKER SERVICE,0656637VSF,HOCKLEY,TX,77447


## Save as `wreckers.csv`

In [100]:
df.to_csv("wreckers.csv", index=False)

# Part Two: Company info

> You can use whatever tool you'd like for this, but remember that form submission doesn't necessarily mean Selenium/Playwright! If you want to go the `requests` route instead, it might mean anything from adding user-agent headers to using [curlconverter](https://curlconverter.com/) to steal the whole headers/cookies/form details. And the only way to know is trial-and-error!

## Step 1: Scraping one page

Try searching from the [tools page](https://www.tdlr.texas.gov/tools_search/) for the TDLR Number `006556161C`. From the results page, scrape the:

* Business name
* Phone number
* License status
* Physical address

And save the results into a dictionary. It's best if each item has its own key, but **it's fine to pull "larger" sections of the page and split them up in pandas later on**

In [193]:
await page.get_by_text("New Search").click()

In [194]:
await page.locator("id=mcrdata").type('006556161C')
await page.locator("id=submit3").click()

In [195]:
import requests

cookies = {
    '_gid': 'GA1.2.57332458.1670099910',
    'ASPSESSIONIDSCWTTBRD': 'ECAIIBCBLJOEGJLDKMDFKNFJ',
    '_ga': 'GA1.1.1395365643.1670099910',
    '_ga_8C39DM76B2': 'GS1.1.1670101849.2.0.1670101854.0.0.0',
}

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    # 'Cookie': '_gid=GA1.2.57332458.1670099910; ASPSESSIONIDSCWTTBRD=ECAIIBCBLJOEGJLDKMDFKNFJ; _ga=GA1.1.1395365643.1670099910; _ga_8C39DM76B2=GS1.1.1670101849.2.0.1670101854.0.0.0',
    'Referer': 'https://www.tdlr.texas.gov/tools_search/mccs_search.asp',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
    'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
}

params = {
    'mcrnumber': '006556161C',
}

response = requests.get(
    'https://www.tdlr.texas.gov/tools_search/mccs_display.asp',
    params=params,
    cookies=cookies,
    headers=headers,
)

In [196]:
response.text

'\r\n\r\n<html>\r\n<head>\r\n\t<title>TDLR Tow Truck and Vehicle Storage Facility Inquiry</title>\r\n\r\n<meta\tcontent="text/html; charset=windows-1252" http-equiv="Content-Type" />\r\n<meta\tNAME="GENERATOR" Content="Microsoft Visual Studio" /> \r\n\r\n<meta\tHTTP-EQUIV="Content-Type" content="text/html; charset=UTF-8" />\r\n<meta\tname="description" \r\n\t\tcontent="Welcome to the Tow Truck and Vehicle Storage Facility Inquiry Information Page. \r\n\t\t\t\tThis web application allows users to obtain information on companies that have \r\n\t\t\t\tobtained registration through TDLR. This includes addresses, insurance records, \r\n\t\t\t\trecent activities, and vehicle data." />\r\n<meta\tname="keywords" \r\n\t\tcontent="Tow Trucks, Vehicle Storage Facility, registration, insurance, Permit \r\n\t\t\t\tRestrictions, Texas Department of Licensing and Regulation, TDLR" />\r\n<meta\tname="subject" content="Transportation" />\r\n<meta\tname="type" content="Programs and services" />\r\n<meta

In [197]:
response = requests.post('https://www.tdlr.texas.gov/tools_search/mccs_display.asp?mcrnumber=006556161C', cookies=cookies, headers=headers, params=params,)

In [198]:
from bs4 import BeautifulSoup
await page.get_by_text("NAME: ").locator("xpath=..").text_content()
await page.get_by_text("PHONE: ").locator("xpath=..").text_content()
await page.get_by_text("Status: ").locator("xpath=..").text_content()
await page.get_by_text("Physical: ").locator("xpath=..").text_content()


#.. -> it's to go back one command. like in the command line, you do cd.. to go back one step 

'Carrier Type:\xa0\xa0Tow Truck Company\n    \n    \n\t    Number of Active Tow Trucks: \xa0 2\n            \n            Address Information\n            Mailing:\n\n17110  LITTLE CYPRESS DR\n\t\t\t    CYPRESS,\xa0TX.\xa077429\n            \n\n            Physical:\n\n11053 LORETTA LN\n\t\t\t    PLANTERSVILLE,\xa0TX.\xa077363\n        '

In [199]:
dirty_south = {}

dirty_south['name'] = await page.get_by_text("NAME: ").locator("xpath=..").text_content()
dirty_south['number'] = await page.get_by_text("PHONE: ").locator("xpath=..").text_content()
dirty_south['license'] = await page.get_by_text("Status: ").locator("xpath=..").text_content()
dirty_south['address'] = await page.get_by_text("Physical: ").locator("xpath=..").text_content()

dirty_south

{'name': 'Name:\xa0\xa0\xa0DIRTY SOUTH TRANSPORT AND RECOVERY, LLC ',
 'number': 'Phone:\xa0\xa0\xa0713-259-5445',
 'license': 'Status:\xa0\xa0Active',
 'address': 'Carrier Type:\xa0\xa0Tow Truck Company\n    \n    \n\t    Number of Active Tow Trucks: \xa0 2\n            \n            Address Information\n            Mailing:\n\n17110  LITTLE CYPRESS DR\n\t\t\t    CYPRESS,\xa0TX.\xa077429\n            \n\n            Physical:\n\n11053 LORETTA LN\n\t\t\t    PLANTERSVILLE,\xa0TX.\xa077363\n        '}

## Step 2: Converting to a function

Write a function called `get_tdlr_info` that is given a TDLR number and returns a dictionary with the business name, phone number, license status, and physical address. You'll mostly be able to use the same content as above.

Test with `0654479VSF`, and confirm that the information is in there. Did it not work out? Go back and edit your selectors, or be a little broader in the parts of the page you sweep up.

In [None]:
async def get_tdlr_info(tdlr):
    await page.get_by_text("New Search").click()
    await page.locator("id=mcrdata").type(f"{tdlr}")
    await page.locator("id=submit3").click()
    
    tow_trucks={}
    
    tow_trucks['name'] = await page.get_by_text("NAME: ").locator("xpath=..").text_content()
    tow_trucks['number'] = await page.get_by_text("PHONE: ").locator("xpath=..").text_content()
    tow_trucks['license'] = await page.get_by_text("Status: ").locator("xpath=..").text_content()
    tow_trucks['address'] = await page.get_by_text("Physical: ").locator("xpath=..").text_content()
    
    try:
        return pd.Series(tow_trucks)
    except:
        return pd.Series({})
        
await get_tdlr_info('0654479VSF')

## Step 3: Scraping many pages

Using pandas, read in `trucks-subset.csv`.

In [222]:
tow_trucks_details_df = df.tow_trucks.head().apply(tdlr)

#help.

#ANSWER:

#Coverts the column in to a list
codes = df.tdlr_number.tolist()

#Loops through each TDLR Number, making a request for each one 
results = []

for code in codes:
    results = await get_tdlr_info(code)
    tdlr_number.append(result)
    
df=pd.DataFrame(results)
df.head()

AttributeError: 'DataFrame' object has no attribute 'tow_trucks'

## Scrape every single row, creating a new dataframe from the scraped data.

You probably want to refer to the classwork about using `.apply`.

Right now, the results from `.apply` will be a list of dictionaries. You can either change your function to `return pd.Series(data)` to make it become a dataframe automatically, or convert the list of dictionaries you end up with to a dataframe using `pd.DataFrame`.

* **Tip:** If you're using Playwright to navigate pages... it's going to be a bit more difficult.
* **Tip:** Remember to use `join` and not `merge` to combine your dataframes

## Save your dataframe as `data-uncleaned.csv`

# Cleaning your data

## Re-open the `data-uncleaned.csv` file

You probably want to set `pd.options.display.max_colwidth`

## Clean it up!

Make sure there are columns for

- Business name
- Phone number
- License status
- Physical address

And drop all of the other columns (The easiest way is to use `df.drop(columns=[...])`)

# Putting it together

## Open up `wreckers.csv` from Part 1

In [212]:
df=pd.read_csv ("Wreckers.csv")
df.head()

Unnamed: 0,Customer,DBA Name,TDLR Number,City,State,Zip Code
0,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC",006096604C (Insurance not applied !),TERRELL,TX,75160.0
1,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC.",0612137VSF (Insurance not applied !),TERRELL,TX,75160.0
2,1ST CHOICE WRECKER SERVICE LLC,,006529369C,SILSBEE,TX,77656.0
3,1ST CHOICE WRECKER SERVICE LLC,,0652937VSF,SILSBEE,TX,77656.0
4,1ST CHOICE WRECKER SERVICE LLC,,0655000VSF,SILSBEE,TX,77656.0


## Clean up the TDLR Number column to *just* the TDLR number

In [217]:
df['TDLR Number'] = df['TDLR Number'].str.replace('\([^)]*\)', '')
df.head()

  df['TDLR Number'] = df['TDLR Number'].str.replace('\([^)]*\)', '')


Unnamed: 0,Customer,DBA Name,TDLR Number,City,State,Zip Code
0,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC",006096604C,TERRELL,TX,75160.0
1,1ST CHOICE WRECKER SERVICE,"1ST CHOICE PAINT & BODY, INC.",0612137VSF,TERRELL,TX,75160.0
2,1ST CHOICE WRECKER SERVICE LLC,,006529369C,SILSBEE,TX,77656.0
3,1ST CHOICE WRECKER SERVICE LLC,,0652937VSF,SILSBEE,TX,77656.0
4,1ST CHOICE WRECKER SERVICE LLC,,0655000VSF,SILSBEE,TX,77656.0


## Applying 

Use `.apply` to run your scraping script on all of your TDLR numbers. Save the results into a new dataframe.

* **Tip:** You can also just do this for the first 20 or so if you don't like waiting around.

In [219]:
tdlr_df = df['TDLR Number'].head().apply(get_tdlr_info)

In [220]:
tdlr_df

0    <coroutine object get_tdlr_info at 0x140efd460>
1    <coroutine object get_tdlr_info at 0x1430b1540>
2    <coroutine object get_tdlr_info at 0x1430b12a0>
3    <coroutine object get_tdlr_info at 0x1430b0c80>
4    <coroutine object get_tdlr_info at 0x1430b0040>
Name: TDLR Number, dtype: object

### Use `.join` to combine it with the original dataframe

In [None]:
merged = df.join(tdlr_df)
merged.head()

### Save to a CSV

In [None]:
merged.to_csv("tow_trucks_details.csv", index=False)