These packages allow us to turn web pages into objects, convert those objects into text, search the text, and save the results in a file

In [65]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from time import sleep

Craigslist works to prevent web scraping. If you go to the "Garage & Moving Sales" page the Phoenix Craigslist site, and enter keywords, you will get a list of results. But if you inspect the html code, you will not find any classes that contain the metadata about your search results. Instead, you have to individually open each result as its own page. Only then will the divs, sections, and classes become apparent. I used this manual method to obtain "a list of URLs," per the assignment.

In [66]:
list_of_urls = ['https://phoenix.craigslist.org/wvl/gms/d/glendale-estate-sale/7588379037.html',
                'https://phoenix.craigslist.org/evl/gms/d/tempe-everything-in-my-1020-storage/7578328538.html',
               'https://tucson.craigslist.org/gms/d/tucson-mini-estate-sale-starting-sunday/7588604990.html',
               'https://tucson.craigslist.org/gms/d/tucson-moving-sales/7586073631.html',
               'https://phoenix.craigslist.org/evl/gms/d/mesa-moving-feb-20-priced-to-sell/7582745739.html',
               'https://phoenix.craigslist.org/nph/gms/d/scottsdale-estate-sale-final-days-of/7566676009.html',
               'https://yuma.craigslist.org/gms/d/quartzsite-huge-rock-auction-feb-16th/7588705227.html']

This section uses requests to retrieve the html code for each page, BeautifulSoup to parse the page and find specific elements, Python to test the strings returned to see if they contain keywords, and Pandas to load a DataFrame with the links and keywords.

In [67]:
df = pd.DataFrame(columns=['link', 'keyword'])

In [68]:
for link in list_of_urls:
    thispage = requests.get(link)
    bs = BeautifulSoup(thispage.text,'html.parser')
    match = bs.find("section", {"id": "postingbody"})
    if 'mattress' in match.text:
        keyword = 'mattress'
    elif 'cabinet' in match.text:
        keyword = 'cabinet'
    elif 'wrench' in match.text:
        keyword = 'wrench'
    sleep(15)
    df_row = {'link': link, 'keyword': keyword}
    df = df.append(df_row,ignore_index=True)

This section displays the contents of the DataFrame and saves it as a csv file.

In [69]:
df

Unnamed: 0,link,keyword
0,https://phoenix.craigslist.org/wvl/gms/d/glend...,mattress
1,https://phoenix.craigslist.org/evl/gms/d/tempe...,mattress
2,https://tucson.craigslist.org/gms/d/tucson-min...,mattress
3,https://tucson.craigslist.org/gms/d/tucson-mov...,mattress
4,https://phoenix.craigslist.org/evl/gms/d/mesa-...,cabinet
5,https://phoenix.craigslist.org/nph/gms/d/scott...,cabinet
6,https://yuma.craigslist.org/gms/d/quartzsite-h...,cabinet


In [70]:
df.to_csv('module_4_basics.csv', index=False)