## Part 1 Scraping and Saving HTML Content

### 1. Interact with the Page-sorting

Initially, the listings is sorted by "newest" first and its url is: https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0

When changing it to "oldest" first, the url becomes: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0

The difference between them is the "?sort=dateoldest" part between "zip" and "#search". So I can change the sorting directly by modifying only the URL of this part. When I change "?sort=dateoldest" to "?sort=datenewest", the page changes to be sorted by "newest" first automatically. 

Thus, "sort" in the URL is associated with sorting and it's expressed by "?sort=". And it's a GET request.

### 2. Interact with the Page-Pagination

I navigated it to the second and third page and their url links are as follows.

The second page: https://sfbay.craigslist.org/search/zip#search=1~gallery~1~0

The third page: https://sfbay.craigslist.org/search/zip#search=1~gallery~2~0

As we can see from the change, the second last number determines the page number and we can change page by modifying the number between "~" in the url. So in this case, this is the variable associated with page change. While in other website, there may be variable named "page" associated with page change.

### 3. Fetch Listing URLs

In [1]:
# Import packages
from bs4 import BeautifulSoup
import requests
import time

In [2]:
# Use `requests` to access the first page of the “free” section, ordered “newest” first
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://sfbay.craigslist.org/search/zip#search=1~gallery~0~0'
page = requests.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')

# Print the html structure
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="SF bay area free stuff - craigslist" property="og:title"/>
  <meta content="SF bay area free stuff - craigslist" name="description"/>
  <meta content="SF bay area free stuff - craigslist" property="og:description"/>
  <meta content="https://sfbay.craigslist.org/search/zip" property="og:url"/>
  <title>
   SF bay area free stuff - craigslist
  </title>
  <link href="https://sfbay.craigslist.org/search/zip" rel="canonical"/>
  <link href="https://sfbay.craigslist.org/search/zip" hreflang="x-default" rel="alternate"/>
  <link href="/favicon.ico" id="favicon" rel="icon">
   <script id="ld_searchpage_data" type="application/ld+json">
    {"breadcrumb":{"itemListElement":[{"@type":"ListItem",

I identify the sector of "li" whose class is "cl-static-search-result" and "a" is the tag under it. So I use "li.cl-static-search-result > a" to locate the link and extract the href part.

In [3]:
# Extract the links by finding tag under a tag
links_1 = soup.select('li.cl-static-search-result > a')
print(len(links_1))

360


In [4]:
# Print all links in the list
for link in links_1:
    print(link['href'])

https://sfbay.craigslist.org/sfc/zip/d/san-francisco-ant-moats-for-jewel-box/7715575775.html
https://sfbay.craigslist.org/sby/zip/d/los-gatos-kids-slide-free/7715575332.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-carboys-2/7715575080.html
https://sfbay.craigslist.org/sby/zip/d/los-gatos-high-chair-free/7715575033.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-italian-glass-table/7708984924.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-leather-couch-and-love/7710283773.html
https://sfbay.craigslist.org/eby/zip/d/oakland-free-dresser/7713921109.html
https://sfbay.craigslist.org/eby/zip/d/walnut-creek-leather-sofa/7715573752.html
https://sfbay.craigslist.org/nby/zip/d/mill-valley-free-packing-materials/7715573507.html
https://sfbay.craigslist.org/nby/zip/d/mill-valley-dirt-devil-vacuum-works/7715573405.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-kenmore-canister-vacuum/7715573196.html
https://sfbay.craigslist.org/eby/zip/d/hayward-pc-moni

#### Extract the first 250 unique listing URLs and save them to a list

In [7]:
# Create a new list
href_list = []

# Add the first 250 links to the new list
for link in links_1[:250]:
    href_link = link['href']
    href_list.append(href_link)

# Check the length of the new list
print(len(href_list))

250


In [8]:
# Print the new list
for i in href_list:
    print(i)

https://sfbay.craigslist.org/sfc/zip/d/san-francisco-ant-moats-for-jewel-box/7715575775.html
https://sfbay.craigslist.org/sby/zip/d/los-gatos-kids-slide-free/7715575332.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-carboys-2/7715575080.html
https://sfbay.craigslist.org/sby/zip/d/los-gatos-high-chair-free/7715575033.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-italian-glass-table/7708984924.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-leather-couch-and-love/7710283773.html
https://sfbay.craigslist.org/eby/zip/d/oakland-free-dresser/7713921109.html
https://sfbay.craigslist.org/eby/zip/d/walnut-creek-leather-sofa/7715573752.html
https://sfbay.craigslist.org/nby/zip/d/mill-valley-free-packing-materials/7715573507.html
https://sfbay.craigslist.org/nby/zip/d/mill-valley-dirt-devil-vacuum-works/7715573405.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-kenmore-canister-vacuum/7715573196.html
https://sfbay.craigslist.org/eby/zip/d/hayward-pc-moni

### 4. Save HTML Pages

In [9]:
# Use loop to save all html files
for link in href_list[:250]:
    
    # Pause between two requests
    time.sleep(5)
    
    # Use 'requests' to fetch the listing page
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(link, headers = headers)
    
    # Read the content of the html
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Extract the unique code
    last_number_html = link.split('/')[-1]
    
    # Name the html file
    file_name = f"../Individual Project 1/{last_number_html}"
    
    # Write the content to html file
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write(soup.prettify())

## Part 2: Parsing and Displaying Information from Saved HTML

In [34]:
# Import package
import os

# Set work directory
directory = '../Individual Project 1'

In [37]:
# Set the number of item
n = 0

# Loop through each file in the directory
for filename in os.listdir(directory):

    # Check if the file ends with .html
    if filename.endswith(".html"):

        # Construct the full file path
        filepath = os.path.join(directory, filename)

        # Read file to string
        with open(filepath, 'r', encoding='utf-8') as file:
            html = file.read()
        
        # Use BeautifulSoup to parse the file content
        soup = BeautifulSoup(html, 'html.parser')
    
        # Item number
        n = n + 1
        print('item number: '+ str(n))
        
        # Title
        title = soup.select('span#titletextonly')
        for title in title:
            print('title: ' + title.text.strip())
    
        # Image URL
        image = soup.select('img[title="1"]')
        for image in image:
            print('url of first image: ' + image['src'])
    
        # Description
        description = soup.select('section#postingbody')
        for des in description:
            print('description: ' + des.text.strip())
        
        # Post ID
        post_id = soup.select('p.postinginfo')
        if len(post_id)>1:
            print(post_id[1].text.strip())
        
        # Posted date
        date = soup.select('time')
        if len(date)>1:
            print('posted date: ' + date[1].text.strip())
        else:
            print('posted date: NA')
        
        # Updated date
        if len(date)>2:
            print('updated date: ' + date[2].text.strip())
        else: 
            print('updated date: NA')
        
        
        # Dividing line between items
        print('---------------------------------------------------------------------------------------')

item number: 1
title: 3 drawer lateral file cabinet
url of first image: https://images.craigslist.org/00Q0Q_5aDc88w6ZRt_0t20CI_600x450.jpg
description: QR Code Link to This Post
       



      Giving away a lateral file cabinet in excellent condition. I dont have the keys, so it wont lock. other than that it works great. I am having a bunch of stuff hauled away on Saturday morning, so you need to pick it up before the haulers get there. Please dont ask if it is available, and don't tell me you want it if you are not going to show up.
post id: 7715526266
posted date: 2024-02-07 16:39
updated date: NA
---------------------------------------------------------------------------------------
item number: 2
title: FREE TEDDY BEARS & GALAXY ROSES FOR VALENTINE'S DAY TO SPREAD LOVE
url of first image: https://images.craigslist.org/00Y0Y_8bQG1tUwXSq_0cI0oc_600x450.jpg
description: QR Code Link to This Post
       



      FREE TEDDY BEARS, VALENTINE PLUSH TOYS & GALAXY ROSES FOR VALENTINE'S D

item number: 54
title: Free canned cat food
url of first image: https://images.craigslist.org/00R0R_9OJGBUVPlrT_0t20CI_600x450.jpg
description: QR Code Link to This Post
       



      Some primo stuff that we can’t use…first come first served. 1626 Myrtle street Oakland. Will remove ad when I see it gone
post id: 7715572085
posted date: 2024-02-07 20:45
updated date: 2024-02-07 20:46
---------------------------------------------------------------------------------------
item number: 55
title: FREE Razer Abyssus V2 Gaming Mouse
url of first image: https://images.craigslist.org/00m0m_fAOx73OSy74_0lM0t2_600x450.jpg
description: QR Code Link to This Post
       



      FREE Razer Abyssus V2 Gaming Mouse
post id: 7715572590
posted date: 2024-02-07 20:49
updated date: NA
---------------------------------------------------------------------------------------
item number: 56
title: Books - A Beginners Guide to Creative Effects for Your Model Railroad
url of first image: https://images.cra

item number: 108
title: FREE Large Filing Cabinet
url of first image: https://images.craigslist.org/01010_2zfzp43NRQY_0t20CI_600x450.jpg
description: QR Code Link to This Post
       



      Large metal Filing Cabinet
      
      It looks brand new, stored in the garage
      

      I have an industrial dolly you may use to assist you, but you will need 2 people to move this and a truck
      


      42”wide x 20”deep x 68”hi
      
      5 Drawers, 13” deep each
      

      It will accommodate Legal  or Letter size files
post id: 7715571634
posted date: 2024-02-07 20:41
updated date: 2024-02-07 20:45
---------------------------------------------------------------------------------------
item number: 109
title: Receiver
url of first image: https://images.craigslist.org/00W0W_5mLCInwmLCH_0CI0t2_600x450.jpg
description: QR Code Link to This Post
       



      Yamaha NaturalSound Av Receiver RXV473
post id: 7715524228
posted date: 2024-02-07 16:31
updated date: 2024-02-07 16:38


item number: 151
title: Re-home Little Free Library
description: QR Code Link to This Post
       



      I have a little free library that I cannot maintain because I am moving.
      

      It is free standing and movable. Made out of a pot filled with cement with a metal ikea cupboard on top.
      

      It weighs around 70 lbs and is around 4.5 feet x 2.5 feet x 2.5 feet. It should fit into a van/suv with the seats folded down and lifted by 2 people.
      

      All books inside are included! Feel free to just grab it from the corner of Lakewood and Royalvale by Northwood park.
post id: 7707650061
posted date: 2024-01-15 03:33
updated date: 2024-02-07 17:41
---------------------------------------------------------------------------------------
item number: 152
title: Glass Shower Doors
url of first image: https://images.craigslist.org/00V0V_bK77QMdcxSF_0t20CI_600x450.jpg
description: QR Code Link to This Post
       



      Two doors 24" x 67 7/8" with obscure texture. Don

item number: 206
title: Storage bin (for yard waste)
description: QR Code Link to This Post
       





      This is NOT suitable for storage since the lid doesn't lock securely and water has the ability to get inside.
      

      Best used to pick up leaves, yard waste, etc.
post id: 7712840017
posted date: 2024-01-30 14:37
updated date: 2024-02-07 10:39
---------------------------------------------------------------------------------------
item number: 207
title: 2013 Highlander Rear Area Mat
url of first image: https://images.craigslist.org/00G0G_ldmvuikd2fk_0CI0t2_600x450.jpg
description: QR Code Link to This Post
       



      We no longer have this vehicle, and had pulled this out when we purchased a heavy duty rubber mat to put in the back.
post id: 7715450519
posted date: 2024-02-07 12:52
updated date: NA
---------------------------------------------------------------------------------------
item number: 208
title: Free! 2 FADE-OUT Calculation and Sketch Pads
url of firs