Rishikesan "Rishi" Ravichandran

In [1]:
from bs4 import BeautifulSoup 
import time 
import requests 
import re 
import os

<br>

Let us determine how URL changes when we change the page, and for this we will use our sorted URL by oldest items, "https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0". When we move between pages, the value after second tilde changes i.e 1\~gallery\~**page_number**\~0", as represented by'page_number' here changes. 

The variable associated with page change is 'search'. As seen here, "search=1\~gallery\~**page_number**\~0" However, it is good to note that this variable 'search' after the first variable 'sort' does not only change the page, but also other dropdown options like 'Gallery'/'Preview' etc. But, only after the second tilde, whatever number comes just after that will take us to that particular page. 

It is interesting to know that when we load the original URL "https://sfbay.craigslist.org/search/zip", it directly adds "#search=1\~gallery\~0\~0" to that URL, and when we add or change the sorting, a new variable is preceded by this i.e 'sort'.  

In [228]:

def fetch_top_250_url():
    """
    Function to fetch top 250 URLS.
    """
    
    headers = {'User-agent': 'Mozilla/5.0'}
    
    #From our previous exploration, we know the URL that could help us get “free” section, ordered “newest” first.
    url = "https://sfbay.craigslist.org/search/zip?sort=date"
    
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    #We'll find all the list item under the class='cl-results-page'
    results_list=soup.select('li',class_='cl-results-page')

    print("\n*** Function Output Start ****\n")
    print("A sample of listing items,\n\n",results_list[0:3])
    
    #From above, we can see that each listing is starting from second element in 'results_list'
    
    print(f"\n\nThe total number of listing items fetched at this time using the requests is {len(results_list[0:])}\n\n")
    
    #So we don't have to request page 2, page 3 etc., while exploration it showed only first 120 items on the website, 
    #with requests it is showing more than 300. Since, our objective is only to fetch top 250 items, we'll just use that. 
    
    #So let us run a loop for 1:251 to store the top 250 listed item's URL
    
    items_url = []
    for i in list(range(1,251)):
        items_url.append(results_list[i].select_one('a').get('href'))
    print("\n*** Function Output End ****\n")
    return items_url

top_250_urls=fetch_top_250_url()

print("The top 250 urls are shown below: \n\n")
top_250_urls


*** Function Output Start ****

A sample of listing items,

 [<li class="cl-static-hub-links">
<div>see also</div>
</li>, <li class="cl-static-search-result" title="Evenflo  PIVOT XPAND TRAVEL SYSTEM">
<a href="https://sfbay.craigslist.org/sby/zip/d/campbell-evenflo-pivot-xpand-travel/7714053414.html">
<div class="title">Evenflo  PIVOT XPAND TRAVEL SYSTEM</div>
<div class="details">
<div class="price">$0</div>
<div class="location">
                        campbell
                    </div>
</div>
</a>
</li>, <li class="cl-static-search-result" title="Treadmill">
<a href="https://sfbay.craigslist.org/eby/zip/d/lafayette-treadmill/7715509943.html">
<div class="title">Treadmill</div>
<div class="details">
<div class="price">$0</div>
<div class="location">
                        lafayette / orinda / moraga
                    </div>
</div>
</a>
</li>]


The total number of listing items fetched at this time using the requests is 361



*** Function Output End ****

The top 250 urls are

['https://sfbay.craigslist.org/sby/zip/d/campbell-evenflo-pivot-xpand-travel/7714053414.html',
 'https://sfbay.craigslist.org/eby/zip/d/lafayette-treadmill/7715509943.html',
 'https://sfbay.craigslist.org/sby/zip/d/san-jose-bath-light-bar-light/7715509713.html',
 'https://sfbay.craigslist.org/nby/zip/d/sebastopol-vinyl-floor-tiles/7715509520.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-scooter-nanrobot/7715509412.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-two-guitar-stands/7715509163.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-binders/7715508675.html',
 'https://sfbay.craigslist.org/sby/zip/d/san-jose-delonghi-space-heater/7715508529.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-stuff-see-pics/7713265862.html',
 'https://sfbay.craigslist.org/sby/zip/d/san-jose-large-sicker-dog-bed-and-pad/7715507916.html',
 'https://sfbay.craigslist.org/sfc/zip/d/san-francisco-free-outdoor-candle-holder/7712855788.html',
 '

Here, after parsing the requests with 'html.parser' with a soup object, we are using the BeautifulSoup's select option to find all the listing items under the class = 'cl-results-page'. On a high level, the structure will be 
**requests->class='cl-results-page'->li->(a->href)**. (a->href) is for each item in the list.

Alternate method: Yes, instead of using select('a'), we can first select the 'li' under class='cl-results-page', then use the findChildren() option to select the children of 'li', and the first element in the child wil be <a> where our url will be. And subsequenty we can use this logic in a loop similar to above to identify all the listed items. The approach will follow something like "soup.select('li',class_='cl-results-page')[i].findChildren()[0].get('href')" to get the url of an item.

The top 250 urls were printed based on the fact that we were able to get more than 250 listing items in our first requests to the sorted url. This method was also validated with the website by checking the top 250 items.
We then used the select 'li' with class, and further select 'a' and get the url with 'href' for top 250 items under \<li> \</li>. 


We didn't consider the pagination feature of Craigslist to navigate through pages here. However, this can also be done. Since, from our exploration we know that there are only 120 items listed in the url, we can run the requests for each page and get the listing from top 1-120 for first two pages, and the top 1-10 for the last page. While doing this, we should also make sure to set a time.sleep() to not bombard the website with different requests in a very short time. 

Addditional note: This part of the code (4.) was exceuted more recently than the upcoming part i.e '5. Saving HTML Pages'. Hence, there might be some differences in the above listing of  files vs the locally downloaded files, which will further be used for the Part 2. of this project.

### Save HTML Pages

In [229]:
def save_url(urls):
    """
    Function to save URLs.
    """
    for url in urls:
        #Get the id of the listing 
        id = re.findall('.*/(.*).html',url)[0]

        #Sleep for 8 sec before sending requests
        time.sleep(8)
        url=f'{url}'
        
        headers = {'User-agent': 'Mozilla/5.0'}
        
        page = requests.get(url, headers=headers)
        soup = BeautifulSoup(page.content, 'html.parser')
        
            
        #Save them as a binary file to avoid any writing errors 
        with open(f'./urls/{id}.html', 'wb') as file:
            file.write(page.content)

#Save the URLs as individual html files for each listing 
save_url(top_250_urls)

<br>
<br>

In [249]:
### Read Saved HTML Files

#Relative directory of all the urls
directory = "./urls" 

#Read each file and get information
for filename in os.listdir(directory):
    
    # Checking if the file ends with .html
    if filename.endswith(".html"):
        
        print(filename)
        #Constructing the full relative file path
        filepath = os.path.join(directory, filename)

        #Getting the id of the filename
        id = re.findall('./urls/(.*).html',filepath)[0]
        
        #Reading file to string
        with open(filepath, 'r', encoding='utf-8') as file:

            html = file.read()

            #Call the function to get information
            extract_information(html,id)


### 2. Extract Information

def extract_information(page,id):
    """Function to extract information from individual listing pages."""
    
    soup = BeautifulSoup(page,'html.parser')
    
    #Title of the listng 
    title = soup.select('h1',class_='postingtilte')[0].select('#titletextonly')[0].text
    
    #URL of the first image, if first image is None, returns a N/A string
    img_url = soup.find('div',class_='slide first visible').find('img').get('src') if soup.find('div',class_='slide first visible') is not None else "Not Available"
    
    #Description of the listing 
    des = soup.select_one('#postingbody').text
    des = re.findall('QR Code Link to This Post\n\n\n(.*)',des)[0]
    
    #Post id
    post_id = re.findall('.*: (.*)',soup.find('div',class_='postinginfos').select_one('p').text)[0]
    post_id
    
    #Posted Date 
    posted_date = soup.find('div',class_='postinginfos').select('p')[1].select('time')[0].get('datetime') 
    posted_date= re.findall('(\d{4}-\d{2}-\d{2})T.*',posted_date)[0]
    
    #Last Updated Date, if updated date is not available under the class postinginfos, returns a N/A information.
    last_updated_date = re.findall('(\d{4}-\d{2}-\d{2})T.*',soup.find('div',class_='postinginfos').select('p')[2].select('time')[0].get('datetime'))[0] if len(re.findall('updated',soup.find('div',class_='postinginfos').select('p')[2].text))>0 else "Date not available!"
    
    print(f"For HTML File ID: {id}\n")
    print(f"Title of the listing: {title}\n")
    print(f"The URL of the first image: {img_url}\n") if img_url is not None else print("Image not found for this url!\n")
    print(f"The description of the listing: {des}\n") 
    print(f"The Post ID: {post_id}\n")
    print(f"The posted date of the listing: {posted_date}\n")
    print(f"The last updated date of the listing: {last_updated_date}\n")
    print("\n\n")

7715222808.html
For HTML File ID: 7715222808:

Title of the listing: Free 1 queen duvet insert and 6 pillows, must take all right now

The URL of the first image: https://images.craigslist.org/00101_5LmfxxIj6tG_0t20CI_600x450.jpg

The description of the listing: PLEASE DO NOT ASK IF AVAILABLE.

The Post ID: 7715222808

The posted date of the listing: 2024-02-06

The last updated date of the listing: 2024-02-06




7712180575.html
For HTML File ID: 7712180575:

Title of the listing: portable crib (graco?)

The URL of the first image: https://images.craigslist.org/00C0C_jdpU8Utkg0Y_0t20CI_600x450.jpg

The description of the listing: portable crib

The Post ID: 7712180575

The posted date of the listing: 2024-01-28

The last updated date of the listing: 2024-02-06




7715106513.html
For HTML File ID: 7715106513:

Title of the listing: Retro TV - Good for Retro Gaming

The URL of the first image: https://images.craigslist.org/00F0F_h3d0GBuVExe_0CI0t2_600x450.jpg

The description of the li