# Part 1: Scraping and Saving HTML Content

## 1. Identify the target

In [35]:
from bs4 import BeautifulSoup
import requests
import time

#Following the class's codes
headers = {'User-Agent': 'Mozilla/5.0'}

url = 'https://sfbay.craigslist.org/search/zip'
page = requests.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')

print(soup.prettify())

time.sleep(5)

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="craigslist" property="og:site_name"/>
  <meta content="preview" name="twitter:card"/>
  <meta content="SF bay area free stuff - craigslist" property="og:title"/>
  <meta content="SF bay area free stuff - craigslist" name="description"/>
  <meta content="SF bay area free stuff - craigslist" property="og:description"/>
  <meta content="https://sfbay.craigslist.org/search/zip" property="og:url"/>
  <title>
   SF bay area free stuff - craigslist
  </title>
  <link href="https://sfbay.craigslist.org/search/zip" rel="canonical"/>
  <link href="https://sfbay.craigslist.org/search/zip" hreflang="x-default" rel="alternate"/>
  <link href="/favicon.ico" id="favicon" rel="icon">
   <script id="ld_searchpage_data" type="application/ld+json">
    {"breadcrumb":{"itemListElement":[{"name":"sfbay.craig

## 2. Interact with the Page-Sorting

**Observe any changes in the URL after you change the sorting order back and forth:**  
Based on my observation, when I interact with the sorting feature on Craiglist page, it appears showing the newest version automatically. 

**Can you trigger the sorting change directly by modifying only the URL in your browser’s address bar?**
If I want to trigger sorting change programmatically, I would just use URLs and choose based on what I need in my project. 
For example, creating two different pages:
urlnewest = 'https://sfbay.craigslist.org/search/zip'
page = requests.get(url, headers)
soup1 = BeautifulSoup(page.content, 'html.parser')

urloldest = 'https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0'
page = requests.get(url, headers)
soup2 = BeautifulSoup(page.content, 'html.parser')

**Explain what type of request is made when you change the sort order (GET or POST)!**
This is GET request since whenever I change the parameter in the URL, it will request data from a specified content page.

**What is the variable in the URL associated with sorting?**
As we can see from the URL change here:
- newest: https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0
- oldest: https://sfbay.craigslist.org/search/zip?sort=dateoldest#search=1~gallery~0~0

The variable that is associated with sorting above:
sort = date
sort = dateoldest

## 3. Interact with the Page-Pagination:

**Determine how to move between pages by only changing the URL.  What part of the URL changes as you navigate through different pages?**

When I observed changes between the first page and the second page, the part of the URL that is changed is: gallery~0~0. It becomes gallery~1~0, when I navigate through the second page, and third:= gallery~2~0. It is located after '#'. As I explore on the internet, one source W3C.org  explains that fragment identifier is the string after URI, after hash identify something specific, usually as part or view. Exactly same in this case for craig. 
First page: https://sfbay.craigslist.org/search/zip?sort=date#search=1~gallery~0~0  

**This task will help you understand how pagination works on Craigslist and how you can programmatically access different pages of listings?**  
First, I need to identify the pagination parameter, look for the changes in the query parameter that change when I click on a different page as I mentioned in prior answer. Then I can construct URLs manually in python to access each page. Finally I can use loop through pages in my script to increment the pagination parameter and then fetch each page's content.  

**Identify the variable associated with page changes.  How does altering this variable in the URL affect the page you’re viewing?**  
The variable associated with change in page is the ~gallery~. By modifying this variable it can affect the page I'm viewing. The web uses the increase and decrease value for this variable to navigate through different pages for listing.

## 4. Fetch Listing URLs:

**Use `requests` to access the first page of the “free” section, ordered “newest” first**  
**Deploy `BeautifulSoup` to parse the HTML content**

In [37]:
from bs4 import BeautifulSoup
import requests
import time

#Following the class's codes
headers = {'User-Agent': 'Mozilla/5.0'}

url = 'https://sfbay.craigslist.org/search/zip'
page = requests.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')


time.sleep(5)

**Identify the structure that holds the links to individual listing pages.  What selector do you choose to grab the link?**  
If we take a look on the structure, the individual listing page is inside the element `<li>` with class `cl-static-search-results` that filed with the hyperlink for individual listing page.
    
  

**Can you identify one more possible selection method to retrieve the link to the individual listing?  Explain!**  
Another solution that I can think of is to using soup.select and use `a` tag directly because when I observe on each listing a tags have href attribute.

**Extract the first 250 unique listing URLs and save them to a list.  Consider the pagination feature of Craigslist to navigate through pages.  Explain your strategy**

In [78]:
from bs4 import BeautifulSoup
import requests
import time

#Following the class's codes
headers = {'User-Agent': 'Mozilla/5.0'}

url = 'https://sfbay.craigslist.org/search/zip'
page = requests.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')

#Setting 5 seconds delay
time.sleep(5)

In [88]:
#Find listing for all URLs on the main page using soup.find_all
find_all_listings = soup.find_all('li',class_='cl-static-search-result')

In [89]:
print(find_all_listings)

[<li class="cl-static-search-result" title="Free Outdoor Patio Wood Burning Oven Fireplace Heater">
<a href="https://sfbay.craigslist.org/scz/zip/d/santa-cruz-free-outdoor-patio-wood/7712235825.html">
<div class="title">Free Outdoor Patio Wood Burning Oven Fireplace Heater</div>
<div class="details">
<div class="price">$0</div>
<div class="location">
                        santa cruz
                    </div>
</div>
</a>
</li>, <li class="cl-static-search-result" title="Set of 4 road flares, 20 minute, red">
<a href="https://sfbay.craigslist.org/eby/zip/d/fremont-set-of-road-flares-20-minute-red/7715224622.html">
<div class="title">Set of 4 road flares, 20 minute, red</div>
<div class="details">
<div class="price">$0</div>
<div class="location">
                        fremont / union city / newark
                    </div>
</div>
</a>
</li>, <li class="cl-static-search-result" title="Free 1 queen duvet insert and 6 pillows, must take all right now">
<a href="https://sfbay.craigslis

**Print the list to screen & Explain your strategy**

In [94]:
unique_urls = []
url_sets = set() #I'm using set to ensure uniqueness of URLs
current_url = 'https://sfbay.craigslist.org/search/zip'

#In here I use while loop so my code keep running until we find 250 unique URLs
#inside the loop, I use GET request for the first page, then iterates over these listing, cheking href attribute of the class selected. 

while len(unique_urls) < 250:
    response = requests.get(current_url)
    soup = BeautifulSoup(response.content, 'html.parser')
    listings = soup.find_all('li', class_='cl-static-search-result')

    for listing in listings:
        a_tag = listing.find('a')
        if a_tag and a_tag['href'] not in url_sets:
            url_sets.add(a_tag['href'])
            unique_urls.append(a_tag['href'])
            if len(unique_urls) == 250:
                break
                
    #Then since I can't rely only on the first page to have 250 listings, this code will check for next page button and extract href attribute to update current url.
    next_button = soup.select_one('.cl-next-page')
    if next_button:
        next_page_url = next_button['href']
        current_url = f"https://sfbay.craigslist.org{next_page_url}"
    else:
        break
    time.sleep(5) #always use 5 second pause after processing each page

for url in unique_urls:
    print(url)
time.sleep(5) 

https://sfbay.craigslist.org/eby/zip/d/fairfield-electric-mower/7715234400.html
https://sfbay.craigslist.org/nby/zip/d/santa-rosa-free-treadmill/7715234306.html
https://sfbay.craigslist.org/sby/zip/d/san-jose-free-42-inch-round-glass-table/7715233115.html
https://sfbay.craigslist.org/pen/zip/d/los-altos-two-black-metal-drafting-lamps/7715232158.html
https://sfbay.craigslist.org/pen/zip/d/san-carlos-power-wheelchair-invacare/7715232044.html
https://sfbay.craigslist.org/pen/zip/d/los-altos-flagpole-mount-for-tow/7715231843.html
https://sfbay.craigslist.org/sfc/zip/d/san-francisco-large-holywood-regency/7715231346.html
https://sfbay.craigslist.org/eby/zip/d/richmond-fish-tanks-and-brine-shrimp/7715231073.html
https://sfbay.craigslist.org/eby/zip/d/vallejo-free-windows-again/7715231009.html
https://sfbay.craigslist.org/nby/zip/d/san-rafael-truck-top-for-pickup/7715230177.html
https://sfbay.craigslist.org/eby/zip/d/danville-baby-changing-table/7715230167.html
https://sfbay.craigslist.org/pe

## 5. Save HTML Pages

**For each of the 250 listing URLs, use `requests` to fetch the listing page & Save each HTML content to a separate file on disk**

In [95]:
import requests
import time
import os

In [96]:
# I create a new directory name for storing HTML files in my computer
directory_name = "craigslistVania"

# Loop over the listing URLs
for url in unique_urls:
    # Extract the listing ID from the URL
    listing_id = url.split('/')[-1].split('.html')[0]
    
    # Fetch the content of the listing
    response = requests.get(url)
    
    # If the response was successful, save the content to a file
    if response.status_code == 200:
        # Ensure the directory exists
        if not os.path.exists(directory_name):
            os.makedirs(directory_name)
        
        # Create the full path for the new file
        file_path = f"{directory_name}/{listing_id}.html"
        
        # Write the content to the file
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        print(f"Saved: {file_path}")
    else:
        print(f"Failed to fetch {url} with status code: {response.status_code}") #Here i just follow content in stackoverflow mentioning for error handling we can use else block showuing the server respond 
    
    # Delay between request for three seconds 
    time.sleep(3)


Saved: craigslistVania/7715234400.html
Saved: craigslistVania/7715234306.html
Saved: craigslistVania/7715233115.html
Saved: craigslistVania/7715232158.html
Saved: craigslistVania/7715232044.html
Saved: craigslistVania/7715231843.html
Saved: craigslistVania/7715231346.html
Saved: craigslistVania/7715231073.html
Saved: craigslistVania/7715231009.html
Saved: craigslistVania/7715230177.html
Saved: craigslistVania/7715230167.html
Saved: craigslistVania/7712529387.html
Saved: craigslistVania/7715228387.html
Saved: craigslistVania/7715227930.html
Saved: craigslistVania/7713797530.html
Saved: craigslistVania/7712881922.html
Saved: craigslistVania/7715226780.html
Failed to fetch https://sfbay.craigslist.org/sby/zip/d/san-jose-free-upright-yamaha-piano/7715210611.html with status code: 410
Saved: craigslistVania/7715225326.html
Saved: craigslistVania/7712235825.html
Saved: craigslistVania/7715224622.html
Saved: craigslistVania/7715222808.html
Saved: craigslistVania/7713596495.html
Saved: craigsl

**Thoughts:** After it is successfully saved, I open each file with a text editor to test whether the file is already stored or not and it appears that the HTML file has been saved in my computer as the content is displayed. Both the image and the description from the Craiglist are visible.

# Part 2: Parsing and Displaying Information from Saved HTML

## 1. Read Saved HTML Files

**Following the code in the instruction from the assignmentm,this script will work to:**

Loop through the files in the directory I created on my laptop.
Open and read each .html file.
Parse the content with BeautifulSoup.
Extract the information I want to have from each html file which are their title, image URL, description, post ID, posted date, and last updated date.
Print these details with the format I created.

## 2. Extract Information:

**For each HTML file, use `BeautifulSoup` to parse the file content.
Extract and print the following details**

In [97]:
from bs4 import BeautifulSoup
import requests
import time


#Set my directory where I save the html page before in my laptop
directory = '/Users/vania/Downloads/DDR/craigslistVania'

# Loop through each file in the directory
for filename in os.listdir(directory):
    # By chcking if the file ends with .html
    if filename.endswith(".html"):
        # Create the complete file path
        filepath = os.path.join(directory, filename) #Following the code from assignment and modify a bit
        # Read file to string
        with open(filepath, 'r', encoding='utf-8') as file:
            html_content = file.read()
            
            # Using BeautifulSoup to parse the HTML content
            soup = BeautifulSoup(html_content, 'html.parser')
            
            # Extract the title of the listing
            title = soup.find('title').text if soup.find('title') else 'Not found'
            
            # Extract the URL of the first image
            image_url = soup.find('img')['src'] if soup.find('img') else 'Not found'
            
            # Extract the full dscription text of the listing
            description = soup.find(id='postingbody').text if soup.find(id='postingbody') else 'Not found'
            
            # Extract the Post ID
            post_id = soup.find('p', class_='postinginfo').text if soup.find('p', class_='postinginfo') else 'Not found'
            
            # Extract Posted Date and Last Updted Date
            posted_date = soup.find(text='posted: ').find_next('time')['datetime'] if soup.find(text='posted: ') else 'Not found'
            last_updated_date = soup.find(text='updated: ').find_next('time')['datetime'] if soup.find(text='updated: ') else 'Not found'
            
            # Print
            print(f"Title: {title}")
            print(f"Image URL: {image_url}")
            print(f"Description: {description.strip()}")
            print(f"Post ID: {post_id}")
            print(f"Posted Date: {posted_date}")
            print(f"Last Updated Date: {last_updated_date}")
            print("\n----------\n")
            
time.sleep(5)

Title: Free 1 queen duvet insert and 6 pillows, must take all right now - free stuff - craigslist
Image URL: https://images.craigslist.org/00101_5LmfxxIj6tG_0t20CI_600x450.jpg
Description: QR Code Link to This Post


PLEASE DO NOT ASK IF AVAILABLE.

IF IT IS STILL POSTED, IT IS STILL AVAILABLE.

PLEASE LET ME KNOW IF YOU CAN PICK UP RIGHT NOW FROM ALBANY CA AND PROVIDE YOUR NUMBER TO TEXT. THANK YOU.

1 queen duvet insert and 6 pillows, must take all right now
Post ID: 
                    Posted
                    
                        2024-02-06 19:30
                    

Posted Date: 2024-02-06T19:30:58-0800
Last Updated Date: 2024-02-06T19:33:33-0800

----------

Title: Retro TV - Good for Retro Gaming - free stuff - craigslist
Image URL: https://images.craigslist.org/00F0F_h3d0GBuVExe_0CI0t2_600x450.jpg
Description: QR Code Link to This Post


FREE
Older  TV  -  Sony Trinitron XBR  
From the 1990's
A perfect monitor for retro gaming - LARGE screen
Works well with old gaming s

# Part 3: Automating Login on The Old Reader

## 1. Creating and Verifying a The Old Reader Account

I have created an account using my UC Davis email through manual login, and successfully entered the home page, confirming that my account is already registered and active.

## 2.	Exploring the Login Mechanism

**Use your browser’s developer tools to inspect the page, focusing on the form tag involved in the login process.**

In [99]:
#Following steps from neopets case
from bs4 import BeautifulSoup
import requests
import time

headers = {'User-agent': 'Mozilla/5.0'}

url = 'https://theoldreader.com/users/sign_in'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

# Print the unmodified page content
print(soup.prettify())
time.sleep(5)

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
  <link href="https://fonts.googleapis.com/css?family=Montserrat:400,600" rel="stylesheet" type="text/css"/>
  <!-- Latest compiled and minified JavaScript -->
  <script src="https://code.jquery.com/jquery.js">
  </script>
  <script src="//netdna.bootstrapcdn.com/bootstrap/3.0.1/js/bootstrap.min.js">
  </script>
  <link href="https://fonts.googleapis.com/css?family=Source+Code+Pro" rel="stylesheet" type="text/css"/>
  <link href="https://fonts.googleapis.com/css?family=Open+Sans:400,800" rel="stylesheet"/>
  <link href="//s.theoldreader.com/assets/reader/public-c7869a909c7b119a27fb646003828344.css" media="screen" rel="stylesheet" type="text/css">
   <link href="//s.theoldreader.com/assets/

**Document all `<input>` fields within the login form, paying special attention to their name attributes. These fields are crucial for submitting the login request programmatically**

In [100]:
input_form = soup.find_all('input')
print(input_form)
time.sleep(5)

[<input name="utf8" type="hidden" value="✓"/>, <input name="authenticity_token" type="hidden" value="6/J89xAkn3cVNy0JUMJF8Lba3mr+m8Aw0rxsHp5saFY="/>, <input autocapitalize="off" autocorrect="off" autofocus="autofocus" class="form-control" id="user_login" name="user[login]" placeholder="Username/Email" size="30" spellcheck="false" type="text"/>, <input class="form-control" id="user_password" name="user[password]" placeholder="Password" size="30" type="password"/>, <input class="btn btn-primary btn-block" name="commit" type="submit" value="Sign In"/>]


## 3.	Analyzing Network Traffic for Login Request  &  4. Automating the Login Process

**Identify the network request made when you submit the login form. Explain why this method was chosen.**  
It is POST because it will request send the data in the body of request, and not displaying the privacy information in the URL, usually this method is adding layer of secuirty on transmitting sensitive information, therefore my creddential is not exposed in browser history.

**Carefully examine the payload that was submitted to the server during login.  Compare this payload to the `<form>` / `<input>` fields you previously analyzed.  Explain your observation.**  

In my observation, I found that the POST request should include a key-value pair where the'authenticity_token' and the value in the string. The authenticity token here is similar with the website on neopets where it was the '_ref_ck' in this case I have the authenticity token who works sort of like cookies and allow me to make post request.  I will extract the authenticity token I have for my login to help me automating the login process.

## 4. Automating the Login Process

**Using Python and appropriate libraries like requests, simulate the login process**

In [101]:
#<input autocapitalize="off" autocorrect="off" autofocus="autofocus" class="form-control" id="user_login" name="user[login]" placeholder="Username/Email" size="30" spellcheck="false" type="text">

In [102]:
headers = {'User-agent': 'Mozilla/5.0'}

url = 'https://theoldreader.com/'
page = requests.get(url, headers=headers)
soup2 = BeautifulSoup(page.content, 'html.parser')

# Print the unmodified page content
print(soup2)
# Always pause between two requests.
time.sleep(5) # 5s

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<link href="https://fonts.googleapis.com/css?family=Montserrat:400,600" rel="stylesheet" type="text/css"/>
<!-- Latest compiled and minified JavaScript -->
<script src="https://code.jquery.com/jquery.js"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.0.1/js/bootstrap.min.js"></script>
<link href="https://fonts.googleapis.com/css?family=Source+Code+Pro" rel="stylesheet" type="text/css"/>
<link href="https://fonts.googleapis.com/css?family=Open+Sans:400,800" rel="stylesheet"/>
<link href="//s.theoldreader.com/assets/reader/public-c7869a909c7b119a27fb646003828344.css" media="screen" rel="stylesheet" type="text/css">
<link href="//s.theoldreader.com/assets/apple-touch-icon-57x57-86fe1176

In [103]:
#<input class="form-control" id="user_password" name="user[password]" placeholder="Password" size="30" type="password">

In [104]:
input_element = soup.select_one('input[name="authenticity_token"]')
authenticity_token = input_element.get('value') 
print(authenticity_token)

# Always pause between two requests.
time.sleep(5) # 5s

6/J89xAkn3cVNy0JUMJF8Lba3mr+m8Aw0rxsHp5saFY=


**Create a session object to maintain your login state across multiple requests & Send a POST request to the login form’s action URL to log in, using the session object**

In [105]:
time.sleep(5)
# An open session carries the cookies and allows me to make post requests
session = requests.session()

res = session.post('https://theoldreader.com/users/sign_in', 
                        data = {#'dob-check' : '',
                                #'destination' : '',
                                'authenticity_token' : authenticity_token,
                                'user[login]' : 'fvsantosa@ucdavis.edu', # your username here
                                'user[password]' : 'blessed999'}, # your password here
                        # headers = dict(referer = 'https://www.neopets.com/login/'),
                        timeout = 20)

# This will get me the Cookies.
cookies = session.cookies.get_dict()
print(cookies)

# Always pause between two requests.
time.sleep(5) # 5s

{'_new_reader_session': 'BAh7CkkiD3Nlc3Npb25faWQGOgZFVEkiJWJmNGNjNmQ4MzI0YTU1MmQ2YzJmOTRlOTg4YWVjNzA0BjsAVEkiGXdhcmRlbi51c2VyLnVzZXIua2V5BjsAVFsHWwZVOhpNb3BlZDo6QlNPTjo6T2JqZWN0SWQiEd93EwRXuCeEt4m8n0kiIiQyYSQwNSRuSGpoVXQ1WlMwU1l2YWRDcDkxekNlBjsAVEkiDWxhbmd1YWdlBjsARjoHZW5JIhByZWRpcmVjdF90bwY7AEZJIgYvBjsARkkiEF9jc3JmX3Rva2VuBjsARkkiMXJnSHZXOU82Z1gzUTk2eTBMMzUvc2dlQVh4a3hvWHcxV2ZEZXFpSjZtbkU9BjsARg%3D%3D--f1d06db1a7eba51c80cdc699901f7d500dbcb515', 'i_know_you': 'vania', 'remember_user_token': 'BAhbB1sGVToaTW9wZWQ6OkJTT046Ok9iamVjdElkIhHfdxMEV7gnhLeJvJ9JIiIkMmEkMDUkbkhqaFV0NVpTMFNZdmFkQ3A5MXpDZQY6BkVU--7d9e4c2fa1283eac3a12ae2e559b1def7a6e479c', 'signed_at': '1707287412'}


## 5.	Verifying Successful Login

In [106]:
# Always pause between two requests.
time.sleep(5) # 5s

# This is the easiest way to remain in-session.
page2 = session.get('https://theoldreader.com/') # use session.xyz
soup2 = BeautifulSoup(page2.content, 'html.parser')

# print(soup2);
# print();
elements = soup2.find_all(title = 'vania') 
for element in elements:
    print(str(element.parent))

<li class="dropdown">
<a class="dropdown-toggle" data-hover="dropdown" data-toggle="dropdown" href="#" title="vania">vania  <i class="fa fa-caret-down"></i></a>
<ul class="dropdown-menu">
<li class="dropdown-header">Settings</li>
<li><a data-pjax="" href="/users/edit">Manage Settings</a></li>
<li><a href="https://theoldreader.com/accounts/manage">Manage Account</a></li>
<li><a data-pjax="" href="/subscriptions">Manage Subscriptions</a></li>
<li><a data-pjax="" href="/profile/df77130457b82784b789bc9f">View Profile</a></li>
<li class="divider"></li>
<li class="dropdown-header">Help</li>
<li><a data-pjax="" href="/pages/tour">Product Tour</a></li>
<li><a href="mailto:support@theoldreader.com">Support</a></li>
<li class="divider"></li>
<li><a href="/users/sign_out">Sign Out</a></li>
</ul>
</li>


**Thoughts** :  I have verified my login is successful by seeing the presence of my user information (vania) that is only available when logged in.