# Scraping sites with pulldown menus

- <a href="https://automobiles.honda.com/">Honda</a>
- <a href="https://www.ryanair.com/us/en">Ryanair</a>
- <a href="https://www.rebgv.org/content/rebgv-org/market-watch/monthly-market-report/">Real Estate Board Monthly Market Report</a>

We want to <a href="https://sandeepmj.github.io/scrape-example-page/pulldown-site/">scrape this site</a>. 

Note that the site is non-functional (choosing items from menu won't take you anywhere, but we can use it to train).

## Buried within are pages like:

- <a href="https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/january-2020.html">January 2020</a>
- <a href="https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/february-2020.html">February 2020</a>
- <a href="https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/august-2020.html">August 2020</a>

In [1]:
pip install icecream

Note: you may need to restart the kernel to use updated packages.


In [2]:
## Lets import all the libaries we are likely to need
import requests ## to capture content from web pages
from bs4 import BeautifulSoup ## to parse our scraped data
import pandas as pd ## to easily export our data to dataframes/CSVs
from icecream import ic ## easily debug
from pprint import pprint as pp ## to prettify our printouts
import itertools ## to flatten lists
from random import randrange ## to create a range of numbers
import time # for timer
import re

In [3]:
url = "https://sandeepmj.github.io/scrape-example-page/pulldown-site/"

In [4]:
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<!-- set the character set -->
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- first step to a responsive site -->
<title>Pulldown menu demo site</title>
<!-- FONT AWESOME -->
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
<!-- Your styles should be called here -->
<link href="styles.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<!-- YOUR CONTENT GOES HERE -->
<div class="main">
<h1>Pulldown menu demo site</h1>
<p>A site to learn and practice how to scrape pulldown menu sites</p>
<p>Select a month and a year. There is no <strong>View Results</strong> button and this is not a fully operational site. Selecting a month and year and clicking on <strong>View Results</strong> would take you to a new page.</p>
<div class="dropdown">
<label for="month">Month</label>
<div class="se

In [12]:
years_elements = soup.find("select", id="year").find_all("option")
years_elements

[<option selected="" value="2021">2021</option>,
 <option value="2020">2020</option>,
 <option value="2019">2019</option>,
 <option value="2018">2018</option>,
 <option value="2017">2017</option>,
 <option value="2016">2016</option>,
 <option value="2015">2015</option>,
 <option value="2014">2014</option>,
 <option value="2013">2013</option>,
 <option value="2012">2012</option>,
 <option value="2011">2011</option>,
 <option value="2010">2010</option>,
 <option value="2009">2009</option>,
 <option value="2008">2008</option>,
 <option value="2007">2007</option>,
 <option value="2006">2006</option>,
 <option value="2005">2005</option>,
 <option value="2004">2004</option>,
 <option value="2003">2003</option>,
 <option value="2002">2002</option>,
 <option value="2001">2001</option>,
 <option value="2000">2000</option>,
 <option value="1999">1999</option>]

In [13]:
type(years_elements)

bs4.element.ResultSet

In [33]:
years = []
for year in years_elements:
#     ic(year)
    target_year = int(year.get_attribute_list("value")[0])
    if 2017 <= target_year < 2021:
#     ic(target_year)
        years.append(target_year)
    
years

[2020, 2019, 2018, 2017]

In [19]:
months_elements = soup.find("select", id = "month").find_all("option")
months_elements

[<option selected="" value="january">January</option>,
 <option value="february">February</option>,
 <option value="march">March</option>,
 <option value="april">April</option>,
 <option value="may">May</option>,
 <option value="june">June</option>,
 <option value="july">July</option>,
 <option value="august">August</option>,
 <option value="september">September</option>,
 <option value="october">October</option>,
 <option value="november">November</option>,
 <option value="december">December</option>]

In [20]:
months = []
for month in months_elements:
    target_month = month.get_attribute_list("value")[0]
   # ic(target_month)
    months.append(target_month)
months

['january',
 'february',
 'march',
 'april',
 'may',
 'june',
 'july',
 'august',
 'september',
 'october',
 'november',
 'december']

In [36]:
months = [target_month["value"] for target_month in months_elements]
months

['january',
 'february',
 'march',
 'april',
 'may',
 'june',
 'july',
 'august',
 'september',
 'october',
 'november',
 'december']

In [22]:
months = [month.get_attribute_list("value")[0] 
          for month in months_elements]
months

['january',
 'february',
 'march',
 'april',
 'may',
 'june',
 'july',
 'august',
 'september',
 'october',
 'november',
 'december']

In [23]:
base_url = "https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/"

In [34]:
links = []
for year in years:
    for month in months:
        links.append(f"{base_url}{month}-{year}.html")
#         ic(year)
#         ic(month)
        
 
links

['https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/january-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/february-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/march-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/april-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/may-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/june-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/july-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/august-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/september-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/october-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/november-2020.html',

In [27]:
links[12:24]

['https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/january-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/february-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/march-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/april-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/may-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/june-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/july-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/august-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/september-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/october-2020.html',
 'https://sandeepmj.github.io/scrape-example-page/pulldown-site/reports/november-2020.html',