# Scrape sale price documents for Brooklyn homes

## Build a list of documents we would like to download

Visit https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page and peek under "Detailed Annual Sales Reports by Borough." We want to build a list of all of the excel files that link to **one borough**. It's your choice - Manhattan, Brooklyn, Staten Island, etc.

* _**Tip:** You can basically cut and paste from the end of class on this one_
* _**Tip:** 2017 and earlier files are `.xls`, not `.xlsx`_

In [23]:
import requests
from bs4 import BeautifulSoup

In [24]:
response = requests.get("https://www.nyc.gov/site/finance/taxes/property-annualized-sales-update.page")
doc = BeautifulSoup(response.text)

In [25]:
links = doc.select("a[href*='_brooklyn.xls']")
links

[<a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_brooklyn.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_brooklyn.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_brooklyn.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2018/2018_brooklyn.xlsx" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2017/2017_brooklyn.xls" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2016/2016_brooklyn.xls" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2015/2015_brooklyn.xls" target="_blank">Download</a>,
 <a href="/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2014/2014_brooklyn.xls" target="_blank">Download</

## Use Python to make a list of the URLs to be downloaded, and save them to a file.

The format is a _little_ different than what we did in class, as `/` at the beginning of a url means "start from the top of the domain" instead of "start relative to the page you're on now." Just examine your URLs and you'll notice it.

_**Tip:** If you want to google around at other ways to do this, the `'\n'.join(urls)` method might be an interesting one to look at._

In [27]:
urls = ['https://www.nyc.gov' + link['href'] for link in links]
urls[:3]

['https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_brooklyn.xlsx',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_brooklyn.xlsx',
 'https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2019/2019_brooklyn.xlsx']

In [28]:
with open("urls.txt", "w") as fp:
    for url in urls:
        fp.write(url + "\n")

## Download the Excel files with `wget` or `curl`

You can see what I did in class, but `wget` has an option to provide it with a filename to download al ist of files from.

In [29]:
!wget -i urls.txt

--2022-11-16 13:38:44--  https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2021/2021_brooklyn.xlsx
Resolving www.nyc.gov (www.nyc.gov)... 104.70.72.36
Connecting to www.nyc.gov (www.nyc.gov)|104.70.72.36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3212511 (3.1M) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘2021_brooklyn.xlsx’


2022-11-16 13:38:45 (6.16 MB/s) - ‘2021_brooklyn.xlsx’ saved [3212511/3212511]

--2022-11-16 13:38:45--  https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2020/2020_brooklyn.xlsx
Reusing existing connection to www.nyc.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 2277851 (2.2M) [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: ‘2020_brooklyn.xlsx’


2022-11-16 13:38:45 (5.57 MB/s) - ‘2020_brooklyn.xlsx’ saved [2277851/2277851]

--2022-11-16 13:38:45--  https://www.nyc.gov/assets/finance/downlo