# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the SEC suspended between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### Write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all files as one.

In [18]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers
import wget # can put down documents, files from websites
# from google.colab import files ## code for downloading in google colab

In [19]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

## Turn page into soup

In [20]:
## get url and print but hard to read. will do prettify next
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>

## Find all txt files

In [21]:
txt_holder = soup.find_all("ul", class_="txts")
print(txt_holder)

[<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>]


## Find all the ```a``` tags 

In [22]:
for txt_files in txt_holder:
    txt_file_links = txt_files.find_all("a")
    print(type(txt_file_links))
    print(txt_file_links)

<class 'bs4.element.ResultSet'>
[<a href="files/text_doc_01.txt">1</a>, <a href="files/text_doc_02.txt">2</a>, <a href="files/text_doc_03.txt">3</a>, <a href="files/text_doc_04.txt">4</a>, <a href="files/text_doc_05.txt">5</a>, <a href="files/text_doc_06.txt">6</a>, <a href="files/text_doc_07.txt">7</a>, <a href="files/text_doc_08.txt">8</a>, <a href="files/text_doc_09.txt">9</a>, <a href="files/text_doc_10.txt">10</a>]


## What is missing from the URLs?

In [23]:
base_url = "https://sandeepmj.github.io/scrape-example-page/"

## Create a list of the full URLs

Without all the ```html```

In [24]:
 n

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt', 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']


## Download all the ```txt``` documents

In [25]:
links_number = len(all_txt_links)
link_count = 1
for link in all_txt_links:
    print(f"Downloaded link {link_count} of {links_number}")
#     files.download(wget.download(link,"")) ## needed in colab instead of next line
    wget.download(link, "")
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

Downloaded link 1 of 10
snoozing for 5 seconds before scraping next link.
Downloaded link 2 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 3 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 4 of 10
snoozing for 5 seconds before scraping next link.
Downloaded link 5 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 6 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 7 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 8 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 9 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 10 of 10
snoozing for 4 seconds before scraping next link.


# Find all ```pdf``` files

In [9]:
pdf_holder = soup.find_all("ul", class_= "pdfs")
print(pdf_holder)

[<ul class="pdfs downloadable">
<p class="pages">Download this list of PDFs</p>
<li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
<li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
<li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
<li>PDF Document <a href="files/pdf_4.pdf">4</a></li>
<li>PDF Document <a href="files/pdf_5.pdf">5</a></li>
<li>PDF Document <a href="files/pdf_6.pdf">6</a></li>
<li>PDF Document <a href="files/pdf_7.pdf">7</a></li>
<li>PDF Document <a href="files/pdf_8.pdf">8</a></li>
<li>PDF Document <a href="files/pdf_9.pdf">9</a></li>
<li>PDF Document <a href="files/pdf_10.pdf">10</a></li>
</ul>]


## Find all the ```a``` tags 

In [10]:
for pdf_files in pdf_holder:
    pdf_file_links = pdf_files.find_all("a")
    print(type(pdf_file_links))
    print(pdf_file_links)

<class 'bs4.element.ResultSet'>
[<a href="files/pdf_1.pdf">1</a>, <a href="files/pdf_2.pdf">2</a>, <a href="files/pdf_3.pdf">3</a>, <a href="files/pdf_4.pdf">4</a>, <a href="files/pdf_5.pdf">5</a>, <a href="files/pdf_6.pdf">6</a>, <a href="files/pdf_7.pdf">7</a>, <a href="files/pdf_8.pdf">8</a>, <a href="files/pdf_9.pdf">9</a>, <a href="files/pdf_10.pdf">10</a>]


## Find all the ```a``` tags 

Without all the ```html```

In [11]:
all_pdf_links = [base_url + pdf_file_link.get("href") for pdf_file_link in pdf_file_links]
print(all_pdf_links)

['https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_4.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_5.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_6.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_7.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_8.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_9.pdf', 'https://sandeepmj.github.io/scrape-example-page/files/pdf_10.pdf']


## Download all the ```pdf``` documents

In [12]:
links_number = len(all_pdf_links)
link_count = 1
for link in all_pdf_links:
    print(f"Downloaded link {link_count} of {links_number}")
    #files.download(wget.download(link,"")) ## needed in colab instead of next line
    wget.download(link, "")
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

Downloaded link 1 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 2 of 10
snoozing for 5 seconds before scraping next link.
Downloaded link 3 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 4 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 5 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 6 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 7 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 8 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 9 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 10 of 10
snoozing for 5 seconds before scraping next link.


## Find all the files and download at one go

In [13]:
## find all files in our soup
all_holder = soup.find_all("li")
all_holder

[<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>,
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>,
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>,
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>,
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>,
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>,
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>,
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>,
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>,
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>,
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>,
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>,
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>,
 <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>,
 <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>,
 <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>,
 <li>PDF Document <a href="file

## Isolate the urls

In [None]:
all_files = [base_url + link.find("a").get("href") for link in all_holder]
all_files

## Download all files

In [None]:
links_number = len(all_files)
link_count = 1
for link in all_files:
    print(f"Downloaded link {link_count} of {links_number}")
   # files.download(wget.download(link,"")) ## needed in colab instead of next line
    wget.download(link, "")
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

# go for class downloadable

In [None]:
## find all files in our soup
docs_holder = soup.find_all("ul", class_ = "downloadable")
docs_holder

In [None]:
type(docs_holder)

In [None]:
all_urls = []
for myLI in docs_holder:
    myLI = myLI.find_all("a")
    for url in myLI:
        url = url.get("href")
        print(url)
        all_urls.append(base_url + url)

all_urls

In [None]:
links_number = len(all_urls)
link_count = 1
for link in all_urls:
    print(f"Downloaded link {link_count} of {links_number}")
    wget.download(link, "")
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")

## Misc.

If you target the ```li``` tag, you can skip the separate step of targetting the ```a``` tags with ```find_all```.

In [None]:
for pdf_files in pdf_holder:
    pdf_file_links = pdf_files.find_all("li")
    print(type(pdf_file_links))
    print(pdf_file_links)