# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all files as one.

In [1]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

# from google.colab import files ## code for downloading in google colab

In [2]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

## Turn page into soup

In [3]:
## get url and print but hard to read. will do prettify next
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li tag</li>
<li>Junk Li tag</li>
<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a hr

## Find all txt files

In [4]:
## save in list called txt_holder
txt_holder = soup.find_all("ul", class_="txts")
print(txt_holder)

[<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>]


## Find all the ```a``` tags 

In [7]:
## for loop
for txt_files in txt_holder:
    txt_file_links = txt_files.find_all("a")
    print(txt_file_links)
    print(type(txt_file_links))

[<a href="files/text_doc_01.txt">1</a>, <a href="files/text_doc_02.txt">2</a>, <a href="files/text_doc_03.txt">3</a>, <a href="files/text_doc_04.txt">4</a>, <a href="files/text_doc_05.txt">5</a>, <a href="files/text_doc_06.txt">6</a>, <a href="files/text_doc_07.txt">7</a>, <a href="files/text_doc_08.txt">8</a>, <a href="files/text_doc_09.txt">9</a>, <a href="files/text_doc_10.txt">10</a>]
<class 'bs4.element.ResultSet'>


In [8]:
## look at the links
txt_file_links


[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

## What is missing from the URLs?

In [10]:
base_url = "https://sandeepmj.github.io/scrape-example-page/"

## Create a list of the full URLs

Without all the ```html```

In [12]:
## lc
all_txt_links = [base_url + txt_link.get("href")\
                 for txt_link in txt_file_links]
all_txt_links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']

## Download all the ```txt``` documents

In [16]:
pip install wget

Note: you may need to restart the kernel to use updated packages.


In [17]:

import wget # can put down documents, files from websites

In [25]:
## download with timer
link_numbers = len(all_txt_links)

link_count = 1

for link in all_txt_links[3:7]:
    print(f"Downloaded link {link_count}  of {link_numbers}")
    wget.download(link)
    link_count += 1
    snooze = randrange(3,6)
    print(f"Snoozing for {snooze} seconds before scraping next link")
    time.sleep(snooze)

Downloaded link 1
Snoozing for 4 seconds before scraping next link
Downloaded link 2
Snoozing for 4 seconds before scraping next link
Downloaded link 3
Snoozing for 3 seconds before scraping next link
Downloaded link 4
Snoozing for 5 seconds before scraping next link


# Find all ```pdf``` files

In [26]:
## grab pdfs
pdf_holder = soup.find_all("ul", class_="pdfs")
pdf_holder

[<ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>
 <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>
 <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>
 <li>PDF Document <a href="files/pdf_7.pdf">7</a></li>
 <li>PDF Document <a href="files/pdf_8.pdf">8</a></li>
 <li>PDF Document <a href="files/pdf_9.pdf">9</a></li>
 <li>PDF Document <a href="files/pdf_10.pdf">10</a></li>
 </ul>]

## Find all the ```a``` tags 

In [28]:
## for loop store in all_pdf_links_fl
for pdf_files in pdf_holder:
    pdf_file_links = pdf_files.find_all("a")
    print(type(pdf_file_links))
    
pdf_file_links

<class 'bs4.element.ResultSet'>


[<a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>]

## Find all the ```a``` tags 

Without all the ```html```

In [30]:
## lc store in all_pdf_links
all_pdf_links = [base_url + pdf_file_link.get("href")\
                 for pdf_file_link in pdf_file_links]
all_pdf_links

['https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_4.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_5.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_6.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_7.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_8.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_9.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_10.pdf']

## Download all the ```pdf``` documents

In [31]:
links_number = len(all_pdf_links)
link_count = 1
for link in all_pdf_links:
    print(f"Downloaded link {link_count} of {links_number}")
    #files.download(wget.download(link)) ## needed in colab instead of next line
    wget.download(link)
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

Downloaded link 1 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 2 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 3 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 4 of 10
snoozing for 3 seconds before scraping next link.
Downloaded link 5 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 6 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 7 of 10
snoozing for 5 seconds before scraping next link.
Downloaded link 8 of 10
snoozing for 5 seconds before scraping next link.
Downloaded link 9 of 10
snoozing for 4 seconds before scraping next link.
Downloaded link 10 of 10
snoozing for 5 seconds before scraping next link.


# Find all the files and download at one go

In [32]:
## find all files in our soup
all_holder = soup.find_all("li")
all_holder

[<li>Junk Li tag</li>,
 <li>Junk Li tag</li>,
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>,
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>,
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>,
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>,
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>,
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>,
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>,
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>,
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>,
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>,
 <li>Junk Li tag</li>,
 <li>Junk Li tag</li>,
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>,
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>,
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>,
 <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>,
 <li>PDF Document <a href="files/pdf_5.pdf">5</a></

## Stop...we can't throw such a wide net!

# Target the class ```downloadable```

In [36]:
## find all files in our soup
docs_holder = soup.find_all("ul", class_="downloadable")
docs_holder

[<ul class="txts downloadable">
 <p class="pages">Download this list of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="files

In [35]:
## type?
type(docs_holder)

bs4.element.ResultSet

In [38]:
for item in docs_holder:
    print(item)
    print("************************")

<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>
************************
<ul class="pdfs downloadable">
<p class="pages">Download this list of PDFs</p>
<li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
<li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
<li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
<li>PDF Document <a href="

### We run into problems because we have a list of lists

#### Quick detour to flatten list lesson

In [41]:
## because docs_holder has p tags, newlines, etc. we need to focus it
all_li = [myLi.find_all("li") for myLi in docs_holder]
all_li

[[<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>,
  <li>Text Document <a href="files/text_doc_02.txt">2</a></li>,
  <li>Text Document <a href="files/text_doc_03.txt">3</a></li>,
  <li>Text Document <a href="files/text_doc_04.txt">4</a></li>,
  <li>Text Document <a href="files/text_doc_05.txt">5</a></li>,
  <li>Text Document <a href="files/text_doc_06.txt">6</a></li>,
  <li>Text Document <a href="files/text_doc_07.txt">7</a></li>,
  <li>Text Document <a href="files/text_doc_08.txt">8</a></li>,
  <li>Text Document <a href="files/text_doc_09.txt">9</a></li>,
  <li>Text Document <a href="files/text_doc_10.txt">10</a></li>],
 [<li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>,
  <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>,
  <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>,
  <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>,
  <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>,
  <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>,
  <li>PDF Docu

## itertools

In [42]:
## let's use itertools to flatten the list
import itertools

list(itertools.chain(*all_li))

[<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>,
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>,
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>,
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>,
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>,
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>,
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>,
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>,
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>,
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>,
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>,
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>,
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>,
 <li>PDF Document <a href="files/pdf_4.pdf">4</a></li>,
 <li>PDF Document <a href="files/pdf_5.pdf">5</a></li>,
 <li>PDF Document <a href="files/pdf_6.pdf">6</a></li>,
 <li>PDF Document <a href="file

In [50]:
## let's blend BeautifulSoup and itertools
all_links = [base_url + url.find("a").get("href")\
            for url in list(itertools.chain(*all_li))]

all_links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/

## For Loop

In [None]:
## Flatten via for loop


## List Comprehension

In [55]:
# step 1
docs_holder_all_a = [item.find_all("a") for item in docs_holder]
docs_holder_all_a

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/pdf_1.pdf">1</a>,
  <a href="files/pdf_2.pdf">2</a>,
  <a href="files/pdf_3.pdf">3</a>,
  <a href="files/pdf_4.pdf">4</a>,
  <a href="files/pdf_5.pdf">5</a>,
  <a href="files/pdf_6.pdf">6</a>,
  <a href="files/pdf_7.pdf">7</a>,
  <a href="files/pdf_8.pdf">8</a>,
  <a href="files/pdf_9.pdf">9</a>,
  <a href="files/pdf_10.pdf">10</a>]]

In [53]:
# step 2
all_urls_lc = [base_url+item.get("href") for sub_list in docs_holder_all_a for item in sub_list]
all_urls_lc

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/

## Download all documents

In [49]:
## careful to put in a list name we just processed (via lc, fl, itertools)
links_number = len(all_links)
link_count = 1
for link in all_links:
    print(f"Downloaded link {link_count} of {links_number}")
    wget.download(link)
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

Downloaded link 1 of 20
snoozing for 4 seconds before scraping next link.
Downloaded link 2 of 20
snoozing for 3 seconds before scraping next link.
Downloaded link 3 of 20
snoozing for 4 seconds before scraping next link.
Downloaded link 4 of 20
snoozing for 5 seconds before scraping next link.
Downloaded link 5 of 20
snoozing for 5 seconds before scraping next link.
Downloaded link 6 of 20
snoozing for 3 seconds before scraping next link.
Downloaded link 7 of 20
snoozing for 3 seconds before scraping next link.
Downloaded link 8 of 20
snoozing for 5 seconds before scraping next link.
Downloaded link 9 of 20
snoozing for 3 seconds before scraping next link.
Downloaded link 10 of 20
snoozing for 4 seconds before scraping next link.
Downloaded link 11 of 20
snoozing for 3 seconds before scraping next link.
Downloaded link 12 of 20
snoozing for 4 seconds before scraping next link.
Downloaded link 13 of 20
snoozing for 3 seconds before scraping next link.
Downloaded link 14 of 20
snoozing 