# Scraping Files from Websites

### You need to create a data set that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all ```txt``` files at one time.

In [1]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

# from google.colab import files ## code for downloading in google colab

In [2]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

In [3]:
## request site data
response = requests.get(url)
response.status_code

200

## Turn page into soup

In [4]:
## get url and print but hard to read. will do prettify next
soup = BeautifulSoup(response.text, "html.parser")

## Find all txt files in the first set of txt files

In [5]:
## too wide
soup.find_all("li")

[<li>Junk Li <a href="">tag 1</a></li>,
 <li>Junk Li <a href="">tag 2</a></li>,
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>,
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>,
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>,
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>,
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>,
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>,
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>,
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>,
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>,
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>,
 <li>Junk Li <a href="">tag 3</a></li>,
 <li>Junk Li <a href="">tag 4</a></li>,
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>,
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>,
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>,
 <li>PDF Document <a href="files/pdf_4.

In [6]:
## save in list called txt_holder
txt_holder = soup.find("ul", class_="txts")
txt_holder

<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>

In [7]:
## type
type(txt_holder)

bs4.element.Tag

## Find all the ```a``` tags

In [8]:
## find a tags
link_a_tags = txt_holder.find_all("a")
link_a_tags

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

In [9]:
## find a_tags in one step

a_list = soup.find("ul", class_="txts").find_all("a")
a_list

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

In [10]:
## for loop
links = []
for a_tag in a_list:
  # print(link)
  links.append(a_tag.get("href"))

In [11]:
## look at the links
links

['files/text_doc_01.txt',
 'files/text_doc_02.txt',
 'files/text_doc_03.txt',
 'files/text_doc_04.txt',
 'files/text_doc_05.txt',
 'files/text_doc_06.txt',
 'files/text_doc_07.txt',
 'files/text_doc_08.txt',
 'files/text_doc_09.txt',
 'files/text_doc_10.txt']

## What is missing from the URLs?

In [12]:
## base url
base_url = "https://sandeepmj.github.io/scrape-example-page/"

## Create a list of the full URLs

Without all the ```html```

In [13]:
## lc
all_links = [base_url + link for link in links]
all_links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']

## Download all the ```txt``` documents

In [14]:
pip install wget

Note: you may need to restart the kernel to use updated packages.


In [15]:
import wget # can put down documents, files from websites

In [16]:
## download with timer
links_total = len(all_links)
link_count = 1

for link in all_links:
  print(f"Downloading link {link_count} of {links_total}")
  link_count += 1
  wget.download(link) ## non-colab notebook code
#   files.download(wget.download(link)) ## this has files.download colab command
  snoozer = randrange(3, 7)
  print(f"Snoozing for {snoozer} seconds before next link")
  time.sleep(snoozer)


Downloading link 1 of 10


NameError: name 'files' is not defined

# Find all ```pdf``` files

In [None]:
## grab pdfs links
pdf_a = soup.find("ul", class_="pdfs").find_all("a")
pdf_a

## Find all the ```a``` tags

In [None]:
## for loop store in all_pdf_links_fl
pdf_links_fl = []
for pdf_link in pdf_a:
  pdf_links_fl.append(base_url + pdf_link.get("href"))

pdf_links_fl

In [None]:
## list comprehension

pdf_links = [base_url + a_tag.get("href") for a_tag in pdf_a]
pdf_links

## Download all the ```pdf``` documents

In [None]:
## download with timer
links_total = len(pdf_links)
link_count = 1

for link in pdf_links:
  print(f"Downloading link {link_count} of {links_total}")
  link_count += 1
  #wget.download(link) ## non-colab notebook code
  files.download(wget.download(link)) ## this has files.download colab command
  snoozer = randrange(3, 7)
  print(f"Snoozing for {snoozer} seconds before next link")
  time.sleep(snoozer)

# Find all the txt files in both sets and download at one go

# Target the class ```downloadable```

In [17]:
## find all files in our soup
docs_holder = soup.find_all("ul", class_="txts")
docs_holder

[<ul class="txts downloadable">
 <p class="pages">Download this first set of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="txts downloadable">
 <p class="pages">Download this second set of text documents</p>
 <li>Text Document <a href="files/text_doc_A.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_B.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_C.txt">3</a

In [18]:
## type?
type(docs_holder)

bs4.element.ResultSet

In [19]:
## length
len(docs_holder)

2

In [20]:
## print to see each new one
for section in docs_holder:
  print(section)
  print("**************")

<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>
**************
<ul class="txts downloadable">
<p class="pages">Download this second set of text documents</p>
<li>Text Document <a href="files/text_doc_A.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_B.txt">2</a></li>
<li>Text Document <a href="files/text_doc_C.txt">3</a></l

In [21]:
## print a tags
all_links = []
for a_tag in docs_holder:
  print(a_tag)

<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>
<ul class="txts downloadable">
<p class="pages">Download this second set of text documents</p>
<li>Text Document <a href="files/text_doc_A.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_B.txt">2</a></li>
<li>Text Document <a href="files/text_doc_C.txt">3</a></li>
<li>Text Doc

In [22]:
## store in list
all_a = [a_tag.find_all("a") for a_tag in docs_holder]
all_a

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/text_doc_A.txt">1</a>,
  <a href="files/text_doc_B.txt">2</a>,
  <a href="files/text_doc_C.txt">3</a>,
  <a href="files/text_doc_D.txt">4</a>,
  <a href="files/text_doc_E.txt">5</a>,
  <a href="files/text_doc_F.txt">6</a>,
  <a href="files/text_doc_G.txt">7</a>,
  <a href="files/text_doc_H.txt">8</a>,
  <a href="files/text_doc_I.txt">9</a>,
  <a href="files/text_doc_J.txt">10</a>]]

### flat is better than nested!

In [None]:
import this

### We run into problems because we have a list of lists

#### Quick detour to flatten list lesson

## itertools

In [23]:
## let's use itertools to flatten the list
import itertools

In [24]:
## let's blend BeautifulSoup and itertools
target_a = list(itertools.chain(*all_a))
target_a

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/text_doc_A.txt">1</a>,
 <a href="files/text_doc_B.txt">2</a>,
 <a href="files/text_doc_C.txt">3</a>,
 <a href="files/text_doc_D.txt">4</a>,
 <a href="files/text_doc_E.txt">5</a>,
 <a href="files/text_doc_F.txt">6</a>,
 <a href="files/text_doc_G.txt">7</a>,
 <a href="files/text_doc_H.txt">8</a>,
 <a href="files/text_doc_I.txt">9</a>,
 <a href="files/text_doc_J.txt">10</a>]

## List Comprehension

In [25]:
# step 1
all_links = [base_url + atag.get("href") for atag in target_a]
all_links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_A.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_B.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_C.txt',
 'https://sandeepmj.github.io/scrape-exam

## Download all documents

In [26]:
## careful to put in a list name we just processed (via lc, fl, itertools)
## download with timer
links_total = len(all_links)
link_count = 1

for link in all_links:
  print(f"Downloading link {link_count} of {links_total}")
  link_count += 1
  wget.download(link) ## non-colab notebook code
#   files.download(wget.download(link)) ## this has files.download colab command
  snoozer = randrange(3, 7)
  print(f"Snoozing for {snoozer} seconds before next link")
  time.sleep(snoozer)

Downloading link 1 of 20
100% [................................................................] 76 / 76Snoozing for 3 seconds before next link
Downloading link 2 of 20
100% [................................................................] 66 / 66Snoozing for 4 seconds before next link
Downloading link 3 of 20
100% [................................................................] 70 / 70Snoozing for 4 seconds before next link
Downloading link 4 of 20
100% [................................................................] 63 / 63Snoozing for 5 seconds before next link
Downloading link 5 of 20
100% [................................................................] 66 / 66Snoozing for 5 seconds before next link
Downloading link 6 of 20
100% [................................................................] 66 / 66Snoozing for 5 seconds before next link
Downloading link 7 of 20
100% [................................................................] 69 / 69Snoozing for 3 seconds before ne

In [None]:
## glob
import glob

In [None]:
#put into list of only pdfs
txt_files = glob.glob("*.txt")
txt_files

In [None]:
## download pdfs
for txt_file in txt_files:
  files.download(txt_file)