# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all files as one.

In [1]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

# from google.colab import files ## code for downloading in google colab

### Create function to handle our initial requests

In [7]:
## write function here
def mk_request(url):
    '''
    Takes a provided url and returns requested response
    '''
    response = requests.get(url)
    if 200 <= response.status_code < 400:
        return response
    else:
        print(f"request returned {response.status_code} error")


In [15]:
# url to scrape
myurl = "https://sandeepmj.github.io/scrape-example-page/pages.html"

In [9]:
## call the function
response = mk_request(url)


## Turn page into soup

In [11]:
## create function to create soup
def mk_soup(response):
    '''
    Make soup
    '''
    return BeautifulSoup(response.text, "html.parser")


In [17]:
## MC function
def scraper(url):
    '''
    enter url, to return soup of page
    '''
    return mk_soup(mk_request(url))


In [16]:
soup = scraper(myurl)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

In [13]:
## call the function
soup = mk_soup(response)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

## Find all txt files

In [20]:
## save in list called txt_holder
aTags = soup.find("ul", class_="txts").find_all("a")
aTags


[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

In [None]:
## what type


## Find all the ```a``` tags 

In [None]:
## target a tags


In [24]:
## save without html using for loop
links = [base_url + atag.get("href") for atag in aTags]
links

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']

In [None]:
## save without html using list comprehension

## What is missing from the URLs?

In [21]:
## base url
base_url = "https://sandeepmj.github.io/scrape-example-page/"

## Create a list of the full URLs

Without all the ```html```

In [None]:
## lc


## Download all the ```txt``` documents

pip install wget, a great utility to download from links

In [26]:
pip install wget

Note: you may need to restart the kernel to use updated packages.


In [28]:
## import wget
import wget

In [36]:
## make timer function
def snoozer(start_range, end_range):
    snooze_time = randrange(start_range, end_range)
    print(f"\n Snoozing for {snooze_time} seconds")
    return time.sleep(snooze_time)
    
    

In [35]:
snoozer(10, 15)


 Snoozing for 13 seconds


In [42]:
## download with timer

link_count = 1
start_range, end_range = 10, 21
for link in links:
    print(f"Downloading link {link_count} of {len(links)}")
    link_count += 1
    wget.download(link)
    snoozer(start_range, end_range)
    

Downloading link 1 of 10
100% [................................................................] 76 / 76
 Snoozing for 16 seconds
Downloading link 2 of 10
100% [................................................................] 66 / 66
 Snoozing for 13 seconds
Downloading link 3 of 10
100% [................................................................] 70 / 70
 Snoozing for 12 seconds
Downloading link 4 of 10
100% [................................................................] 63 / 63
 Snoozing for 19 seconds
Downloading link 5 of 10
100% [................................................................] 66 / 66
 Snoozing for 12 seconds
Downloading link 6 of 10
100% [................................................................] 66 / 66
 Snoozing for 17 seconds
Downloading link 7 of 10
100% [................................................................] 69 / 69
 Snoozing for 19 seconds
Downloading link 8 of 10
100% [...........................................................

In [44]:
## find all text docs on page
all_text = soup.find_all("ul", class_="txts")
all_text

[<ul class="txts downloadable">
 <p class="pages">Download this first set of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="txts downloadable">
 <p class="pages">Download this second set of text documents</p>
 <li>Text Document <a href="files/text_doc_A.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_B.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_C.txt">3</a

In [48]:
atag_list = []
for atag in all_text:
    atag_list.append(atag.find_all("a"))
    print(atag.find_all("a"))
    print("******")

[<a href="files/text_doc_01.txt">1</a>, <a href="files/text_doc_02.txt">2</a>, <a href="files/text_doc_03.txt">3</a>, <a href="files/text_doc_04.txt">4</a>, <a href="files/text_doc_05.txt">5</a>, <a href="files/text_doc_06.txt">6</a>, <a href="files/text_doc_07.txt">7</a>, <a href="files/text_doc_08.txt">8</a>, <a href="files/text_doc_09.txt">9</a>, <a href="files/text_doc_10.txt">10</a>]
******
[<a href="files/text_doc_A.txt">1</a>, <a href="files/text_doc_B.txt">2</a>, <a href="files/text_doc_C.txt">3</a>, <a href="files/text_doc_D.txt">4</a>, <a href="files/text_doc_E.txt">5</a>, <a href="files/text_doc_F.txt">6</a>, <a href="files/text_doc_G.txt">7</a>, <a href="files/text_doc_H.txt">8</a>, <a href="files/text_doc_I.txt">9</a>, <a href="files/text_doc_J.txt">10</a>]
******


In [49]:
atag_list

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/text_doc_A.txt">1</a>,
  <a href="files/text_doc_B.txt">2</a>,
  <a href="files/text_doc_C.txt">3</a>,
  <a href="files/text_doc_D.txt">4</a>,
  <a href="files/text_doc_E.txt">5</a>,
  <a href="files/text_doc_F.txt">6</a>,
  <a href="files/text_doc_G.txt">7</a>,
  <a href="files/text_doc_H.txt">8</a>,
  <a href="files/text_doc_I.txt">9</a>,
  <a href="files/text_doc_J.txt">10</a>]]

In [50]:
import itertools

In [51]:
html_links = list(itertools.chain(*atag_list))
html_links

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/text_doc_A.txt">1</a>,
 <a href="files/text_doc_B.txt">2</a>,
 <a href="files/text_doc_C.txt">3</a>,
 <a href="files/text_doc_D.txt">4</a>,
 <a href="files/text_doc_E.txt">5</a>,
 <a href="files/text_doc_F.txt">6</a>,
 <a href="files/text_doc_G.txt">7</a>,
 <a href="files/text_doc_H.txt">8</a>,
 <a href="files/text_doc_I.txt">9</a>,
 <a href="files/text_doc_J.txt">10</a>]