# Scraping Files from Websites 

### You need to create a data set that tracks how many companies the SEC suspended between 2019 and 1999. You find the data at:

```https://www.sec.gov/litigation/suspensions.shtml```



### Write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format)

# The Challenge?

### Details are actually in PDFs!

# Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```txt``` files.
2. Download all ```pdf``` files.
3. Download all files as one.

In [None]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

# from google.colab import files ## code for downloading in google colab

In [None]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

## Turn page into soup

In [None]:
## get url and print but hard to read. will do prettify next
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
print(soup)

## Find all txt files

In [None]:
txt_holder = soup.find_all("ul", class_="txts")
print(txt_holder)

## Find all the ```a``` tags 

In [None]:
for txt_files in txt_holder:
    txt_file_links = txt_files.find_all("a")
    print(type(txt_file_links))
    print(txt_file_links)

In [None]:
txt_file_links

## What is missing from the URLs?

In [None]:
base_url = "https://sandeepmj.github.io/scrape-example-page/"

## Create a list of the full URLs

Without all the ```html```

In [None]:
all_txt_links = [base_url + txt_link.get("href") for txt_link in txt_file_links]
all_txt_links

## Download all the ```txt``` documents

In [None]:
import wget # can put down documents, files from websites

In [None]:
links_number = len(all_txt_links)
link_count = 1
for link in all_txt_links:
    print(f"Downloaded link {link_count} of {links_number}")
#     files.download(wget.download(link)) ## needed in colab instead of next line
    wget.download(link)
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

# Find all ```pdf``` files

In [None]:
pdf_holder = soup.find_all("ul", class_= "pdfs")
print(pdf_holder)

## Find all the ```a``` tags 

In [None]:
for pdf_files in pdf_holder:
    pdf_file_links = pdf_files.find_all("a")
    print(type(pdf_file_links))
    print(pdf_file_links)

## Find all the ```a``` tags 

Without all the ```html```

In [None]:
all_pdf_links = [base_url + pdf_file_link.get("href") for pdf_file_link in pdf_file_links]
print(all_pdf_links)

## Download all the ```pdf``` documents

In [None]:
links_number = len(all_pdf_links)
link_count = 1
for link in all_pdf_links:
    print(f"Downloaded link {link_count} of {links_number}")
    #files.download(wget.download(link)) ## needed in colab instead of next line
    wget.download(link)
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")
    time.sleep(snooze)

## Find all the files and download at one go

In [None]:
## find all files in our soup
all_holder = soup.find_all("li")
all_holder

## Stop...we can't throw such a wide net!

# Target the class ```downloadable```

In [None]:
## find all files in our soup
docs_holder = soup.find_all("ul", class_ = "downloadable")
docs_holder

In [None]:
type(docs_holder)

### We run into problems because we have a list of lists

#### Quick detour to flatten list lesson

In [None]:
## because docs_holder has p tags, newlines, etc. we need to focus it
all_li = [myLi.find_all("li") for myLi in docs_holder ]
all_li

## itertools

In [None]:
## let's use itertools to flatten the list
import itertools

list(itertools.chain(*all_li))

In [None]:
## let's blend BeautifulSoup and itertools
all_links = [base_url + url.find("a").get("href")\
             for url in list(itertools.chain(*all_li))]
all_links

## For Loop

In [None]:
## Flatten via for loop
all_urls = []
for myLI in docs_holder:
    myLI = myLI.find_all("a")
    for url in myLI:
        url = url.get("href")
#         print(url)
        all_urls.append(base_url + url)

all_urls

## List Comprehension

In [None]:
docs_holder_all_a = [item.find_all("a") for item in docs_holder]
docs_holder_all_a

In [None]:
all_urls_lc = [base_url+item.get("href") for sub_list in docs_holder_all_a for item in sub_list]
all_urls_lc

## Download all documents

In [None]:
## careful to put in a list name we just processed (via lc, fl, itertools)
links_number = len(all_urls)
link_count = 1
for link in all_urls:
    print(f"Downloaded link {link_count} of {links_number}")
    wget.download(link)
    link_count += 1
    snooze = randrange(3,6)
    print(f"snoozing for {snooze} seconds before scraping next link.")