# 1. Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```pdf``` files.
2. Download all files at one time.

In [1]:
## create new cells as necessary

In [2]:
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

In [3]:
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

In [4]:
response = requests.get(url)
response.status_code

200

In [6]:
## Now we soup
soup = BeautifulSoup(response.text, "html.parser")
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">

In [8]:
## It looks like there are a bunch of things we can target.
## I'm not confident that I can write one big scrape that will do everything and I don't have a ton of time 
## so we are going to target each type of file seperately. 

txt_group = soup.find("ul", class_ = "txts")
txt_group

<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">7</a></li>
<li>Text Document <a href="files/text_doc_08.txt">8</a></li>
<li>Text Document <a href="files/text_doc_09.txt">9</a></li>
<li>Text Document <a href="files/text_doc_10.txt">10</a></li>
</ul>

In [9]:
## BOOM that worked. okay so lets get all the a tags. 
links_txt = txt_group.find_all("a")
links_txt


[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

In [10]:
links = []
for tag in links_txt:
    links.append(tag.get("href"))
links

['files/text_doc_01.txt',
 'files/text_doc_02.txt',
 'files/text_doc_03.txt',
 'files/text_doc_04.txt',
 'files/text_doc_05.txt',
 'files/text_doc_06.txt',
 'files/text_doc_07.txt',
 'files/text_doc_08.txt',
 'files/text_doc_09.txt',
 'files/text_doc_10.txt']

In [14]:
## now lets make the urls whole
baseURL = "https://sandeepmj.github.io/scrape-example-page/"
full_txt_link = []
for link in links:
    full_txt_link.append(baseURL + link)

full_txt_link

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt']

In [15]:
import wget

In [16]:
## so now we can download all the links using a forloop
links_total = len(full_txt_link)
link_count = 1
for link in full_txt_link:
    print(f"downloading link {link_count} of {links_total}")
    link_count += 1
    wget.download(link)
    snoozer = randrange (5, 7)
    print(f"snoozing for {snoozer} seconds next link")
    time.sleep(snoozer)

downloading link 1 of 10
100% [................................................................] 76 / 76snoozing for 6 seconds next link
downloading link 2 of 10
100% [................................................................] 66 / 66snoozing for 6 seconds next link
downloading link 3 of 10
100% [................................................................] 70 / 70snoozing for 6 seconds next link
downloading link 4 of 10
100% [................................................................] 63 / 63snoozing for 6 seconds next link
downloading link 5 of 10
100% [................................................................] 66 / 66snoozing for 6 seconds next link
downloading link 6 of 10
100% [................................................................] 66 / 66snoozing for 5 seconds next link
downloading link 7 of 10
100% [................................................................] 69 / 69snoozing for 5 seconds next link
downloading link 8 of 10
100% [..........

In [None]:
## okay lets see if we can speed up the process for downloading all the files. 

In [17]:
##I forgot what the soup look like 
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this list of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.txt">

In [38]:
## It looks like we can target "downloadable" and "ul"
## It also look like yu threw in some broken links just to be cheeky.
## don't worry, we got this... I think
docs_holder = soup.find_all ('ul', class_ = "downloadable")

In [39]:
##Trying a more direct approach
## This has some broken links but its a single flat list...
##docs_holder = soup.find_all("a")
docs_holder

[<ul class="txts downloadable">
 <p class="pages">Download this list of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="files

In [40]:
big_list = [item.find_all("a") for item in docs_holder]
big_list

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/pdf_1.pdf">1</a>,
  <a href="files/pdf_2.pdf">2</a>,
  <a href="files/pdf_3.pdf">3</a>,
  <a href="files/pdf_4.pdf">4</a>,
  <a href="files/pdf_5.pdf">5</a>,
  <a href="files/pdf_6.pdf">6</a>,
  <a href="files/pdf_7.pdf">7</a>,
  <a href="files/pdf_8.pdf">8</a>,
  <a href="files/pdf_9.pdf">9</a>,
  <a href="files/pdf_10.pdf">10</a>]]

In [42]:
##okay so this is a list, which means we need to flatten it. 
flat_docs = []
for sub_list in big_list:
    for item in sub_list:
        flat_docs.append(baseURL + item.get("href"))
        
flat_docs

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/

In [43]:
links_total = len(flat_docs)
link_count = 1
for link in flat_docs:
    print(f"downloading link {link_count} of {links_total}")
    link_count += 1
    wget.download(link)
    snoozer = randrange (5, 7)
    print(f"snoozing for {snoozer} seconds next link")
    time.sleep(snoozer)

downloading link 1 of 20
100% [................................................................] 76 / 76snoozing for 6 seconds next link
downloading link 2 of 20
100% [................................................................] 66 / 66snoozing for 5 seconds next link
downloading link 3 of 20
100% [................................................................] 70 / 70snoozing for 5 seconds next link
downloading link 4 of 20
100% [................................................................] 63 / 63snoozing for 6 seconds next link
downloading link 5 of 20
100% [................................................................] 66 / 66snoozing for 6 seconds next link
downloading link 6 of 20
100% [................................................................] 66 / 66snoozing for 6 seconds next link
downloading link 7 of 20
100% [................................................................] 69 / 69snoozing for 6 seconds next link
downloading link 8 of 20
100% [..........

# 2. Universal conversion function
Rewrite your function from last week so it can do both:

- take individual string values like ```$12.24267```, ```10,201``` and ```$12,501``` and convert them into floating point numbers like 12.24, 10201.0 and 12501.0

- take string values in lists and convert them to floating point numbers. (reminder: you use a zip function).

Test it on the numbers above and in this list:

In [48]:
## list of string numbers
string_numbers = ["$12.24267", "10,201", "$12,501", "42,901", "$902,091"]

In [50]:
def convert (string):
    string = string.replace("$", "").replace(",", "") 
    return round(float(string), 2)

In [51]:
converted_string = list(map(convert, string_numbers))

In [52]:
converted_string

[12.24, 10201.0, 12501.0, 42901.0, 902091.0]

In [None]:
## I honestly have no idea where you're going with "use a zip function". 
## Like, I feel like I was paying attention and at this point I'm just drawing a blank