# 1. Demo downloading files from websites 

There are ```txt``` and ```pdf``` files on:

```https://sandeepmj.github.io/scrape-example-page/pages.html```

Do the following:

1. Download all ```pdf``` files.
2. Download all files at one time.

In [12]:
# import libraries
from bs4 import BeautifulSoup  ## scrape info from web pages
import requests ## get web pages from server
import time # time is required. we will use its sleep function
from random import randrange # generate random numbers

In [13]:
# url to scrape
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

response = requests.get(url)
response.status_code

200

In [14]:
## make soup
soup = BeautifulSoup(response.text, "html.parser")

In [15]:
## find all links to files in our soup
all_files = soup.find_all("ul", class_="downloadable")
all_files

[<ul class="txts downloadable">
 <p class="pages">Download this list of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="pdfs downloadable">
 <p class="pages">Download this list of PDFs</p>
 <li>PDF Document <a href="files/pdf_1.pdf">1</a> </li>
 <li>PDF Document <a href="files/pdf_2.pdf">2</a></li>
 <li>PDF Document <a href="files/pdf_3.pdf">3</a></li>
 <li>PDF Document <a href="files

In [16]:
## iterate through our list
## to find just the a tags 
all_a_tags = [file.find_all("a") for file in all_files]
all_a_tags

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/pdf_1.pdf">1</a>,
  <a href="files/pdf_2.pdf">2</a>,
  <a href="files/pdf_3.pdf">3</a>,
  <a href="files/pdf_4.pdf">4</a>,
  <a href="files/pdf_5.pdf">5</a>,
  <a href="files/pdf_6.pdf">6</a>,
  <a href="files/pdf_7.pdf">7</a>,
  <a href="files/pdf_8.pdf">8</a>,
  <a href="files/pdf_9.pdf">9</a>,
  <a href="files/pdf_10.pdf">10</a>]]

### Notice that we have two lists within a list.

#### These are lists nested inside a list. 

In [17]:
## pull out the first item from our list
## just to see how an individual item looks
all_a_tags[0]


[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>]

## The following will break!
We try to pull out the href from all_a_tags

In [18]:
href_list = [href.get("href") for href in all_a_tags]
href_list

AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

## Why did that break?

It broke because we are trying to do an operation that works only on a single item but we are doing it on a list within a list.

This for loop clearly reveals you that you are pulling out a list or ```<class 'bs4.element.ResultSet'>``` from the big list:

In [20]:
for href in all_a_tags:
  print(type(href))

<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>


### We need to flatten all_files so it contains indivudal items rather than multiple lists.

We'll use ```itertools``` to accomplish that

In [21]:
## import itertools
import itertools

In [22]:
## use itertools to flatten the list

flat_list = list(itertools.chain(*all_a_tags))
flat_list

[<a href="files/text_doc_01.txt">1</a>,
 <a href="files/text_doc_02.txt">2</a>,
 <a href="files/text_doc_03.txt">3</a>,
 <a href="files/text_doc_04.txt">4</a>,
 <a href="files/text_doc_05.txt">5</a>,
 <a href="files/text_doc_06.txt">6</a>,
 <a href="files/text_doc_07.txt">7</a>,
 <a href="files/text_doc_08.txt">8</a>,
 <a href="files/text_doc_09.txt">9</a>,
 <a href="files/text_doc_10.txt">10</a>,
 <a href="files/pdf_1.pdf">1</a>,
 <a href="files/pdf_2.pdf">2</a>,
 <a href="files/pdf_3.pdf">3</a>,
 <a href="files/pdf_4.pdf">4</a>,
 <a href="files/pdf_5.pdf">5</a>,
 <a href="files/pdf_6.pdf">6</a>,
 <a href="files/pdf_7.pdf">7</a>,
 <a href="files/pdf_8.pdf">8</a>,
 <a href="files/pdf_9.pdf">9</a>,
 <a href="files/pdf_10.pdf">10</a>]

In [23]:
## now we can target the a tags

href_list = [a_tag.get("href") for a_tag in flat_list]
href_list

['files/text_doc_01.txt',
 'files/text_doc_02.txt',
 'files/text_doc_03.txt',
 'files/text_doc_04.txt',
 'files/text_doc_05.txt',
 'files/text_doc_06.txt',
 'files/text_doc_07.txt',
 'files/text_doc_08.txt',
 'files/text_doc_09.txt',
 'files/text_doc_10.txt',
 'files/pdf_1.pdf',
 'files/pdf_2.pdf',
 'files/pdf_3.pdf',
 'files/pdf_4.pdf',
 'files/pdf_5.pdf',
 'files/pdf_6.pdf',
 'files/pdf_7.pdf',
 'files/pdf_8.pdf',
 'files/pdf_9.pdf',
 'files/pdf_10.pdf']

In [24]:
## base url
base_url = "https://sandeepmj.github.io/scrape-example-page/"

In [25]:
## iterate and join the base url to the relative url
full_link_list = [base_url + href for href in href_list]
full_link_list

['https://sandeepmj.github.io/scrape-example-page/files/text_doc_01.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_02.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_03.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_04.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_05.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_06.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_07.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_08.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_09.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/text_doc_10.txt',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_1.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_2.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/pdf_3.pdf',
 'https://sandeepmj.github.io/scrape-example-page/files/

# Downloading the documents
We need the ```wget``` package. Let's pip install it:

In [26]:
pip install wget

Note: you may need to restart the kernel to use updated packages.


In [27]:
## import that package
import wget

In [29]:
## full scrape of the documents

links_number = len(full_link_list)
link_count = 1
for link in full_link_list:
  print(f"Downloading link {link_count} of {links_number}")
  link_count += 1
  wget.download(link)## wget function
  snooze = randrange(3, 6)
  print(f"Snoozing for {snooze} seconds from next link")
  time.sleep(snooze)

Downloading link 1 of 20
100% [................................................................] 76 / 76Snoozing for 5 seconds from next link
Downloading link 2 of 20
100% [................................................................] 66 / 66Snoozing for 3 seconds from next link
Downloading link 3 of 20
100% [................................................................] 70 / 70Snoozing for 3 seconds from next link
Downloading link 4 of 20
100% [................................................................] 63 / 63Snoozing for 4 seconds from next link
Downloading link 5 of 20
100% [................................................................] 66 / 66Snoozing for 5 seconds from next link
Downloading link 6 of 20
100% [................................................................] 66 / 66Snoozing for 3 seconds from next link
Downloading link 7 of 20
100% [................................................................] 69 / 69Snoozing for 5 seconds from next link
Downlo

# 2. Universal conversion function
Rewrite your function from last week so it can do both:

- take individual string values like ```$12.24267```, ```10,201``` and ```$12,501``` and convert them into floating point numbers like 12.24, 10201.0 and 12501.0

- take string values in lists and convert them to floating point numbers. (reminder: you use a map function).

Test it on the numbers above and in this list:

In [6]:
## list of string numbers
string_numbers = ["$12.24267", "10,201", "$12,501", "42,901", "$902,091"]

In [7]:
## FUNCTION
def string2float(a_string):
    '''
    Function to remove commas and $ and return a floating number
    para1: string
    '''
    a_string = a_string.replace("$", "").replace(",", "") 
    return round(float(a_string), 2)

In [8]:
## still works on individual string number?
string2float("$42,901")

42901.0

In [9]:
list(map(string2float, string_numbers))

[12.24, 10201.0, 12501.0, 42901.0, 902091.0]