# Functions & Downloads

Today we'll learn how to write more efficient code using ```functions``` and then apply them toward scraping/downloading documents from the Web.

## Functions: Making Your Code Work Smarter

A **function** is a reusable block of code that performs a specific task. Instead of writing the same code over and over, you write it once in a function and call it whenever you need it.

Functions help you:
* **Stay DRY** (Don't Repeat Yourself) – write code once, use it many times
* **Organize your code** – break complex problems into smaller, manageable pieces
* **Make debugging easier** – fix issues in one place instead of hunting through repeated code
* **Build modular programs** – combine simple functions to accomplish complex tasks

A function runs only when something "invokes" or "calls" it, giving you control over when and how your code executes.


In [3]:
# you've been using functions like print


In [4]:
# define a function that says hello world 
def sayHello():
    print("Hello World")

In [5]:
# invoke or call this function
sayHello()

Hello World


In [6]:
# let's create a function that says hello to someone 
def sayHello():
    print("Hello Shuchi, hope you're having a good day!")

In [7]:
# call it
sayHello()

Hello Shuchi, hope you're having a good day!


In [8]:
# what if we want a different name?
def sayHello2():
    print("Hello Frank, hope you're having a good day!")

In [9]:
# instead of writing a different function for differen people, we add a parameter.

def sayHello(name):
    print(f"Hello {name}, hope you're having a good day!")

In [10]:
# call updated function
sayHello("Sami")

Hello Sami, hope you're having a good day!


In [11]:
# build a function called addFunction

def addNumbers(number1, number2):
    total = number1 + number2
    print(f"The sum of {number1} and {number2} is {total}")
    

In [12]:
#call 
addNumbers(3,4)

The sum of 3 and 4 is 7


In [13]:
# create a function that 

def aging (name, age):
    print(f"{name} is {age} years old")

aging("Sandeep", 59)

Sandeep is 59 years old


In [14]:
# note that the order of the parameters matter because they are getting mapped to the different parts of the function.

aging(59, "Sandeep")

59 is Sandeep years old


In [15]:
# going back to the addNumber func
addNumbers(10,20)

The sum of 10 and 20 is 30


In [16]:
# lets try to save it as a variable
my_result = addNumbers(10,30)

The sum of 10 and 30 is 40


In [17]:
# print or call this variable. What does it hold?
print(my_result)

None


In [18]:
type(my_result)

NoneType

In [19]:
# tweak function by adding a return statement 

def addNumbers(number1, number2):
    # total = number1 + number2
    # print(f"The sum of {number1} and {number2} is {total}")
    return number1 + number2

In [20]:
my_result = addNumbers (10,30)
my_result

40

In [21]:
### Docstrings 
def pctChg(old_value, new_value):
    '''
    This function takes two numbers and returns the percent change
    parameter1 = old number
    parameter2 = newer number
    '''
    return(new_value - old_value)/ old_value * 100


In [22]:
# run function -- click shift tab to get popup
pctChg(50, 100)

100.0

In [23]:
type(pctChg)

function

### Build a credit check function

Recall that we built a conditional expression to evaluate credit rating based on these values:

- 300-579: Poor
- 580-669: Fair
- 670-739: Good
- 740-799: Very good
- 800-850: Excellent

Now build a function that evaluates a score to return a rating, but also prints "Your credit rating is **whatever**!"

## here is the expression we previously built:

```python
if credit <= 579:
  print(f"Your credit of {credit} is poor")
elif 579 < credit <= 669:
  print(f"Your credit of {credit} is fair")
elif 670 < credit <= 739:
  print(f"Your credit of {credit} is good")
elif 740 <= credit <= 799:
  print(f"Your credit of {credit} is very good")
else:
  print(f"Your credit of {credit} is excellent")

```

Can you think of a way to make it more efficient and DRY.

In [26]:
## Code your function here

## Why functions are awesome

For functions I use regularly, I can simply import them and reduce time spent copy-n-pasting, or rewriting same functions.

#### (The next few steps are demo only)

In [30]:
pip install git+https://github.com/sandeepmj/my_functions.git

Collecting git+https://github.com/sandeepmj/my_functions.git
  Cloning https://github.com/sandeepmj/my_functions.git to /private/var/folders/7k/dkrw1dkn70xb44njcjrldx2w0000gn/T/pip-req-build-1mmnrlov
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mgit version[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[2 lines of output][0m
  [31m   [0m xcode-select: note: no developer tools were found at '/Applications/Xcode.app', requesting install. Choose an option in the dialog to download the command line developer tools.
  [31m   [0m [31m[end of output][0m
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
[1;31merror[0m: [1msubprocess-exited-with-error[0m

[31m×[0m [32mgit version[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m See above for output.

[1;35mnote[0m: This error originates from a subprocess, and is likely not a pro

In [32]:
## import my_fuctions
from my_functions import timer

ModuleNotFoundError: No module named 'my_functions'

In [34]:
# write a loop that counts to 5 and prints the number
# add a timer 
timer(10,20)

NameError: name 'timer' is not defined

### Your Challenge

Write a function that makes a request to a website and returns the content as soup.

In [40]:
## dependencies
import requests
from bs4 import BeautifulSoup

## function
def makeSoup(url):

    '''
    Makes soup of any html site
    para1 = url of site to be scraped
    Returns soup
    '''
    
    headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
    response = requests.get(url, headers = headers)
    if response.status_code == 200:
        return BeautifulSoup(response.text, "html.parser")

    else:
        print(f"Your request returned {response.status_code}")
    

In [46]:
# place url into variable
# call function

url = "https://bestsellingalbums.org/decade/2010"
makeSoup(url)


<!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width" name="viewport"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<link href="https://bestsellingalbums.org/xmlrpc.php" rel="pingback"/>
<!--[if lt IE 9]>
    <script src="https://bestsellingalbums.org/wp-content/themes/twentyfifteen/js/html5.js?ver=3.7.0"></script>
    <![endif]-->
<script>(function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);</script>
<!-- This site is optimized with the Yoast SEO plugin v14.5 - https://yoast.com/wordpress/plugins/seo/ -->
<title>Best-selling albums of 2010's</title>
<meta content="Best-selling albums of 2010's" name="description">
<meta content="index, follow" name="robots">
<meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="googlebot"/>
<meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name=

In [48]:
base_url = "https://bestsellingalbums.org/decade/2010-"

for url_number in range(2,4):
    url = f"{base_url}{url_number}"
    print(makeSoup(url))
    timer(5,10)

print


<!DOCTYPE html>

<html class="no-js" lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width" name="viewport"/>
<link href="https://gmpg.org/xfn/11" rel="profile"/>
<link href="https://bestsellingalbums.org/xmlrpc.php" rel="pingback"/>
<!--[if lt IE 9]>
    <script src="https://bestsellingalbums.org/wp-content/themes/twentyfifteen/js/html5.js?ver=3.7.0"></script>
    <![endif]-->
<script>(function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);</script>
<!-- This site is optimized with the Yoast SEO plugin v14.5 - https://yoast.com/wordpress/plugins/seo/ -->
<title>Best-selling albums of 2010's : places 51 - 100</title>
<meta content="Best-selling albums of 2010's : places 51 - 100" name="description">
<meta content="index, follow" name="robots">
<meta content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="googlebot"/>
<meta content="index, follow, max-snippet:-1, max-image-previe

NameError: name 'timer' is not defined

## Scraping/Downloading Web Documents

You want to create a dataset that tracks how many companies the <a href="https://www.sec.gov/litigation/suspensions.shtml">SEC suspended</a> between 2024 and 2004 (and for what reasons).

We want to write a scraper that aggregates:

* Date of suspension
* Company name
* Order
* Release (the PDFs in the XX-YYYYY format

The challenge? All that info is held in the PDFs.

We will need to download all the PDFs before we can analyze the info.

## Practice Site

We'll practice the required techniques <a href="https://sandeepmj.github.io/scrape-example-page/pages.html">on this demo site</a> by:

1. downloading all ```txt``` files.
2. downloading all ```pdf``` files.
3. Perhaps all files at one time.

In [52]:
import time
from random import uniform

In [58]:
url = "https://sandeepmj.github.io/scrape-example-page/pages.html"
soup = makeSoup(url)
soup

<html lang="en">
<head>
<!-- Makes the page responsive and scaled to be read easily -->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- Links to stylesheet -->
<link href="style.css" rel="stylesheet" type="text/css"/>
<!-- Remember to update page title -->
<title>List of Documents</title>
</head>
<body>
<!-- All content goes here -->
<div class="container">
<h1>Documents to Download</h1>
<li>Junk Li <a href="">tag 1</a></li>
<li>Junk Li <a href="">tag 2</a></li>
<ul class="txts downloadable">
<p class="pages">Download this first set of text documents</p>
<li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
<li>Text Document <a href="files/text_doc_02.txt">2</a></li>
<li>Text Document <a href="files/text_doc_03.txt">3</a></li>
<li>Text Document <a href="files/text_doc_04.txt">4</a></li>
<li>Text Document <a href="files/text_doc_05.txt">5</a></li>
<li>Text Document <a href="files/text_doc_06.txt">6</a></li>
<li>Text Document <a href="files/text_doc_07.

In [64]:
# narrow to class
targets = soup.find_all("ul", class_= "txts") # remember- txts downloadable means they are two different classes. 
targets

[<ul class="txts downloadable">
 <p class="pages">Download this first set of text documents</p>
 <li>Text Document <a href="files/text_doc_01.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_02.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_03.txt">3</a></li>
 <li>Text Document <a href="files/text_doc_04.txt">4</a></li>
 <li>Text Document <a href="files/text_doc_05.txt">5</a></li>
 <li>Text Document <a href="files/text_doc_06.txt">6</a></li>
 <li>Text Document <a href="files/text_doc_07.txt">7</a></li>
 <li>Text Document <a href="files/text_doc_08.txt">8</a></li>
 <li>Text Document <a href="files/text_doc_09.txt">9</a></li>
 <li>Text Document <a href="files/text_doc_10.txt">10</a></li>
 </ul>,
 <ul class="txts downloadable">
 <p class="pages">Download this second set of text documents</p>
 <li>Text Document <a href="files/text_doc_A.txt">1</a> </li>
 <li>Text Document <a href="files/text_doc_B.txt">2</a></li>
 <li>Text Document <a href="files/text_doc_C.txt">3</a

In [92]:
# just trying something out - from 

target_list = []
for target in targets:
 a_tag = target.find("a")
href = a_tag.get("href")
target_list.append(href)

target_list

['files/text_doc_A.txt']

In [107]:
atags = [atag.find_all("a") for atag in targets]
atags

[[<a href="files/text_doc_01.txt">1</a>,
  <a href="files/text_doc_02.txt">2</a>,
  <a href="files/text_doc_03.txt">3</a>,
  <a href="files/text_doc_04.txt">4</a>,
  <a href="files/text_doc_05.txt">5</a>,
  <a href="files/text_doc_06.txt">6</a>,
  <a href="files/text_doc_07.txt">7</a>,
  <a href="files/text_doc_08.txt">8</a>,
  <a href="files/text_doc_09.txt">9</a>,
  <a href="files/text_doc_10.txt">10</a>],
 [<a href="files/text_doc_A.txt">1</a>,
  <a href="files/text_doc_B.txt">2</a>,
  <a href="files/text_doc_C.txt">3</a>,
  <a href="files/text_doc_D.txt">4</a>,
  <a href="files/text_doc_E.txt">5</a>,
  <a href="files/text_doc_F.txt">6</a>,
  <a href="files/text_doc_G.txt">7</a>,
  <a href="files/text_doc_H.txt">8</a>,
  <a href="files/text_doc_I.txt">9</a>,
  <a href="files/text_doc_J.txt">10</a>]]

In [109]:
# to flatten list- import itertool
from itertools import chain

In [76]:
flat_targets = chain.from_iterable(atags)

NameError: name 'atags' is not defined

In [94]:
# capture 
base_url = "https://sandeepmj.github.io/scrape-example-page/pages.html"

urls = [base_url + item.find]

NameError: name 'item' is not defined

In [97]:
pip install wget

Note: you may need to restart the kernel to use updated packages.


In [101]:
import wget

In [103]:
# download all - this will downlaod all the text documents
for i, link in enumerate(urls, start =1):
    print(f"Downloading link {i} of {len(urls)}")
    wget.download(link)
    timer(10,20)

NameError: name 'urls' is not defined