# <center>Session 4 - Functions </center>

<hr>

## Table of Contents

<b>

* [Functions](#Functions)
* [Project: Creating an NPR Scraper](#Scraper)
</b>

<hr>
<a id="Functions"></a>
# Functions

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

Any piece of code that you might want to reuse, or that might be overly complicated, is a good candidate for becoming a function.  Having good function names and pushing specific pieces of logic into them can make your code infinitely more readable (and editable in the future). 

    def functionname( parameters ):
       <function behavior>
       return <return value>

In [3]:
## Example function definition
def sum_numbers(num_1, num_2):
    summed_value = num_1 + num_2
    return summed_value

What happens when you run the code block above?  Call the new function you've created.  Now you can treat it like the built-in functions we've seen before. 

In [4]:
## Call the function above.  Your code here

Before running the next code cell, what do you think it will print out and in what order?

In [5]:
def happyBirthdaySong(name):
    print("Happy Birthday to you!")
    print("Happy Birthday to you!")
    print("Happy Birthday, dear", name + ".")
    print("Happy Birthday to you!")
    
print ("When will this evaluate?")
happyBirthdaySong("Cynthia")

When will this evaluate?
Happy Birthday to you!
Happy Birthday to you!
Happy Birthday, dear Cynthia.
Happy Birthday to you!


### Variable Scope

Variables have a scope in which they can be referenced.  It's inevitable that you will have variables with the same name, so knowing WHERE they can be accessed is important. 

In [7]:
def area_of_a_circle(radius):
    pi = 3.14159
    return pi * radius**2

pi = 3.14
print(area_of_a_circle(2))

print(pi)

12.56636
3.14


Will these next two examples work?

In [11]:
def area_of_a_circle(radius):
    return pi_ex2 * radius**2

pi_ex2 = 3.14
print(area_of_a_circle(2))


12.56


In [10]:
def area_of_a_circle(radius):
    pi_val = 3.14159
    return pi_val * radius**2

print(area_of_a_circle(2))

print(pi_val)

12.56636


NameError: name 'pi_val' is not defined

<hr>

<a id="Scraper"></a>
# Creating an NPR Web Scraper

We'll work through this problem step by step, creating some logic, turning it into a series of functions as we work out each piece. 

The goal is:

* Start with the webpage 'http://text.npr.org/'
* Identify all article links
    * Create a queue of articles to process, add all these links to the queue
* Download each article
    * Pop a link off the queue
    * Identify the article text
    * Write the article text to a file with the article id as the filename
    * Find any other articles linked to in the article
        * Make sure the new links haven't already been processed
        * Add them to the queue of articles to process if they're new
    

#### Step 1

Use the Requests library to get the content of 'http://text.npr.org/'

Create a loop and use string functions to identify every link.  Add each link to a set.

*Hint:* The beginning of any link in an HTML file will be '<a href="'  


In [13]:
import requests

In [34]:
path = 'http://text.npr.org'
r = requests.get(path)

text_to_process = r.text

link_queue = set()
while text_to_process.find('href="') >= 0:
    text_to_process = text_to_process[text_to_process.find('href="') + 6:]
    link = text_to_process[:text_to_process.find('"')]
    if not link.startswith('http'):
        link = path + link
    link_queue.add(link)
    text_to_process = text_to_process[text_to_process.find('"') + 1:]
    

#### Part 2

Take a look at all the links.  You'll notice that some are relative (ie don't start with http).  Modify your code above to add 'http://text.npr.org' to the beginning of every relative link. 

Note that strings have a function 'startswith'

In [27]:
link_queue

{'http://text.npr.org/s.php?sId=575392333',
 'http://text.npr.org/s.php?sId=575906151',
 'http://text.npr.org/s.php?sId=575916244',
 'http://text.npr.org/s.php?sId=575954398',
 'http://text.npr.org/s.php?sId=575975670',
 'http://text.npr.org/s.php?sId=576050497',
 'http://text.npr.org/s.php?sId=576087864',
 'http://text.npr.org/s.php?sId=576096900',
 'http://text.npr.org/s.php?sId=576123068',
 'http://text.npr.org/s.php?sId=576139303',
 'http://text.npr.org/t.php?tid=1001',
 'http://text.npr.org/t.php?tid=1008',
 'http://text.npr.org/t.php?tid=1039',
 'https://about.npr.org',
 'https://help.npr.org/customer/portal/emails/new',
 'https://www.npr.org',
 'https://www.npr.org/about-npr/179876898/terms-of-use',
 'https://www.npr.org/about-npr/179878450/privacy-policy',
 'https://www.npr.org/about-npr/179881519/rights-and-permissions-information'}

#### Part 3

Once you have a set of links, remove any that aren't for npr.org, or don't fit the format of the npr articles or topics.  

*Hint:* The article link format is 'http://text.npr.org/s.php?sId=< Article ID >&rid=< Topic ID >', athough not all articles will have a topic id.  
*Hint 2:* The topic link format is 'http://text.npr.org/t.php?tid= < Topic ID >'

In [35]:
queue_to_return = set()

for link in link_queue:
    if 'npr.org' in link:
        if  'sid=' in link.lower() or 'tid=' in link.lower():
            queue_to_return.add(link)
        
queue_to_return

{'http://text.npr.org/s.php?sId=575392333',
 'http://text.npr.org/s.php?sId=575906151',
 'http://text.npr.org/s.php?sId=575916244',
 'http://text.npr.org/s.php?sId=575954398',
 'http://text.npr.org/s.php?sId=575975670',
 'http://text.npr.org/s.php?sId=576050497',
 'http://text.npr.org/s.php?sId=576087864',
 'http://text.npr.org/s.php?sId=576096900',
 'http://text.npr.org/s.php?sId=576123068',
 'http://text.npr.org/s.php?sId=576139303',
 'http://text.npr.org/t.php?tid=1001',
 'http://text.npr.org/t.php?tid=1008',
 'http://text.npr.org/t.php?tid=1039'}

#### Part 3

Take the logic you've created for identifying links in an HTML string, and create a function. 

It stands to reason the input for the function should be a string (HTML), and the output should be a set where each element is a link to a topic or an article. 

You are welcome to re-configure the code a bit to be shorter. 

In [36]:
def get_links(html_text):
    link_queue = set()
    while html_text.find('href="') >= 0:
        html_text = html_text[html_text.find('href="') + 6:]
        link = html_text[:html_text.find('"')]
        
        if not link.startswith('http'):
            link = path + link
        
        if 'npr.org' in link:
            if  'sid=' in link.lower() or 'tid=' in link.lower():
                link_queue.add(link)
        
        html_text = html_text[html_text.find('"') + 1:]
        
    return link_queue

link_queue = get_links(r.text)
link_queue

{'http://text.npr.org/s.php?sId=575392333',
 'http://text.npr.org/s.php?sId=575906151',
 'http://text.npr.org/s.php?sId=575916244',
 'http://text.npr.org/s.php?sId=575954398',
 'http://text.npr.org/s.php?sId=575975670',
 'http://text.npr.org/s.php?sId=576050497',
 'http://text.npr.org/s.php?sId=576087864',
 'http://text.npr.org/s.php?sId=576096900',
 'http://text.npr.org/s.php?sId=576123068',
 'http://text.npr.org/s.php?sId=576139303',
 'http://text.npr.org/t.php?tid=1001',
 'http://text.npr.org/t.php?tid=1008',
 'http://text.npr.org/t.php?tid=1039'}

#### Part 4

Take one of the article links you found and use the requests library to get its content.  Inspect it, and try to identify what defines the content of the article itself. 

In [54]:
article_html = requests.get("http://text.npr.org/s.php?sId=575392333").text
article_html



#### Part 5

Use a similar loop to the link finding code to identify every paragraph in the text.  Create a list of each paragraph. Note that right now we're just trying to pull the text between the '&lt;p&gt;' and '&lt;/p&gt;' 

In [55]:
paragraphs = []
while article_html.find('<p>') >= 0:
    article_html = article_html[article_html.find('<p>') + 3:]
    paragraph = article_html[:article_html.find('</p>')]
    
    print(paragraph)
    paragraphs.append(paragraph)
    
    article_html = article_html[article_html.find('</p>') + 4:]

Text-Only NPR.org (go to <a href="https://www.npr.org">full version</a>)
<a href="/">Home</a> &gt; <a href="/p.php?pid=2"> Program: All Things Considered</a>
Nixon's Manhunt For The High Priest Of LSD In 'The Most Dangerous Man In America'
By Ari Shapiro
All Things Considered,  &middot; In the early 1970s, with a counter-cultural revolution in full swing, an unlikely figure became the No. 1 enemy of the state — Timothy Leary, the so called "High Priest of LSD." Leary was a former Harvard psychologist who left the ivory tower behind to spread the gospel of psychedelics. After breaking out of a California prison, he went on the run, sparking a madcap manhunt for a bumbling fugitive.  
Steven L. Davis and Bill Minutaglio's new book asks if Leary really was "the most dangerous man in America," as President Richard Nixon claimed. The story follows Leary as he hops from country to country, trying to stay one step ahead of the Nixon administration.
"He's kind of, you know, a Mr. Magoo on acid

Note that we still have a lot of HTML in our code.  Luckily all HTML tags are defined by an opening bracket and closing bracket, so we can identify and remove them pretty easily. 

Loop over your list of paragraphs, identify any html in them, and remove it.  Ensure your list now contains html free strings. 

In [67]:
cleaned_paragraphs = []
for paragraph in paragraphs:
    new_paragraph = ""
    
    tag = False
    for char in paragraph:
        
        if char == "<":
            tag = True
            
        if not tag:
            new_paragraph += char
            
        if char == ">":
            tag = False
        
        
    if len(new_paragraph.strip()) > 0:
        cleaned_paragraphs.append(new_paragraph)

In [69]:
for paragraph in cleaned_paragraphs:
    print(paragraph + '\n')

Text-Only NPR.org (go to full version)

Home &gt;  Program: All Things Considered

Nixon's Manhunt For The High Priest Of LSD In 'The Most Dangerous Man In America'

By Ari Shapiro

All Things Considered,  &middot; In the early 1970s, with a counter-cultural revolution in full swing, an unlikely figure became the No. 1 enemy of the state — Timothy Leary, the so called "High Priest of LSD." Leary was a former Harvard psychologist who left the ivory tower behind to spread the gospel of psychedelics. After breaking out of a California prison, he went on the run, sparking a madcap manhunt for a bumbling fugitive.  

Steven L. Davis and Bill Minutaglio's new book asks if Leary really was "the most dangerous man in America," as President Richard Nixon claimed. The story follows Leary as he hops from country to country, trying to stay one step ahead of the Nixon administration.

"He's kind of, you know, a Mr. Magoo on acid, if you will," Minutaglio says. "He's just tripping his way through li

#### Part 6

Now that you have HTML free text, merge the list back together, with each paragraph separated by two new line characters ('\n\n').  

Turn the ability to take the html from an articles webpage and create a cleaned article into a function. 

The function should take the html string as input and return a string of the articles content.  

In [72]:
def parse_text(article_html):
    paragraphs = []
    while article_html.find('<p>') >= 0:
        article_html = article_html[article_html.find('<p>') + 3:]
        paragraph = article_html[:article_html.find('</p>')]

        new_paragraph = ""
    
        tag = False
        for char in paragraph:

            if char == "<":
                tag = True
            if not tag:
                new_paragraph += char
            if char == ">":
                tag = False

        if len(new_paragraph.strip()) > 0:
            paragraphs.append(new_paragraph)

        article_html = article_html[article_html.find('</p>') + 4:]
        
    return "\n\n".join(paragraphs)

In [None]:
article_html = requests.get("http://text.npr.org/s.php?sId=575392333").text

print(parse_text(article_html))

#### Part 7 - Put it all together

We have two functions - one that will get all the relevant links out of an article, and another that will get the actual content. 

Now we need to write the logic that controls the actual flow of our web scraper. 

In [76]:
def get_id(url):
    url = url[url.lower().find('sid=') + 4:]
    if '&' in url:
        url = url[:url.find('&')]
        
    return url
    

'575943681'

In [78]:
import os

starting_url = 'http://text.npr.org'
folder_out   = 'data/scraper_test/'

if not os.path.exists(folder_out):
    os.makedirs(folder_out)

links_visited = set()
links_queue   = get_links(requests.get(starting_url).text)

while len(links_queue) > 0:
    
    link = links_queue.pop()
    links_visited.add(link)
    
    html = requests.get(link).text
    
    urls = get_links(html)
    for url in urls:
        if not url in links_visited:
            links_queue.add(url)
            
    if 'sid=' in link.lower():
        text = parse_text(html)
        article_id = get_id(link)
        
        with open(folder_out + article_id + ".txt", "w") as ft_hdl:
            ft_hdl.write(text)