# <center>Session 4 - Functions </center>

<hr>

## Table of Contents

<b>

* [Functions](#Functions)
* [Project: Creating an NPR Scraper](#Scraper)
</b>

<hr>
<a id="Functions"></a>
# Functions

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

Any piece of code that you might want to reuse, or that might be overly complicated, is a good candidate for becoming a function.  Having good function names and pushing specific pieces of logic into them can make your code infinitely more readable (and editable in the future). 

    def functionname( parameters ):
       <function behavior>
       return <return value>

In [None]:
## Example function definition
def sum_numbers(num_1, num_2):
    summed_value = num_1 + num_2
    return summed_value

What happens when you run the code block above?  Call the new function you've created.  Now you can treat it like the built-in functions we've seen before. 

In [None]:
## Call the function above.  Your code here

Before running the next code cell, what do you think it will print out and in what order?

In [None]:
def happyBirthdaySong(name):
    print("Happy Birthday to you!")
    print("Happy Birthday to you!")
    print("Happy Birthday, dear", name + ".")
    print("Happy Birthday to you!")
    
print ("When will this evaluate?")
happyBirthdaySong("Cynthia")

There's no return value for the function above.  What do you think it returns?

In [None]:
print(happyBirthdaySong("Cynthia"))

### Optional Arguments

Not all arguments are required.  You can set *optional* or *default* arguments that have a default behavior if they're not explicitly passed in. 

Say we write a function that reads CSV's.  We use the original csv reader to read CSV's as lists.  We later realize we want the option to read them in as dictionaries, but don't want to break any code already calling our function.  We can add an optional argument, and set the value to keep our code's behavior the same.  

In [None]:
import csv 

def read_csv(filename):
    lines = []
    with open(filename, 'r') as file_hdl:
        csv_reader = csv.reader(file_hdl)
        for row in csv_reader:
            lines.append(row)
    
    return lines

In [None]:
results = read_csv('data/articles/article_log.csv')
results[1]

In [None]:
import csv 

def read_csv(filename, read_dicts=False):
    
    lines = []
    with open(filename, 'r') as file_hdl:
        if read_dicts:
            csv_reader = csv.DictReader(file_hdl)
        else:
            csv_reader = csv.reader(file_hdl)
        
        for row in csv_reader:
            lines.append(row)
    return lines
    

In [None]:
results = read_csv('data/articles/article_log.csv')
results[1]

In [None]:
results = read_csv('data/articles/article_log.csv', read_dicts=True)
results[0]

### Variable Scope

Variables have a scope in which they can be referenced.  It's inevitable that you will have variables with the same name, so knowing WHERE they can be accessed is important. 

In [None]:
def area_of_a_circle(radius):
    pi = 3.14159
    return pi * radius**2

pi = 3.14
print(area_of_a_circle(2))

print(pi)

Will these next two examples work?

In [None]:
def area_of_a_circle(radius):
    return pi_ex2 * radius**2

pi_ex2 = 3.14
print(area_of_a_circle(2))


In [None]:
def area_of_a_circle(radius):
    pi_val = 3.14159
    return pi_val * radius**2

print(area_of_a_circle(2))

print(pi_val)

<hr>

<a id="Scraper"></a>
# Creating an NPR Web Scraper

We'll work through this problem step by step, creating some logic, turning it into a series of functions as we work out each piece. 

The goal is:

* Start with the webpage 'http://text.npr.org/'
* Identify all article links
    * Create a queue of articles to process, add all these links to the queue
* Download each article
    * Pop a link off the queue
    * Identify the article text
    * Write the article text to a file with the article id as the filename
    * Find any other articles linked to in the article
        * Make sure the new links haven't already been processed
        * Add them to the queue of articles to process if they're new
    

#### Part 1

Use the Requests library to get the content of 'http://text.npr.org/'

Create a loop and use string functions to identify every link.  Add each link to a set.

*Hint:* The beginning of any link in an HTML file will be '<a href="'  


In [None]:
## Your code here

In [None]:
## Your code here

#### Part 2

Take a look at all the links.  You'll notice that some are relative (ie don't start with http).  Modify your code above to add 'http://text.npr.org' to the beginning of every relative link. 

Note that strings have a function 'startswith'

In [None]:
## Your code here

#### Part 3

Once you have a set of links, remove any that aren't for npr.org, or don't fit the format of the npr articles or topics.  

*Hint:* The article link format is 'http://text.npr.org/s.php?sId=< Article ID >&rid=< Topic ID >', athough not all articles will have a topic id.  
*Hint 2:* The topic link format is 'http://text.npr.org/t.php?tid= < Topic ID >'

In [None]:
## Your code here

#### Part 4

Take the logic you've created for identifying links in an HTML string, and create a function. 

It stands to reason the input for the function should be a string (HTML), and the output should be a set where each element is a link to a topic or an article. 

You are welcome to re-configure the code a bit to be shorter. 

In [None]:
## Your code here

#### Part 5

Take one of the article links you found and use the requests library to get its content.  Inspect it, and try to identify what defines the content of the article itself. 

In [None]:
## Your code here

#### Part 6

Use a similar loop to the link finding code to identify every paragraph in the text.  Create a list of each paragraph. Note that right now we're just trying to pull the text between the '&lt;p&gt;' and '&lt;/p&gt;' 

In [None]:
## Your code here

Note that we still have a lot of HTML in our code.  Luckily all HTML tags are defined by an opening bracket and closing bracket, so we can identify and remove them pretty easily. 

Loop over your list of paragraphs, identify any html in them, and remove it.  Ensure your list now contains html free strings. 

In [None]:
## Your code here

In [None]:
## Your code here

#### Part 7

Now that you have HTML free text, merge the list back together, with each paragraph separated by two new line characters ('\n\n').  

Turn the ability to take the html from an articles webpage and create a cleaned article into a function. 

The function should take the html string as input and return a string of the articles content.  

In [None]:
## Your code here

In [None]:
## Your code here

#### Part 8 - Put it all together

We have two functions - one that will get all the relevant links out of an article, and another that will get the actual content. 

Now we need to write the logic that controls the actual flow of our web scraper. 

Write a function that, given the url string, extracts the id ("sid=") of the article.  

Once you have that, write code to start with the starting url 'http://text.npr.org', the output folder 'data/scraper_test', creates a queue of links to process. 

For each link, download the html, pull out any npr links and add them to the processing queue.  Clean up the text, and pull out its id.  Write out the cleaned text to the output folder, using its id as the filename.

**See the steps at the beginning of the exercise for the full logic**

In [None]:
## Your code here

In [None]:
## Your code here

In [None]:
## Your code here