# Intro to Python and Web Scraping

## Info
- Scott Bailey (CIDR), *scottbailey@stanford.edu*
- Javier de la Rosa (CIDR), *versae@stanford.edu*
- Ashley Jester (CIDR/SSDS), *ajester@stanford.edu*

## Goal

By the end of our workshop today, we hope you'll understand basic syntax in Python for variables, functions, and control flow. We also hope you'll know enough about the process of web scraping and some standard packages in Python to successfully scrape information off of a basic, well-formatted web site. 

## Topics
- Imports
- Variables and types/structures (String, Int, List)
- Functions
- Control flow
- Web scraping with Requests and BeautifulSoup
- Writing text to a file

## Packages we need in our environment
- requests
- beautifulsoup4

## Imports
- At the top of your script/file, do imports. 
- Import whole module
- Import part of a module

In [None]:
from bs4 import BeautifulSoup
import os
import requests

## Types and variables

In [None]:
# Strings
greeting = "Hello, I'm Scott. It's a pleasure to meet you."
# After you run this cell, note the difference between printing out in Jupyter and getting the
# output from the last line of the cell
print(greeting)
greeting

In [None]:
# Find a letter by index
greeting[0]

In [None]:
# Get the length of a string
len(greeting)

In [None]:
# Count spaces in the string
greeting.count(' ')

In [None]:
# Slice to get the first 3 characters
greeting[0:3]

In [None]:
# Get the last three characters
greeting[-3:]

In [None]:
# Replace hello with goodbye
greeting.replace("Hello", "Goodbye")

In [None]:
# String concatenation
"Hello" + "World"

In [None]:
# Numbers
# Integer and floats
first_num = 10
second_num = 5.467
print(type(first_num), type(second_num))

In [None]:
# Addition
1 + 5

In [None]:
# Division
10 / 2

In [None]:
# Multiplication
5 * 2

In [None]:
# Lists
drinks = ['coffee', 'tea', 'water']
drinks

In [None]:
# Python allows you to create lists of different types
mixed = [2, 'hello', 10.5, 'here is a sentence']
mixed

In [None]:
# Get item by index
drinks[2]

In [None]:
# Add an item to the end of the list
drinks.append('juice')
drinks

In [None]:
# Splitting a string - note the type of the output
greeting_words = greeting.split(' ')
greeting_words

In [None]:
# Joining a list of strings 
' '.join(greeting_words)

There are plenty of other data types and structures that we aren't going to use today, such as: sets, dictionaries, tuples, and so forth. 

## Functions

At the most basic level, functions are chunks of reusable code

In [None]:
# Define a function
def add(x, y):
    return x + y

add(1, 2)


In [None]:
def combine_arrays(array1, array2):
    new_list = array1 + array2
    return new_list

first = ['hello', 2]
second = ['1', 10]
new = combine_arrays(first, second)
new

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
In the cell below, experiment with the add function defined above. What happens if you put in two strings? A string and an integer? A list and a string?
</p>
</div>

In [None]:
# Experiment with using different and mixed variable types with add(x, y)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">

Pig latin is a language game where you take the first letter of a word, move it to the back of the word, then add '-ay' at the end. For example, 'pig latin' would be 'igpay atinlay' and 'python' would turn into 'ythonpay'.

In the cell below, write a function that takes a string, lowercases it, and returns the pig latin translation of the word. You'll need to use slicing and string concatenation to make this work. 
</p>
</div>

In [None]:
def pig_latinize(word):
    ...
    return ...


## Control flow 

In [None]:
# IF
name = "Bob"

if name == "Scott":
    print("Hi Scott!")
else:
    print("Who are you?")

In [None]:
# You can use control flow with functions
# Also, you can if, else if, and else to specify more than one condition
name = "John"

def say_hello(name):
    return "Hello " + name + "!"

if (name == "Bob"):
    message = say_hello("Bob")
    print(message)
elif (name == "Scott"):
    message = say_hello("Scott")
    print(message)
else:
    print("Who are you?")

In [None]:
# FOR loops let you iterate over a list or other iterable object
names = ["Stu", "Scott", "Javier", "Ashley"]
for name in names:
    print(name, len(name))

In [None]:
# You can combine types of control flow
for name in names[:3]:
    if len(name) > 5:
        print(name)

In [None]:
def add_one(num):
    return num + 1

nums = [1, 2, 3, 4]
plus = []
for num in nums:
    plus.append(add_one(num))
plus

In [None]:
# ADVANCED: List Comprehensions
# List comprehensions are a "pythonic" way of building lists in a compact manner

added = [add_one(num) for num in nums]
added

In [None]:
long_names = [name.lower() for name in names[:3] if len(name) > 5]
long_names

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
In the cell below, write a function that loops over a list and returns a new list where all the strings have been replaced with their pig latin translations. 

For example, if your list is `['hello', 5, 'world']` your output should be `['ellohay', 5, 'orldway']`.

Feel free to reuse the pig latinizer you wrote above. You'll also need to think about checking the type of each item in the list. 
</p>
</div>

In [None]:
def pig_latinize_list(items):
    ...
    return ...

## Web scraping with Requests and Beautiful Soup

### Scraping text

In [None]:
# We'll use the requests library to carry out an HTTP request on the url
# Then use BeautifulSoup to parse the HTML
url = "https://en.wikipedia.org/wiki/Stanford_University"
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
soup

In [None]:
# We can use the find method to specify an HTML element to find,
# and pass attributes such as class or id to find specific elements
# Find only returns the first element found
hatnote = soup.find('div', {'class': 'hatnote'})
hatnote

In [None]:
# The get_text method pulls just the text from a chunk of HTML
hat_text = hatnote.get_text()
hat_text

In [None]:
# Within a chunk of HTML we've found, we can use find again to find another html element
main_text_area = soup.find('div', {'class': 'mw-content-ltr'})
main_text = main_text_area.find('p')
main_text.get_text()

In [None]:
# We can use find_all to find every instance of an HTMl element
# find_all returns an object we can iterate over
paragraphs = soup.find_all('p')
type(paragraphs)

In [None]:
for para in paragraphs:
    print(para.get_text())

### Another text scraping example

Let's create a list of urls for the chapters of A Byte of Python, iterate over the first few, and get that page content.

A Byte of Python is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, allowing us to copy the book, distribute it, transmit it, remix it and so forth. 

In [None]:
url = "https://python.swaroopch.com/"
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")

In [None]:
# We can chain together methods to find one element, then find all instances of another
# element within that HTML block
chapters = soup.find('nav').find_all('a')
chapters

In [None]:
# We can use square brackets to access the value of an attribute, such as the href of a link
for a in chapters:
    print(a['href'])

In [None]:
# Since the href didn't give us a full url, we use a function to build one
def create_url(url):
    return 'https://python.swaroopch.com/' + url

# Then use a list comprehension to create a list of full urls of chapters
chapter_links = [create_url(a['href']) for a in chapters[2:-1]]
chapter_links

In [None]:
# We've used this chunk of code several times, so let's make it a function that specifically
# gets the text from a chapter page
def get_page_text(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "lxml")
    return soup.find('section', {'class': 'markdown-section'}).get_text()

for url in chapter_links:
    print(get_page_text(url))


### Writing text to a file
However you prefer, create a directory/folder named 'chapters' at the same level as the file for this notebook. 

In [None]:
# In the below functions, I've put in docstrings, which let you document the purpose of a 
# function, its parameters, and what it returns 

# The first function breaks apart a filename, builds a path including a directory,
# then puts the right file extension at the end
def create_filename(name, dirname):
    """
    Builds a filename
    
    Args:
        name (string) - the name of the file to be written
        dirname (string) - the name of the directory to contain the files
        
    Returns:
        filename (string) - path to the file
    """
    chunks = name.split('.')
    filename = os.path.join(dirname, chunks[0] + '.txt')
    return filename

def create_url(url):
    """
    Takes a final chunk of a url and creates a full url
    
    Args:
        url (string) - the url with file extension, e.g. 'dedication.html'
        
    Returns a full url (string)
    """
    return 'https://python.swaroopch.com/' + url

def get_page_text(url):
    """
    Pulls html from the url, creates a beautiful soup object, and gets the text from the page
    
    Args:
        url (string) - the url for the page from which you want text
        
    Returns the text (string) from the page
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "lxml")
    return soup.find('section', {'class': 'markdown-section'}).get_text()

# Iterate over the chapter links, create a filename for each, get the text for each, 
# then write it to a local file in the chapters directory
for a in chapters[2:-1]:
    filename = create_filename(a['href'], 'chapters')
    text = get_page_text(create_url(a['href']))
    with open(filename, 'w') as f:
        f.write(text)

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
Activity
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
One type of common use for web scraping is to gather content for analysis, such as sentiment analysis. You may have seen data-driven journalists offer sentiment analysis of political content. We won't do the analysis, but let's practice scraping news headlines off of a news site.</p> 
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">In the cell below, scrape the article titles from ProPublica's page on the current presidential administration - https://www.propublica.org/trump-administration/. You'll need to look at the html code of the page to locate the right markup to find. Think about first finding all the articles, then iterating over those to find each title. You can either just print out each title, or put it into a list. </p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">Always check any given site's terms of use, content policy, and robots.txt file before scraping it. In this case, content has a Creative Commons license and the robots.txt file seems to allow a robot to hit pages for categories of articles. </p>
</div>