<h3>Let's have fun with webscraping

If you'd like to follow along, notebooks and instructions at: *https://github.com/wsuen/Webscraping-Workshop-ACTW*

<img src="imgs/catscatscats.png" =60x60>

<b>How to pull data

* Find out which URL/s you want data from.

* Find out how the data is stored on the webpage.

* request.get(), then let the fun begin!



In [2]:
import requests

my_example_domain = "http://placekitten.com/"

r = requests.get(my_example_domain)

r.status_code

200

In [None]:
#doesn't look very nice and human readable
r.content

Beautiful Soup to the rescue!

In [3]:
from bs4 import BeautifulSoup

my_soup = BeautifulSoup(r.content, "html.parser")

In [None]:
print my_soup.prettify()

In [None]:
#print out img elements in my html
for cat in my_soup.findAll("img"):
    #print cat
    print cat.get("src")

In [None]:
my_soup.findAll?

<b>How to save data

<img src="imgs/cats_img.png">

Use case: I need to save all cat pictures to a folder, so I can look at them even if the internet is down.

In [None]:
from PIL import Image

cat_index = 0
for cat in my_soup.findAll("img"):
    img_url = cat.get("src")
    img_request = requests.get(img_url)
    with open("saved_cats/cat_%d.png" %cat_index, "w") as wf:
        wf.write(img_request.content)
    cat_index += 1
    
print "I saved %d cat pictures!" %cat_index

<b>How to play nicely 

* Some sites get mad if you scrape their data.

* Some sites are ok with it, on certain conditions.

* When in doubt, sites will generally have a ToS page.


<b>Additional Examples

What about a different, more complex site? Many types of elements, more elaborate formatting. Let's also pass in a header saying we're a robot, just to be good citizens.

Headlines change all the time, so news sites, classifieds info, and other sites with frequently updating content are good candidates for automation.

In [19]:
headers = {"User-agent" : "iambot"}
nyt_r = requests.get("https://www.nytimes.com/", headers=headers)

In [20]:
nyt_r.status_code

200

In [21]:
nyt_soup = BeautifulSoup(nyt_r.content, "html.parser")

In [34]:
#there's a lot here. Let's just find some content tags for trending topics.
for item in nyt_soup.findAll("meta"):
    if item.get("name") == "keywords":
        print item.get("content")
        keywords_string = item.get("content")

United States Politics and Government,Trump, Donald J,Mueller, Robert S III,United States Politics and Government,Special Prosecutors (Independent Counsel),United States Politics and Government,Mueller, Robert S III,Federal Bureau of Investigation,Special Prosecutors (Independent Counsel),Rosenstein, Rod J,Mueller, Robert S III,United States Politics and Government,Justice Department,United States Politics and Government,Comey, James B,Trump, Donald J,Rosenstein, Rod J,United States Politics and Government,United States Politics and Government,Mueller, Robert S III,Justice Department,Federal Bureau of Investigation,Trump, Donald J,Rosenstein, Rod J,Russia,Presidential Election of 2016,Comey, James B,United States Politics and Government,Federal Bureau of Investigation,Comey, James B,Republican Party,Trump, Donald J,Flynn, Michael T,House of Representatives,Senate,United States Politics and Government,Trump, Donald J,Impeachment,Special Prosecutors (Independent Counsel),Flynn, Michael T

In [46]:
#clean and dedupe 
keywords = set(keywords_string.replace(", ", " ").split(","))

keywords

{u'Abbas Mahmoud',
 u'Airlines and Airplanes',
 u'Alien: Covenant (Movie)',
 u'Amri Anis (1992- )',
 u'Antitrust Laws and Competition Issues',
 u'Assad Bashar al-',
 u'Audioslave',
 u'Australia',
 u'Berlin (Germany)',
 u'Books and Literature',
 u'Bridgeport (Conn)',
 u'Business Travel',
 u'California Today',
 u'Carbon Caps and Emissions Trading Programs',
 u'Central African Republic',
 u'Coal',
 u'Comey James B',
 u'Compton Jr. Posse',
 u'Computers and the Internet',
 u'Congo Democratic Republic of (Congo-Kinshasa)',
 u'Connecticut',
 u'Cornell Chris (1964- )',
 u'Corruption (Institutional)',
 u'Crime and Criminals',
 u'Deaths (Obituaries)',
 u'Defense Department',
 u'Disabilities',
 u'Drug Abuse and Traffic',
 u'Ebola Virus',
 u'Elections Governors',
 u'Elections Mayors',
 u'European Commission',
 u'European Union',
 u'Ex-Convicts',
 u'Executive Orders and Memorandums',
 u'Facebook Inc',
 u'Federal Bureau of Investigation',
 u'Fines (Penalties)',
 u'Flynn Michael T',
 u'Ganim Joseph P

In [59]:
#save to csv along with name of newspaper and the timestamp
from datetime import datetime

current_time = str(datetime.now())

In [62]:
wf = open("nytimes_test.csv", "w")

for keyword in keywords:
    wf.write("nytimes.com,%s,%s\n" %(current_time, keyword))

wf.close()