# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [1]:
%%HTML
<h1>This is a header</h1>
<p class = 'super-paragraph'>This would be a paragraph. <strong>Strong Words</strong> here. </p>


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [9]:
url = 'https://en.wikipedia.org/wiki/List_of_21_Jump_Street_episodes'

In [10]:
response = requests.get(url)

In [11]:
response

<Response [200]>

In [12]:
response.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of 21 Jump Street episodes - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_21_Jump_Street_episodes","wgTitle":"List of 21 Jump Street episodes","wgCurRevisionId":844038329,"wgRevisionId":844038329,"wgArticleId":35403829,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from May 2012","All articles needing additional references","21 Jump Street","Lists of American crime television series episodes"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparat

In [13]:
soup = BeautifulSoup(response.text, 'html.parser')

In [15]:
soup.find('h2')

<h2>Contents</h2>

In [16]:
all_h2 = soup.find_all('h2')

In [21]:
for header in all_h2[2:7]:
    print(header.text)

Season 1 (1987)[edit]
Season 2 (1987-88)[edit]
Season 3 (1988-89)[edit]
Season 4 (1989-90)[edit]
Season 5 (1990-91)[edit]


In [26]:
table_1 = soup.find('table', {'class': 'wikitable plainrowheaders'})

In [29]:
season_1_titles = table_1.find_all('td', {'class': 'summary'})

In [31]:
for title in season_1_titles:
    print(title.text)

"Pilot"
"America, What a Town"
"Don't Pet the Teacher"
"My Future's So Bright, I Gotta Wear Shades"
"The Worst Night of Your Life"
"Gotta Finish the Riff"
"Bad Influence"
"Blindsided"
"Next Generation"
"Low and Away""Running on Ice"
"16 Blown to 35"
"Mean Streets and Pastel Houses"


In [32]:
soup.find('p')

<p><i><a href="/wiki/21_Jump_Street" title="21 Jump Street">21 Jump Street</a></i> is an American <a href="/wiki/Police_procedural" title="Police procedural">police procedural</a> <a class="mw-redirect" href="/wiki/Crime_drama" title="Crime drama">crime drama</a> <a class="mw-redirect" href="/wiki/Television_series" title="Television series">television series</a> that aired on the <a href="/wiki/Fox_Broadcasting_Company" title="Fox Broadcasting Company">Fox Network</a> and in first run syndication from April 12, 1987, to April 27, 1991, with a total of 103 <a href="/wiki/Episode" title="Episode">episodes</a>. The series focuses on a squad of youthful-looking undercover police officers investigating crimes in high schools, colleges, and other teenage venues.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>

In [33]:
%%HTML
<a href = 'https://www.reddit.com/r/MachineLearning/'> The Reddit Page </a>

In [None]:
len(soup.find_all('p'))

In [None]:
len(soup.find_all('h2'))

In [None]:
soup.find('a', {'data-click-id': 'body'})['href']

In [None]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [None]:
links

In [None]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [None]:
df.head()

### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [35]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [36]:
user = api.get_user('thrashermag')

In [37]:
for tweet in user.timeline():
    print(tweet.text)

Double the handcuffs equals triple the fun. Or pain. Probably pain. Watch the latest episode Tuesday night at 9 pm… https://t.co/CcyguAD6wC
On Friday, attendees pulled up to the historic Max Fish bar in the Lower East Side of Manhattan from all over the E… https://t.co/hoZvHDxdKW
With a nod to past, but firmly planted in the future, this offering from @dcshoes is everything you could ever want… https://t.co/w4nxVWK4lh
With Chico and Carroll as their guides, watch Real tackle some '90s moves and f--ked up Goofy Boy fashion.… https://t.co/BoaQSv0QSD
It all kicks off with a beautiful block-to-block line and just keeps getting better. Resounding pop, majestic style… https://t.co/hl76dds6tU
Element attacks Sacto with help from Cardiel, Carroll and Chico gives Real an early-90s makeover and the F Troop ba… https://t.co/qWHqhBL2MW
T-Funk blows the doors off an NYC hotspot with a rail combo bordering on the absurd. Hell YES.… https://t.co/PbmnhhaqCO
A solid crew went over to Omar Hassan’s back

In [38]:
print(user.followers_count)

415866


In [39]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [40]:
tweets[:5]

['Double the handcuffs equals triple the fun. Or pain. Probably pain. Watch the latest episode Tuesday night at 9 pm… https://t.co/CcyguAD6wC',
 'On Friday, attendees pulled up to the historic Max Fish bar in the Lower East Side of Manhattan from all over the E… https://t.co/hoZvHDxdKW',
 'With a nod to past, but firmly planted in the future, this offering from @dcshoes is everything you could ever want… https://t.co/w4nxVWK4lh',
 "With Chico and Carroll as their guides, watch Real tackle some '90s moves and f--ked up Goofy Boy fashion.… https://t.co/BoaQSv0QSD",
 'It all kicks off with a beautiful block-to-block line and just keeps getting better. Resounding pop, majestic style… https://t.co/hl76dds6tU']

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?