# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
url = 'https://www.reddit.com/r/MachineLearning/'

In [3]:
response = requests.get(url)

In [4]:
response

<Response [429]>

In [5]:
response.text[:100]

'\n<!doctype html>\n<html>\n  <head>\n    <title>Too Many Requests</title>\n    <style>\n      body {\n     '

In [6]:
soup = BeautifulSoup(response.text, 'html.parser')

In [7]:
soup.find('h2')

In [8]:
soup.find('h2').text

AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
soup.find('p')

In [None]:
len(soup.find_all('p'))

In [None]:
len(soup.find_all('h2'))

In [None]:
soup.find('a', {'data-click-id': 'body'})['href']

In [None]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [None]:
links

In [None]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [None]:
df.head()

### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [39]:
#consumer_key = 
#consumer_secret = 
#access_token = 
#access_token_secret = 

In [None]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [None]:
user = api.get_user('thrashermag')

In [None]:
for tweet in user.timeline():
    print(tweet.text)

In [None]:
print(user.followers_count)

In [None]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [None]:
tweets[:5]

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?