### MACNM Computational Workshop No. 3 (Jan. 6 Morning) <br>Web Scraping<br> by Lu Guan


### NOTES
#### Three ways to obtain large-scale data on social media platforms
- log-file from back-end (cooperating with company)
- collecting data through application programming interfaces (APIs)
- direct web scraping (HTML & Selenium web browser automation)

Problems:
- difficult to find cooperation opportunities
- data ethics, data privacy (Facebook - Cambridge Analytica event )
- anti-spider (道高一尺魔高一丈)

#### Api vs scraping

<img src="images/WeChat%20Screenshot_20181219134328.png" width="60%" style="float: left">

#### Via API
 Automatic method to search, read, send tweets<br/>
 Allows third-party applications to provide services<br/>
 Allows us to automatically scrape tweets easily<br/>

#### Steps:
- Get authentication token (log in and get token key)
- Send HTTP request (submit data requests)
- Parsing JSON data (save the responses in structured data tables) <br/>

### OAuth: Open Authentication

- for developers to gain API access
- for Twitter to monitor and interact with third-party platform developers as needed<br/>
<br/>

- Why OAuth? <br/>
Allow access tokens to be issued to third-party clients by an authorization server, with the approval of the resource owner. <br/>
Similar to pin code for credit card.

<p>** Creating an Twitter Developer App**</p>

Twitter Developer App:https://developer.twitter.com/en/apps

- Step 1: apply the app<br/>
<img src="images/api%20apply.png" width="60%" style="float: left">

- Step 2: filling the information requirement<br/>
<img src="images/five%20steps%20in%20twitter%20developer%20app%20application.png" width="40%" style="float: left"><br/>


- Step 3: waiting for approval<br/>
<img src="images/apiunderreview.png" width="30%" style="float: left"><br/>

- Application approved <br/>
<img src="images/apiapp.png" width="60%" style="float: left"><br/>


#### CASE: Twitter API

Three types of Twitter API
- Twitter’s Search API<br/>
Pulling Twitter’s data through a search (keywords, usernames, hashtags, locations, named places, etc.)<br/>
For an individual user, the maximum number of tweets you can receive is the last 3,200 tweets.<br/>
With a specific keyword, you can typically only retrieve the last 5,000 tweets per keyword.<br/>
Request limits: 180 requests in 15 minutes for keyword search; 15 requests in 15 minutes for user profile.<br/>
Please check the request limits here: https://developer.twitter.com/en/docs/basics/rate-limits <br/>
Notes: No guarantee for a representative sample<br/>
<br/>
- Twitter’s Streaming API<br/>
A push of data as tweets happen in near real-time <br/>
Users register a set of criteria (keywords, usernames, locations, named places, etc.) As tweets match the criteria, they are pushed directly to the user.<br/>
<br/>
- Twitter’s Firehose<br/>
Guarantee delivery of 100% of the tweets that match your criteria<br/>
Handled by data providers: GNIP, DataSift, Crimson Hexagon, etc.<br/>
It could be every expensive.<br/>

#### Using Tweepy library to submit data requests

- install tweepy (pip install tweepy) : http://docs.tweepy.org/en/v3.5.0/install.html
- tweepy documentation: http://docs.tweepy.org/en/v3.5.0/

### CODE

#### 1. Log in OAuth tokens and secrets:

In [1]:
import tweepy

In [2]:
with open('API1.txt') as h:
    line=h.readlines()[0]
    lis=line.split()
    consumer_key = lis[0]
    consumer_secret = lis[1]
    access_token = lis[2]
    access_token_secret = lis[3]    

In [3]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [4]:
api = tweepy.API(auth)

#### 2. Get user profile:

In [6]:
##get user profile of realDonaldTrump and HillaryClinton
user_name='@HillaryClinton'
user=api.get_user(user_name)
print ("User's screen name is", user.screen_name)
print ("User's location is", user.location)
print ("User has", user.followers_count, 'followers.')
print ("User has", user.friends_count, 'followees')
print ("User posted", user.statuses_count, 'posts')

User's screen name is HillaryClinton
User's location is New York, NY
User has 23814817 followers.
User has 784 followees
User posted 10622 posts


#### 3. Get tweets by keywords:

In [7]:
#documentation: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html
searchQuery = 'china' # Keyword
new_tweets = api.search(q=searchQuery,count=100,
                        result_type = "recent",
                        lang = "en") 

In [9]:
for tweet in new_tweets:
    print (tweet.id, tweet.user.screen_name, tweet.created_at, tweet.text)

1080845847271624709 S_A_R_A_N_17 2019-01-03 15:18:25 RT @RabiaBaluch: This is a Chinese officer beating a Muslim Uyghur for having a copy of the Quran in his house! Everyone send this out so t…
1080845846801838081 do_minchen 2019-01-03 15:18:25 RT @MOFA_Taiwan: Forced unification is a folly. #Taiwan is a free &amp; democratic country where #HumanRights are protected. Only its 23 millio…
1080845846017687552 abaehr4 2019-01-03 15:18:25 China lands a rover on the dark side of  the moon … and the mind wanders https://t.co/pmiqZs7Mb6
1080845844570624000 Ar_ch_ie 2019-01-03 15:18:25 China isn't playing.
Now the Chinese are building their own Space Station after Uncle Sam barred them doe International Space Station.
1080845844323028992 scruffchick 2019-01-03 15:18:25 RT @realDonaldTrump: The United States Treasury has taken in MANY billions of dollars from the Tariffs we are charging China and other coun…
1080845843543019520 BeHappyandCivil 2019-01-03 15:18:24 RT @washingtonpost: Dow drops 50

#### 4. Save tweets into csv formats:
- Problem 1: CSV format defaultly uses "," to separate columns (e.g., columnA, columnB, ). There might be comma in tweets too.
- Problem 2: There might be line break(换行符) in the text we want to scrap. The line break will challenge the dataframe in csv formats.

<img src="images/tweet_test.png" width="70%" style="float: left"><br/>

In [8]:
import re
with open('tweets_test.csv','a', encoding='utf-8') as f:
    f.write('tid\001user_screen_name\001created_time\001text'+'\n')
    for tweet in new_tweets:
        f.write(str(tweet.id)+'\001')
        f.write(str(tweet.user.screen_name)+'\001')
        f.write(str(tweet.created_at)+'\001')
        text = re.sub('\r','',str(tweet.text))
        text = re.sub('\n','',text)
        f.write(text)
        f.write('\n')

In [30]:
import pandas as pd
df_tweets = pd.read_csv('tweets_test.csv', sep='\001')
df_tweets.head()

Unnamed: 0,tid,user_screen_name,created_time,text
0,1080323675293835265,lauhoho,2019-01-02 04:43:30,RT @hongkong_news: SCMP: University of Hong Ko...
1,1080321999619014662,BB2BHKG,2019-01-02 04:36:50,HKU students and staff want city’s leader remo...
2,1080321240911364097,johnqgoh,2019-01-02 04:33:49,RT @hongkong_news: SCMP: University of Hong Ko...
3,1080320551250345985,hongkong_news,2019-01-02 04:31:05,SCMP: University of Hong Kong students and sta...
4,1080300071684427776,jonrstanley,2019-01-02 03:09:42,RT @SCMPNews: HKU students and staff want city...


#### In-class exercise 1:

- Search the recent 100 tweets about "China" on twitter API
- Extract and save their author, tweet id, tweet time, number of retweets, number of favorite, and text into csv format
- Then calculate the mean value of number of retweets/favorite for these tweets

### NOTES
#### Via web scraping
- HTML & CSS
- Selenium web browser automation

#### Introduction to HTML (HyperText Markup Language)
- HTML is a markup language for describing web documents (web pages).
- w3schools: https://www.w3schools.com/html/default.asp

#### HTML as a tree

<img src="images/html.png" width="30%" style="float: left"><br/>


### Code
#### Hand-on 2: Your first web scrapper!
- BeautifulSoup Library<br/>
- webpage: http://pythonscraping.com/pages/page1.html


In [2]:
from urllib.request import urlopen
html=urlopen("http://pythonscraping.com/pages/page1.html")
html.read()

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'

In [13]:
from bs4 import BeautifulSoup
html=urlopen("http://pythonscraping.com/pages/page1.html")
bsobj=BeautifulSoup(html.read(), 'html.parser')
print (bsobj.html.body.h1)
print (bsobj.body.h1)
print (bsobj.html.body.div)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


### Notes

### CSS and HTML

- HTML has limited set of tags
- Cascading Style Sheets (CSS) 
 * Specify style (font, color, border, placement)
 * Based on structure, tags, HTML classses
- HTML specifies structure, CSS layout

### CSS Structure

- Selectors select groups of nodes:
 * Tag name: p {..}
 * Class (share same attributes): .main, p.main {..}
 * ID (share same ID name): #story, p#story {..}


###  Let's play a game!
http://flukeout.github.io/

### Hand-on 3: Advanced web scrapper!

webpage: http://www.pythonscraping.com/pages/warandpeace.html

Requirment: extract and print all text in green

In [15]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsobj=BeautifulSoup(html.read(),"html.parser")

- bsobj.find(tagName,TagAttribute) returns the first tag with the attribute on the page
- name.get_text() separate the content from the page

In [16]:
name=bsobj.find("span",{"class":"green"})
print (name.get_text())

Anna
Pavlovna Scherer


- bsobj.findAll(tagName,TagAttribute) returns a list of all the tags with the attribute on the page

In [17]:
nameList=bsobj.findAll("span",{"class":"green"})
for name in nameList:
    print (name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


### In-class exercise 2:

For the html page at: http://www.pythonscraping.com/pages/page3.html<br/>
Write a program which use methods in BeautifulSoup to extract and print all text in the first column (Item Title) and third column (Cost) in the table (header excluded), with format as follows:<br/>
<img src="images/exercise1.png" width="30%" style="float: left"><br/>

### In-class exercise 3:

- Scrap the recent posts shown on @realDonaldTrump profile page on Twitter (https://twitter.com/realDonaldTrump)
- Extract the number of comments, number of retweets, number of favorites, and text of his posts
- save them in the csv format that you can analyze them later.

<img src="images/horse.jpg" width="25%" style="float: left"><br/>