# <center>Web Scraping II</center>

References: 
https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

## 1. Different ways to access data on the web
 - Scrape HTML web pages (covered in Web Scraping I)
 - Download data file directly 
    * data files such as csv, txt
    * pdf files
 - Access data through Application Programming Interface (API), e.g. The Movie DB, Twitter

## 3. Scrape data through API (e.g. tweets)
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. TMDB APIs (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data



### 3.1. Access tweet stream through tweepy package
- **Steam**: transmitting or receiving data as a steady, continuous flow (the opposite is **batch**)

- Event **Listener**(or Event Handler): 
  - A procedure or function that waits for an event to occur.
  - Event examples: a user clicking or moving the mouse, pressing a key on the keyboard, an internal timer, or a tweet arriving.
  - A listener is in effect a loop that is programmed to react to an input or signal.
  
- Twitter Terminology (https://support.twitter.com/articles/166337)
  - **@{username}**: mentioning an accounts {username} in a tweet
  - **\#{topic}**: a hashtag indicates a keyword or topic.
  - **follow**: Subscribing to a Twitter account 
  - **reply**: A response to another person’s Tweet
  - **Retweet (n.)**: A tweet that you forward to your followers
  - **like (n.)**: indicates appreciating a tweet. 
  - **timeline**: A timeline is a real-time stream of tweets. Your Home timeline, for instance, is where you see all the Tweets shared by your friends and other people you follow.
  - **Twitter emoji**: A Twitter emoji is a specific series of letters immediately preceded by the # sign which generates an icon on Twitter such as a national flag or another small image.


In [None]:
# Exercise 3.1.1 define a listener which listens to tweets in real time


import tweepy
# to install tweepy, use: pip install tweepy

# import twitter authentication module
from tweepy import OAuthHandler

# import tweepy steam module
from tweepy import Stream

# import stream listener
from tweepy.streaming import StreamListener

# import the python package to handle datetime
import datetime

# set your keys to access tweets 
consumer_key = 'your key here'
consumer_secret = 'your key here'
access_token = 'your key here'
access_secret = 'your key here'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
# Customize a tweet event listener 
# inherited from StreamListener provided by tweepy
# This listener reacts when a tweet arrives or an error happens

class MyListener(StreamListener):
    
    # constructor
    def __init__(self, output_file, time_limit):
        
            # attribute to get listener start time
            self.start_time=datetime.datetime.now()
            
            # attribute to set time limit for listening
            self.time_limit=time_limit
            
            # attribute to set the output file
            self.output_file=output_file
            
            # initiate superclass's constructor
            StreamListener.__init__(self)
    
    # on_data is invoked when a tweet comes in
    # overwrite this method inheritted from superclass
    # when a tweet comes in, the tweet is passed as "data"
    def on_data(self, data):
        
        # get running time
        running_time=datetime.datetime.now()-self.start_time
        print(running_time)
        
        # check if running time is over time_limit
        if running_time.seconds/60.0<self.time_limit:
            
            # ***Exception handling*** 
            # If an error is encountered, 
            # a try block code execution is stopped and transferred
            # down to the except block. 
            # If there is no error, "except" block is ignored
            try:
                # open file in "append" mode
                with open(self.output_file, 'a') as f:
                    # Write tweet string (in JSON format) into a file
                    f.write(data)
                    
                    # continue listening
                    return True
                
            # if an error is encountered
            # print out the error message and continue listening
            
            except BaseException as e:
                print("Error on_data:" , str(e))
                
                # if return "True", the listener continues
                return True
            
        else:  # timeout, return False to stop the listener
            print("time out")
            return False
 
    # on_error is invoked if there is anything wrong with the listener
    # error status is passed to this method
    def on_error(self, status):
        print(status)
        # continue listening by "return True"
        return True

In [None]:
# Exercise 3.1.2 Collect tweets with specific topics within 2 minute

# initiate an instance of MyListener 
tweet_listener=MyListener(output_file="python.txt",time_limit=1)

# start a staeam instance using authentication and the listener
twitter_stream = Stream(auth, tweet_listener)
# filtering tweets by topics
twitter_stream.filter(track=['#blockchain', '#bitcoin','#crpytocurrency','#smartcontract'])

In [None]:
# Exercise 3.1.3. Collect 1% sample of all tweets within 30 seconds

tweet_listener=MyListener(output_file="tweets.txt",time_limit=0.5)
twitter_stream = Stream(auth, tweet_listener)
twitter_stream.sample()


In [None]:
# Exercise 3.1.4. Collect nhistorical tweets for a topic

searched_tweets = []
tweets=[]
max_tweets=500
last_id = -1

query='#blockchain'

api = tweepy.API(auth)


while len(searched_tweets) < max_tweets:
    count = max_tweets - len(searched_tweets)
    try:
        # for each search, at maximum you get 100 results, although
        # you can set count larger than 100
        # You can limit the id for the most recent tweets (max_id)
        # query can be a list of hashtags
        # search api returns tweets sorted by time in descending order
        new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))

        if not new_tweets:
            break
        # append new batch into list    
        searched_tweets.extend(new_tweets)
        # only store a list of (date, tweet text) 
        tweets+=[(item.created_at, item.text) for item in new_tweets]
        
        # get the first tweet in the batch
        last_id = new_tweets[-1].id

    except tweepy.TweepError as e:
        # depending on TweepError.code, one may want to retry or wait
        # to keep things simple, we will give up on an error
        break

In [None]:
tweets

## 4. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in name/value pairs
- Data is separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON file can be easily loaded into a dictionary or a list of dictionaries

In [None]:
# Exercise 4.1. Read/write JSON 
import json
tweets=[]

with open('python1.txt', 'r') as f:
    # each line is one tweet string in JSON format
    for line in f: 
        
        # load a string in JSON format as Python dictionary
        tweet = json.loads(line) 
              
        tweets.append(tweet)

# write the whole list back to JSON
json.dump(tweets, open("all_tweets.json",'w'))

# to load the whole list
# pay attention to json.load and json.loads
tweets=json.load(open("all_tweets.json",'r'))

# open "all_tweets.json" and "python.txt" to see the difference

In [None]:
# Exercise 4.2. Investigating a tweet

# A tweet is a dictionary
# Some values are dictionaries too!
# for details, check https://dev.twitter.com/overview/api/tweets

print("# of tweets:", len(tweets))
first_tweet=tweets[0]

print("\nprint out first tweet nicely:")
print(json.dumps(first_tweet, indent=4))   

# note the difference between "json.dumps()" and "json.dump()"


In [None]:
# Exercise 4.3. Investigating attributes of a tweet

print("tweet text:", first_tweet["text"] )
# get all hashtags (i.e. topics) in this tweet
      
topics=[hashtag["text"] for hashtag in first_tweet["entities"]["hashtags"]]
print("\ntopics:", topics)

# get all user_mentions in this tweet
user_mentions=[user_mention["screen_name"] for user_mention in first_tweet["entities"]["user_mentions"]]
print("\nusers mentioned:", user_mentions)

In [None]:
# Exercise 4.4. count tweets per topic

# get the number of tweets for each topic as a dictionary
count_per_topic={}

# loop through each tweet in the list
for t in tweets:
    # check if "entities" exist and "hashtags" exist in "entities"
    if "entities" in t and "hashtags" in t["entities"]:
        # get all topics as a set (unique topics)
        topics=set([hashtag["text"].lower() for hashtag in t["entities"]["hashtags"]])
        
        for topic in topics:
            topic=topic.lower()
            if topic in count_per_topic:
                count_per_topic[topic]+=1
            else:
                count_per_topic[topic]=1
        
print(count_per_topic)

# convert the dictionary into a list of tuples (topic, count)
topic_count_list=count_per_topic.items()

# sort the list by vcount in descending order
sorted_topics=sorted(topic_count_list, key=lambda item:-item[1])
print(sorted_topics)

# get top 20 topics
top_20_topics=sorted_topics[0:20]



In [None]:
# Exercise 4.5. Visualize data

import pandas as pd
from matplotlib import pyplot as plt

df=pd.DataFrame.from_dict(count_per_topic, orient='index')
df.columns=['count']
df

df.sort_values(by='count', ascending=False).iloc[0:10].plot.bar();
plt.show()

In [None]:
# Exercise 4.6. Word Cloud

import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(background_color="white")
wordcloud.generate_from_frequencies(frequencies=count_per_topic)
plt.figure(figsize=(10,6))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()


## 5. Scrape data by REST APIs (TMDB)
- A REST API is a web service that uses HTTP requests to GET, PUT, POST and DELETE data
- requests package can be used for REST API calls

In [None]:
# Exercise 5.1. search movies by name

import requests
import json

title='finding dory'

# Search API: http://api.themoviedb.org/3/search/movie
# has two parameters: query string and api_key
# For the get methods, parameters are attached to API URL after a "?"
# Parameters are separated by "&"

# to test, apply for an api key and use the key ere
url="http://api.themoviedb.org/3/search/movie?query="+title+"&api_key=<your api key>"

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    
    if "results" in r.json():
        results=r.json()["results"]
        print (json.dumps(results, indent=4))


## 6. Scrape pdf files
- A number of Python libraries can handle PDFs (https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167)
- Some popular libraries:
  * pyPDF2: support both python2 and python3
    * To install, issue: pip install pypdf2
  * PDFMiner: only support python2
  * PDFQuery


In [None]:
# Exercise 5.1. downloading and parse pdf files 

import requests
from PyPDF2 import  PdfFileReader

# First download the pdf file
pages=[]
r=requests.get("http://ciese.org/media/live/curriculum/airproj/docs/aqiworksheet.pdf")
if r.status_code==200:
    # write the content to a local file
    with open("some_pdf.pdf","wb") as f:
        f.write(r.content)

# Parse the pdf content. It may need further clean-up depending on the content
pdfreader = PdfFileReader(open("some_pdf.pdf", "rb"))

#loop through each page of the pdf file
for i in range(pdfreader.getNumPages()):
    # get each page
    page=pdfreader.getPage(i)
    # extract text
    page_content=page.extractText()
    
    # append the text to the list
    pages.append(page_content)
    
print(pages)