# <center>Web Scraping III -- API </center>

References: 
https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

## 1. Scrape data through API (e.g. tweets)
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. OMDB APIs (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data

## 2. Access tweet stream (e.g. real-time tweets) through tweepy package
- **Steam**: transmitting or receiving data as a steady, continuous flow (the opposite is **batch**)

- Event **Listener**(or Event Handler): 
  - A procedure or function that waits for an event to occur.
  - Event examples: a user clicking or moving the mouse, pressing a key on the keyboard, an internal timer, or a tweet arriving.
  - A listener is in effect a loop that is programmed to react to an input or signal.
  
- Twitter Terminology (https://support.twitter.com/articles/166337)
  - **@{username}**: mentioning an accounts {username} in a tweet
  - **\#{topic}**: a hashtag indicates a keyword or topic.
  - **follow**: Subscribing to a Twitter account 
  - **reply**: A response to another person’s Tweet
  - **Retweet (n.)**: A tweet that you forward to your followers
  - **like (n.)**: indicates appreciating a tweet. 
  - **timeline**: A timeline is a real-time stream of tweets. Your Home timeline, for instance, is where you see all the Tweets shared by your friends and other people you follow.
  - **Twitter emoji**: A Twitter emoji is a specific series of letters immediately preceded by the # sign which generates an icon on Twitter such as a national flag or another small image.


In [None]:
# Exercise 2.1 define a listener which listens to tweets in real time


import tweepy
# to install tweepy, use: pip install tweepy

# import twitter authentication module
from tweepy import OAuthHandler

# import tweepy steam module
from tweepy import Stream

# import stream listener
from tweepy.streaming import StreamListener

# import the python package to handle datetime
import datetime

# set your keys to access tweets 

consumer_key = 'your consumer key'
consumer_secret = 'consumer secret'
access_token = 'your access token'
access_secret = 'your access secret'
 
    
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
# Customize a tweet event listener 
# inherited from StreamListener provided by tweepy
# This listener reacts when a tweet arrives or an error happens
# for details of class StreamListener, see https://github.com/tweepy/tweepy/blob/master/tweepy/streaming.py

class MyListener(StreamListener):
    
    # constructor
    def __init__(self, output_file, time_limit):
        
            # attribute to get listener start time
            self.start_time=datetime.datetime.now()
            
            # attribute to set time limit for listening
            self.time_limit=time_limit
            
            # attribute to set the output file
            self.output_file=output_file
            
            # initiate superclass's constructor
            StreamListener.__init__(self)
    
    # on_data is invoked when a tweet comes in
    # overwrite this method inheritted from superclass
    # when a tweet comes in, the tweet is passed as "data"
    def on_data(self, data):
        
        # get running time
        running_time=datetime.datetime.now()-self.start_time
        print(running_time)
        
        # check if running time is over time_limit
        if running_time.seconds/60.0<self.time_limit:
            
            # ***Exception handling*** 
            # If an error is encountered, 
            # a try block code execution is stopped and transferred
            # down to the except block. 
            # If there is no error, "except" block is ignored
            try:
                # open file in "append" mode
                with open(self.output_file, 'a') as f:
                    # Write tweet string (in JSON format) into a file
                    f.write(data)
                    
                    # continue listening
                    return True
                
            # if an error is encountered
            # print out the error message and continue listening
            
            except BaseException as e:
                print("Error on_data:" , str(e))
                
                # if return "True", the listener continues
                return True
            
        else:  # timeout, return False to stop the listener
            print("time out")
            return False
 
    # on_error is invoked if there is anything wrong with the listener
    # error status is passed to this method
    def on_error(self, status):
        print(status)
        # continue listening by "return True"
        return True

In [None]:
# Exercise 2.2 Collect tweets with specific topics within 2 minute

# initiate an instance of MyListener 
tweet_listener=MyListener(output_file="python.txt",\
                          time_limit=1)

# start a stream instance using authentication and the listener
twitter_stream = Stream(auth, tweet_listener)
# filtering tweets by topics
twitter_stream.filter(\
track=['#blockchain', '#bitcoin','#crpytocurrency','#smartcontract'])

In [None]:
# Exercise 2.3. Collect 1% sample of all tweets within 30 seconds

tweet_listener=MyListener(output_file="tweets.txt",\
                          time_limit=0.5)
twitter_stream = Stream(auth, tweet_listener)
twitter_stream.sample()


In [None]:
print(str(datetime.datetime.now().date()))

In [None]:
# Exercise 2.4. Collect historical tweets 
# (i.e. tweets happened in the last week ) for a topic

tweets=[]
max_tweets=500
last_id = -1

query='#blockchain'

api = tweepy.API(auth)


while len(tweets) < max_tweets:
    count = max_tweets - len(tweets)
    try:
        # for each search, at maximum you get 100 results, although
        # you can set count larger than 100
        # You can limit the id for the most recent tweets (max_id)
        # query can be a list of hashtags
        # search api returns tweets sorted by time in descending order
        new_tweets = api.search(q=query, count=count, max_id=str(last_id - 1))
        
        if not new_tweets:
            break

        # extract (date, tweet text) of new tweets 
        tweets+=[(item.id, item.created_at, item.text) for item in new_tweets]
        
        # get the first tweet in the batch
        last_id = new_tweets[-1].id

    except tweepy.TweepError as e:
        # depending on TweepError.code, one may want to retry or wait
        # to keep things simple, we will give up on an error
        break

tweets

# you can find tweets happened in the last week using
# new_tweets = api.search(q=query, count=count, since="2018-10-01", until="2018-10-02", max_id=str(last_id - 1))
# but it can retrieve tweets older than a week

### 2.5. How to get all past tweets within a period of time? 
- You can always search tweets at https://twitter.com/search and then scrape the results returned
- Note that there is **no authentication needed**!
- Check github project, https://github.com/Jefferson-Henrique/GetOldTweets-pythonyou 
- Motivated by this project, let's try the following code

In [None]:
# 2.5.1. Scrape past tweets using API 

import requests
from bs4 import BeautifulSoup

# User agent must be defined in http request header
# a user agent is software that is acting on behalf of 
# a user. Usually it tells the browser used.
headers = { 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.'
                              '86 Safari/537.36'}

# specify parameters as a dictionary
payload={"f":"tweets",  # retrieve tweets
         "q":"blockchain since:2017-09-10 until:2017-09-12", # query string
         "max_position":''} # max_position of results (paging purpose)

# send a request with parameters and headers
r=requests.get("https://twitter.com/i/search/timeline",\
              params=payload, headers=headers)

if r.status_code==200:
    result=r.json()
    print(result)
    
    # retrieve the position of last tweet
    min_position = result['min_position']
    
    # get html source code of tweets
    tweets_html = result['items_html']

In [None]:
# 2.5.2. define a function to parse tweets html 
# using BeautifulSoup

def getTweets(tweets_html):
    
    result=[]
    
    soup=BeautifulSoup(tweets_html, "html.parser")

    tweets=soup.select('div.js-stream-tweet')

    for t in tweets:
        username, text, timestamp, tweet_id = '','','',''
        select_user = t.select("span.username.u-dir b")
        if select_user!=[]:
            username=select_user[0].get_text()
    
        select_text = t.select("p.js-tweet-text")
        if select_text!=[]:
            text=select_text[0].get_text()
    
        time_select = t.select("small.time span.js-short-timestamp")
        if time_select!=[]:
            timestamp=int(time_select[0]["data-time"])
            timestamp=datetime.datetime.fromtimestamp(timestamp)
    
        tweet_id = t["data-tweet-id"]
    
        #print(username, text, timestamp, tweet_id, "\n")
        
        result.append((username, text, timestamp, tweet_id))
        
    return result

In [None]:
# 2.5.3. Parse tweets using the function

tweets=getTweets(tweets_html)
print("total tweets:", len(tweets))
print("first tweet: ",tweets[0])

In [None]:
# # 2.5.4. What if we want to return more?

# set the max_position to 
# the min_position of last search
payload={"f":"tweets",\
         "q":"blockchain",\
         "since":"2017-09-10",\
         "until":"2017-09-12",\
         "max_position":min_position} 

# search again
r=requests.get("https://twitter.com/i/search/timeline",\
              params=payload, headers=headers)
#https://twitter.com/i/search/timeline?f=tweets&q=%20blockchain%20since%3A2017-09-10%20until%3A2017-09-12&src=typd&max_position=
if r.status_code==200:
    result=r.json()
    min_position = result['min_position']
    tweets_html = result['items_html']
    
    print(min_position)
    
    tweets=getTweets(tweets_html)
    print("total tweets:", len(tweets))
    print("first tweet: ",tweets[0])
    
# You can use a loop to keep sending requests
# until all tweets satisfying your criteria
# has been fetched.

## 3. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in name/value pairs
- Data is separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON file can be easily loaded into 
- **a dictionary** or 
- a **list of dictionaries**

In [None]:
# Exercise 3.1. Read/write JSON 
import json
tweets=[]

with open('python.txt', 'r') as f:
    # each line is one tweet string in JSON format
    for line in f: 
        
        # load a string in JSON format as Python dictionary
        tweet = json.loads(line) 
              
        tweets.append(tweet)

# write the whole list back to JSON
json.dump(tweets, open("all_tweets.json",'w'))

# to load the whole list
# pay attention to json.load and json.loads
tweets=json.load(open("all_tweets.json",'r'))

# open "all_tweets.json" and "python.txt" to see the difference

In [None]:
# Exercise 3.2. Investigating a tweet

# A tweet is a dictionary
# Some values are dictionaries too!
# for details, check https://dev.twitter.com/overview/api/tweets

print("# of tweets:", len(tweets))
first_tweet=tweets[0]

print("\nprint out first tweet nicely:")
print(json.dumps(first_tweet, indent=4))   

# note the difference between "json.dumps()" and "json.dump()"


In [None]:
# Exercise 3.3. Investigating attributes of a tweet

print("tweet text:", first_tweet["text"] )
# get all hashtags (i.e. topics) in this tweet
      
topics=[hashtag["text"] for hashtag in first_tweet["entities"]["hashtags"]]
print("\ntopics:", topics)

# get all user_mentions in this tweet
user_mentions=[user_mention["screen_name"] for user_mention in first_tweet["entities"]["user_mentions"]]
print("\nusers mentioned:", user_mentions)

In [None]:
# Exercise 3.4. count tweets per topic

import pandas as pd

# get the number of tweets for each topic as a dictionary
count_per_topic={}

# loop through each tweet in the list
for t in tweets:
    # check if "entities" exist and "hashtags" exist in "entities"
    if "entities" in t and "hashtags" in t["entities"]:
        
        # get all topics as a set (unique topics)
        topics=set([hashtag["text"].lower() for hashtag in t["entities"]["hashtags"]])
        
        for topic in topics:
            topic=topic.lower()
            if topic in count_per_topic:
                count_per_topic[topic]+=1
            else:
                count_per_topic[topic]=1
        
print(count_per_topic)


In [None]:
# Exercise 3.5. Visualize data

import pandas as pd
from matplotlib import pyplot as plt

# convert dictionary to dataframe
# argument orient: {‘columns’, ‘index’} indicates 
# whether the keys are used as rows or columns

df=pd.DataFrame.from_dict(count_per_topic, orient="index" )
df.columns=["count"]

df
# how to get top 10 topics?

# how to plot top 10 topics?


In [None]:
# Exercise 4.6. Word Cloud

import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(background_color="white")
wordcloud.generate_from_frequencies(frequencies=count_per_topic)
plt.figure(figsize=(8,8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()


## 4. Scrape data by REST APIs (e.g. OMDB API)
- A REST API is a web service that uses HTTP requests to GET, PUT, POST and DELETE data
- Experiment:
    - Get an API key here: http://www.omdbapi.com/apikey.aspx and follow the instruction to activate the key
    - Use API, e.g. **http://www.omdbapi.com/<font color="blue"><b>?</b></font>t=Rogue+One<font color="blue"><b>&</b></font>plot=full<font color="blue"><b>&</b></font>r=json<font color="blue"><b>&</b></font>apikey={your api key}**, where
        - t=Rogue+One: specify the movie title 
        - plot=full: return full plot
        - r=json: result is in json format
        - apikey: use your api key 
    - Note the format of URL:
        - API endpoint: http://www.omdbapi.com/ 
        - parameters appear in the URL after the question mark (<font color="blue"><b>?</b></font>) after the endpoint
        - all parameters are concatenated by <font color="blue"><b>"&"</b></font>  
    - You can directly paste the above API to your browser
    - Or issue API calls using requests

In [None]:
# Exercise 4.1. search movies by name

import requests
import json

title='Rogue+One'

# Search API: http://www.omdbapi.com/
# has four parameters: title, full plot, result format, and api_key
# For the get methods, parameters are attached to API URL after a "?"
# Parameters are separated by "&"

# to test, apply for an api key and use the key ere
url="http://www.omdbapi.com/?t="+title+\
    "&plot=full&r=json&apikey={your key here}"

# invoke the API 
r = requests.get(url)

# Another way to pass parameters
# payload = {'t': title, 'plot': 'full', 'apikey':"your key here"}
# r=requests.get('http://www.omdbapi.com/', params=payload)
# in case authentication is needed, use
# r = requests.get('https://api.github.com/user', 
# auth=('user', 'pass'))

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    print (json.dumps(r.json(), indent=4))
