# <span style="color:#0b486b">SIT 112 - Data Science Concepts</span>

---
Lecturer: Sergiy Shelyag | sergiy.shelyag@deakin.edu.au<br />


School of Information Technology, <br />
Deakin University, VIC 3215, Australia.

---

## <span style="color:#0b486b">Practical Session : Twitter API</span>

**The purpose of this session is to teach you:**

1. How to use Twitter API
2. Distances
3. Term-by-Document matrix


---
## <span style="color:#0b486b">1. Twitter API</span>

To work with Twitter API, we use a package called `TwitterAPI`. You can install it by executing the cell below if you don't have it on your machine.

In [None]:
!pip install TwitterAPI

To be able to collect data from the Twitter API you need an Access token and secret. For now we have provided you with them, but to obtain yours you can go to https://apps.twitter.com/, click on 'Create New App', fill the form and then click on 'Create your Twitter Application'.

In [None]:
from TwitterAPI import TwitterAPI

CONSUMER_KEY = "HLnOFAqRgkrJaaqMoy41a0nxj"
CONSUMER_SECRET = "tz4GKwFn9dsqYv6Mnl9mzctSqV5LeOVGhAuvr1KW0c2LpLuprZ"
OAUTH_TOKEN = "763929928219766784-AjLCTTzhDWx87EJxHpfeC7IPvFnQB1t"
OAUTH_TOKEN_SECRET = "MTJThLdK6SkVzqnHEGvja2yUhIDSCZ8XCfNzjDTUgiXAu"

# Authonticating with your application credentials
api = TwitterAPI(CONSUMER_KEY,
                 CONSUMER_SECRET,
                 OAUTH_TOKEN,
                 OAUTH_TOKEN_SECRET)

Now we have access to API. For a complete reference on what the API offers look at the [Twitter API documentation](https://dev.twitter.com/overview/api). For example we can [search for tweets](https://dev.twitter.com/rest/reference/get/search/tweets) that contain a specific keyword or collect tweets from the Twitter stream. Twitter responses are in JSON format which we can easily parse into Python dictionary object.

### <span style="color:#0b486b">1.1 Search Tweets</span>

You can query Twitter with a keyword:

In [None]:
resp = api.request('search/tweets', {'q':'SpaceX'})

In [None]:
resp

Iterate over the reponse to print the Twitter message:

In [None]:
for r in resp:
    print(r['text'])

**Exercise 1:**

1. Select a keyword and crawl some tweets from Twitter containing that keyword and then print them.
2. Crawl 100 tweets containing this keyword and print them. Maybe you want to check Twitter API documentation first.

In [None]:
# your code here
resp = api.request('search/tweets', {'q':'Deakin', 'count': 100})
for r in resp:
    print(r['text'])

There are other parameters that you can set to restrict the response. For example the language of the tweets, or geographical location.

In [None]:
# result_type: popular, recent, mixed
# geocode: lat,long,radius

# geo coordinations of the desired place
my_lat = 51.5;
my_long = 0.12;

resp = api.request('search/tweets', {'q':'house', 
                                     'count':'100', 
                                     'lang':'en', 
                                     'result_type':'recent',
                                     'geocode':'{},{},100mi'.format(my_lat, my_long)})
for r in resp:
    print(r['text'])

**Exercise 2:**

check out the API documentation and narrow down your search results for Exercise1 using parameters other that keyword.

In [None]:
# your code here
my_resp2 = api.request('search/tweets', {'q': 'music',
                                    'count': '100',
                                    'lang': 'en',
                                    'result_type': 'recent'})

for i, tweet in enumerate(my_resp2):
    print(i, tweet['text'])

Apart from the tweet text, you can retrieve other metadata from the Twitter response. For example the user who sent the tweet, whether the tweet is in reply to another user or is a retweet, how many times it is retweeted and so on. Since the response is parsed into a dictionary, use `keys()` function to see the fields that are available:

In [None]:
response = resp.json()

In [None]:
response['statuses'][0]['user']

In [None]:
response['statuses'][0].keys()

**Exercise 3:**

print user, place, and geo locations of tweets you have collected.

In [None]:
# Your code here
#my_resp2
for i, tweet in enumerate(my_resp2):
    print('Information for tweet {}'.format(i))
    print('user:', tweet['user'])
    print('place:', tweet['place'], ', geo:', tweet['geo'], '\n')

---
## <span style="color:#0b486b">2. Distances</span>

`Distance` is a numerical description of how far apart objects are. It is a concrete way of describing what it means for elements of some space to be close or far away from each other, for example the distance between two vectors in an 2-dimensional space.

Now that you have know how to represent an n-dimensional vector in Python with NumPy arrays, we will write a function as a metric to measure the distance between two vectors. There are multiple ways to measure the distance between two vectors. We will discuss Euclidean distance and cosine distance.

### <span style="color:#0b486b">2.1 Euclidean Distance</span>

Euclidean distance comes from Geometry. If we assume $\mathbf{x}_{1}=\left[x_{11},x_{12},\ldots,x_{1n}\right]$ and $\mathbf{x}_{2}=\left[x_{21},x_{22},\ldots,x_{2n}\right]$, then the Euclidean distance between $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ is defined as:

$$d\left(\mathbf{x}_{1},\mathbf{x}_{2}\right)=\sqrt{\left(x_{11}-x_{21}\right)^{2}+\left(x_{12}-x_{22}\right)^{2}+\ldots+\left(x_{1n}-x_{2n}\right)^{2}}
$$

We can use array operators for this task.

In [None]:
import numpy as np

In [None]:
x1 = np.array([2, 5, 4, 6, 8])
x2 = np.array([3, 5, 6, 8, 6])

print(x1 - x2)
print((x1 - x2) ** 2)
print(np.sqrt(np.sum((x1 - x2) ** 2)))

In [None]:
def euclidean_distance1(x1, x2):
    d = x1 - x2
    d = d ** 2
    return np.sqrt(d.sum())

In [None]:
x1 = np.array([-1, 2, 0, 5])
x2 = np.array([4, 2, 1, 0])

print(euclidean_distance1(x1, x2))

Since two vectors passed to the function should be the same size, it is better to perform a sanity check before applying the subtraction. Otherwise it will raise an error. We can do this by using `if - elif` statement or as a better practice by using `try - except`.

In [None]:
import sys

def euclidean_distance2(x1, x2):
    if x1.shape[0] != x2.shape[0]:
        sys.exit('x1 and x2 are not the same size')
    else:
        d = x1 - x2
        d = d ** 2
        return np.sqrt(d.sum())

**Exercise 4:** Run the cell below fix the code so that the two array are the same size.

In [None]:
# After run this cell, fix it

x1 = np.array([-1, 2, 0, 5, 9])
x2 = np.array([4, 2, 1, 0])
euclidean_distance2(x1, x2)

In [None]:
def euclidean_distance3(x1, x2):
    try:
        d = x1 - x2
        d = np.power(d, 2)
        return np.sqrt(d.sum())
    except ValueError as e:
        print("Vectors passed to the function are not the same size")
        # you can return a default value
        return None

**Exercise 5:** Run the cell below and then fix it to prevent an exception.

In [None]:
# fix this cell

x1 = np.array([-1, 2, 0, 5, 9])
x2 = np.array([4, 2, 1, 2])
a = euclidean_distance3(x1, x2)

### <span style="color:#0b486b">2.2 cosine similarity and distance</span>

Cosine similarity is a measure of similarity between two vectors based on the angle between them. Cosine similarity is widely used in information retrieval and text mining as a measure of similarity between documents and is defined as:

$$D_{c}\left(\mathbf{x}_{1},\mathbf{x_{2}}\right) = \frac{\mathbf{x_1}.\mathbf{x_{2}}}{\parallel\mathbf{x_1}\parallel \parallel\mathbf{x}_{2}\parallel}$$


Cosine similarity is particularly used in positive space where the outcome is bounded in [0, 1]. The cosine distance is defined as the complement to cosine similarity in positive space that is $S_{c}\left(x_{1},x_{2}\right)=1-D_{c}\left(x_1,x_2\right)$ where $D_c$ is the cosine distance and $S_c$ is the cosine similarity.

In [1]:
x1 = np.array([1,2,3])
x2 = np.array([3,4,6])

x1 * x1

NameError: name 'np' is not defined

In [2]:
def cosine_distance(x1, x2):
    try:
        num = (x1*x2).sum()
        denom = np.sqrt((x1*x1).sum()) * np.sqrt((x2*x2).sum())
        num += 0.0    # or use np.astype(float) to make sure of float division
        return num/denom
    except ValueError as e:
        print("Vectors passed to the function are not the same size")
        return None
    

In [None]:
x1 = np.array([2, 0, 5, 9])
x2 = np.array([4, 2, 1, 0])
cosine_distance(x1, x2)

### <span style="color:#0b486b">2.3 Term-by-Document matrix</span>

A term-by-document matrix is a mathematical representation of a text corpus. It describes the frequency of terms that occur in the document collection. Each row corresponds to a document and each column correspond to a term. Thus the value that appears in row $j$ and column $i$ represents the frequency of appearing term $i$ in document $j$.

We will represent two datasets with term-by-document matrix:

* a collection of 100 Twitter messages about Geelong
* a collection of 6 news articles (5 about Apple and 1 about politics)

The data is already collected and stores in text files. Thus you will need to:

* read the text files
    * using file object
* perform pre-processing
    * using string methods
    * using re package
* construct the term-by-document matrix
    * using numpy arrays and operations
    

### Twitter dataset
First read the data:

In [None]:
import os

# get current working directory
cwd = os.getcwd()   

# join the subdirectory of the data and data file name
file_path = os.path.join(cwd, "data/prac06/tweets.txt")

# read the contents of the file and store it in a list
i = 0
tweets = []
fp = open(file_path, encoding='utf8')
tweets = fp.readlines()
for tweet in tweets:
    print(tweet)

Mostly when dealing with data, we have to perform some sort of data pre-processing. Data collection is often loosely controlled, resulting in out of the range values, missing values, and etc. Thus quality of the data is first and formost before running an analysis. This step is specific to the nature of the data. For example for text data it may consist of cleaning, normalization, tokenization, and etc. 

In this case, our pre-processing consists of:

* converting all the words into lower case to remove the effect of the letter case
* replacing the URLs with a simple string such as 'url'. From the previous cell, you should be able to see that many of the tweets contain a URL. Since we are not using them now, we can remove them or replace them.
* Removing the punctuations

In [None]:
import numpy as np
import re
import string
from collections import Counter

def pre_process(doc):
    """
    pre-processes a doc
      * Converts the tweet into lower case,
      * removes the URLs,
      * removes the punctuations
      * tokenizes the tweet
    """
    
    doc = doc.lower()
    # gettign rid of non ascii codes
    try:
        doc = doc.decode('ascii', 'ignore')
    except:
        'It is already decode'
    
    # repalcing URLs
    url_pattern = "http://[^\s]+|https://[^\s]+|www.[^\s]+|[^\s]+\.com|bit.ly/[^\s]+"
    doc = re.sub(url_pattern, 'url', doc) 

    punctuation = r"\(|\)|#|\'|\"|-|:|\\|\/|!|_|,|=|;|>|<|\."
    doc = re.sub(punctuation, ' ', doc)
    
    return doc.split()

In [None]:
print(r['text'])
pre_process(r['text'])

**Exercise:**

Use the function provided to pre-process one of the tweets.

In [None]:
# code here
print(tweets[2])
print(pre_process(tweets[2]))

In [None]:
def termdoc(docs):
    """
    returns the term-by-document matrix and the vocabulary of the passed corpus
    """
    
    vocab = set()   
    termdoc_sparse = []

    for doc in docs:
        # pre-process the doc
        doc_tokens = pre_process(doc)
        # computes the frequencies for doc
        doc_sparse = Counter(doc_tokens)    

        termdoc_sparse.append(doc_sparse)

        # update the vocab
        vocab.update(doc_sparse.keys())  

    vocab = list(vocab)
    vocab.sort()

    n_docs = len(docs)
    n_vocab = len(vocab)
    termdoc_dense = np.zeros((n_docs, n_vocab), dtype=int)

    for j, doc_sparse in enumerate(termdoc_sparse):
        for term, freq in doc_sparse.items():
            termdoc_dense[j, vocab.index(term)] = freq
            
    return termdoc_dense, vocab

In [None]:
tweets_termdoc, tweets_vocab = termdoc(tweets)
tweets_termdoc.shape

Lets look at the vocabulary:

In [None]:
tweets_vocab

Lets look at one of tweets:

In [None]:
j = 0
print(tweets[j])
print(tweets_termdoc[j])

In [None]:
tweets_vocab.index('beyond')

In [None]:
tweets_termdoc[j][127]

So basically, now each tweet is represented by a vector of size `len(tweets_vocab)`.

### News dataset
Similar to previous sections, the data is stored in text files names as'news1.txt', ..., 'news5.txt'. All we have to do is read the files, construct the corpus and send it to `termdoc()` function.

In [None]:
n_docs = 6
cwd = os.getcwd()   
news = []

for j in range(1, n_docs+1):
    filename = "news{}.txt".format(j)
    file_path = os.path.join(cwd, "data/prac06/{}".format(filename))
    fp = open(file_path, encoding='utf8')
    news.append(fp.read())

news_termdoc, news_vocab = termdoc(news)

So now each news article is represented with a large vector of size `len(news_vocab)`. We can do many things with this representation. For example measuring the distance between two documents. The first 5 news articles are tech news and about Apple, but the 6th one is about politics. We expect that tech news be more similar to each other rather than to the politics news. In other words the distance between two articles from tech news should be less than the distance between a tech news article and a news article about politics. This is shown below:

In [None]:
print(cosine_distance(news_termdoc[1], news_termdoc[2]))   # both about Apple
print(cosine_distance(news_termdoc[1], news_termdoc[4]))   # one from tech world, the other from politics