#### CSCE 676 :: Data Mining and Analysis :: Fall 2019


# Data Collection

*Notebook overview:* In this notebook, we're going to go over the basic of getting data through several handy methods:

* Reading from a CSV file
* Reading from a JSON file
* Scraping from the web
* Using an API

## Part A. Reading from a CSV file

#### Read data from a csv file as a pandas data frame.

In [None]:
import pandas as pd
data_path='./data/births.csv'
data = pd.read_csv(data_path, sep=',')
data_sample = data[:5]
print(data_sample)

In [None]:
# rather than slicing the data, we can use head() to see the top few rows
data.head()

#### Show the data type of columns

In [None]:
print(data.dtypes)

In [None]:
# we can group data by different attributes
grouped = data.groupby(['year', 'month', 'gender']).sum()
grouped.head()

#### Data selections

In [None]:
#Create a data frame with female records
data_f = data[data.gender=='F'].reset_index(drop=True)
data_f.head()

In [None]:
#Create a data frame with number of births greater than 5500
data_large_birth = data[data.births>5500].reset_index(drop=True)
print(data_large_birth)

In [None]:
#Create a new data frame with only two columns
data_less_column = data[['year','gender']]
print(data_less_column)

In [None]:
#Select the those with birthday of Feb 28th or Feb 29th, the list type allow multiple selection.
data_birth_selection = data[(data.day.isin(['29','28'])) & (data.month.isin([2]))].reset_index(drop=True)
print(data_birth_selection[:10])

In [None]:
# reading from a csv file from the web

import pandas as pd

url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv'

bob_ross_data = pd.read_csv(url)
bob_ross_data.head()

## Part B. Reading from a JSON file

#### Now lets try to read data in json format to python dictionary

In [None]:
import json
json_data_path='./data/tweets.json'
with open(json_data_path,'r') as tweets_file:
    for line in tweets_file:
        line=json.loads(line)
        print(line.keys())
        break

In [None]:
# code from: http://stackoverflow.com/questions/30088006/cant-figure-out-how-to-fix-the-error-in-the-following-code
with open('./data/tweets.json', 'r') as f:
    data = f.readlines()

# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)

# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas
data_df = pd.read_json(data_json_str)
data_df.head()


### More material on file I/O.

* [Python Fundamentals Tutorial: Working with Files](https://newcircle.com/bookshelf/python_fundamentals_tutorial/working_with_files)
* [Reading and Writing Files in Python](http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python)
* [JSON encoding and decoding with Python](https://pythonspot.com/en/json-encoding-and-decoding-with-python/)
* [Text file (including csv) handling](http://nbviewer.jupyter.org/github/pydata/pydata-book/blob/master/ch06.ipynb)
* [Getting started with pandas 1](http://nbviewer.jupyter.org/github/pydata/pydata-book/blob/master/ch05.ipynb)
* [Getting started with pandas 2](http://nbviewer.jupyter.org/github/pydata/pydata-book/blob/master/ch06.ipynb)


## Part C. Scraping from the web

This example shows how to crawl a webpage and extract the text in all h2 headers. This is done through using requests and beautifulsoup for html parsing.

More info on Beautiful Soup:
* [Web Scraping with Beautiful Soup](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html)
* [Intro to Beautiful Soup](http://programminghistorian.org/lessons/intro-to-beautiful-soup)


In [None]:
from bs4 import BeautifulSoup,SoupStrainer
import requests
url='http://www.caverlee.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

print(soup)

In [None]:
for a in soup.find_all("h1"):
    print(a.get_text())

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

Another package for website crawl is called **scrapy**. For parsing aditional packages widely used includes **lxml, xpath** and **HTMLParser**.

## Part D. Using an API

In this example, we show how to crawl twitter users' timeline. Before access the api, there is an authentication process. OAuth is an authentication protocol that allows users to approve application to act on their behalf without sharing their password.Twitter’s implementation is based on the Client Credentials Grant flow of the OAuth 2 specification. Thus you need to register your application at [link](https://apps.twitter.com/) in order to get the credentials. (consumer key, consumer token, access key, access token) The user timeline api is at [link](https://api.twitter.com/1.1/statuses/user_timeline.json?)

In [None]:
import oauth2 as oauth
import json
"""Fill in the blanks here for your own Twitter app."""
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
consumer = oauth.Consumer(consumer_key, consumer_secret)
token = oauth.Token(key=access_key, secret=access_secret)
userlist = ['ev','CSE_at_TAMU']
addr = 'https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=%s'
client = oauth.Client(consumer, token)
for uid in userlist:
    resp, content = client.request(
        addr%uid,
        method='GET',
        )
    print(json.loads(content)[1]['text'])

There are many other librarys we can make use of. For Twitter, a nice wrapper named **tweepy** is readily available. 

To install: 

**pip install tweepy**

Here is a sample crawler which has 3 member functions for crawling user profile and user tweets as well as rate limit checker.

In [None]:
import tweepy
import time
import sys

class TwitterCrawler():
    '''Fill in the blanks here for your own Twitter app.'''
    consumer_key = ""
    consumer_secret = ""
    access_key = ""
    access_secret = ""
    auth = None
    api = None

    def __init__(self):
        self.auth = tweepy.OAuthHandler(self.consumer_key, self.consumer_secret)
        self.auth.set_access_token(self.access_key, self.access_secret)
        self.api = tweepy.API(self.auth, parser=tweepy.parsers.JSONParser())
        #print self.api.rate_limit_status()

    def check_api_rate_limit(self, sleep_time):
        try:
            rate_limit_status = self.api.rate_limit_status()
            print('------------check rate limit------------')
            #print rate_limit_status
        except Exception as error_message:
            print(error_message)
            if error_message['code'] == 88:
                print("Sleeping for %d seconds." %(sleep_time))
                print(rate_limit_status['resources']['statuses'])
                time.sleep(sleep_time)

    def crawl_user_profile(self, user_id):
        try:
            user_profile = self.api.get_user(user_id)
        except:
            return None
        return user_profile

    def crawl_user_tweets(self, user_id, count):
        self.check_api_rate_limit(900)
        page_cnt = 0
        tried_count = 0
        tweets= []
        tweets_api_call=[]
        while tweets_api_call!= None and len(tweets) < count:
            try:
                page_cnt += 1
                tweets_api_call = self.api.user_timeline(user_id, count=count, page=page_cnt)
                tweets.extend(tweets_api_call)
            except:
                pass
            tried_count += 1
            if tried_count == 5:
                break
        return tweets
def main():
    tc = TwitterCrawler()
    user = tc.crawl_user_profile('TheRealCaverlee')
    print(user)
    tweets = tc.crawl_user_tweets('TheRealCaverlee', 500)
    print(len(tweets))


if __name__ == "__main__":
    main()

### More on crawler
* [Example twitter crawler](http://www.benkhalifa.com/twitter-crawler-python)
* [Mining Twitter Data with Python (Part 1: Collecting data)](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)