# Data Collection

<i>Data Collection</i> adalah proses pengumpulan sejumlah data yang banyak.
Pada proses <i>Data Collection</i>, kita dapat menggunakan berbagai macam tools atau software misalanya menggunakan Web Data Collection Tool : [import.io](https://www.import.io/) 
<br><br>
Pada praktikum kali ini, dengan menggunakan bahasa python untuk melakukan pengkoleksian data.

### 1. Twitter
Pada pengambilan data tweet di twitter memanfaatkan <i>Application Programming Interface</i> (API) dari Twitter yang bersifat <i>open source</i>.

#### a. Register App
Lakukan register app di [sini](http://apps.twitter.com). 
Pastikan mendapatkan <i>Consumer Key (API Key), Consumer Secret (API Secret), Access Token,</i> dan <i>Access Token Secret </i>.

#### b. Install Tweepy (Python Library for Accessing Twitter-API)
Untuk meng-install dapat menggunakan <u>pip install tweepy</u>.
Dokumentasi mengenai tweepy bisa diakses di [sini](http://docs.tweepy.org/en/v3.5.0/)


#### c. OAuth
Lakukan inisialisasi untuk melakukan OAuth pada API Twitter

In [None]:
import tweepy
from tweepy import OAuthHandler
 
consumer_key = 'CONSUMER-KEY'
consumer_secret = 'COMSUMER-SECRET'
access_token = 'ACCESS-TOKEN'
access_secret = 'ACCESS-SECRET'
     
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

Variable <u>api</u> akan menjadi gerbang untuk akses pada API Twitter.

- Berikut ini adalah salah satu contoh yang digunakan untuk menampilkan 10 items tweet pada <u>home timeline</u>.

In [None]:
from tweepy import Cursor

i = 1
for status in Cursor(api.home_timeline).items(10):
    # Process a single status
    print(i, status.text)
    i+=1

- Menampilkan atribut dari satu tweet

In [None]:
for status in Cursor(api.home_timeline).items(1):
    # Process a single status
    print(status)

Bisa juga kita meletakkan file hasil <i>crawling</i> pada suatu file <i>json, csv, txt, dll</i>

- JSON

In [None]:
import json

with open('home_timeline.json', 'w') as f:
    for status in Cursor(api.home_timeline).items(2):
        # Process a single status
        f.write(json.dumps(status._json)+"\n")

- CSV

In [None]:
import csv

csvFile = open('home_timeline.csv', 'wb')
csvWriter = csv.writer(csvFile)

row = [ "user", "text" ]
csvWriter.writerow(row)

for tweet in tweepy.Cursor(api.home_timeline).items(2):
    #Write a row to the csv file/ I use encode utf-8
    csvWriter.writerow([tweet.user.screen_name, tweet.text.encode('utf-8')])
csvFile.close()

- TXT

In [None]:
import sys

with open('home_timeline.txt', 'w') as txt:
    for status in Cursor(api.home_timeline).items(2):
        # Process a single status
        txt.write(status.user.screen_name + " = " + status.text.encode('utf-8') + "\n")

#### d. Twitter Streaming
Pada proses ini kita mendapatkan data dengan melakukan stream langsung untuk pengambilan tweet.


In [None]:
from tweepy import Stream

class MyStreamListener(tweepy.StreamListener): 
    def on_data(self, kicau):
        print(kicau.text)
 
    def on_error(self, status_code):
        if status_code == 420:
            #returning False in on_data disconnects the stream
            return False
        
myStream = tweepy.Stream(auth = api.auth, listener = MyStreamListener())

myStream.filter(track=['python'], async=True)

### 2. Wikipedia
Wikipedia juga menyediakan API untuk mengakses/mengambil data yang kita butuhkan.

#### a. Install Library Wikipedia
Untuk meng-install menggunakan <u>pip install wikipedia</u>. Untuk dokumentasi bisa diakses di [sini](https://wikipedia.readthedocs.io/en/latest/quickstart.html)

In [None]:
import wikipedia

wikipedia.set_lang("id")

- Search

In [None]:
print wikipedia.search("Semarang")

- Suggest

In [None]:
print wikipedia.suggest("Semarng")

- Summary

In [None]:
print wikipedia.summary("Semarang", sentences=1)

- Page

In [None]:
page = wikipedia.page("Semarang")

print (page.title)
print (page.url)
print (page.content)

- Images & Links

In [None]:
print (page.images[0])
print (page.links[0])
print (page.categories)

- Geosearch

In [None]:
print wikipedia.geosearch(-6.966667, 110.41666699999996)

### 3. Foursquare (Masih Ada Kendala)

#### a. Register app
Lakukan register app di [sini](https://id.foursquare.com/developers/). Pastikan mendapatkan <i>Client ID</i> dan <i>Client Secret</i>

#### b. Install Foursquare
Install menggunakan <u>pip install foursquare</u>. Untuk link github bisa menuju ke [sini](https://github.com/mLewisLogic/foursquare) dan untuk dokumentasi bisa dibaca di [sini](https://developer.foursquare.com/docs/)

In [None]:
import foursquare

client_id = "TSILZOA3WXF2KBIF1PLMU0BGGZPUKD0RAF4NVYBDON33Z15Y"
client_secret = "OLQEWJW4X0AIERO0ZBMDAPWG5ZB1LRSW4W5NFY4CJOYQF4WS"

client = foursquare.Foursquare(client_id, client_secret)

In [None]:
# Construct the client object
client = foursquare.Foursquare(client_id=client_id, client_secret=client_secret, redirect_uri='http://localhost')

# Build the authorization url for your app
auth_uri = client.oauth.auth_url()

In [None]:
# Interrogate foursquare's servers to get the user's access_token
access_token = client.oauth.get_token('XX_CODE_RETURNED_IN_REDIRECT_XX')

# Apply the returned access token to the client
client.set_access_token(access_token)

# Get the user's data
user = client.users()

### 4. Facebook
Facebook juga menyediakan API untuk mengambil data. Namun, memiliki keterbatasan pada API-nya

#### a. Register App
Lakukan register app di [sini](https://developers.facebook.com). Untuk mendapatkan access token dari Graph API Explorer bisa menuju ke [sini](https://developers.facebook.com/tools-and-support/).
#### b. Install facebook library
Untuk meng-install dapat menggunakan <u>pip install facebook-sdk</u>. Dokumentasi mengenai library facebook bisa diakses di [sini](https://pypi.python.org/pypi/facebook-sdk)

#### c. Inisialisasi 

In [None]:
import facebook
import json

token = 'TOKEN'

graph = facebook.GraphAPI(token, version='2.4')

- Mendapatkan post facebook

In [None]:
post = graph.get_object(id='POST_ID')
print(post['message'])

- Mendapatkan jumlah attending event

In [None]:
event = graph.get_object(id='EVENT_ID', fields='attending_count,declined_count')
print(event['attending_count'])
print(event['declined_count'])

- Mendapatkan jumlah teman

In [None]:
# Get all of the authenticated user's friends
friends = graph.get_connections(id='me', connection_name='friends')

print(json.dumps(friends, indent=4))

- Melakukan comment pada post facebook

In [None]:
graph.put_object(parent_object='POST_ID', connection_name='comments',message='Tes API!')

- Melakukan update status pada wall

In [None]:
# put_wall_post
attachment =  {
    'name': 'Link name',
    'link': 'https://www.example.com/',
    'caption': 'Check out this example',
    'description': 'This is a longer description of the attachment',
    'picture': 'https://www.example.com/thumbnail.jpg'
}

graph.put_wall_post(message='Check this out...', attachment=attachment)

- Melakukan comment pada post facebook

In [None]:
# put_comment
graph.put_comment(object_id='POST_ID', message='Great post...')

- Melakukan 'like' pada object tertentu, misal : comment post facebook

In [None]:
# put_like
graph.put_like(object_id='COMMENT_ID')

- Mendapatkan post dari facebook kita dengan atribut tertentu

In [None]:
import requests

posts = graph.get_connections('me', 'posts', fields='message,created_time,description,caption,link,place,status_type,shares')
while True:  # keep paginating
    try:
        with open('my_posts.json', 'a') as f:
            for post in posts['data']:
                f.write(json.dumps(post)+"\n")
            # get next page
            posts = requests.get(posts['paging']['next']).json()
    except KeyError:
        # no more pages, break the loop
        break

- Mendapatkan profile facebook

In [None]:
profile = graph.get_object("me", fields='name,location{general_info,location},languages{name,description}')
print(json.dumps(profile, indent=4))

### 5. Instagram


Instagram juga menyediakan API untuk mengambil data. Namun, memiliki keterbatasan pada API-nya.

#### a. Register App
Lakukan register app di [sini](https://www.instagram.com/developer/). Untuk mendapatkan access token melalui link [sini](https://instagram.com/oauth/authorize/?client_id=[CLIENT_ID_HERE]&redirect_uri=http://localhost&response_type=token) dengan terlebih dahulu mendefinisikan pada app client redirect uri-nya.
<br><br>
Pastikan juga redirect uri yang terdapat pada app dan link di atas sama.

#### b. Install instagram library
Untuk meng-install dapat menggunakan <u>pip install python-instagram</u>. Dokumentasi mengenai library instagram bisa diakses pada github di [sini](https://github.com/facebookarchive/python-instagram). Walaupun library-nya sepertinya sudah tidak di-update lagi.

- Inisialisasi

In [None]:
from instagram.client import InstagramAPI

access_token = 'ACCESS-TOKEN'
client_secret = 'CLIENT-SECRET'
api = InstagramAPI(access_token=access_token, client_secret=client_secret)

- Mendapatkan caption

In [None]:
recent_media, next_ = api.user_recent_media(user_id="USER-ID", count=10)
for media in recent_media:
    print(media.caption.text)

- Mendapatkan signature

In [None]:
import hmac
from hashlib import sha256

def generate_sig(endpoint, params, secret):
    sig = endpoint
    for key in sorted(params.keys()):
        sig += '|%s=%s' % (key, params[key])
    return hmac.new(secret, sig, sha256).hexdigest()

endpoint = '/media/657988443280050001_25025320'
params = {
    'access_token': '1397096926.dc637bd.b8b3dc8571914eb09e95772139086196',
    'count': 10,
}
secret = 'ee7d8da7b1bb45dd9f2b66e75950682f'

sig = generate_sig(endpoint, params, secret)
print sig

- Mendapatkan user yang terdaftar pada app

In [None]:
api.user(USER_ID)

- Mendapatkan media yang disukai

In [None]:
api.user_liked_media()

- Mendapatkan detail user

In [None]:
user = api.user_search('yasirabdr')
my_usr = user[0]
print 'User id is', my_usr.id, 'and name is ', my_usr.username

- Mendapatkan request masuk

In [None]:
api.user_incoming_requests()

- Melakukan follow user (hanya bisa pada user yang terdapat di app)

In [None]:
api.follow_user(user_id=USER_ID)

- Melakukan unfollow user (hanya bisa pada user yang terdapat di app)

In [None]:
api.unfollow_user(user_id=USER_ID)

- Perintah lainnya yang dapat dilakukan

In [None]:
api.block_user(user_id)
api.unblock_user(user_id)
api.approve_user_request(user_id)
api.ignore_user_request(user_id)
api.user_relationship(user_id)

- Menampilkan comment pada post

In [None]:
api.media_comments(media_id='POST-ID')

- Menuliskan comment

In [None]:
api.create_media_comment(media_id='POST-ID', text='this is my comment!')

- Menghapus comment

In [None]:
api.delete_comment(media_id='POST-ID', comment_id='1')

- Melakukan like

In [None]:
api.like_media(media_id='POST-ID')

- Melakukan unlike

In [None]:
api.unlike_media(media_id='POST-ID')

### 6. Scrapping pada web ecommerce (Bukalapak)
Melakukan scrapping adalah melakukan ekstraksi informasi pada suatu website. Pada proses scrapping yang digunakan kali ini menggunakan library python yaitu [Scrapy](https://scrapy.org/). Scrapy adalah framework untuk melakukan ekstraksi data pada web.

#### a. Install Scrapy
Cara mudah untuk melakukan install menggunakan <u>pip install Scrapy</u>

#### b. Get Started with Scrapy
Untuk memulai kita terlebih dahulu membuat sebuah projek baru dengan
<br><br><b><pre>scrapy startproject bukalapak</pre></b><br>
Dengan melakukan hal tersebut, maka otomatis akan menghasilkan seperti berikut:
<br><b>
<pre>bukalapak/
    scrapy.cfg
   bukalapak/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py
</pre>
</b><br>
Scrapy adalah <b>Object Oriented</b> programming dalam mendefinisikan item dan spider. Berikut penjelasan dari tiap item
- <b>scrapy.cfg</b><br> 
It is a project configuration file which contains information for setting module for the project along with its deployment information.<br><br>
- <b>bukalapak</b> : application directory
<br><br>
- <b>items.py</b><br>
Items are containers that will be loaded with the scraped data; they work like simple <b>Python dicts</b>. While one can use plain Python dicts with Scrapy, Items provide additional protection against populating undeclared fields, preventing typos. They are declared by creating a <b>scrapy.Item</b> class and defining its attributes as <b>scrapy.Field</b> objects.
<br><br>
- <b>pipelines.py</b><br>
After an item has been scraped by a spider, it is sent to the Item <b>Pipeline</b> which processes it through several components that are executed sequentially.Each item pipeline component is a Python class which has to implement a method called <b>process_item</b> to process scraped items. It receives an item and performs an action on it, also decides if the item should continue through the pipeline or should be dropped and and not processed any longer. If it wants to drop an item then it raises <b>DropItem</b> exception to drop it.
<br><br>
- <b>settings.py</b> : <br>
It allows one to customise the behaviour of all Scrapy <b>components</b>, including the core, extensions, pipelines and spiders themselves. It provides a global namespace of key-value mappings that the code can use to pull configuration values from.
<br><br>
- <b>spiders</b> : <br>
Spiders is a directory which contains all <b>spiders/crawlers</b> as Python classes. Whenever one runs/crawls any spider then scrapy looks into this directory and tries to find the spider with its name provided by user. Spiders define how a certain site or a group of sites will be scraped, including how to perform the crawl and how to extract data from their pages. In other words, Spiders are the place where one defines the custom behavior for crawling and parsing pages for a particular site.Spiders have to define three major attributes i.e <b>start_urls</b> which tells which URLs are to be scrapped, <b>allowed_domains</b> which defines  only those domain names which need to scraped and <b>parse</b> is a method which is called when any response comes from lodged requests. These attributes are important because these constitute the base of Spider definitions.

#### c. Lets Scrap
Dalam melakukan scrapping terdapat 3 hal yang perlu dilakukan yaitu:
1. Melakukan update <b>items.py</b> dengan mengisi field yang akan di extract.
2. Membuat sebuah <b>Spider</b> baru untuk mendefinisikan <b>allowed_domains, start_urls, parse</b> method
3. Melakukan update <b>pipelines.py</b> untuk data processing lebih jauh.

Sekarang ayo kita melakukan scrapping.

- Update <b>items.py</b>

In [None]:
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BukalapakItem(scrapy.Item):
    # define the fields for your item here like:
    product_name = scrapy.Field()
    product_category = scrapy.Field()
    product_currency_price = scrapy.Field()
    product_price = scrapy.Field()

- Create new <b>Spider</b>
Untuk membuatnya, kita bisa menggunakan default utility <b>genspider</b>
<br><br><b><pre>
scrapy genspider BukalapakProductSpider bukalapak.com
</pre></b><br>

- Update <b>BukalapakProductSpider.py</b>

In [None]:
import scrapy
from bukalapak.items import BukalapakItem

class BukalapakProductSpider(scrapy.Spider):
    name = "BukalapakDeals"
    allowed_domains = ["bukalapak.com"]

    #Use working product URL below
    start_urls = [
        "https://www.bukalapak.com/p/komputer/laptop/43mglk-jual-laptop-hp-8560w-elitebook-workstation-core-i7", 
        "https://www.bukalapak.com/p/komputer/laptop/3sz46v-jual-hp-1000",
        "https://www.bukalapak.com/p/komputer/laptop/43ohlq-jual-dell-e6420-core-i5", 
        "https://www.bukalapak.com/p/komputer/laptop/45uqjz-jual-macbook-white-core-2-duo-mulus-normal-lancar"
    ]

    def parse(self, response):
        items = BukalapakItem()
        name = response.xpath('//h1[@itemprop="name"]/text()').extract()
        category = response.xpath('//dd[@itemprop="category"]/text()').extract()
        currency_price = response.xpath('//span[@itemprop="priceCurrency"]/text()').extract()
        price = response.xpath('//span[@itemprop="price"]/text()').extract()

        items['product_name'] = ''.join(name).strip()
        items['product_category'] = ''.join(category).strip()
        items['product_currency_price'] = ''.join(currency_price).strip()
        items['product_price'] = ''.join(price).strip()
        
        yield items


- Pipeline<br>
Pipeline classes implement <b>process_item</b> method which is called each and every time whenever items is being yielded by a Spider. It takes <b>item</b> and <b>spider</b> class as arguments and returns a <b>dict</b> object. So for this example, we are just returning item dict as it is.
<br><br>
Untuk menggunakan <b>pipeline</b> terlebih dahulu melakukan enable pada <b>settings.py</b>

In [None]:
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'amazon.pipelines.AmazonPipeline': 300,
}

- Melakukan <b>crawling</b>
Untuk melakukan crawing dan menyimpannya pada sebuah file json, gunakan perintah berikut
<br><br><b>scrapy crawl BukalapakDeals -o items.json</b><br><br>
Berikut output yang dihasilkan pada <b>items.json</b>

In [None]:
[
{"product_category": "Laptop", "product_price": "6.000.000", "product_name": "Laptop HP 8560W EliteBook Workstation Core i7", "product_currency_price": "Rp"},
{"product_category": "Laptop", "product_price": "2.400.000", "product_name": "HP 1000", "product_currency_price": "Rp"},
{"product_category": "Laptop", "product_price": "3.200.000", "product_name": "DELL E6420 CORE I5", "product_currency_price": "Rp"},
{"product_category": "Laptop", "product_price": "2.950.000", "product_name": "Macbook White Core 2 Duo Mulus , Normal,  Lancar", "product_currency_price": "Rp"}
]