# Collect Stocktwit messages

The objective of this notebook is to:

- Iteratively collect stock-related messages using Stocktwit API
- Store the data in a MongoDB 

## Ressources for the project

Slide availables here: [Project](https://drive.google.com/open?id=0B0rdK44Elj9RZzdYMnJzaTREaHlhOUlmNGd2Qzg3RFJTSDBn)

The online codes are available [here](https://repl.it/@trenault/StockTwits101)

## Note about Stocktwit API

The API returns 30 messages at a time.

To get older messages need the specify the `max` arguments: https://api.stocktwits.com/developers/docs/api#streams-symbol-docs

- https://api.stocktwits.com/api/2/streams/symbol/max=177617138/BTC.X.json

Example

-  base URL: https://api.stocktwits.com/api/2/streams/symbol/

Paramteres:

-  1) ticker
- 2) max id

- Only latest 30 messages
https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json




We need the following information in a dataframe:

- `messages`
  - `id`
  - `body`
  - `created_at`
  - `user`
    - `id`
    - `username`
    - `name`
    - `avatar_url`
    - `join_date` 

## Stocktwit API

[API](https://api.stocktwits.com/developers/docs)

For next week, scrap  the data for bitcoin and messages about bitcoins

## MongoDB Documentation (MacOnly)

Installation process: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-os-x/

To open the database, please first open the terminal and paste the following code 

```
mongod --config /usr/local/etc/mongod.conf
```

Then open another terminal and launch mongoDB with `mongo` to use MongoDB with the command line

You are all set!

## Sentiment

If sentiment is Bullish, label +1, if sentiment is Bearish, lable -1 else 0


# Data Collection

We define a function to collect the data. The steps performed in the function are the following:

- Step 1: Send a Get-request to the API
- Step 2: Evaluate the status. If the connection is refused, wait `n` amount of time
- Step 3: Extract the following information from the JSON file:
    - ID, Body, UserName, Created At, sentiment
- Step 4: Insert to mongoDB
- Step 5: Get the latest ID of the JSON file: allow iterative data collection

In [None]:
#import urllib3
import json
import pandas as pd
#from pymongo import MongoClient
import datetime
from dateutil.relativedelta import relativedelta
import time
from tqdm import tqdm
import requests
import warnings
warnings.filterwarnings('ignore')

In [None]:
def extract_ticker(ticker, first_id = None):
        
    date_test = True 
    timesecond = 0
    begin = datetime.datetime.now()
    #first_id = "210206060" ### 

    while date_test:
        time.sleep(timesecond)
        if first_id != None:
            ### check if we have the first id (useful first loop)
            url = "https://api.stocktwits.com/api/2/streams/symbol/" \
        "{0}.json?max={1}".format(
                ticker,
                first_id)  
        else:
            url =  "https://api.stocktwits.com/api/2/streams/" \
        "symbol/{0}.json".format(ticker)
            
        print(url)
            
        

        content= requests.get(url)

        if content.status_code == 200:
            begin = datetime.datetime.now()
            data = content.json()

            first_id = data['cursor']['since']

            date_test = datetime.datetime.strptime(
            data['messages'][0]['created_at'],
            '%Y-%m-%dT%H:%M:%SZ') > datetime.datetime.today() + \
            relativedelta(months=-3)

            name = "data/{}_{}.json".format(ticker, first_id)
            with open(name, 'w') as outfile:
                json.dump(data['messages'], outfile)
        else:
            dic_last_id = {
                'url':url,
                'last_id':first_id
            }
            
                
            if content.status_code == 429:
                end = datetime.datetime.now()
                time_code = end - begin
    
                time_next_batch = begin + datetime.timedelta(hours=1)
                time_end_batch = begin + datetime.timedelta(seconds=
                                                            time_code.seconds)
                timesecond = (time_next_batch - time_end_batch).seconds
                
                dic_last_id['time_stop'] = end.strftime('%Y-%m-%dT%H:%M:%SZ')
    
                print(""" Next batch in {} minutes.
                     It will happen at {}""".format(timesecond/60,
                     time_next_batch.strftime("%H:%M:%S"))
                    )
        

        
            name = "data/logs/{}_{}_log.json".format(ticker, first_id)
            with open(name, 'w') as outfile:
                json.dump(dic_last_id, outfile)

Get all tweet up to three months

In [None]:
ticker = "BTC.X"
extract_ticker(ticker = ticker)