# Project Introduction
#1 Our goal is to explore the trend variation on the bitcoin market price based on the market news and social news.
#2 The first raw data we crawled from website is mainly from cryptocurrencynews.com. It has already done in the data_collect.ipynb progress in the same directory in the github. Now we would directly use the stored data in <b>cryptocurrencynews_com_bitcoin_news.csv</b> file

#### Changes to the project
1> Our goal remains, but the source of the data changed.

Previously we would want to get data from mediacloud.com website. Later on we found that such data is not in accordance with our goal to use the contents to judge the market price variation. So we changed our way to re-crawled the data from the main cryptocurrency market news websites and social medias, like tweets. It tempted to be slow.

2> to be continued.

In [1]:
# Pre import
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt

## Read the data from raw data .csv files

In [11]:
# Read 'Bitcoin' and the pricing of 'Bitcoin' market
priceOfBitcoin = pd.read_csv('bitcoin-24hv0.csv')
priceOfBitcoin.rename(columns={'Series 1':'price', 'DateTime':'date'}, inplace=True)
print(priceOfBitcoin.head(), priceOfBitcoin.describe())
priceOfBitcoin['date'] = [x[:10] for x in priceOfBitcoin['date']]
priceOfBitcoin.head()
# we add a new column 'Rising' to determine if the price is rising than previous day's price
# 1: means rising, 0: means non rising
# priceOfBitcoin['Rising']

                  date   price
0  2013-04-28 00:00:00  135.30
1  2013-04-29 00:00:00  134.44
2  2013-04-30 00:00:00  144.00
3  2013-05-01 00:00:00  139.00
4  2013-05-02 00:00:00  116.38               price
count   2139.000000
mean    2349.127901
std     3371.049914
min       68.500000
25%      334.485000
50%      616.310000
75%     3603.660150
max    19497.400000


Unnamed: 0,date,price
0,2013-04-28,135.3
1,2013-04-29,134.44
2,2013-04-30,144.0
3,2013-05-01,139.0
4,2013-05-02,116.38


In [27]:
# Read from cryptocurrencynews_com_bitcoin_news.csv
newsOfBitcoin = pd.read_csv('tmp/cryptocurrencynews_com_bitcoin_news.csv')
newsOfBitcoin.head()

Unnamed: 0,title,date,author,number_of_views,url
0,A Potential Reason Behind Bitcoin's Surge,"Jun 6, 2017",Josh Li,6506,https://cryptocurrencynews.com/daily-news/bitc...
1,Bitcoin Just A Bubble? Credit Suisse CEO's Pre...,"Nov 2, 2017",Chelsea Roh,3399,https://cryptocurrencynews.com/daily-news/bitc...
2,"BREAKING: Bitcoin Soars 6% to $7,400 in just 2...","Nov 3, 2017",Chelsea Roh,763,https://cryptocurrencynews.com/daily-news/bitc...
3,Can You Make a Fortune out of Bitcoin? Richard...,"Nov 6, 2017",Caroline Harris,3684,https://cryptocurrencynews.com/daily-news/bitc...
4,"Hard Fork News: Bitcoin SegWit2x Suspended, Pr...","Nov 8, 2017",Chelsea Roh,3189,https://cryptocurrencynews.com/daily-news/bitc...


# Preprocessing the Raw data
### 1 format transformation

### 2 duplicate removal and aggregate same date articles

### 3 create labels

### 4 TF-IDF

In [15]:
# write a inner function to transfer date column in newsOfBitcoin to make it accord with that in priceOfBitcoin
def date_transformer(date_str):
    """
        returns a string date format of as YYYY-MM-DD, from "Jun 6, 2018"
        args:
            str(date_str): like "Jun 6, 2018"
        returns:
            str(date): like 2018-06-06
    """
    result = ''
    month_dict = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08',
                 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
    splits = date_str.split(',')
    # boundaries exceptions
    if len(splits) != 2:
        return 'null'
    year = splits[1].strip()
    month_day = splits[0].split()
    if len(month_day) != 2:
        return 'null'
    month = month_dict[month_day[0]]
    day = month_day[1].zfill(2)
    
    result = year + '-' + month + '-' + day
    return result

In [22]:
date_transformer("Jun 6, 2018"), date_transformer("Jun, 6, 2018"), date_transformer("Jun6, 2018"), \
date_transformer("Jul 06, 2018")

('2018-06-06', 'null', 'null', '2018-07-06')

In [28]:
newsOfBitcoin['date'] = newsOfBitcoin['date'].apply(date_transformer)
newsOfBitcoin.head()
# newsOfBitcoin[newsOfBitcoin['date'] == 'null']

Unnamed: 0,title,date,author,number_of_views,url
0,A Potential Reason Behind Bitcoin's Surge,2017-06-06,Josh Li,6506,https://cryptocurrencynews.com/daily-news/bitc...
1,Bitcoin Just A Bubble? Credit Suisse CEO's Pre...,2017-11-02,Chelsea Roh,3399,https://cryptocurrencynews.com/daily-news/bitc...
2,"BREAKING: Bitcoin Soars 6% to $7,400 in just 2...",2017-11-03,Chelsea Roh,763,https://cryptocurrencynews.com/daily-news/bitc...
3,Can You Make a Fortune out of Bitcoin? Richard...,2017-11-06,Caroline Harris,3684,https://cryptocurrencynews.com/daily-news/bitc...
4,"Hard Fork News: Bitcoin SegWit2x Suspended, Pr...",2017-11-08,Chelsea Roh,3189,https://cryptocurrencynews.com/daily-news/bitc...


#### Merge the dataset with price and news

In [45]:
df = priceOfBitcoin.merge(newsOfBitcoin, how = 'inner', left_on='date', right_on='date')
df.head()

Unnamed: 0,date,price,title,author,number_of_views,url
0,2017-06-06,2690.84,A Potential Reason Behind Bitcoin's Surge,Josh Li,6506,https://cryptocurrencynews.com/daily-news/bitc...
1,2017-07-10,2525.25,3 Altcoins That Have Potential To Be The Next ...,Josh Li,6957,https://cryptocurrencynews.com/daily-news/cryp...
2,2017-07-10,2525.25,3 Altcoins That Have Potential To Be The Next ...,Josh Li,6979,https://cryptocurrencynews.com/daily-news/cryp...
3,2017-07-12,2332.77,A Crash in the Cryptocurrency Industry Might A...,Caroline Harris,671,https://cryptocurrencynews.com/daily-news/cryp...
4,2017-07-12,2332.77,A Crash in the Cryptocurrency Industry Might A...,Caroline Harris,672,https://cryptocurrencynews.com/daily-news/cryp...


#### Remove Duplicate and aggregate same date articles

In [46]:
df = df.drop_duplicates(subset=['title'])
df['title'] = df.groupby('date')['title'].apply(lambda x: '{%s}'.join(x))
df.head()

Unnamed: 0,date,price,title,author,number_of_views,url
0,2017-06-06,2690.84,,Josh Li,6506,https://cryptocurrencynews.com/daily-news/bitc...
1,2017-07-10,2525.25,,Josh Li,6957,https://cryptocurrencynews.com/daily-news/cryp...
3,2017-07-12,2332.77,,Caroline Harris,671,https://cryptocurrencynews.com/daily-news/cryp...
5,2017-10-20,6011.45,,Jen Jiang,431,https://cryptocurrencynews.com/daily-news/cryp...
6,2017-10-20,6011.45,,Jen Jiang,463,https://cryptocurrencynews.com/daily-news/cryp...
