## In this note book, we conduct the following analysis
### The general idea is to find out the pattern of people's accuracy in terms of bearish / bullish 
- take a part of the data, by chosing from the 100 packed files we produced in pack data notebook
- plug in the symbol in a conversation. This is for the case where a sentiment labled message is posted in a conversation while this message do not contain the symbol (ticker) because it is in the parent message. For example, in a conversation about Apple, one replied that "I think it will go down soon" without directly tag the ticker, we find its parent message and plug the ticker "Apple" into this unlabled message, so that we can put more data into use.
- Find a number of most popular tickers, so that the accuracy is not biased by tickers that mentioned by only a few messages
- Find the daily price history for the tickers
- For each messages, check if the bullish / bearish divination is correct in comming 1,3,7,14,28 days
- Compute the forcast accuracy by user id

In [1]:
import numpy as np
import pandas as pd
import os
import glob
import yfinance as yf
import re
import ast
import matplotlib as mlt
from utils import *
import sys

### Read a part of the data

In [2]:
def read_clean(end_file, start_file = 1):
    path = "e:\csv_clean_2\pack_csv_"
    dflis = []
    for i in range(start_file,end_file+1):
        sub = pd.read_csv(path + str(i) + '.csv')
        dflis.append(sub)
    df = pd.concat(dflis)
    return df

In [16]:
raw = read_clean(99,76)

### plug in symbol from reply messages

In [17]:
def plug_in_reply_symbol(df):
    # use1: the reply messages in df
    use1 = df.loc[df['parent_message_id'].notna()]

    # use2: the non-reply messages in df
    use2 = df.loc[df['parent_message_id'].isna()]

    # use3: message-symbol map for merge
    use3 = df[['message_id', 'symbol']].rename(columns = {'message_id':'reply_to_id', 
                                                                        'symbol':'reply_to_symbol'})

    # use4: reply messages with the the symbol of its parent message
    use4 = use1.merge(use3, how = 'left', left_on = 'parent_message_id', right_on = 'reply_to_id')

    # use5: assign the symbol of replying message as its parent message's, if there is any
    use5 = use4.copy()
    use5['symbol'] = use5['symbol'].fillna(use5['reply_to_symbol'])
    use5.drop(columns = ['reply_to_id','reply_to_symbol'], inplace = True)

    # use6: concat the symbol_filled reply messages and non-reply messages
    use6 = pd.concat([use5,use2])

    return use6


In [18]:
use = plug_in_reply_symbol(raw)

In [19]:
print(raw[['symbol']].isnull().sum() / len(raw))
print(use[['symbol']].isnull().sum() / len(use))

symbol    0.586402
dtype: float64
symbol    0.038934
dtype: float64


### After plugging in the symbols, remove the messages that do not contain sentiment or tickers

In [20]:
use1 = use.loc[(use['sentiment'].notna()) & (use['symbol'].notna())]
use1.shape

(26701357, 8)

In [21]:
use1['symbol'] = [eval(i) for i in use1['symbol']]
use1['symbol_count'] = [len(i) for i in use1['symbol']]
use2 = use1.loc[use1['symbol_count'] <=5]
use2['symbol'] = [i[0] for i in use2['symbol']]
use2.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  use1['symbol'] = [eval(i) for i in use1['symbol']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  use1['symbol_count'] = [len(i) for i in use1['symbol']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  use2['symbol'] = [i[0] for i in use2['symbol']]


(26592579, 9)

### Find popular symbols 

In [22]:
sl = get_nth_popular_symbol(use2, 2000)
symlis = [i for i in sl if '.X' in i] # This is for only check the crypto tickers
symlis

['SHIB.X',
 'BTC.X',
 'DOGE.X',
 'SAFEMOON.X',
 'ADA.X',
 'ETH.X',
 'JASMY.X',
 'VGX.X',
 'ALGO.X',
 'SOL.X',
 'ACH.X',
 'XRP.X',
 'AMP.X',
 'MANA.X',
 'HBAR.X',
 'SAITAMA.X',
 'LTC.X',
 'BTT.X',
 'ETC.X',
 'VET.X',
 'ONE.X',
 'ELON.X',
 'MATIC.X',
 'ATOM.X',
 'LUNC.X',
 'XYO.X',
 'QNT.X',
 'LRC.X',
 'COTI.X',
 'XTZ.X',
 'CKB.X',
 'LINK.X',
 'FTM.X',
 'DOT.X',
 'DGB.X',
 'XLM.X',
 'NU.X',
 'BABYDOGE.X',
 'ANKR.X',
 'VRA.X',
 'STMX.X',
 'HEX.X',
 'MONONOKE.X',
 'OMG.X',
 'POLY.X',
 'CRO.X',
 'ASM.X',
 'KEEP.X',
 'KISHU.X',
 'BCH.X',
 'NEO.X',
 'CELR.X',
 'AUCTION.X',
 'THETA.X',
 'AIDI.X',
 'ICP.X',
 'FEG.X',
 'SC.X',
 'FLOKI.X',
 'BAT.X',
 'VAULT.X',
 'TRX.X',
 'EGC.X',
 'TRB.X',
 'CHZ.X',
 'STORJ.X',
 'REQ.X',
 'FET.X',
 'ARPA.X',
 'OXT.X',
 'SAND.X',
 'INDC.X',
 'GRT.X',
 'BNB.X',
 'AVAX.X',
 'CATGIRL.X',
 'NKN.X',
 'RVN.X',
 'ENJ.X',
 'HOT.X',
 'RYOSHI.X',
 'GTC.X',
 'HNT.X',
 'PLAIR.X',
 'ICX.X',
 'REN.X',
 'KUMA.X',
 'ZEC.X',
 'EGLD.X',
 'CTSI.X',
 'DGEM.X',
 'MXS.X',
 'OMI.X',
 '

### Get ticker price history data from yfinance

In [23]:
symdf = get_interval(use2, symlis, pre_margin = 10, post_margin = 40)
stock_df = get_stock(symdf)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] = pd.to_datetime(df['created_at'])


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%********

In [24]:
stock_df['ticker'] = [i.replace('-USD','.X') for i in stock_df['ticker']]
stock_df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,ticker
0,2021-08-08,8e-06,9e-06,7e-06,7e-06,7e-06,923406924.0,SHIB.X
1,2021-08-09,7e-06,8e-06,7e-06,8e-06,8e-06,513264306.0,SHIB.X
2,2021-08-10,8e-06,8e-06,7e-06,8e-06,8e-06,358181887.0,SHIB.X
3,2021-08-11,8e-06,8e-06,8e-06,8e-06,8e-06,567005556.0,SHIB.X
4,2021-08-12,8e-06,8e-06,7e-06,8e-06,8e-06,447258857.0,SHIB.X


### Checking the forecast accuracy and compute FA files

In [25]:
pre_df = get_predict_df(use2,stock_df, pre_lis = [1,3,7,14,28])
check2 = check_pre(pre_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['date'] =  pd.to_datetime(pd.to_datetime(df['created_at']).dt.date)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = (df['date'] + pd.DateOffset(days=i))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = (df['date'] + pd.DateOffset(days=i))
A value is trying to be set on a copy of

In [26]:
check2.head()

Unnamed: 0.1,Unnamed: 0,message_id,user_id,message_body,created_at,sentiment,parent_message_id,symbol,symbol_count,+1,+3,+7,+14,+28
0,893,37001966,45603,$BTCUSD - It&#39;s decision time! - http://stk...,2015-05-16T20:40:06Z,Bearish,,BTC.X,1,0.0,1.0,0.0,1.0,1.0
1,307,37000355,45603,$BTCUSD - Bitcoin in the short term - http://s...,2015-05-16T18:10:06Z,Bearish,,BTC.X,1,0.0,1.0,0.0,1.0,1.0
2,58,37002062,45603,$BTCUSD - It&#39;s decision time! - http://stk...,2015-05-16T20:50:06Z,Bearish,,BTC.X,1,0.0,1.0,0.0,1.0,1.0
3,125,37007127,45603,$BTCUSD - BTCUSD daily and 240 - http://stks.c...,2015-05-17T07:50:05Z,Bullish,,BTC.X,1,0.0,0.0,1.0,0.0,0.0
4,229,37005260,45603,$BTCUSD - Head &amp; Shoulder pattern : long t...,2015-05-17T02:20:06Z,Bullish,,BTC.X,1,0.0,0.0,1.0,0.0,0.0


In [27]:
check2.to_csv( "e:\mid\cpt_4qt.csv")

In [28]:
FA1_user = compute_fa(check2, by_col = ['user_id'])
FA1_user.to_csv("e:\FA_tracking\Four_part\cpt_qt_4_by_user.csv")
FA1_user.head()

Unnamed: 0,user_id,+1,+3,+7,+14,+28,message_count
0,5684602,0.526091,0.516627,0.510639,0.482678,0.579609,9352
1,4388963,0.543914,0.535529,0.621369,0.703492,0.73833,5784
2,1122346,0.50326,0.423474,0.512045,0.507698,0.441587,5521
3,5478610,0.542612,0.593518,0.697426,0.746616,0.661582,5245
4,3881057,0.329961,0.387466,0.353913,0.376316,0.454264,4843
