## BTC/USD Dataset Exploration

##### Short dataset description
Dataset was sourced from kaggle : https://www.kaggle.com/datasets/prasoonkottarathil/btcinusd?resource=download&select=BTC-Hourly.csv. It cointains hourly historical data of the pair BTC/USD. Data are retrieved from Gemini (crypto cex). The columns of the data contain :

1) **open** price of the hourly candle
2) **close** price of the hourly candle
3) **low** price (min) of the hourly candle
4) **high** price (max) of the hourly candle
5) **volume** volume during the candle denominated in the respective currency

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('Bitcoin_tweets.csv', engine="python")
df_2 = pd.read_csv('Bitcoin_tweets_dataset_2.csv', engine="python")

In [4]:
df = pd.concat([df, df_2]).drop_duplicates()

In [5]:
df.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,DeSota Wilson,"Atlanta, GA","Biz Consultant, real estate, fintech, startups...",2009-04-26 20:05:09,8534.0,7605,4838,False,2021-02-10 23:59:04,Blue Ridge Bank shares halted by NYSE after #b...,['bitcoin'],Twitter Web App,False
1,CryptoND,,😎 BITCOINLIVE is a Dutch platform aimed at inf...,2019-10-17 20:12:10,6769.0,1532,25483,False,2021-02-10 23:58:48,"😎 Today, that's this #Thursday, we will do a ""...","['Thursday', 'Btc', 'wallet', 'security']",Twitter for Android,False
2,Tdlmatias,"London, England","IM Academy : The best #forex, #SelfEducation, ...",2014-11-10 10:50:37,128.0,332,924,False,2021-02-10 23:54:48,"Guys evening, I have read this article about B...",,Twitter Web App,False
3,Crypto is the future,,I will post a lot of buying signals for BTC tr...,2019-09-28 16:48:12,625.0,129,14,False,2021-02-10 23:54:33,$BTC A big chance in a billion! Price: \487264...,"['Bitcoin', 'FX', 'BTC', 'crypto']",dlvr.it,False
4,Alex Kirchmaier 🇦🇹🇸🇪 #FactsSuperspreader,Europa,Co-founder @RENJERJerky | Forbes 30Under30 | I...,2016-02-03 13:15:55,1249.0,1472,10482,False,2021-02-10 23:54:06,This network is secured by 9 508 nodes as of t...,['BTC'],Twitter Web App,False


In [6]:
df.shape

(4865604, 13)

In [7]:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df = df.dropna(subset=['date'])
df = df.reset_index(drop=True)
df = df.sort_values(by='date')

In [8]:
df.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
21523,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:04,2⃣ Debunking 9 #Bitcoin Myths by @Patrick_Lo...,"['Bitcoin', 'cryptocurrency', 'bitcoin', 'cryp...",Twitter Web App,False
21524,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:04,📖 Weekend Read 📖\n\nKeen to learn about #cryp...,['crypto'],Twitter Web App,False
21522,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:06,4⃣ 🎙️ Bloomberg LP #CryptoOutlook 2021 with @...,"['CryptoOutlook', 'cryptocurrency', 'bitcoin',...",Twitter Web App,False
21521,Iconic Holding,"Frankfurt am Main, Germany",Professional Crypto Asset Ventures \nhttps://t...,2021-01-05 13:22:24,301.0,1075,361,False,2021-02-05 10:52:07,"5⃣ #Blockchain 50 2021 by @DelRayMan, @Forbe...","['Blockchain', 'cryptocurrency', 'bitcoin', 'c...",Twitter Web App,False
21520,Nick Doevendans,"Edam-Volendam, Nederland","Amateur historicus m.n. WW2, schrijver, muziek...",2020-06-12 16:50:07,37.0,123,410,False,2021-02-05 10:52:26,#reddcoin #rdd @reddcoin to the moon #altcoin ...,"['reddcoin', 'rdd', 'altcoin', 'turnreddcoinin...",Twitter for iPhone,False


In [9]:
df.shape

(4859004, 13)

In [10]:
print("Starting date is :",df['date'].min())
print("Ending date is :",df['date'].max())

Starting date is : 2021-02-05 10:52:04
Ending date is : 2023-03-05 23:59:56


In [11]:
df2 = df.copy(deep=True)

In [12]:
df2.columns

Index(['user_name', 'user_location', 'user_description', 'user_created',
       'user_followers', 'user_friends', 'user_favourites', 'user_verified',
       'date', 'text', 'hashtags', 'source', 'is_retweet'],
      dtype='object')

In [30]:
# Convert 'user_followers' column to numeric
df2['user_followers'] = pd.to_numeric(df2['user_followers'], errors='coerce')

In [43]:
df2[(df2['is_retweet'] == 'False') & (df2['user_followers'] > 100)].shape

(3180923, 13)

In [74]:
df3 = df2[(df2['source'].isin(['Twitter for iPhone', 'Twitter for Android'])) & (df2['is_retweet'] == 'False') & (df2['user_followers'] > 100) & (df2['user_description'].notnull())].copy(deep=True)

In [75]:
isinstance(df3['hashtags'].values[0], list)

False

In [76]:
# Let's assume your dataframe is named df and the column of interest is 'hashtags'
df3['hashtags'] = df3['hashtags'].apply(lambda hashtags: str(hashtags).lower())

# Filtering out rows that don't contain 'bitcoin' or 'btc'
df3 = df3[df3['hashtags'].apply(lambda hashtags: 'bitcoin' in hashtags or 'btc' in hashtags)]

In [77]:
df3.head()

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
21516,Youssef Fedda,Australia,"love interesting startups, stocks and innovati...",2021-01-13 07:16:35,2520.0,1085,191,False,2021-02-05 10:58:03,@JulSwap $juld $bnb #Binance #BSC #BinanceSmar...,"['binance', 'bsc', 'binancesmartchain', 'btc',...",Twitter for Android,False
21502,Zapumal,Colombo,"NUS PhD, Lecturer at CSE, Consultant Xeptagon....",2009-06-01 09:06:54,290.0,58,569,False,2021-02-05 11:05:17,We are gaining pace with more and more institu...,['bitcoin'],Twitter for Android,False
21499,EM_CryPT0,Nederland,▪️@CryptoBrothers5 Team ▪️💯% #Crypto▪️#BTC ▪️N...,2010-07-12 17:04:23,16100.0,602,1014,False,2021-02-05 11:08:30,To-do or not To-do. #crypto #btc #Bitcoin #E...,"['crypto', 'btc', 'bitcoin', 'ethereum']",Twitter for iPhone,False
21496,Mr Fulcanelli,Argentina,"be decentralized, be a smart contract",2010-08-23 20:41:38,157.0,96,8570,False,2021-02-05 11:10:39,Node for #Bitcoin \n\n#blockchain #BTC https:/...,"['bitcoin', 'blockchain', 'btc']",Twitter for iPhone,False
21487,Cardano Apps,,CardanoApps - Discover new decentralised apps ...,2020-07-25 16:10:17,331.0,274,1540,False,2021-02-05 11:17:53,#bitcoin #ATH is ~2x since last #ATH. #CARDANO...,"['bitcoin', 'ath', 'ath', 'cardano', 'ada', 'a...",Twitter for iPhone,False


In [78]:
df3.shape

(1140258, 13)

In [79]:
import re
from tqdm import tqdm

In [80]:
df3 = df3.reset_index(drop=True)

In [81]:
for i,s in enumerate(tqdm(df3['text'],position=0, leave=True)):
    text = df3.loc[i, 'text']
    text = text.replace("#", "")
    text = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', text, flags=re.MULTILINE)
    text = re.sub('@\\w+ *', '', text, flags=re.MULTILINE)
    df3.loc[i, 'text'] = text

100%|██████████| 1140258/1140258 [00:54<00:00, 20859.57it/s]


#### Get sentiment prediction

In [82]:
from transformers import AutoConfig

# Load the configuration of the model
config = AutoConfig.from_pretrained("ProsusAI/finbert")

# Get the labels
labels = [label for key, label in config.id2label.items()]

print(labels)

['positive', 'negative', 'neutral']


In [83]:
df4 = df3

In [84]:
df4

Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,Youssef Fedda,Australia,"love interesting startups, stocks and innovati...",2021-01-13 07:16:35,2520.0,1085,191,False,2021-02-05 10:58:03,$juld $bnb Binance BSC BinanceSmartChain BTC B...,"['binance', 'bsc', 'binancesmartchain', 'btc',...",Twitter for Android,False
1,Zapumal,Colombo,"NUS PhD, Lecturer at CSE, Consultant Xeptagon....",2009-06-01 09:06:54,290.0,58,569,False,2021-02-05 11:05:17,We are gaining pace with more and more institu...,['bitcoin'],Twitter for Android,False
2,EM_CryPT0,Nederland,▪️@CryptoBrothers5 Team ▪️💯% #Crypto▪️#BTC ▪️N...,2010-07-12 17:04:23,16100.0,602,1014,False,2021-02-05 11:08:30,To-do or not To-do. crypto btc Bitcoin Ether...,"['crypto', 'btc', 'bitcoin', 'ethereum']",Twitter for iPhone,False
3,Mr Fulcanelli,Argentina,"be decentralized, be a smart contract",2010-08-23 20:41:38,157.0,96,8570,False,2021-02-05 11:10:39,Node for Bitcoin \n\nblockchain BTC /Oz4QGx769e,"['bitcoin', 'blockchain', 'btc']",Twitter for iPhone,False
4,Cardano Apps,,CardanoApps - Discover new decentralised apps ...,2020-07-25 16:10:17,331.0,274,1540,False,2021-02-05 11:17:53,bitcoin ATH is ~2x since last ATH. CARDANO ADA...,"['bitcoin', 'ath', 'ath', 'cardano', 'ada', 'a...",Twitter for iPhone,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,Free Bitcoin,,#FreeBitcoin #Bitcoin #BTC #Bitcoinmining #El...,2018-08-10 06:29:19,33038.0,1024,8348,False,2021-02-08 18:33:57,FreeBitcoin Bitcoin BTC Bitcoinmining ElonMus...,"['freebitcoin', 'bitcoin', 'btc', 'bitcoinmini...",Twitter for Android,False
1996,Free Bitcoin,,#FreeBitcoin #Bitcoin #BTC #Bitcoinmining #El...,2018-08-10 06:29:19,33038.0,1024,8348,False,2021-02-08 18:34:04,FreeBitcoin Bitcoin BTC Bitcoinmining ElonMus...,"['freebitcoin', 'bitcoin', 'btc', 'bitcoinmini...",Twitter for Android,False
1997,xbt_blvrg,,I trade crypto futures and flip #BTC.,2011-11-09 16:38:56,634.0,192,514,False,2021-02-08 18:34:10,TOTAL $BTC $ALT CRYPTOCAP \n\nI really nailed ...,"['cryptocap', 'btc', 'bitcoin', 'crypto']",Twitter for iPhone,False
1998,Free Bitcoin,,#FreeBitcoin #Bitcoin #BTC #Bitcoinmining #El...,2018-08-10 06:29:19,33038.0,1024,8348,False,2021-02-08 18:34:10,FreeBitcoin Bitcoin BTC Bitcoinmining ElonMus...,"['freebitcoin', 'bitcoin', 'btc', 'bitcoinmini...",Twitter for Android,False


In [92]:

from transformers import pipeline
import pandas as pd

# Sample DataFrame with 'text' column containing the small texts

# Create a sentiment analysis pipeline using transformers
sentiment_analyzer = pipeline("text-classification", model="ProsusAI/finbert")
#sentiment_analyzer = pipeline('sentiment-analysis')
# Function to get sentiment using transformers
def get_sentiment(text):
    sentiment = sentiment_analyzer(text)[0]
    return sentiment['label'], sentiment['score']

# Use tqdm to display a progress bar while applying the sentiment analysis function
tqdm.pandas()

# Add new columns 'sentiment_label' and 'sentiment_score' to the DataFrame
df4['sentiment_label'], df4['sentiment_score'] = zip(*df4['text'].progress_apply(get_sentiment))

print(df4.head())

100%|██████████| 2000/2000 [01:04<00:00, 30.90it/s]

          user_name user_location  \
0     Youssef Fedda     Australia   
1           Zapumal       Colombo   
2         EM_CryPT0     Nederland   
3     Mr Fulcanelli     Argentina   
4      Cardano Apps           NaN   
...             ...           ...   
1995   Free Bitcoin           NaN   
1996   Free Bitcoin           NaN   
1997      xbt_blvrg           NaN   
1998   Free Bitcoin           NaN   
1999   Free Bitcoin           NaN   

                                       user_description         user_created  \
0     love interesting startups, stocks and innovati...  2021-01-13 07:16:35   
1     NUS PhD, Lecturer at CSE, Consultant Xeptagon....  2009-06-01 09:06:54   
2     ▪️@CryptoBrothers5 Team ▪️💯% #Crypto▪️#BTC ▪️N...  2010-07-12 17:04:23   
3                 be decentralized, be a smart contract  2010-08-23 20:41:38   
4     CardanoApps - Discover new decentralised apps ...  2020-07-25 16:10:17   
...                                                 ...                  ..


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df4['sentiment_label'], df4['sentiment_score'] = zip(*df4['text'].progress_apply(get_sentiment))


: 

In [None]:
df4.to_pickle('sentiment.pkl')