# Bitcoin & Twitter

Project on Data Mining by Bontenakel Lenny & Bels Senne.  
 *[Link to our github repository](https://github.com/snenenenenenene/btc-twitter-data-mining)*

## Onderzoeksvraag

Bestaat er een invloed tussen de activiteit rond Bitcoin op Twitter en de prijs van twitter? Zo ja, hoe groot is deze impact?

Hiervoor zullen we het sentiment analyseren van elke tweet
	- algemeen sentiment
	- sentiment binnen een bepaalde periode (per maand mss?)

Ook de activiteit van tweets binnen een bepaalde periode
	

In [None]:
from pyspark import SparkContext
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import col, desc, asc, udf, max as max_
from pyspark.sql import functions as F
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
from matplotlib.pyplot import rc
import re
import yfinance as yf
import datetime
import re
from typing import List
import numpy as np
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import folium
from PIL import Image
import matplotlib.pyplot as plt
import yfinance as yf
from geopy.geocoders import Nominatim
from pattern.en import sentiment
from pyspark import SparkContext
from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, desc, asc, udf
from pyspark.sql.types import *

# sc = SparkContext("local").getOrCreate()
sqlContext = SQLContext(sc)
spark = SparkSession(sc)
plt.style.use('ggplot')

## Data Exploration

### BITCOIN

 - Unix Timestamp - Date represented as epoc value
 - Date - date and time when the data point was collected
 - Symbol - Symbol of the currency
 - Open - Open value of the currency
 - High - Highest value of currency in the given minute
 - Low - Lowest value of currency in the given minute
 - Close - Close value of the currency in the given minute
 - Volume - Volume of the currency transacted in the given minute.

### TWITTER

 - user_name    The name of the user, as theyâ€™ve defined it.
 - user_location    The user-defined location for this accountâ€™s profile.
 - user_description	The user-defined UTF-8 string describing their account.
 - user_created	Time and date, when the account was created.
 - user_followers	The number of followers an account currently has.
 - user_friends	The number of friends a account currently has.
 - user_favourites	The number of favorites a account currently has
 - user_verified    When true, indicates that the user has a verified account
 - date UTC time and date when the Tweet was created
 - text The actual UTF-8 text of the Tweet
 - hashtags	All the other hashtags posted in the tweet along with #Bitcoin & #btc
 - source   Utility used to post the Tweet, Tweets from the Twitter website have a source value - web
 - is_retweet	Indicates whether this Tweet has been Retweeted by the authenticating user.

# Tweets

We load tweet data from a dataset found on kaggle. A link to this dataset can be found in the README, provided in our *[github repository](https://github.com/snenenenenenene/btc-twitter-data-mining)*

In [None]:
tweets_schema = StructType([
    StructField('user_name', StringType(), True),
    StructField('user_location', StringType(), True),
    StructField('user_description', StringType(), True),
    StructField('user_created', StringType(), True),
    StructField('user_followers', FloatType(), True),
    StructField('user_friends', FloatType(), True),
    StructField('user_favourites', FloatType(), True),
    StructField('user_verified', BooleanType(), True),
    StructField('date', StringType(), True),
    StructField('text', StringType(), True),
    StructField('hashtags', StringType(), True),
    StructField('source', StringType(), True),
    StructField('is_retweet', BooleanType(), True),
])

tweets_df = spark.read.csv(
    "./data/tweets.csv",
    header=True,
    sep=',',
    multiLine=True,
    unescapedQuoteHandling="STOP_AT_CLOSING_QUOTE",
    schema=tweets_schema
)


In [None]:
tweets_df.show()

### Yahoo finance data

Now we will load the bitcoin financial data from yahoo finance using a python library. 

In [None]:
btc_stock = yf.Ticker("BTC-USD")
end = datetime.datetime(2021, 11, 26)
start = datetime.datetime(2021, 2, 5)

btc_stock = btc_stock.history(start=start, end=end)
btc_df = spark.createDataFrame(btc_stock)


## TWITTER

### Null values

Our dataset - now correctly loaded - shows that the amount of null values in the dataset is quite low. Except for the locations of users. This, however, is not due to a poor dataset, but due to these users not being willing to share their locations. Which should of course be respected and tolerated. Thus, there will be no actions taken to fill this data.  
The red line indicates the total amount of rows present in our dataset.

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))

ax.axhline(y=tweets_df.count(), label="Total amount of rows")
ax.bar_label(
    ax.bar(
        tweets_df.columns,
        [tweets_df.where(col(l).isNull()).count() for l in tweets_df.columns]
    )
)

ax.set_xlabel("Columns")
ax.set_ylabel("# Null values")

fig.tight_layout()
plt.show()

## Preparing Data    Date|
+-------+-------------+

In [None]:
def conv_to_int(val):
    if isinstance(val, str):
        return 0
    else:
        return float(val)


conv_to_int_udf = udf(lambda x: conv_to_int(x), IntegerType())

# tweets_df = tweets_df.withColumn("user_followers", conv_to_int_udf(col("user_followers")))\
#     .withColumn("user_friends", conv_to_int_udf(col("user_friends")))\
#     .withColumn("user_favourites", conv_to_int_udf(col("user_favourites")))

### Text cleaning

In [None]:
def clean_text(text):
    if (isinstance(text, str)):
        text = text.replace("#", "")
        text = re.sub('\\n', '', text)
        text = re.sub('https:\/\/\S+', '', text)
        return text
    else:
        return ""
clean_text_udf = udf(lambda x: clean_text(x), StringType())



tweets_df = tweets_df.withColumn("text", clean_text_udf(col("text"))).dropna(subset=["user_name"])

## Generating impact score

### Fixing the hashtags array column 
Before we can generate the impact score, we need to generate an array of strings. This array represents the hashtags present within the tweet.
Since the csv format - which we use to read in the data - does not support arrays in pyspark. We need to fix it after reading it in as a string.

In [None]:
def fix_hashtags_array(hashtags_arr_string):
    try:
        closing_bracket = hashtags_arr_string.index(']', 1)
        subject = hashtags_arr_string[1 :closing_bracket]

        result = subject.split(', ') if closing_bracket > 1 else []
        result = ' '.join(result).replace("'", "").split()

        return result
        
    except ValueError:
        return []
    
fix_hashtags_array_udf = udf(lambda x: fix_hashtags_array(x), ArrayType(StringType()))

tweets_df = tweets_df.fillna("[]", subset="hashtags").withColumn("hashtags", fix_hashtags_array_udf(col("hashtags")))

Now that is done. We can use this array to generate an impact score for every tweet.

In [None]:
from pyspark.sql.functions import struct


def generate_impact_score(tweet):
    coef_verified = 1.1 if tweet.user_verified else 1
    coef_hashtags = 1 + (len(tweet.hashtags) / 20)
    return ((tweet.user_followers + (tweet.user_friends / 4)) * coef_verified * coef_hashtags) / 100
    
generate_impact_score_udf = udf(lambda x: generate_impact_score(x), FloatType())

tweets_df = tweets_df.withColumn("impact_score", generate_impact_score_udf(struct([tweets_df[x] for x in tweets_df.columns])))

# Generating Date Dataframe


In [None]:
date_df = tweets_df.withColumn("date", F.to_date(F.col("date")))
date_df = date_df.groupby("date").count().dropna().sort(asc("date")).filter(
    (date_df.date > datetime.datetime(2020, 3, 20)) & (date_df.date < datetime.datetime.today()))

counts_df = date_df

def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
    return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]

def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
    get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))

    window = Window.orderBy(*group_columns, date_column)

    return df.withColumn("_diff", F.datediff(F.lead(date_column, 1).over(window), date_column)).filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")).withColumn(fill_column, F.lit('')).withColumn(date_column, F.explode("_next_dates")).drop("_diff", "_next_dates")

fill_df = _get_fill_dates_df(date_df, "date",[], "count")
date_df = date_df.union(fill_df).sort(asc(col("date")))

### Twitter

### Most popular users
We will start separating data from the main dataframe, to create a new dataframe showing data about the used accounts.
This way we can easily show what accounts are most followed and loved.
However, user accounts are constantly changing. The amounts of followers, friends and favourites an accounts has rarely remains the same for long.
These values rise and fall, therefore it would not be wise to simply select the instance with the max amount of followers.

In [None]:
accounts_df = tweets_df.groupBy('user_name').max('user_followers').withColumnRenamed('max(user_followers)', 'user_followers').sort(desc('user_followers'))

x_rows = accounts_df.select('user_name').collect()
y_rows = accounts_df.select('user_followers').collect()

fig, ax = plt.subplots(figsize=(16, 8))

n = 20
ax.barh(
    [x.user_name for x in x_rows[:n]],
    [y.user_followers for y in y_rows[:n]],
    color='green',
    label='tweets/ user'
)

plt.show()

## Where is Elon Musk?

In [None]:
# tweets_df.where(tweets_df.user_name == "Elon Musk").show()
# tweets_df.where(tweets_df.user_name == "Reuters").show()

### Tweets / user

In [None]:
%matplotlib inline

user_volume = tweets_df.groupby("user_name").count().withColumnRenamed("count", "user_count").sort(desc("user_count"))

n = 10
x_rows = user_volume.limit(n).select("user_name").collect()
y_rows = user_volume.limit(n).select("user_count").collect()

fig, ax = plt.subplots(figsize=(16, 8))

ax.bar_label(
    ax.barh([x.user_name for x in x_rows],
           [y.user_count for y in y_rows],
           label='tweets/ user')
)

ax.set_xlabel('# tweets')
ax.set_ylabel('username')

### Bitcoin Value

In [None]:
# df_location = tweets_df.groupBy('user_location').count().sort(col("count").desc()).show()
geolocator = Nominatim(user_agent="example")
location_df = tweets_df.groupBy('user_location').count().filter("count >= 500").where(
    "user_location not in ('Decentralized', 'Moon', 'ðŸ‡¦ðŸ‡º', 'Everywhere', 'Road Warrior', 'Mars', 'Cloud Engineer', 'Planet Earth', 'Earth', 'Blockchain', 'The Blockchain')").sort(
    col("count").desc()).dropna().collect()


def coords(location_string):
    try:
        location_obj = geolocator.geocode(location_string).raw
        return (location_obj['lat'], location_obj['lon'])
    except:
        return (20, 20)


locations = list(map(lambda r: [r['user_location'], r['count'], coords(r['user_location'])], location_df))
map_tweets = folium.Map(location=[51,10], zoom_start=2)

for location_name, count, location_coords in locations:
    folium.Circle(location=location_coords,
                  popup=f"{location_name}: {count}",
                  radius=count * 50,
                  color="crimson",
                  fill_color="crimson",
                  tooltip=count).add_to(map_tweets)
map_tweets

In [None]:
# X FOR BTC VOLUME
dates = date_df.select("date").collect()
x = list(map(lambda r: (r['date']), dates))

# X FOR COUNTS
counts_dates = counts_df.select("date").collect()
counts_x = list(map(lambda r: (r['date']), counts_dates))

# Y/COUNT OF TWEETS
y_rows = counts_df.select("count").collect()
tweets_y = list(map(lambda r: r['count'], y_rows))

# Y FOR BTC VOLUME
y_rows = btc_df.select("Open").collect()
btc_y = list(map(lambda r: float(r['Open']), y_rows))

fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(counts_dates, tweets_y, color='blue', label='Tweet Volume')
# ax.set_yscale('log')
ax.tick_params(axis='y')

ax2 = ax.twinx()
ax2.plot(x,btc_y, color='red', label='BTC Value')
# ax2.set_yscale('log')
ax2.tick_params(axis='y')

lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2 , labels + labels2, loc=0)
plt.show()


## Sentiment Analysis

Here we will analyse the sentiments.....

In [None]:
tweet_text = tweets_df.select("text").collect()
sentiments = [(x.text, *sentiment(x.text)) for x in tweet_text]

sentiment_schema = ["text", "polarity", "subjectivity"]
sentiments_df = spark.createDataFrame(
    data=sentiments,
    schema=sentiment_schema
)

sentiments_df.show()

For the sake of not cluttering the chart with futile words such as 'a', 'in' or 'and' we left out the aforementioned ones as well as a short - yet cherry-picked - list of others.

In [None]:
word_occurance = tweets_df.withColumn('word', F.explode(F.split(F.col('text'), ' '))).groupBy('word').count().sort('count', ascending=False).where(
    "word not in (' ', '', 'the', 'a', 'to', 'and', 'a', 'in', 'of', 'for', 'you', 'will', 'be', 'on', 'this', 'i', 'The', 'are', 'at', 'it', 'I')").limit(100)

twitter_mask = np.array(Image.open('twitter.jpeg'))
freqs = {r.asDict()['word'] : r.asDict()['count'] for r in word_occurance.collect()}
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white", mask=twitter_mask).generate_from_frequencies(freqs)
plt.figure(figsize=(10, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# hashtags_occurance = tweets_df.withColumn('hashtag', F.explode(col('hashtags'))).groupBy('hashtag').count().sort('count', ascending=False).limit(20)

hashtags_occurance = tweets_df.select("hashtags").withColumn("hashtag", F.explode(col('hashtags'))).groupBy('hashtag').count().sort(desc('count')).limit(20)

hashtags_occurance.show()


hashtag_rows = hashtags_occurance.select("hashtag").collect()
x = list(map(lambda r: (r['hashtag']), hashtag_rows))

y_rows = hashtags_occurance.select("count").collect()
y = list(map(lambda r: int(r['count']), y_rows))

plt.subplots(figsize=(16, 8))
plt.barh(x,y, color='orange', label='Volume')
plt.show()