MODULE 4 | LESSON 3


---


# **Applications of Alternative Data**


*In this lesson we consider applications of alternative data. Alternative Data in finance refers to non-traditional data sources used to gain insights into market trends, asset pricing, and investment decisions. This contrasts with traditional data like financial statements, company filings, and market prices. Text data from social media falls under this category as it reflects public sentiment, market chatter, and news dissemination that can influence asset prices. Specifically, Twitter and StockTwits data are considered valuable sources of alternative data in financial engineering due to their real-time nature and focus on financial discussions and stock market activity.*

In [1]:
# Load libraries
import datetime
import matplotlib.pyplot as plt
import os
import pandas as pd
import plotly.express as px
import re
import requests
import zipfile

from datetime import datetime, timedelta


## **1 What is Twitter Data**

Twitter data refers to the vast amount of information generated and shared on the Twitter platform. Twitter data is a vast collection of information generated by users on the platform, including tweets, profiles, messages, trends, media, and interactions. It's primarily available in JSON format through the Twitter API, though other methods like web scraping and third-party tools can also be used. Access to historical data is often limited, and an Academic Research Track offers greater access for researchers. Twitter data is a rich resource for understanding and analyzing the online world. This data is valuable for social media analytics, market research, business intelligence, news, public health, political science, and academic research, allowing insights into user behavior, public opinion, and real-time events.

Data Structure and Format:

 - JSON: The Twitter API primarily provides data in JSON (JavaScript Object Notation) format. JSON is a lightweight data-interchange format that is easy for humans to read and write and for machines to parse and generate. Each tweet, user profile, or other data object is represented as a JSON object with various attributes and values.
 - CSV/TSV: Twitter data can also be exported or converted into CSV (Comma-Separated Values) or TSV (Tab-Separated Values) formats for analysis in spreadsheet software or other tools.
 - Databases: For large-scale analysis and storage, Twitter data is often loaded into databases like MySQL, PostgreSQL, or NoSQL databases like MongoDB.

Twitter Data Access and Limitations:

 - Twitter API: While the Twitter API provides a powerful way to access data, there are limitations and restrictions. These include rate limits on the number of requests you can make, access to only a subset of historical data (typically the past 7 days for the standard API), and requirements for developer accounts and API keys.
 - Academic Research Track: For academic researchers, Twitter offers an "Academic Research Track" with elevated access levels and greater data availability. This track requires a detailed research proposal and approval process.
 - Data Bias and Ethical Considerations: It's important to be aware of potential biases in Twitter data, such as the overrepresentation of certain demographics or viewpoints. Ethical considerations are crucial, including respecting user privacy, obtaining informed consent when necessary, and avoiding the spread of misinformation. Twitter data is inherently biased towards active users and publicly shared opinions. This dataset may not represent the views of the entire investor population.
 - Data Sparsity and Quality: The dataset may contain gaps or periods where tweet volume is low, which can affect the analysis. It's essential to verify the data quality and address any potential issues like missing values or inconsistencies before conducting analysis.

There are several options for downloading ready-made Twitter datasets, several avenues exist for obtaining ready-made Twitter data, catering to both researchers and general users, with varying levels of access and data specifics:

 - For researchers: The Twitter Academic Research Track offers the most comprehensive access, but requires application and approval.

 - Public options: Platforms like Kaggle, figshare, Zenodo, Data.World, Hugging Face, IEEE DataPort and Mendeley Data host a variety of datasets on different topics, readily available for download. Kaggle is the most prominent community-run platform among the ones mentioned. IEEE DataPort is more focused on research data sharing within specific fields. Hugging Face is a community-driven platform specializing in NLP and related areas.

 - Specialized sources: Government agencies and organizations may offer datasets relevant to specific events or topics.

When using these platforms, perhaps most important is to adhere to responsible data practices and prioritize legality, user protection, and ethical conduct, ensuring trust and positive societal impact. Data licensing, user privacy, and ethical considerations are essential for responsible data use. Licensing ensures legal compliance, privacy safeguards personal information, and ethics guide fair and respectful data practices. These aspects are crucial for publicly available data as well, requiring adherence to license terms, anonymization for privacy, and awareness of potential biases for ethical analysis.





### **1.1 Getting Twitter Data**

In this lesson we explore Stock Market Tweets Data from IEEE Dataport available at https://dx.doi.org/10.21227/g8vy-5w61. IEEE DataPort is a curated repository for research data, primarily focusing on engineering and technology-related fields. It is operated by the Institute of Electrical and Electronics Engineers (IEEE) and encourages researchers to share their datasets to advance knowledge in their respective domains. This dataset contains a collection of tweets related to the stock market, specifically focusing on the S&P 500 index. It aims to provide data for researchers and practitioners interested in analyzing the relationship between social media sentiment and stock market movements.

This dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This license allows for the free use, sharing, adaptation, and redistribution of the data, as long as appropriate credit is given to the original creators.

The dataset covers tweets from April 9th, 2020 to July 16th, 2020. This time range provides a substantial amount of data for analysis and covers various market conditions. The data was colected using the S&P 500 tag (#SPX500), the references to the top 25 companies in the S&P 500 index, and the Bloomberg tag (#stocks). The data is packaged in ZIP archive in two CSV files: 'tweets_labelled_09042020_16072020.csv' consists of 5,000 tweets selected using random sampling. Out of those 5,000 tweets, 1,300 were manually annotated in positive, neutral, or negative classes. The second file 'tweets_remaining_09042020_16072020.csv' contains remainder of tweets.

The following code snippet is designed to extract a ZIP file containing Twitter data and then locate all the CSV files within the extracted folder for further processing. The code iterates through the list of CSV files containing Twitter data, reads each file into a pandas DataFrame, standardizes the column name for tweet text, and then combines all the DataFrames into a single DataFrame called `twitter_df`:




In [None]:
# Extract the ZIP file in folder named 'extracted_tweets'
with zipfile.ZipFile('tweets.zip', 'r') as zip_ref:
    zip_ref.extractall('extracted_tweets')

# Get a list of CSV files in the inner 'tweets' folder
tweets_folder = 'extracted_tweets/tweets'
csv_files = [f for f in os.listdir(tweets_folder) if f.endswith('.csv')]

# Loop through CSV files and read them into DataFrames
dfs = []
for csv_file in csv_files:
    file_path = os.path.join(tweets_folder, csv_file)
    df = pd.read_csv(file_path, sep=';')

    # Rename 'full_text' or 'text' column to a 'tweet_text' common name
    if 'full_text' in df.columns:
        df = df.rename(columns={'full_text': 'tweet_text'})
    elif 'text' in df.columns:
        df = df.rename(columns={'text': 'tweet_text'})

    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame and display it
twitter_df = pd.concat(dfs, ignore_index=True)
twitter_df


### **1.2 Adjusting time zone in Twitter data**


Let's now observe Date and Time formats in 'created_at' column. The "+00:00" in the 'created_at' values indicates a timezone. It represents an offset from Coordinated Universal Time (UTC), which in this case is 0 hours. This means that the timestamps are in UTC itself. Let's convert this to New York timezone. We do this by using `pd.to_datetime()` function converts the 'created_at' column to pandas DateTime objects. Then we apply chained `.dt.tz_convert('America/New_York')` function which converts the DateTime objects to the New York timezone using the IANA time zone name 'America/New_York'. This ensures that the timestamps are adjusted for daylight saving time transitions in New York:



In [None]:
# Convert timestamps to the New York timezone
twitter_df['created_at'] = pd.to_datetime(twitter_df['created_at']).dt.tz_convert('America/New_York')
twitter_df


### **1.3 Discovering retweets in Twitter data**

On Twitter, "RT" is a common abbreviation for "retweet". When a user wants to share someone else's tweet with their own followers, they can retweet it.  "RT" is a key indicator of retweets on Twitter. It's used in manual retweets, quote retweets, and is a convention that has become part of Twitter's culture. By searching for "RT" at the beginning of tweets, we can effectively identify retweets in data analysis.

Let's now take our collection of tweets and separate them into two groups: one for retweets and one for original tweets (or other types of tweets that are not retweets). The following code snippet takes `twitter_df` DataFrame containing Twitter data and splits it into two new DataFrames by filtering 'tweet_text' column. Then the code displays both `retweets_df` and `remainder_df` using the `display()` function:



In [None]:
# Create a DataFrame containing only retweets
retweets_df = twitter_df[twitter_df['tweet_text'].str.startswith('RT')]

# Create a DataFrame containing tweets that are not retweets
remainder_df = twitter_df[~twitter_df['tweet_text'].str.startswith('RT')]

# Display the DataFrames
print("## Retweets DataFrame:")
print("This DataFrame contains only retweets from the Twitter data.")
display(retweets_df)

print("\n## Remainder DataFrame:")
print("This DataFrame contains tweets that are not retweets.")
display(remainder_df)


Here we see that `retweets_df` has 350,752 rows and `remainder_df` has 577,921 rows. We should note that DataFrames are separated based on whether a tweet is a retweet or not. The number of rows in `retweets_df` and `remainder_df` gives an idea of the proportion of retweets in dataset but this doesn't directly tell us how many unique tweets were retweeted. A single original tweet can be retweeted many times by different users so that the 350,752 rows in `retweets_df` could represent a much smaller number of unique original tweets that were retweeted multiple times. We should also consider time delay effect based on typical Twitter behavior. It's highly probable that the majority of retweets in our dataset are from the same period but there would be some from slightly earlier period as the original tweets we collected.


### **1.4 Understanding distribution of tweets**

Let's now try to understand the distribution of tweets over time. The following code takes the 'created_at' column as the index for resampling. It then resamples the data weekly and calculates the count of tweets within each weekly interval. Finally, it counts the number of tweets that occurred within each week:

In [None]:
# Get weekly tweet counts
weekly_counts = twitter_df.set_index('created_at').resample('W-MON', label='left', closed='left').count()
weekly_counts = weekly_counts.rename_axis(index="Week Starting")
display(weekly_counts)


Here `.resample('W-MON', label='left', closed='left')` is the core of the operation. `.resample('W-MON')` resamples the data into weekly bins, starting on Mondays ('W-MON'). `label='left'` indicates that the label for each bin should be the left edge (the beginning) of the bin. `closed='left'` specifies that the left edge of each bin is inclusive (included in the bin). This means that tweets created on Monday will be included in the bin for that week.

This weekly distribution table provides opportunity to observe trends in tweet activity over time. Here we see that the tweet counts generally fluctuate but seem to be relatively high throughout June and July. And there were two weeks (May 11th and May 18th) with zero tweets recorded and possibly the following week (May 25th) also with reduced tweet activity. This could indicate a data collection issue or a period of low activity.


### **1.5 Visualising Twitter data**

By considering the data volume and applying appropriate visualization techniques, we can gain more meaningful insights from the daily tweet count plot. Let's now group data by date and plot the daily tweet counts for the `twitter_df` DataFrame to investigate it. The followig code first groups the `twitter_df` DataFrame by the date part of the 'created_at' column using `dt.date` to extract the date. This creates groups of tweets for each day.
Then it counts the number of tweets in each group using `['tweet_text'].count()`, which counts the non-null values in the 'tweet_text' column. We then plot the data:

In [None]:
# GroupBy date and count
daily_tweet_counts_twitter_all = twitter_df.groupby(twitter_df['created_at'].dt.date)['tweet_text'].count()

# Plotting
plt.figure(figsize=(12, 6))

plt.plot(daily_tweet_counts_twitter_all.index, daily_tweet_counts_twitter_all.values, label='Twitter (All Data)', marker='o')

plt.xlabel('Date')
plt.ylabel('Number of Tweets')
plt.title('Daily Tweet Counts - Twitter (All Data)')
plt.legend()
plt.grid(True)

plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability

plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()


By examining the plot, we are able to identify any gaps or periods where the tweet counts are significantly less or zero. These gaps might indicate missing data or periods where there was a decrease in tweet activity. If we need to further investigate specific gaps or patterns, we can adjust the date range of the plot or add annotations to highlight areas of interest. In our case we know that there is gap for the period from May 10 to May 27 2020. This means there were either no tweets collected or recorded during that period, or they were filtered out during data processing.

There are various Data Imputation techniques to fill the missing data. For a simple approach, we could fill the gap with the average tweet count from the days before and after the gap. For more complex scenarios, we could explore time series imputation techniques or machine learning models to predict the missing values based on patterns in the existing data. However, this should be done cautiously and with awareness of the potential biases introduced by imputation. In our case neither of these techniques might be accurate since there is significant gap during which trends or events might be very unpredictable to infer or model. So we leave this gap as is for now and continue with further explorations.





### **1.6 Discovering cashtags and hashtags in Twitter data**

In Twitter tweets, stock symbols can appear as either cashtags (e.g., \\$AAPL, \\$GOOG) or hashtags (e.g., #AAPL, #GOOG), both typically representing a company's stock ticker symbol. Cashtags are becoming more common, but hashtags are still used for broader stock-related discussions or events. These symbols can be placed anywhere within a tweet and often provide context for the message, linking it to a specific stock or topic.

Let's now uncover some insights into which stocks are being discussed most frequently in our Twitter data. The following code first extracts stock symbols from the beginning of each tweet in the `twitter_df` DataFrame using regular expressions and stores them in a new 'stock_symbols' column. Then, it explodes this column to create separate rows for each symbol in a tweet. It converts the symbols to lowercase, counts their occurrences, sorts them by frequency in descending order, and finally displays the top 10 most frequently mentioned stock symbols in the dataset:

In [None]:
# Discover stock symbols and hashtags
twitter_df['cashtags_hashtags'] = twitter_df['tweet_text'].str.findall(r'(?:\$|#)([a-zA-Z]+\.*[a-zA-Z]+)')
twitter_df.explode('cashtags_hashtags')['cashtags_hashtags'].str.lower().value_counts().sort_values(ascending=False).head(10)


Here `.str.findall(r'(?:\$|#)([a-zA-Z]+\.*[a-zA-Z]+)')` applies a regular expression to find and extract cashtags and hashtags from the tweet text. Let's break down the regex:

 - `(?: ... )`: This creates a non-capturing group. It means this part of the regex is used for matching, but the matched content won't be separately captured or stored.
 - `\$|#`: This matches either a literal dollar sign ($) or a hashtag (#). The ` | ` symbol represents an "or" condition.
 - `[a-zA-Z]+`: Matches one or more letters (a-z, A-Z).
 - `\.*`: Matches zero or one dot (.), allowing for symbols with dots (e.g., \$BRK.B).
 - `[a-zA-Z]+`: Matches one or more letters again, capturing the rest of the symbol after the dot (if any).

So the entire regex in the first line in the above code searches for patterns that look like stock symbol cashtags (e.g., \\$AAPL, \\$AAPL, \\$GOOG, \\$BRK.B) or hashtags and stores them in a new column called 'cashtags_hashtags'.

`twitter_df.explode('cashtags_hashtags')` in the second line of the above code "explodes" the 'cashtags_hashtags' column, transforming each list of cash-/hashtags into separate rows. So, if a tweet had multiple cash-/hashtags, it will now have multiple rows, each with one tag. Then the reminder of second line in the above code takes the extracted cash-/hashtags, counts how many times each tag appears in the tweets, sorts them by frequency, and shows the top 10 most mentioned cash-/hashtags.

Overall, this table highlights that the S&P 500 index (spx) together with SPDR S&P 500 ETF Trust (spy) and E-mini S&P 500 futures (es) (both are financial instruments that track the performance of the S&P 500 index) are among the top 10 mentions in this Twitter dataset. This is not suprising since this Twitter dataset itself was originally collected using S&P 500 tag and other top 25 companies in the S&P 500 index. From the above table we can also gather that other most discussed financial instruments within the analyzed Twitter data are Apple (aapl), Amazon (amzn), FB (fb), and Microsoft (msft). Hashtags "stocks", "stockmarket" and "trading" further emphasize the dataset's focus on stock market-related topics. This information can be valuable for understanding market trends, investor sentiment, and the overall focus of financial discussions on Twitter during the period covered by the dataset.



### **1.7 Filterig Twitter data**

Let's now try to filter some of tweets for specific terms. The following code defines a function called `contains_spy()` that checks if a given text string contains any mention of the SPY ETF (SPDR S&P 500 ETF Trust), considering various textual variations. This time we check entire tweet text instead of just searching for cashtags and hashtags. First we use `text.lower()` which converts the input text to lowercase to ensure case-insensitive matching. `patterns = [...]` defines a list of regular expression patterns representing different ways the SPY ETF might be mentioned. It includes variations like "spy", "$spy", "#spy", "spydr", "s&p 500 etf", with and without word boundaries (\b). Word boundaries ensure that "spy" is not matched within words like "spying". In essence, this function aims to identify text entries in tweets that mention the SPY ETF, regardless of how it is written. Then the code filters the `twitter_df` DataFrame to keep only the tweets that mention the SPY ETF, using the `contains_spy()` function:

In [None]:
# Define a function to check for S&P 500 mentions with variations
def contains_spy(text):
    # Handle variations in spacing, "&", and case
    text = text.lower()  # Convert to lowercase for case-insensitive matching
    text = re.sub(r"[^a-zA-Z0-9 ]", "", text)  # Remove special characters except spaces

    # Check for different patterns
    patterns = [
        r"\bspy\b",  # Using word boundaries (\b) to avoid capturing part of word (e.g. "spying")
        r"\$spy",  # Cashtag
        r"#spy",  # Hashtag
        r"\b\$spy\b",  # Cashtag with word boundaries
        r"\b#spy\b",  # Hashtag with word boundaries
        r"\bspdr\b",  # Full name part using word boundaries
        r"\bs\&p 500 etf\b",  # Full name part with variations using word boundaries
        r"\bs\&p500 etf\b",  # Full name part with variations using word boundaries
        r"\bsp500 etf\b"  # Full name part with variations using word boundaries
        r"\bspy etf\b"  # Full name part with variations using word boundaries
        # ...add more variations as needed...
    ]

    return any(re.search(pattern, text) for pattern in patterns)

# Apply the function to filter the DataFrame and display it
filtered_twitter_df = twitter_df[twitter_df['tweet_text'].apply(contains_spy)]
filtered_twitter_df


The resulting `filtered_twitter_df` DataFrame contains a subset of the original data, specifically focusing on tweets relevant to the SPY ETF. This filtering step is essential for further analysis or visualization tasks when we are specifically interested in tweets related to the SPY ETF.

## **2 What is StockTwits data**

StockTwits is a social media platform specifically designed for investors and traders to share ideas and information about the stock market. It's like Twitter, but with a focus on financial discussions. StockTwits aims to provide a platform for investors of all levels to connect, share ideas, and stay informed about the stock market in real time. It's a valuable tool for understanding market sentiment, discovering new investment ideas, and engaging with a community of like-minded individuals.

StockTwits fosters a strong sense of community by enabling users to connect with each other, follow interesting profiles, and engage in discussions about specific stocks or broader market trends. It's a space for collaborative learning and idea exchange. It ofers ability to engage in Stock-Specific Discussions. Conversations are often organized using "cashtags" (similar to hashtags on Twitter), which are denoted by a dollar sign followed by a stock symbol (e.g., \$AAPL for Apple stock). This feature makes it easy to find discussions and sentiment related to particular companies.

StockTwits' distinct feature is to allow users to express their attitude and overall emotional tone or opinion regarding a particular stock, market trend, or investment strategy. It's essentially a measure of whether people are feeling bullish (positive), bearish (negative), or neutral about a specific topic. On the StockTwits platform, users can tag their messages as either Bearish or Bullish to indicate their sentiment or outlook on a particular stock or the market in general. These tags provide a quick and visual way for other users to understand the sentiment behind a message. These tags can be levereged for:

 - Sentiment Analysis: The Bearish and Bullish tags provide valuable data for sentiment analysis algorithms. These algorithms can use the tags to quickly identify and quantify the overall sentiment expressed by users towards specific stocks or the market as a whole.
 - Filtering and Search: Users can filter their StockTwits feed or search for messages based on these tags. This allows them to focus on messages that align with their own sentiment or to get a sense of the prevailing sentiment around a particular stock.
 - Community Engagement: The tags encourage users to explicitly express their opinions and engage in discussions with others who share similar or opposing views. This fosters a more dynamic and interactive community.

Important considerations for Bearish and Bullish tags on StockTwits:

 - Subjectivity: The Bearish and Bullish tags reflect the individual opinions of the users who post them. These opinions might be based on different factors, including personal biases, research, or speculation. It's important to remember that these tags are subjective and might not always be accurate.

 - Noise and Misinformation: Social media platforms like StockTwits can also contain a significant amount of noise and misinformation. There's always a possibility that some users might intentionally try to manipulate sentiment by posting misleading or exaggerated messages with certain tags. It's advisable to be cautious and to verify information from multiple sources before making any investment decisions based on sentiment data. It's crucial to verify information from multiple sources and to avoid relying solely on sentiment data for investment decisions.

 - Dynamic Nature of Sentiment: Sentiment can change rapidly, especially in response to market events or news. It's essential to monitor sentiment trends over time rather than relying on a single snapshot.

In summary, the Bearish and Bullish tags on StockTwits provide a valuable tool for understanding market sentiment and engaging with the financial community. However, it's important to use them wisely and in conjunction with other forms of analysis to make informed investment decisions.







### **2.1 Getting StockTwits data**

Unfortunately, StockTwits no longer offers a publicly available developer API or developer accounts for general use. StockTwits did have a developer API program in the past, but it have been significantly restricted recently. The previous API allowed developers to access data like streams, sentiment, symbols, and user information. Currently available data access options are via Enterprise Partnerships, Third-Party Data Providers, or Public Datasets.

In our lesson we explore 'SPY Messages Data from Stocktwits' dataset hosted on figshare and available at https://doi.org/10.6084/m9.figshare.20237736.v1. Figshare is a repository for research data, including datasets, figures, and other research outputs. It encourages researchers to share their data openly and allows users to upload and contribute datasets.

This dataset is also available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This dataset is provided in CSV format and contains a collection of messages posted on the StockTwits platform related to the SPY ETF (SPDR S&P 500 ETF Trust).

The following code downloads this dataset from figshare, saves it as a CSV file, and then reads the data into `stocktwits_df` pandas DataFrame for further analysis:




In [None]:
# Define the chunk size (adjust this value as needed)
chunksize = 10000  

# Create an empty list to store processed chunks
processed_chunks = []  

# Iterate through the CSV file in chunks
for chunk in pd.read_csv("SPY_stocktwits_messages.csv", chunksize=chunksize):
    processed_chunks.append(chunk) # Append the chunk to the list 

# Concatenate the processed chunks into a single DataFrame named stocktwits_df
stocktwits_df = pd.concat(processed_chunks, ignore_index=True) 
stocktwits_df

Please note that size of original 'SPY Messages Data from Stocktwits' dataset hosted on figshare is about 452.52 MB. When dealing with large datasets, like the StockTwits dataset, attempting to load the entire file into memory at once can lead to memory exhaustion and crashes. In or code we download this dataset in chunks to prevent overloading the memory. By reading the data in smaller chunks, we process the dataset piece-by-piece, reducing the memory footprint at any given time. This allows to work with larger-than-memory datasets without running into memory issues.



### **2.2 Adjusting time zone in StockTwits data**

The StockTwits dataset covers much longer period spanning 2 years from 2020 to 2022 than our Twitter dataset. The following code focuses on filtering the StockTwits messages DataFrame (`stocktwits_df`) to include only messages within a specific date range, from April 9, 2020, to July 16, 2020 that corresponds to data availability window for Twitter dataset in previous section of this lesson.

First, we again convert date values to DateTime object. `stocktwits_df['DateTime'] = pd.to_datetime(stocktwits_df['DateTime'])` line converts values in the 'DateTime' column in the `stocktwits_df` DataFrame to pandas DateTime objects. This is essential for performing date-based filtering and comparisons effectively. Then we define the start and end dates of the desired date range and filter the original DataFrame to create a new DataFrame containing only messages posted between April 9, 2020, and July 16, 2020, inclusive:

In [None]:
# Convert 'DateTime' column to DateTime objects
stocktwits_df['DateTime'] = pd.to_datetime(stocktwits_df['DateTime'])

# Filter stocktwits_df DateTime from April 9 to July 16, 2020
start_date = pd.to_datetime('2020-04-09')
end_date = pd.to_datetime('2020-07-16')

filtered_stocktwits_df = stocktwits_df[(stocktwits_df['DateTime'] >= start_date) & (stocktwits_df['DateTime'] <= end_date)].copy()
filtered_stocktwits_df


In this DataFrame we have both 'Timestamp' column and 'DateTime' column. It's common for 'Timestamp' columns to store Unix timestamps, which represent the number of seconds that have elapsed since January 1, 1970, at 00:00:00 Coordinated Universal Time (UTC). This is a standard way to represent time in many systems. Other options are Epoch Time (the number of seconds since a specific reference point), Relative Time (the time elapsed since the start of a specific event or process), or some other custom time formats. There are many ways to investigate and work with Timestamps. As a startig point in our case let's convert 'Timestamp' column to Unix timestamps, but in milliseconds and compare the newly created 'DateTime_from_Timestamp' column with the existing 'DateTime' column to see if they represent the same information or if there are any differences.

In [None]:
# Convert Timestamp values to pandas DateTime object
filtered_stocktwits_df.loc[:, 'DateTime_from_Timestamp'] = pd.to_datetime(filtered_stocktwits_df['Timestamp'], unit='ms')
filtered_stocktwits_df


Ideally both 'Timestamp' and 'DateTime' columns are supposed to represent the same time information. However, here we see that time difference between the newly created 'DateTime_from_Timestamp' column with the existing 'DateTime' column is consistently 4 hours. This suggests that correct timezone should be New York timezone for this dataset. The following code transforms the timestamp data into a more usable and timezone-aware format:

In [None]:
# Convert and localize Timestamp values to NY timezone
filtered_stocktwits_df['DateTime_from_Timestamp'] = pd.to_datetime(filtered_stocktwits_df['Timestamp'], unit='ms', utc=True).dt.tz_convert('America/New_York')
filtered_stocktwits_df


By localizing and converting the timestamps, we ensure that the datetime values in the DataFrame are timezone-aware and represent the correct local time in New York. This is crucial for accurate analysis and interpretation of time-based data. This process of timezone localization and conversion is essential for consistency in datetime representation when working with time-based data, particularly when dealing with data from different sources or locations, to ensure accuracy and consistency in analysis.

### **2.3 Visualising StockTwits data**

Let's now craft the code to plot the `filtered_stocktwits_df` data as a bar plot, grouped by date and showing the counts of 'Bearish', 'Neutral', and 'Bullish' values in the Sentiment column. The following code processes the `filtered_stocktwits_df` DataFrame to replace NaN values in the 'Sentiment' column with 'Neutral', groups the data by date and sentiment, and then creates an interactive bar plot using Plotly with specific color mapping for different sentiments.

In [None]:
# Replace 'NaN' with 'Neutral' in the 'Sentiment' column before grouping, using .loc
filtered_stocktwits_df.loc[filtered_stocktwits_df['Sentiment'].isnull(), 'Sentiment'] = 'Neutral'

# Group the data by date and sentiment, then count occurrences
sentiment_counts = filtered_stocktwits_df.groupby([filtered_stocktwits_df['DateTime_from_Timestamp'].dt.date, 'Sentiment'])['Sentiment'].count().unstack(fill_value=0)
sentiment_counts = sentiment_counts[['Bearish', 'Neutral', 'Bullish']] # reorder columns in specific order

# Create the interactive bar plot using Plotly
fig = px.bar(sentiment_counts, x=sentiment_counts.index, y=sentiment_counts.columns,
             labels={'x': 'Date', 'y': 'Count'},
             title='Daily Sentiment Counts - StockTwits (Bearish, Neutral, Bullish)',
             color_discrete_map={'Bearish': 'red', 'Neutral': 'grey', 'Bullish': 'green'})

# Customize the layout for better readability and display the plot
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.show()


This approach allows to visualize the daily sentiment trends on StockTwits, with clear color-coded representation for Bearish, Neutral, and Bullish sentiments, enhancing the interpretability of the plot. The bar plot visualizes daily sentiment counts for StockTwits messages about the SPY ETF, categorized as Bearish (red), Neutral (grey), or Bullish (green). The x-axis represent dates, while the y-axis represent message counts. Three bars per date show counts for each sentiment. Plotly's interactivity enables zooming, hovering for details, and potential data selection, allowing in-depth exploration of sentiment trends over time.

### **2.4 Discovering stock symbols in StockTwits data**

StockTwits messages often contain stock symbols, also known as tickers or cashtags. Users frequently include them to denote the specific stocks they are discussing. These symbols are typically represented by a dollar sign followed by the stock's ticker symbol (e.g., \\$AAPL for Apple, \\$SPY for SPDR S&P 500 ETF Trust).

Similar to how we did for Twitter dataset here again we can extract stock symbols from StockTwits messages. The following code aims to extract stock symbols (tickers) from the 'Sentence' column of the `filtered_stocktwits_df` DataFrame, count their occurrences, and display the top 10 most frequently mentioned tickers:


In [None]:
# Discovering stock symbols in StockTwits data
filtered_stocktwits_df['tickers'] = filtered_stocktwits_df['Sentence'].str.findall('\$[a-zA-Z]+\.*[a-zA-Z]+')
filtered_stocktwits_df.explode(['tickers'])['tickers'].value_counts().sort_values(ascending=False).head(10)


This table shows the top 10 most frequently mentioned stock symbols (tickers) in the StockTwits dataset, along with their respective counts. The most prominent observation to be made is SPY Dominance - the ticker \\$SPY (SPDR S&P 500 ETF Trust) and \\$spy (its lowecase vrsion) is by far the most mentioned stock symbol in the dataset, appearing over 553,000 times. This is not surprising considering that this entire StockTwits dataset is SPY centric.

The table reveals that the StockTwits dataset primarily focuses on discussions about the S&P 500 index and related ETFs. However, there is also significant interest in other major market indices, individual stocks (particularly large-cap companies), and futures contracts. This information provides insights into the topics and trends that are most prevalent within the StockTwits community during the analyzed period.

### **2.5 Comparing Twitter and StockTwits data**

Let's now compare Twitter and StockTwits data. The following code aims to do this by creating a line plot. First we calculate the daily tweet counts from the `filtered_twitter_df` DataFrame by grouping the data by the date part of the 'created_at' column (using `dt.date` to extract the date) and then counting the number of tweets in each group by counting the non-null values in the 'tweet_text' column. We do the similar operation for the `filtered_stocktwits_df` DataFrame, using the 'DateTime_from_Timestamp' column for date grouping and the 'Sentence' column for counting messages. The we create and customize plot:



In [None]:
# Group Twitter data by date and count tweets
daily_tweet_counts_twitter = filtered_twitter_df.groupby(filtered_twitter_df['created_at'].dt.date)['tweet_text'].count()

# Group StockTwits data by date and count messages
daily_tweet_counts_stocktwits = filtered_stocktwits_df.groupby(filtered_stocktwits_df['DateTime_from_Timestamp'].dt.date)['Sentence'].count()

# # Create a figure and plot Twitter and StockTwits data
plt.figure(figsize=(12, 6))
plt.plot(daily_tweet_counts_twitter.index, daily_tweet_counts_twitter.values, label='Twitter', marker='o')
plt.plot(daily_tweet_counts_stocktwits.index, daily_tweet_counts_stocktwits.values, label='StockTwits', marker='x')

# Customize the plot
plt.xlabel('Date')
plt.ylabel('Number of Tweets')
plt.title('Daily Tweet Counts - Twitter vs. StockTwits')
plt.legend()
plt.grid(True)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')

# Adjust layout to prevent overlapping labels and display the plot
plt.tight_layout()
plt.show()


First let's observe that for the same time period `filtered_twitter_df` has 99,246  rows, while `filtered_stocktwits_df` has 564,588 rows. This information provides valuable context for interpreting the daily tweet count plots. It highlights the significant difference in the volume of data between the two platforms for the selected period. Here are some additional points to keep in mind:

 - Platform Activity: The difference in row counts suggests that StockTwits might have had much higher activity or user engagement compared to Twitter for the given period and filtering criteria. This could be due to various factors, including the platforms' user base, focus, and popularity within the investment community.

 - Data Collection and Integrity: The difference in data volume might be influenced by the data collection methods used. Data collectors use different APIs or data sources for Twitter and StockTwits and there might be variations in how data was collected, filtered, or processed, leading to differences in the final row counts. We also need to ensure that the data collection and filtering processes were consistent and reliable for both platforms to minimize potential biases or errors in the analysis.

 - Analysis Implications: When comparing trends or patterns between Twitter and StockTwits, keep in mind that the absolute tweet counts might not be directly comparable due to the difference in data volume. We need to consider using normalization or relative metrics to make more meaningful comparisons.

 - Data Gap Context: While the gap in Twitter data from May 10 to May 27 2020 is important to acknowledge, it's also essential to consider it in the context of the overall data volume. If the gap represents a small proportion of the total Twitter data, its impact on the overall analysis might be limited. However, if the gap is significant relative to the total data, it could affect the interpretation of trends or patterns.

By considering the overall data volume difference and its potential implications, we can make more informed interpretations of the daily tweet count plots and draw more robust conclusions from analysis. We need to remember to clearly communicate the data volume difference and any normalization or scaling applied in reports or presentations for transparency.

## **X. Conclusion**

In this lesson we explored applications of alternative data. Alternative Data in finance refers to non-traditional data sources used to gain insights into market trends, asset pricing, and investment decisions. This contrasts with traditional data like financial statements, company filings, and market prices. Text data from social media falls under this category as it reflects public sentiment, market chatter, and news dissemination that can influence asset prices. Specifically, Twitter and StockTwits data are considered valuable sources of alternative data in financial engineering due to their real-time nature and focus on financial discussions and stock market activity.

In this lesson we explored and compared Twitter and StockTwits data to see how they could be used for financial analysis. Main focus was on the SPY ETF (SPDR S&P 500 ETF Trust) and a specific time period. This involved getting data from both Twitter and StockTwits, adjusting timezones, and handling retweets on Twitter; Creating plots to see how many tweets were posted each day on each platform to understand the differences in data volume and user activity.

**References**

 - Bruno Taborda, Ana de Almeida, José Carlos Dias, Fernando Batista, Ricardo Ribeiro, April 15, 2021, "Stock Market Tweets Data", IEEE Dataport, doi: https://dx.doi.org/10.21227/g8vy-5w61

 - Liu, Jin-Xian (2022). SPY_stocktwits_messages.csv. figshare. Dataset. https://doi.org/10.6084/m9.figshare.20237736.v1


---
Copyright 2024 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
