MODULE 4 | LESSON 4


---


# **Social Media**


|  |  |
|:---|:---|
|**Reading Time** | 6h |
|**Prior Knowledge** | API, function of social media in general  |
|**Keywords** | API, youtube, reddit, subreddit, sentiment analysis, nltk, keywords |

---

*Social media has become an integral part of modern life, transforming how we communicate, share information, and interact with the world around us. To understand the significance of social media data and its applications today, it's essential to examine the historical evolution of these platforms.*

*The roots of social media can be traced back to the early days of the internet. In 1997, SixDegrees.com launched as the first recognizable social network site, allowing users to create profiles and list friends. This paved the way for subsequent platforms like Friendster, MySpace, and eventually Facebook, which revolutionized the social media landscape. As these platforms grew in popularity and functionality, they began generating vast amounts of user data. This data, encompassing everything from user profiles and connections to content interactions and behavioral patterns, quickly became a valuable resource for researchers, marketers, and businesses.*

*The rise of social media data as a field of study and application coincided with the increasing sophistication of these platforms. Facebook's introduction of the News Feed in 2006, for instance, not only changed how users interacted with the platform but also provided a rich source of data on user preferences and behaviors. Today, social media data is used across various domains, including marketing, public health, political science, and social research. Platforms like YouTube and Reddit offer unique insights into user behavior, content consumption, and public discourse. Through APIs (application programming interfaces) provided by these platforms, researchers and developers can access and analyze this data in unprecedented ways.*

*In this lesson, we will focus on the practical aspects of retrieving and working with social media data. We'll explore how to use APIs to collect data from YouTube and Reddit, demonstrating the real-world applications of social media data analysis. By understanding both the historical context and current methodologies, we can better appreciate the power and potential of social media data in today's digital landscape.*

## **1. What is Social Media?**

We will provide some definitions as to what social media is even though each and every one of us probably has enough exposure to such platforms.

Boyd and Ellison define social network sites as web-based services that allow individuals to
* construct a public or semi-public profile within a bounded system,
* articulate a list of other users with whom they share a connection, and
* view and traverse their list of connections and those made by others within the system.
This information highlights the structural aspects of social media, such as user profiles, connections, and network visibility.

The Merriam-Webster's Dictionary definition gives emphasis to the communication and community building aspects of social media: the social networks are forms of electronic communication (such as websites for social networking and microblogging) through which users create online communities to share information, ideas, personal messages, and other content (such as videos) ("Social media" [*Merriam-Webster.com Dictionary*]).

On the other hand, Cambridge Dictionary describes social media as technological platforms: websites and computer programs that allow people to communicate and share information on the internet using a computer or mobile phone ("Social media" [*Cambridge Dictionary*]).

In short, we can sum up the aspects of social media as follows:
* User profiles and connections
* Online communities
* Information and content sharing
* Technological platforms
* User-generated content
* Interactivity and collaboration
* Virtual networks
* Expression and communication

Each of the above characteristics is a data-generating process. For example, the "User profiles and connections" part of social media generates demographic data, the social network itself (friends, followers, and connections), and personal interests and preferences. The "Online communities" generate group membership data, community interaction patterns, and interests/trends within different communities.

It should be clear by now that the vast amount of data generated can be used for multiple purposes that depend on the objectives of the researcher. Specifically, in the context of finance, we are very interested in the "User-generated content" from which we can extract:
* text data: posts, comments
* sentiment and opinion data
* multimedia content related to financial topics

This data is valuable for several finance-specific applications:

* Market sentiment analysis: By analyzing the sentiment of social media posts about specific stocks, commodities, market conditions, etc., researchers can quantify public opinion.
* Stock price prediction: Some studies have shown correlations between social media sentiment and short-term stock price changes, making this data potentially useful for trading strategies.
* Financial news dissemination: Social media often spreads financial news faster than traditional media!
* Risk assessment: By monitoring social media discussions, financial institutions can identify potential risks or reputational issues early.
* Cryptocurrency trends: Given the significant role of online communities in cryptocurrency markets, social media data is particularly valuable in this sector.


## **2. Retrieving the Data**

In this section, we will provide several examples of how to use an API in order to connect to a social media platform, how to retrieve the data, and how the data looks like before structuring it into dataframes for plotting and analysis.

### **2.1 An Example of Quantifying Sentiment Using YouTube Data**

In this subsection, we'll explore how to leverage YouTube API to gather and analyze public sentiment around significant financial events. We'll use the 2020 market crash as a case study to understand how sentiment can be quantified and correlated with financial data like the S&P 500 index.

Pre-requisites:

* Google Account: Ensure you have an active Google account. This is necessary to access Google's API services.

* Google Cloud Project: You'll need to create a project in the Google Cloud Console. This project will host your API key, which is essential for accessing YouTube's data.

YouTube API Setup: Follow these steps to get started:

1. Go to the Google Cloud Console.

2. Create a new project.

3. Navigate to the API Library and enable the YouTube Data API v3.

4. Create credentials (API key) for accessing the API.

In [None]:
import requests
import praw
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import pandas as pd
from collections import defaultdict
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import yfinance as yf
import seaborn as sns
sns.set()

Key Considerations

1. Quota Management: Be mindful of the API quota limits. Google provides a limited number of requests per day for free. Plan your data retrieval carefully to stay within these limits or you will be forced to stop your workflow for a day.

2. Keyword Selection: The choice of keywords is critical. Think about the terms that were relevant during the time of the financial event. For our case study on the 2020 crash, we used terms like "recession," "economic crisis," "inflation," etc.

3. Channel Selection: Choose reputable and relevant channels that provide consistent and high-quality content.

4. Data Analysis: After retrieving the data, you'll analyze the frequency of the keywords and correlate this with financial metrics like the S&P 500 index.

In [None]:
API_KEY = "YOUR_API_KEY"  
CHANNEL_NAMES = ['cnbc'] # You can include your selection of Youtube finance channels
CHANNEL_IDs = {} # Dictionary to store channel names and their IDs

# Function to get channel ID by channel name
def get_channel_id_by_name(channel_name, api_key):
    url = f"https://www.googleapis.com/youtube/v3/search?part=snippet&type=channel&q={channel_name}&key={api_key}"
    response = requests.get(url)
    data = response.json()
    return data['items'][0]['snippet']['channelId']

# Get channel IDs for all channel names
for channel_name in CHANNEL_NAMES:
    channel_id = get_channel_id_by_name(channel_name, API_KEY)
    CHANNEL_IDs[channel_name] = channel_id

# Print channel names and their IDs
for channel_name, channel_id in CHANNEL_IDs.items():
    print(f"{channel_name}: {channel_id}")

In [None]:
# Define parameters
KEYWORDS = ['recession', 'unemployment', 'crash', 'stimulus', 'crisis', 'inflation'] # List of keywords to search for. You can include your selection of relevant keywords
START_DATE = '2020-01-01T00:00:00Z'
END_DATE = '2020-05-30T23:59:59Z'

In [None]:
# Initialize YouTube API
def initialize_youtube(api_key):
    youtube = build('youtube', 'v3', developerKey=api_key)
    return youtube

# Get videos from a channel
def get_videos(youtube, channel_id, start_date, end_date):
    request = youtube.search().list(
        part='snippet',
        channelId=channel_id,
        publishedAfter=start_date,
        publishedBefore=end_date,
        maxResults=50,
        type='video'
    )
    videos = []
    while request:
        response = request.execute()
        videos.extend(response['items'])
        request = youtube.search().list_next(request, response)
    return videos

# Get comments from a video
def get_comments(youtube, video_id):
    comments = []
    request = youtube.commentThreads().list(
        part='snippet',
        videoId=video_id,
        maxResults=100
    )
    while request:
        try:
            response = request.execute()
            comments.extend(response['items'])
            request = youtube.commentThreads().list_next(request, response)
        except HttpError as e:
            if e.resp.status == 403 and 'commentsDisabled' in str(e):
                print(f"Comments are disabled for video ID: {video_id}")
                break
            else:
                raise e
    return comments

# Filter data and count keyword mentions
def filter_and_count_comments(comments, keywords, start_date, end_date):
    keyword_counts = defaultdict(int)
    start = datetime.strptime(start_date, '%Y-%m-%dT%H:%M:%SZ')
    end = datetime.strptime(end_date, '%Y-%m-%dT%H:%M:%SZ')
    current_date = start

    while current_date <= end:
        keyword_counts[current_date.strftime('%Y-%m-%d')] = 0
        current_date += timedelta(days=1)

    for comment_thread in comments:
        comment = comment_thread['snippet']['topLevelComment']['snippet']
        comment_date = datetime.strptime(comment['publishedAt'], '%Y-%m-%dT%H:%M:%SZ').strftime('%Y-%m-%d')
        comment_text = comment['textDisplay'].lower()
        if any(keyword in comment_text for keyword in keywords):
            keyword_counts[comment_date] += 1

    return keyword_counts

def plot_keyword_counts(keyword_counts, channel_name):
    dates = list(keyword_counts.keys())
    counts = list(keyword_counts.values())
    df = pd.DataFrame({'Date': dates, 'Count': counts})
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.set_index('Date')
    df.sort_index(inplace=True)

    plt.figure(figsize=(14, 7))  # Adjust the figure size for better readability
    ax = df.plot(kind='bar', legend=False, width = 0.8)
    ax.set_title(f'Keyword Mentions per Day in {channel_name}')
    ax.set_xlabel('Date')
    ax.set_ylabel('Count of Keywords')

    # Set x-ticks at regular intervals and rotate labels for better readability
    ax.set_xticks(range(0, len(df.index), 14))
    ax.set_xticklabels(df.index.strftime('%Y-%m-%d')[::14], rotation=45, ha='right')

    #plt.tight_layout()
    plt.show()



In [None]:
# Main script
# THIS CELL CAN TAKE SIGNIFICANT TIME TO COMPLETE EVEN CLOSE TO 20 MINUTES.
# PLEASE BE PATIENT.

# Initialize the YouTube API client
youtube = initialize_youtube(API_KEY)

# Fetch videos from the specified channel within the date range
videos = get_videos(youtube, channel_id, START_DATE, END_DATE)

# Initialize a list to store all comments
all_comments = []

# Fetch comments for each video
for video in videos:
    video_id = video['id']['videoId']
    try:
        comments = get_comments(youtube, video_id)
        all_comments.extend(comments)
    except HttpError as e:
        print(f"An error occurred: {e}")

# Filter and count keywords in comments
keyword_counts = filter_and_count_comments(all_comments, KEYWORDS, START_DATE, END_DATE)

In the cell below, you can see what a response looks like. You can use your imagination to think of how the data can be used. For example, the `likeCount` key can be used to assess whether a specific comment should be weighted more than the rest. The `totalReplyCount` values can be used to identify controversial, favorite, or interesting topics.

In [None]:
comments[0]

In [None]:
# Preparing dataset for visualization
df_keywords = pd.DataFrame(keyword_counts.items()).set_index(0)
df_keywords.index = pd.to_datetime(df_keywords.index)
df_keywords.sort_index(inplace=True)
df_keywords.columns = ['count']
df_keywords['count'] = df_keywords['count'].astype(int)
df_keywords = df_keywords.resample('D').sum()

rolling_sum = df_keywords.loc[:"2020-05-30"].rolling('7D').sum()

# Downloading S&P500 as a benchmark
gspc = yf.download('^GSPC', start='2020-01-01', end='2020-05-30')['Adj Close']

# Plotting
fig, ax1 = plt.subplots(figsize=(16, 7))

# Plotting the keywords data
#ax1.plot(df_keywords.loc[:"2020-05-30"].rolling('7D').sum(), color='green', alpha=0.6, label='Rolling weekly sum of keywords')
ax1.bar(rolling_sum.index, rolling_sum.values.flatten(), color='green', alpha=0.6, label='Rolling weekly sum of keywords')
#ax1.set_ylim(-1, 40)
ax1.set_ylabel('Sum of Keywords')

# Creating a twin axis to plot S&P500 data
ax2 = ax1.twinx()
ax2.plot(gspc, color='blue', alpha=0.5, label='S&P500 Price')
ax2.set_ylim(1500, 3400)
ax2.set_ylabel('S&P500 Price')

# Adding legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

plt.title('Rolling weekly sum of keywords and S&P500 Price')
plt.show()

The graph above is an example of how one can quantify sentiment. The easiest way to do so is by counting the number of words in our choice, and given that some words are not easily found unless there is a reason, we can trust that the excess mentioning of such words indicates the sentiment at that time. The green bars represent the sum of keywords like "recession," "economic crisis," and "inflation" found in YouTube comments, indicating heightened public concern. The blue line tracks the S&P 500 index price, highlighting the market's reaction during the same period. As the keyword mentions continue to rise after March 2020, coinciding with the steep decline in the S&P 500, we observe the rising concern of the people and the impact of the crash on overall sentiment.

**Exercise 1**

Choose an asset and a period when significant events for this asset occurred. Identify the keywords you want to count and the channel(s) that are appropriate for your analysis. Use the code provided and follow the same workflow as above. Plot the results (you may need to do serious modifications in the code) and paste it in the forum along with a "what we see" analysis.  

### **2.2 Reddit**

Reddit is a popular social news and discussion website where users can share content and engage in conversations on various topics, including cryptocurrency. For a financial engineering student, Reddit data can give valuable examples as to how to gain insight into market sentiment and trends.

In this section, we'll explore how to use the Reddit API to collect and analyze data related to cryptocurrency discussions. The Reddit API allows developers to access Reddit's database of posts, comments, and other content programmatically. To use the API, you'll need to create a Reddit account and register an application to obtain the necessary credentials (client ID, client secret, and user agent). At this point, you are advised to use AI instructions in order to complete this step.

Our example focuses on analyzing sentiment in cryptocurrency-related subreddits over the past month. This is due to the limitations of Reddit's API and the fact that it was not created for data analysis but merely as a tool for subreddit moderators. The code considers only the last month (the last month starting from the time you run the code cells), and for this reason, the results are expected to be different every time you run the code.

You are STRONGLY advised to not spam the cell that downloads the comments as your app will be restricted from reaching these subreddits, even though that can be corrected by creating a new app. This is a good reminder to students that sentiment analysis data do not come for free. We are bounded by API restrictions and the absence of free historical data.

It's also important to note that while we often expect sentiment analysis to align closely with price movements, this correlation isn't always perfect. Factors such as market manipulation, external news, or regulatory changes can cause divergences between sentiment and price. However, sentiment analysis remains a valuable tool for understanding market psychology and potential price directions.

In [None]:
# Initialize the Reddit instance

reddit = praw.Reddit(
    client_id='CLIENT_ID',
    client_secret="CLIENT_SECRET",
    user_agent='USER_AGENT',
    check_for_async=False
)

# Example of fetching the top posts from a subreddit
submissions = reddit.subreddit('finance').top(limit=5)
for submission in submissions:
    print(f"Title: {submission.title}")
    print(f"Score: {submission.score}")
    print(f"URL: {submission.url}")
    print("="*40)

In [None]:
# Define subreddits
subreddits = ['cryptocurrency']

# Define bull and bear sentiment keywords
bull_keywords = ['bull', 'bullish', 'moon', 'lambo', 'hodl', 'buy', 'long', 'uptrend', 'breakout', 'rally']
bear_keywords = ['bear', 'bearish', 'crash', 'dip', 'sell', 'short', 'downtrend', 'correction', 'dump', 'fud']

# Function to get comments from a subreddit
def get_comments(subreddit_name, start_date):
    subreddit = reddit.subreddit(subreddit_name)
    comments = []
    for submission in subreddit.new(limit=None):
        if submission.created_utc < start_date:
            break
        submission.comments.replace_more(limit=0)
        for comment in submission.comments.list():
            comments.append({
                'text': comment.body,
                'created_utc': datetime.fromtimestamp(comment.created_utc),
                'subreddit': subreddit_name
            })
    return comments

# Get comments from the last month
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
all_comments = []

for subreddit in subreddits:
    all_comments.extend(get_comments(subreddit, start_date.timestamp()))

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. In our analysis, we use several NLTK components:

* `VADER` (Valence Aware Dictionary and sEntiment Reasoner) lexicon: A rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.
* `SentimentIntensityAnalyzer`: A VADER-based tool that provides sentiment scores for text.
* `word_tokenize`: A function that splits text into individual words or tokens.

These tools allow us to process and analyze the sentiment of Reddit comments effectively, providing insights into the overall mood of cryptocurrency discussions.

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize

# Download the VADER lexicon
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('punkt_tab')

# Initialize the sentiment analyzer
sia = SentimentIntensityAnalyzer()

**Exercise 2**

Explain the differences between the `.split(" ")` function and `word_tokenize` by using the documentation. Display those differences by using a paragraph. Explain with simple words what tokenization is in the NLP context.

In [None]:
# Preparing the dataset for visualization

text_df = pd.DataFrame(all_comments) # Create a DataFrame from the comments

text_df['tokenized'] = text_df['text'].apply(word_tokenize) # Tokenize the comments
text_df['tokenized'] = text_df['tokenized'].apply(lambda x: [word.lower() for word in x]) # Convert words to lowercase
3669
text_df['bull_count'] = text_df['tokenized'].apply(lambda x: sum(1 for word in x if word in bull_keywords)) # Count bull keywords
text_df['bear_count'] = text_df['tokenized'].apply(lambda x: sum(1 for word in x if word in bear_keywords)) # Count bear keywords

text_df['sia'] = text_df['text'].apply(lambda x: sia.polarity_scores(x)['compound']) # Calculate sentiment using NLTK

text_df.set_index('created_utc', inplace=True) # Set the index to the created_utc column
text_df.sort_index(inplace=True) # Sort the DataFrame by the index

In [None]:
# Downloading BTCUSD price as a cryptomarket proxy.
btcusd = yf.download('BTC-USD', start=start_date, end=end_date)['Adj Close'] # Download Bitcoin price data

In [None]:
# Superimposing the price on the smoothened sentiment score

fig, ax1 = plt.subplots(figsize=(14, 5)) # Create a figure and axis
ax2 = ax1.twinx() # Create a second axis that shares the same x-axis

smoothened_sia = text_df[['sia']].resample('D').mean().rolling('7D').mean() # Calculate the rolling mean of SIA
ax1.plot(smoothened_sia, color='green', label='Sentiment Analysis Score') # Plot the SIA mean
ax2.plot(btcusd, color='blue', label='BTC-USD') # Plot the Bitcoin price

ax1.set_ylabel('SIA Sentiment') # Set the label for the first axis
ax2.set_ylabel('BTC-USD Price') # Set the label for the second axis
ax1.set_xlabel('Date') # Set the label for the x-axis

# Get the handles and labels from both axes
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()

# Combine the handles and labels, and create a legend
ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')

plt.title('SIA Sentiment vs. BTC-USD Price') # Set the title of the plot
plt.show() # Show the plot

Let's do the same as above, but now, let's choose our own keywords in estimating the sentiment. We will aggregate the count of each bullish word and each bearish word. In order to have a sentiment estimation, we will use the difference bull - bear. This difference will be smoothed by a 7-day window (as in the SIA case) before we superimpose the BTCUSD price.

In [None]:
keywords_bull_bear = text_df[['bull_count', 'bear_count']].resample('D').sum()
bull_minus_bear = keywords_bull_bear['bull_count'] - keywords_bull_bear['bear_count']
bull_minus_bear = bull_minus_bear.rolling('7D').mean()

In [None]:
fig, ax1 = plt.subplots(figsize=(14, 5)) # Create a figure and axis
ax2 = ax1.twinx() # Create a second axis that shares the same x-axis


ax1.plot(bull_minus_bear, color='green', label='Bull - Bear') # Plot the difference
ax2.plot(btcusd, color='blue', label='BTC-USD') # Plot the Bitcoin price

ax1.set_ylabel('Bull - Bear counts') # Set the label for the first axis
ax2.set_ylabel('BTC-USD Price') # Set the label for the second axis
ax1.set_xlabel('Date') # Set the label for the x-axis

# Get the handles and labels from both axes
lines_1, labels_1 = ax1.get_legend_handles_labels()
lines_2, labels_2 = ax2.get_legend_handles_labels()

# Combine the handles and labels, and create a legend
ax1.legend(lines_1 + lines_2, labels_1 + labels_2, loc='upper left')

plt.title('Bull - Bear counts vs. BTC-USD Price') # Set the title of the plot
plt.show() # Show the plot

**Exercise 3**

In the Reddit section, we have downloaded all the comments during a specific period for one subreddit. By using the code provided, you can select the keywords (given the subreddit we have already selected) that will help you search additional trends (specific altcoins, specific meme coins, etc). Use both the `nltk` and the manual way of estimating sentiment. Paste your results in the forums.

## **3. Conclusion**

In conclusion, this lesson has demonstrated the powerful role that social media data can play in financial analysis. We've explored how to retrieve and analyze data from major platforms like YouTube and Reddit, using API interactions and employing simple sentiment analysis techniques. These methods provide valuable insights into market sentiment, public opinion, and the trends that can impact financial markets.

The examples given are significant in the sense of what one can achieve with a **free** API. Keep in mind that data for sentiment analysis are hard to find and most likely expensive. But by carefully choosing your period and channels, subreddits, tweets, etc., you will be able to extract, quantify, and visualize the sentiment.

**References**

* Boyd, Danah M., and Nicole B. Ellison. "Social Network Sites: Definition, History, and Scholarship." Journal of Computer‐Mediated Communication, vol. 13, no. 1, 2007, pp. 210-230.

* "Social media." *Cambridge Dictionary*, Cambridge University Press. https://dictionary.cambridge.org. Accessed 18 Nov. 2024.

* "Social media." *Merriam-Webster.com Dictionary*, Merriam-Webster, https://www.merriam-webster.com. Accessed 18 Nov. 2024.

---
Copyright 2024 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
