**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names


- Parv Chordiya
- Vivek Rane
- Sonakshi Mohanty
- Rahul Bulsara
- Varun Naik


# Research Question

Can we identify a measurable correlation between specific social media sentiment indicators (such as counts of positive, neutral, and negative posts) on platforms like X (formerly Twitter) and Reddit, and key Bitcoin price metrics? Specifically, we aim to analyze whether changes in sentiment on these platforms are associated with variations in Bitcoin’s price volatility, daily closing price, and trading volume over defined time frames (e.g., hourly, daily, weekly).

In this project, sentiment will be measured by categorizing posts as positive, neutral, or negative based on commonly used sentiment indicators. We will then examine how these sentiment trends align with Bitcoin's price action and trading activity, using statistical analysis to determine any significant correlations.


## Background and Prior Work


In recent years, social media has changed the way information spreads, and it’s even had a big impact on financial markets. This is especially true for the cryptocurrency market and Bitcoin in particular. Because Bitcoin is decentralized and its price is highly volatile, it’s more sensitive to public opinion and speculation. Platforms like Twitter and Reddit have become central spaces where people express their views about Bitcoin, often influencing its price. A famous example of this is how Elon Musk’s tweets can send Bitcoin prices soaring or dropping, showing just how much social media sentiment can drive market behavior[^1]("#1").

Previous studies have looked into this connection between social media sentiment and Bitcoin prices, and they’ve uncovered some interesting patterns. One notable study used Twitter data to gauge “public mood” and see how it affects Bitcoin price volatility. They found that major price shifts often happened in response to swings in public sentiment on Twitter. By analyzing the positive and negative tones of tweets, this study showed a significant link between these sentiments and Bitcoin’s price changes, particularly when the market was very volatile[^2]("#2"). Another study looked at Reddit’s cryptocurrency discussions, analyzing how the amount and sentiment of posts impacted Bitcoin prices. They observed that sudden increases in positive or negative discussions were often followed by price changes, suggesting that social media sentiment can lead to short-term trading behavior that affects prices[^3]("#3").

While a lot of research has been done on Bitcoin’s price volatility and market trends, combining social media sentiment analysis with Bitcoin price prediction is still a developing area. This project aims to build on what we know by using sentiment analysis to measure the real-time impact of Twitter and Reddit on Bitcoin prices, hoping to see if we can make accurate price predictions based on sentiment trends. By mixing sentiment analysis with time series modeling, this project seeks to deepen our understanding of the relationship between social media and cryptocurrency markets, adding to ongoing research on how online platforms influence financial markets.


<a name="#1"></a> Rani Molla(2021, June 14). When Elon Musk tweets, crypto prices move.https://www.vox.com/recode/2021/5/18/22441831/elon-musk-bitcoin-dogecoin-crypto-prices-tesla. 

<a name="#2"></a> Abraham, J., Higdon, D., Nelson, J., & Ibarra, J. (2018). "Cryptocurrency price prediction using tweet volumes and sentiment analysis.https://api.semanticscholar.org/CorpusID:52950647

<a name="#3"></a>Oliveira, N., Cortez, P., & Areal, N. (2017). The impact of microblogging data for stock market prediction: Using Twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Syst. Appl., 73, 125-144.https://doi.org/10.1016/j.im.2017.10.004. ^
https://api.semanticscholar.org/CorpusID:21682466



# Hypothesis


We hypothesize that there is a positive correlation between social media sentiment toward Bitcoin and Bitcoin's price metrics. Specifically:

Positive Sentiment Hypothesis: Increases in the proportion of positive mentions of Bitcoin on social media platforms like X (formerly Twitter) and Reddit (categorized based on words or phrases associated with optimism, such as "bullish," "gains," or "buy") will be associated with an increase in Bitcoin's closing price, trading volume, or price volatility.

Negative Sentiment Hypothesis: Conversely, increases in the proportion of negative mentions (identified by terms reflecting pessimism, such as "bearish," "losses," or "sell") will correlate with a decrease in these metrics.

For clarity, sentiment will be categorized into positive, neutral, or negative based on keyword counts and sentiment scoring rules that classify posts according to commonly accepted standards. We expect that shifts in sentiment will show a measurable impact on Bitcoin’s price metrics, reflecting the influence of public opinion on market trends.

# Data

## Data overview

Bitcoin Historical Data

- [Link to the dataset](https://www.kaggle.com/datasets/mczielinski/bitcoin-historical-data/data?select=btcusd_1-min_data.csv)
- Number of observations: ~ 6,712,281
- Number of variables: 6

Bitcoin Tweets

- [Link to the dataset](https://www.kaggle.com/datasets/kaushiksuresh147/bitcoin-tweets)
- Number of observations: ~ 4,689,354
- Number of variables: 13

Bitcoin Historical Data: This dataset includes historical bitcoin market data at 1-min intervals for select bitcoin exchanges where trading takes place. The important variables in this dataset are # Open which tells us the opening price of Bitcoin for the day and # Close which tells us the value of Bitcoin at the end of that day. # Volume is also insightful as it tells us the volume of Bitcoin transacted during that day. This dataset is quite extensive as it started in 2012. We will most likely clean it by only keeping observations from 2021 to the present as this timeframe is included in our Bitcoin Tweets dataset.

Bitcoin Tweets: This dataset contains tweets from Twitter/X that contain the #Bitcoin and #Btc hashtags. This collection started on 6/2/2021 and is updated daily. An important variable in this dataset are date which tells us the UTC time and date when the tweet was posted. We’ll be able to see the price of Bitcoin at the time of these tweets by combining this dataset with the Bitcoin Historical Data. Another important variable is text which gives us the actual content of the tweets. We’ll use sentiment scoring rules to classify tweets as being either negative or positive. More information about our sentiment analysis is included in our hypothesis.

## Bitcoin Historical Data

In [None]:
btc_df = pd.read_csv("/Users/rahulbulsara/Downloads/btcusd_1-min_data.csv")
btc_df

### Bitcoin Data Wrangling

Convert index into `datetime`

In [None]:
btc_df['Timestamp'] = pd.to_datetime(btc_df['Timestamp'], unit='s', utc=True)
btc_df.set_index('Timestamp', inplace=True)


Check for and drop any columns with missing values

In [None]:
missing_values = btc_df.isnull().sum()
print("Missing values per column:\n", missing_values)

In [None]:
btc_df.dropna(inplace=True)


Resample data to hourly

In [None]:
btc_hourly = btc_df.resample('h').agg({
    'Open': 'first',
    'High': 'max',
    'Low': 'min',
    'Close': 'last',
    'Volume': 'sum'
})

In [None]:
btc_hourly

## Bitcoin Tweets 

In [None]:
tweets_df = pd.read_csv("C:/Users/User/Documents/cogs108_data/Bitcoin_tweets.csv", usecols=['user_followers', 'date', 'text'])



### Bitcoin tweets Data Wrangling

Set index to `datetime` and drop any tweets that don't have valid dates

In [None]:
tweets_df['tweet_datetime'] = pd.to_datetime(tweets_df['date'], utc=True, errors='coerce')
tweets_df.dropna(subset=['tweet_datetime'], inplace=True)

tweets_df.set_index('tweet_datetime', inplace=True)


Clean the tweets getting rid of any URLs, emojis and special characters, punctuation, numbers, and whitespaces.

In [None]:
def clean_tweet_text(text):
    text = str(text)
    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    # Remove mentions and hashtags
    #text = re.sub(r'@\w+|#\w+', '', text)
    # Remove emojis and special characters
    text = text.encode('ascii', 'ignore').decode('ascii')
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove extra whitespace
    text = re.sub('\s+', ' ', text).strip()
    return text.lower()


In [None]:
tweets_df['cleaned_text'] = tweets_df['text'].apply(clean_tweet_text)


Make sure no tweets are empty after cleaning, and only store those that are not.

In [None]:
tweets_df = tweets_df[tweets_df['cleaned_text'] != '']

Sort by datetime

In [None]:
tweets_df.sort_index(inplace=True)

## Sentiment Analysis

We apply the SentimentIntensityAnalyzer to get the polarity scores for each cleaned tweet in our dataset.

Polarity in sentiment analysis refers to the degree of positivity, negativity, or neutrality expressed in a piece of text. It quantifies sentiment on a scale, typically ranging from -1 (strongly negative) to +1 (strongly positive), with 0 representing a neutral sentiment. it is crucial in understanding the emotional tone of social media posts, reviews, or any textual content.








In [None]:
sia = SentimentIntensityAnalyzer()

def get_sentiment_score(text):
    return sia.polarity_scores(text)['compound']

tweets_df['sentiment_score'] = tweets_df['cleaned_text'].apply(get_sentiment_score)


We then categorize each polarity score as positive, neutral, or negative. Since neutral tweets will be 0 or very close to 0, a very small range is used for encoding them.

In [None]:
def categorize_sentiment(score):
    if score > 0.05:
        return 'positive'
    elif score < -0.05:
        return 'negative'
    else:
        return 'neutral'

tweets_df['sentiment_category'] = tweets_df['sentiment_score'].apply(categorize_sentiment)


Group the data by hour, just like our bitcoin data.

In [None]:
sentiment_counts = tweets_df.groupby([pd.Grouper(freq='h'), 'sentiment_category']).size().unstack(fill_value=0)

Make sure each category exists for each hour of data and fill in with 0 if not.

In [None]:
for category in ['positive', 'neutral', 'negative']:
    if category not in sentiment_counts.columns:
        sentiment_counts[category] = 0


Compute features, `total_tweets`, `net_sentiment_score` and `average_sentiment`. 

`net_sentiment_score` is simply the net amount of directional(positive or negative) tweets divided by total number of tweets.

`average_sentiment` is the mean uncategorized sentiment score across each hour

In [None]:
sentiment_counts['total_tweets'] = sentiment_counts.sum(axis=1)


In [None]:
sentiment_counts['net_sentiment_score'] = (sentiment_counts['positive'] - sentiment_counts['negative']) / sentiment_counts['total_tweets']


In [None]:
average_sentiment = tweets_df.resample('h')['sentiment_score'].mean()
sentiment_counts['average_sentiment'] = average_sentiment

Since our data is missing certain days, we need to filter our bitcoin data to only include days which we have tweets for.

In [None]:
available_hours = sentiment_counts.index.unique()
btc_filtered = btc_hourly.loc[btc_hourly.index.isin(available_hours)]

We can now merge both of our fully proceessed datasets into one.

In [None]:
sentiment_counts = sentiment_counts.reindex(btc_filtered.index)

In [None]:
merged_df = btc_filtered.join(sentiment_counts, how='inner')

In [None]:
merged_df = pd.read_csv(r'C:\Users\User\Group017-FA24\Group017-FA24\Data\mergedData.csv', index_col=0)
merged_df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,negative,neutral,positive,total_tweets,net_sentiment_score,average_sentiment
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2021-02-05 10:00:00+00:00,37302.24,37391.12,37049.34,37094.46,97.855257,0,5,6,11,0.545455,0.274245
2021-02-05 11:00:00+00:00,37094.46,37700.43,37060.00,37431.08,227.234701,10,47,31,88,0.238636,0.109375
2021-02-05 12:00:00+00:00,37430.83,37777.78,37430.83,37706.73,152.645297,12,59,68,139,0.402878,0.192934
2021-02-05 13:00:00+00:00,37714.98,37836.97,37462.33,37771.16,255.330184,15,68,48,131,0.251908,0.130253
2021-02-05 14:00:00+00:00,37766.09,37817.53,37250.00,37301.38,189.317303,15,82,63,160,0.300000,0.135852
...,...,...,...,...,...,...,...,...,...,...,...
2023-01-09 19:00:00+00:00,17225.00,17268.00,17211.00,17263.00,33.898095,251,680,615,1546,0.235446,0.125101
2023-01-09 20:00:00+00:00,17263.00,17328.00,17230.00,17271.00,128.415261,248,705,630,1583,0.241314,0.128724
2023-01-09 21:00:00+00:00,17270.00,17355.00,17270.00,17325.00,150.895088,214,518,591,1323,0.284958,0.150869
2023-01-09 22:00:00+00:00,17333.00,17373.00,17321.00,17359.00,50.276312,234,495,465,1194,0.193467,0.116634


# Ethics & Privacy

In our data science project, we prioritize ethics and privacy by using only open-source datasets that adhere to strict data usage policies, ensuring we respect data ownership and compliance with privacy regulations. 

Bitcoin data is publicly available and doesn’t raise any violations on privacy and ethics. Analyzing social media sentiment does raise some flags on the use of peoples opinions and posts to represent and train our sentiment model but all scraped data will come from public posts.

Conducting sentiment analysis on social media platforms inherently carries risks of bias. Different sentiment tones can arise due to cultural, linguistic, and regional aspects. Additionally, sentiment extraction methods may amplify certain viewpoints or under-represent others, leading to biased outcomes in our analysis. For instance, words or phrases with different meanings across regions can cause sentiment scores to differ and be inaccurate.

To mitigate these biases, we employ normalization techniques and careful preprocessing steps to ensure that our sentiment model is as objective as possible. This includes standardizing language differences, removing outliers, and applying region-specific sentiment adjustments when feasible. Furthermore, we plan to perform a bias audit on our model’s outputs to identify and address any unintended skew in our results.


# Team Expectations 


- We will use iMessage to communicate. We will meet at least once a week. If a group member has a question and they put it in the groupchat, they should be able to get their question answered within one hour
- Everyone is expected to contribute to the group tasks each week. Tasks will be delegated based on preference and everyone should have around an equal amount of work. 
- If someone is struggling with their work or unable to complete something, they should tell the rest of the group ASAP 
- Be blunt but polite if you want to express a disagreement or different opinion. It is important to have effective communication and be direct. 
- It is important to take everyone’s perspective into consideration and hear each other out during times of conflict. 


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/30  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 10/30  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 10/30  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 11/18  | 6 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 11/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 12/10  | 12 PM  | Complete analysis; Draft results/conclusion/discussion| Discuss/edit full project |
| 12/11  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |