**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Kavin Raj
- Arnav Saxena
- Tiantong Wu
- Peike Xu
- Jing Yin Yip

# Research Question

How do opinionated tweets on Twitter as measured by a sentiment analysis model affect the stock prices of tech firms within the time period of ...?

## Background and Prior Work

Twitter is a widely used social media platform for expressing many kinds of opinions, including those on certain companies. We believe that this may have an impact on the public perception of these companies, and we aim to investigate whether there is a connection between opinions expressed in social media and the actual stock prices of the companies in question. An example of was seen this back in 2021, when “a thread on r/WallStreetBets”  caused “more than 7,200% increase in GME—and a 689% run”<a name="cite_note-1"></a> <sup>[<a href="#cite_ref-1">1</a>]</sup> . This occurence informed us that there is a potential causal effect between opinions on social media and real-world stock prices; we think that this effect has a much greater scope than just this isolated case of GameStop stocks, and we are interested in seeing if this is a larger, more general phenomenon that can be applied to other time periods and companies.

When reading a research paper from the IOP conference series, we found a sentiment analysis model based on social media opinion on stock trading. It was remarked in the conclusion that “looking into correlation coefficient compared by number of days before and after the trading day, the result shows that correlation reaches to the peak on trading day then it gradually declines with the magnitude depending on the day length after trading day.” <a name="cite_note-2"></a> <sup>[<a href="#cite_ref-2">2</a>]</sup> This research is similar to what we aim to investigate, as the paper conducted their test on a Thai social media platform called Pantip, and discussed ten Thai companies. The paper provides more evidence that there is a causal link between social media sentiment and stock prices, and we are interested to see if a similar trend can be seen with tweets and tech companies' stocks in the US. Additionally, we would also like to investigate if there are other similarities or differences in the trends that we are able to identify between our research and the paper, such as the correlation between variables reaching a peak on trading day.

1. <a name="cite_note-1"></a> <sup>[<a href="#cite_ref-1">1</a>]</sup> Rechel, J. (28 Jan 2021) How social media moves markets: Analyzing GameStop (GME) using social listening data. Sprout Blog.
   <br> https://sproutsocial.com/insights/gamestop-stock-social-media

2. <a name="cite_note-2"></a> <sup>[<a href="#cite_ref-2">2</a>]</sup>  P Padhanarath et al 2019 IOP Conf. Ser.: Mater. Sci. Eng. 620 012094.
   <br>https://iopscience.iop.org/article/10.1088/1757-899X/620/1/012094/pdf


# Hypothesis


Before answering our research question by investigating the available data, we think that there will be a positive correlation between a positive sentiment for a company and said company’s stock prices rising, as well as a positive correlation between a negative sentiment for a company and said company’s stock prices falling. This is due to the connection between a company’s public perception, how that is reflected in social media, and how it manifests in the stock market. If tweets about a company are mostly negative within a certain time period, we would expect to observe a decrease in stock prices, as both phenomena correspond to a decrease in public perception of the company. However, we also acknowledge that this relationship may not be as straightforward as is stated here, as there may be other confounds affecting each variable, such as Twitter only capturing the sentiment of a more vocal sample of people as compared to the rest of the population.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Set up

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

## Data Cleaning

We will be cleaning up two data files: stock_tweets.csv and TESLA_HISTORICAL.csv for when we do our analysis later on.

### Tweet Stock

For tweets we read the file and drop any row that doesn't contain an values. We convert the dates into standard date/time. We then narrow down the tweets to anything that includes "TSLA". We then convert all the tweets into lower case, needed later for sentiment analysis. 

In [None]:
tweet = pd.read_csv('stock_tweets.csv')
tweet.dropna(inplace=True)
#delete any missing value within the dataset
tweet['Date'] = pd.to_datetime(tweet.get('Date')).dt.date
#convert the date enteries into standarized form
tweet_tesla = tweet[tweet.get('Stock Name')=='TSLA']
#reduced the dataset into Tesla only
tweet_tesla['Tweet'] = tweet_tesla['Tweet'].str.lower()
#convert all the tweet into lower case for easier analysis
# %pip install nltk
# import nltk
# from nltk.corpus import stopwords
# from nltk.tokenize import word_tokenize
# tweet_tesla['Tweet'] = tweet_tesla['Tweet'].apply(word_tokenize)
# stop_words = set(stopwords.words('english'))
# tweet_tesla['Tweet'].apply(lambda x: [item for item in x if item not in stop_words])
# #tokenize the tweet and remove stopwords for furthur sentiment analysis

## Tesla Stock price

We first read the csv file for Tesla. We drop any rows that don't have any data in them. We then convert the date to standard date time for, to make it easier to maniuplate the certain dates we are looking for. We then run a function to take out the dollar signs in the data entries and covnert them to float variables. We match up the time range for both Tesla stock prices and Tweets.

In [1]:
tesla = pd.read_csv('TESLA_HISTORICAL.csv')
tesla.dropna(inplace=True)
#delete any missing value within the dataset
tesla['Date'] = pd.to_datetime(tesla.get('Date')).dt.date
#convert the date enteries into standarized form
def nodollartofloat(series):
    series = series.str.strip('$')
    blank = []
    for i in series:
        i = float(i)
        blank = np.append(blank,i)
    return blank
tesla['Close/Last'] = nodollartofloat(tesla['Close/Last'])
tesla['Open'] = nodollartofloat(tesla['Open'])
tesla['High'] = nodollartofloat(tesla['High'])
tesla['Low'] = nodollartofloat(tesla['Low'])
#convert all string input into number

NameError: name 'pd' is not defined

# Ethics & Privacy

In our project, there exist several ethical and privacy concerns. First of all, to collect users' twitter posts might be a violation of privacy. Although Twitter posts are publicly posted, people might not want to have their posts be a part of a project or an experiment. Thus, we planned to handle this privacy issue by anonymizing all data from the users. In addition, there exists a potential bias that it is hard to judge a human written post as simply positive or negative. Some posts might contain emotions that are complex. People might be confused about the trend of a stock's price, or hard to determine whether they are clearly for or clearly against a change in the stock prices. Also, we should be able to accurately interpret the information contained in a twitter post. To handle this issue, we must present our findings as transparently as possible. Also, since our data set is relatively large(there are thousands of twitter posts), we could clearly identify the ones who are less complex or have clear stances, and analyze the correlation between those posts and the stock prices.

One of the concerns is that our data rely heavily on the language sentiment analyzing modeling. It's important to use these models with caution, be aware of their potential shortcomings, and complement them with human judgment and context whenever possible to ensure the most accurate and reliable analysis of sentiment in your data. Also, the words defined for negative in a stock situation might be different with general sentiment analysis. We can try out several analyzing models and compare results between different models, finding the most accurate modeling. 

# Team Expectations 

- Keeping up with the project work in a timely manner with everyone providing similar effort to the project
- Having open and consistent communication to update everyone on any conflicts and issues in order to proceed with the project smoothly. 

# Project Timeline Proposal

Two aspects that we may need guidance on are: obtaining data from social media sites where APIs may not be available; and using sentiment analysis models to determine whether text is positve or negative towards certain entities.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/5  |  1 PM | Proposal  | Collect Datasets, Assign EDA Tasks | 
| 11/12  |  1 PM |  Collection of Raw Data; Data cleaning | Completing Checkpoint 1 | 
| 11/19  | 1 PM | Completed Raw Data Collection  | Start EDA |
| 11/26  | 1 PM | Finishing up EDA, starting analysis | Completing Checkpoint 2 |
| 12/3  | 1 PM  | Draft the report, discussion results | Start Final Project Report |
| 12/10 | 1 PM  | Finalize Project |Prepare for the final presentation |