In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
import nltk
import seaborn as sns

# Investigating Stock Price Influences: The Impact of Tweets, Insider Trading, and Earnings

## Abstract

## 1. Introduction

The stock market is a complex system influenced by many factors. All of the people, that trade long or short term stocks, always seek ways to dynamics, that drive stock prices. The traditional financial indicators like earnings, financial statements and insider trading activities have long been linked with influincing the market. The rise of social media and the interconnected world, we live in, introduces a new dimension to market analysis. The social network websites like twitter and reddit are considered to potentially trigger a market movement. We have observed such cases in the last couple of years, for example with Gamestop and Dogecoin, but does it happen more often and is it as reliable as the traditional indicators?

This project aims to investigate how various factors, that influence stock prices. By leveraging multiple datasets, comprehensive analysis will be conducted, in order to identify the extent to which these factors impact stock market behavior and what factors seem to be more useful than others.

- **What is the stock market?**

Explain what the stock market is, how it is moved and how people make money potentially?
https://www.investopedia.com/terms/s/stockmarket.asp
Stocks vs indexes

- **What factors make the stock move?**

In here explain about technical analysis, financials, insider trading etc
https://www.investopedia.com/articles/basics/04/100804.asp


Explain the terms and so on from the above points

What is a stock split? Dividend?

## 2. Data Collection and Preparation

- **Overview**

To conduct a comprehensive analysis of the factors influencing stock prices, we have utilized a diverse set of datasets from Kaggle. These datasets encompass a range of information, including stock market data, insider trading activities, financial indicators, social media sentiment, and earnings data. Below is an overview of each dataset:

   1. **Stock Market Data**:
      - **Source**: [Price Volume Data for All U.S. Stocks & ETFs](https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs)
      - **Description**: This dataset includes price and volume data for all U.S. stocks and ETFs, offering a comprehensive view of market movements over time. It provides essential information such as opening and closing prices, highest and lowest prices, and trading volumes.

   2. **Insider Trading Data**:
      - **Source**: [Insider Trading S&P 500 Inside Info](https://www.kaggle.com/datasets/ilyaryabov/insider-trading-sp500-inside-info)
      - **Description**: This dataset shows insider trading activities, including trades made by executives and other insiders within S&P 500 companies. Information includes the type of trade (buy/sell), the number of shares traded, and the position of the insider within the company.

   3. **Financial Indicators**:
      - **Source**: [200 Financial Indicators of U.S. Stocks (2014-2018)](https://www.kaggle.com/datasets/cnic92/200-financial-indicators-of-us-stocks-20142018)
      - **Description**: This dataset provides a wide range of financial indicators for U.S. stocks, covering metrics such as profitability, liquidity, leverage, and valuation from 2014 to 2018. These indicators are critical for assessing the financial health and performance of companies.

   4. **Tweets for Sentiment Analysis**:
      - **Source**: [Stock Tweets for Sentiment Analysis and Prediction](https://www.kaggle.com/datasets/equinxx/stock-tweets-for-sentiment-analysis-and-prediction)
      - **Description**: This dataset includes tweets related to stock market sentiment. Each tweet will be tagged with sentiment scores (positive, negative, neutral), providing insights into public perception and its potential impact on stock prices. The dataset is useful for analyzing the correlation between social media sentiment and market movements.

   5. **Earnings Data**:
      - **Source**: [U.S. Historical Stock Prices with Earnings Data](https://www.kaggle.com/datasets/tsaustin/us-historical-stock-prices-with-earnings-data)
      - **Description**: This dataset focuses on stock price movements around earnings announcements. It includes historical stock prices and earnings data, allowing for the analysis of how earnings reports impact stock prices both in the short-term and long-term.

By integrating these datasets, we can conduct a multifaceted analysis to uncover the relationships between various factors and stock prices, providing a holistic view of market dynamics.


**Preparation**

Let's begin with looking at the data and perform some cleaning operations, where necessary.

First we will investigate the main dataset - stock prices:

In [102]:
stock_data = pd.read_csv('stocks_data/MSFT.csv')
stock_data

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,1986-03-13,0.055653,0.063838,0.055653,0.061109,1031788800,0.0,0.0
1,1986-03-14,0.061109,0.064384,0.061109,0.063292,308160000,0.0,0.0
2,1986-03-17,0.063292,0.064929,0.063292,0.064383,133171200,0.0,0.0
3,1986-03-18,0.064383,0.064929,0.062201,0.062746,67766400,0.0,0.0
4,1986-03-19,0.062746,0.063292,0.061109,0.061655,47894400,0.0,0.0
...,...,...,...,...,...,...,...,...
9152,2022-07-06,263.750000,267.989990,262.399994,266.209991,23824400,0.0,0.0
9153,2022-07-07,265.119995,269.059998,265.019989,268.399994,20859900,0.0,0.0
9154,2022-07-08,264.790009,268.100006,263.290009,267.660004,19648100,0.0,0.0
9155,2022-07-11,265.649994,266.529999,262.179993,264.510010,19455200,0.0,0.0


TODO talk about the dataset, explaining all the columns (if possible) and then do some interesting charts if any possible

TODO do this about all the datasets

## 3. Analyzing the Impact of Tweets on Stock Prices
- **Hypothesis**: Social media sentiment can influence stock prices.
- **Data Analysis**:
  - Sentiment analysis of tweets (positive, negative, neutral).
  - Correlation between tweet sentiment and stock price movements.
- **Methods**:
  - Text preprocessing and sentiment analysis using NLP techniques.
  - Time series analysis to link tweet sentiment to stock prices.
- **Visualization**: Display results and key findings.

## 4. Investigating the Effect of Insider Trading on Stock Prices
- **Hypothesis**: Insider trading activity can be a predictor of stock price movements.
- **Data Analysis**:
  - Identifying patterns in insider trading data.
  - Correlation between insider trades and subsequent stock price movements.
- **Methods**:
  - Statistical analysis to detect abnormal returns post insider trading.
  - Machine learning models to predict price changes based on insider trades.
- **Visualization**: Display results and key findings.

## 5. Financial Statements and Long-term Stock Price Trends
- **Hypothesis**: Earnings reports have a significant impact on stock prices, both short-term and long-term.
- **Data Analysis**:
  - Immediate effect of earnings announcements on stock prices.
  - Long-term impact of positive and negative earnings on stock price trends.
- **Methods**:
  - Event study methodology to assess immediate price reactions.
  - Time series analysis for long-term impact.
  - Regression analysis to understand the relationship between earnings surprises and price changes.
- **Visualization**: Display results and key findings.

## 6. Comparative Analysis and Synthesis of Findings
- **Comparison**: Compare the influence of tweets, insider trading, and earnings on stock prices.
- **Discussion**: Discuss which factor has the most significant impact.
- **Exploration**: Explore any potential interactions between these factors.

## 7. Conclusion and Recommendations
- **Summary**: Summarize key findings from the analyses.
- **Implications**: Practical implications for investors and financial analysts.
- **Future Research**: Recommendations for future research.

# TODO delete playground

In [89]:
# https://www.kaggle.com/datasets/rprkh15/sp500-stock-prices?select=AAPL.csv
# How are dividends calculated? -> they are in USD ($)
# ADD PERCCENTAGE CHANGE
# Think about stock price to dividend
# Think about stock splits as well?
# Stock split 1 to 7 split (prices are correctly adjusted for the split)
# Check biggest diffs in opea and close prices (as well as high and low)
# Does volumne impact the price. More volume -> more likely to go up or down?
# Do stock splits affect the price?

stock_data = pd.read_csv('stocks_data/MSFT.csv')
stock_data

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,1986-03-13,0.055653,0.063838,0.055653,0.061109,1031788800,0.0,0.0
1,1986-03-14,0.061109,0.064384,0.061109,0.063292,308160000,0.0,0.0
2,1986-03-17,0.063292,0.064929,0.063292,0.064383,133171200,0.0,0.0
3,1986-03-18,0.064383,0.064929,0.062201,0.062746,67766400,0.0,0.0
4,1986-03-19,0.062746,0.063292,0.061109,0.061655,47894400,0.0,0.0
...,...,...,...,...,...,...,...,...
9152,2022-07-06,263.750000,267.989990,262.399994,266.209991,23824400,0.0,0.0
9153,2022-07-07,265.119995,269.059998,265.019989,268.399994,20859900,0.0,0.0
9154,2022-07-08,264.790009,268.100006,263.290009,267.660004,19648100,0.0,0.0
9155,2022-07-11,265.649994,266.529999,262.179993,264.510010,19455200,0.0,0.0


In [90]:
# https://www.kaggle.com/datasets/ilyaryabov/insider-trading-sp500-inside-info

insider_trading = pd.read_csv('insider_trading/MSFT.csv').sort_values(['Value ($)'], ascending = False)
insider_trading

Unnamed: 0,Insider Trading,Relationship,Date,Transaction,Cost,Shares,Value ($),Shares Total,SEC Form 4
1,SMITH BRADFORD L,President and Vice Chair,2022-02-08,Sale,304.64,27860,8487170,622460,Feb 09 06:16 PM
4,List Teri,Director,2021-12-07,Sale,334.9,1650,552578,1654,Dec 08 05:59 PM
3,Walmsley Emma N,Director,2022-01-28,Buy,295.48,1700,502317,7086,Feb 01 06:15 PM
9,Hogan Kathleen T,"EVP, Human Resources",2021-09-10,Sale,298.68,20000,5973540,183988,Sep 13 06:05 PM
2,Walmsley Emma N,Director,2022-01-31,Buy,311.53,1600,498445,8686,Feb 01 06:15 PM
11,Nadella Satya,Chief Executive Officer,2021-09-01,Sale,303.28,75573,22920037,1669375,Sep 01 08:12 PM
10,Capossela Christopher C,"EVP, Chief Marketing Officer",2021-09-10,Sale,298.82,10000,2988176,94415,Sep 13 06:03 PM
0,Nadella Satya,Chief Executive Officer,2022-03-01,Sale,296.52,7931,2351736,809645,Mar 02 06:20 PM
7,Althoff Judson,"EVP, Chief Commercial Officer",2021-11-02,Sale,332.28,54757,18194837,139586,Nov 03 06:13 PM
12,Hood Amy,"EVP, Chief Financial Officer",2021-09-01,Sale,303.08,60000,18184980,463259,Sep 03 06:05 PM


In [92]:
# https://www.kaggle.com/datasets/cnic92/200-financial-indicators-of-us-stocks-20142018
# we have data per year, is it cleaned data?
# also maybe check this out

financial_indicators = pd.read_csv('financial_indicators/2018_Financial_Data.csv')
financial_indicators

Unnamed: 0.1,Unnamed: 0,Revenue,Revenue Growth,Cost of Revenue,Gross Profit,R&D Expenses,SG&A Expense,Operating Expenses,Operating Income,Interest Expense,...,Receivables growth,Inventory Growth,Asset Growth,Book Value per Share Growth,Debt Growth,R&D Expense Growth,SG&A Expenses Growth,Sector,2019 PRICE VAR [%],Class
0,CMCSA,9.450700e+10,0.1115,0.000000e+00,9.450700e+10,0.000000e+00,6.482200e+10,7.549800e+10,1.900900e+10,3.542000e+09,...,0.2570,0.0000,0.3426,0.0722,0.7309,0.0000,0.1308,Consumer Cyclical,32.794573,1
1,KMI,1.414400e+10,0.0320,7.288000e+09,6.856000e+09,0.000000e+00,6.010000e+08,3.062000e+09,3.794000e+09,1.917000e+09,...,0.0345,-0.0920,-0.0024,0.0076,-0.0137,0.0000,-0.1265,Energy,40.588068,1
2,INTC,7.084800e+10,0.1289,2.711100e+10,4.373700e+10,1.354300e+10,6.750000e+09,2.042100e+10,2.331600e+10,-1.260000e+08,...,0.1989,0.0387,0.0382,0.1014,-0.0169,0.0390,-0.0942,Technology,30.295514,1
3,MU,3.039100e+10,0.4955,1.250000e+10,1.789100e+10,2.141000e+09,8.130000e+08,2.897000e+09,1.499400e+10,3.420000e+08,...,0.4573,0.1511,0.2275,0.6395,-0.5841,0.1738,0.0942,Technology,64.213737,1
4,GE,1.216150e+11,0.0285,9.546100e+10,2.615400e+10,0.000000e+00,1.811100e+10,4.071100e+10,-1.455700e+10,5.059000e+09,...,-0.2781,-0.2892,-0.1575,-0.4487,-0.2297,0.0000,0.0308,Industrials,44.757840,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4387,YRIV,0.000000e+00,0.0000,0.000000e+00,0.000000e+00,0.000000e+00,3.755251e+06,3.755251e+06,-3.755251e+06,1.105849e+07,...,0.0000,0.0000,-0.0508,-0.1409,-0.0152,0.0000,-0.2602,Real Estate,-90.962099,0
4388,YTEN,5.560000e+05,-0.4110,0.000000e+00,5.560000e+05,4.759000e+06,5.071000e+06,9.830000e+06,-9.274000e+06,0.000000e+00,...,0.3445,0.0000,-0.2323,-0.8602,0.0000,0.0352,-0.0993,Basic Materials,-77.922077,0
4389,ZKIN,5.488438e+07,0.2210,3.659379e+07,1.829059e+07,1.652633e+06,7.020320e+06,8.672953e+06,9.617636e+06,1.239170e+06,...,0.1605,0.7706,0.2489,0.4074,-0.0968,0.2415,0.8987,Basic Materials,-17.834400,0
4390,ZOM,0.000000e+00,0.0000,0.000000e+00,0.000000e+00,1.031715e+07,4.521349e+06,1.664863e+07,-1.664863e+07,0.000000e+00,...,0.8980,0.0000,0.1568,-0.2200,0.0000,2.7499,0.1457,Industrials,-73.520000,0


In [99]:
# twitter data https://www.kaggle.com/datasets/equinxx/stock-tweets-for-sentiment-analysis-and-prediction
# also maybe check this out

tweet_data = pd.read_csv('tweet_data/stock_tweets.csv')
tweet_data

Unnamed: 0,Date,Tweet,Stock Name,Company Name
0,2022-09-29 23:41:16+00:00,Mainstream media has done an amazing job at br...,TSLA,"Tesla, Inc."
1,2022-09-29 23:24:43+00:00,Tesla delivery estimates are at around 364k fr...,TSLA,"Tesla, Inc."
2,2022-09-29 23:18:08+00:00,3/ Even if I include 63.0M unvested RSUs as of...,TSLA,"Tesla, Inc."
3,2022-09-29 22:40:07+00:00,@RealDanODowd @WholeMarsBlog @Tesla Hahaha why...,TSLA,"Tesla, Inc."
4,2022-09-29 22:27:05+00:00,"@RealDanODowd @Tesla Stop trying to kill kids,...",TSLA,"Tesla, Inc."
...,...,...,...,...
80788,2021-10-07 17:11:57+00:00,Some of the fastest growing tech stocks on the...,XPEV,XPeng Inc.
80789,2021-10-04 17:05:59+00:00,"With earnings on the horizon, here is a quick ...",XPEV,XPeng Inc.
80790,2021-10-01 04:43:41+00:00,Our record delivery results are a testimony of...,XPEV,XPeng Inc.
80791,2021-10-01 00:03:32+00:00,"We delivered 10,412 Smart EVs in Sep 2021, rea...",XPEV,XPeng Inc.


In [98]:
# earnings data https://www.kaggle.com/datasets/tsaustin/us-historical-stock-prices-with-earnings-data

earnings_data = pd.read_csv('earnings_data/earnings_latest.csv')
earnings_data

Unnamed: 0,symbol,date,qtr,eps_est,eps,release_time
0,A,2009-05-14,04/2009,,,post
1,A,2009-08-17,07/2009,,,post
2,A,2009-11-13,10/2009,,,pre
3,A,2010-02-12,01/2010,,,pre
4,A,2010-05-17,04/2010,,,post
...,...,...,...,...,...,...
168598,ZYXI,2020-02-27,Q4,0.077,0.09,post
168599,ZYXI,2020-04-28,Q1,0.063,0.09,post
168600,ZYXI,2020-07-28,Q2,0.086,0.09,post
168601,ZYXI,2020-10-27,Q3,0.053,0.04,post


If have time -> check institutional investments