# Stock Price Sentiment Analysis using Tweet Data

## Team Members


1. Uzay Karadağ | 090200738
2. Hasan Çelik | 090180305
3. Muhammed Fatih Kaya | 090200751

## Dataset

We are going to use the **[tweepy](https://www.tweepy.org/)** module to gather tweets using the Twitter API for the given stock using queries containing the respective ticker.


## Description of the Problem

After gathering tweets about our respective stocks we will perform sentiment anlaysis on the tweets about the given ticker using various NLP modules. We will then use this sentiments to predict if the stock will close higher or lower than the opening price on that date. We will test the data using the price time series from the **yfinance** module.

## Project Planning

First we will start off by acquiring the raw tweets from the **tweepy** module and then process the raw text data using the several methods we learned during the class. We will then form a DataFrame using the **pandas** library where individual tickers and dates are listed properly.

Second, we will do Exploratory Data Analysis to get a better picture on the dataset.

We will then use **spaCy** and **NLTK** to perform sentiment analysis in the newly formed DataFrame. We may fine tune features and other parameters given the accuracy of our model.

Lastly, we will test the model against historical price data gathered from **yfinance**. May circle back to Step 2 if needed.


### Project Pieces

1. Data Acquisition | Uzay Karadağ
2. Data Cleaning and Preprocessing | Uzay Karadağ
3. EDA | Hasan Çelik
4. Analysis of EDA | Hasan Çelik
5. Literature Review, Feature Engineering |  Fatih Kaya
6. Model Construction | Fatih Kaya


### Hardware and Software

1. Uzay Karadağ: 2017 MacBook Pro | 2,3 GHz Dual-Core Intel Core i5, 8GB RAM
2. Fatih Kaya: Monster Abra A5 v15.5 | 2.4 GHz Quad-Core Intel Core i5, 16GB RAM 
3. Hasan Çelik: Dell Vostro 3515 |  2,6 GHz AMD Ryzen7, 16 GB RAM

*Google Colab might be used for GPU and/or TPU computation if needed.*

## Atabey's notes

I like the fact that you have a clear idea which tools you are going to use to collect the data. However, this is still a very weak proposal. Something written in less than an hour.

You must show a sample of the data and explain each of the pieces. You must explain how you are going to clean the twitter data, also how you are going to match the timelines. The data you are going to collect from Yahoo is a time series, the same goes for the twitter data. You must also explain how you are going to convert text data (twitter data) into usable numerical time series data. 

As for the models: how you are going to check any correlations, cause/effect relationships between twitter data and yfinance data? What are your questions? What is your hypothesis? How are you going to test it? What ML algorithms are available? Which ones do you think are appropriate to use? I need more details.

Also the labor division and time-table is not detailed enough.

### Clearing the Data

In [228]:
import pandas as pd

Even though we tried several times we couldn't maange to get the Twitter API access (Thanks Elon!) As of now we have decided to use an alternative [dataset](https://www.kaggle.com/datasets/omermetinn/tweets-about-the-top-companies-from-2015-to-2020?resource=download&select=Tweet.csv) from Kaggle, however we couldn't find a way to import the dataset into the notebook without downloading it to the OS of the machine running it. We will try to come up with a method to do that in the coming days. For now see this as a scratch commit that demonstrates how the time series data will be merged(joined) using data processing. 

In [183]:
tdf  = pd.read_csv('Tweet.csv')
tdf

Unnamed: 0,tweet_id,writer,post_date,body,comment_num,retweet_num,like_num
0,550441509175443456,VisualStockRSRC,1420070457,"lx21 made $10,008 on $AAPL -Check it out! htt...",0,0,1
1,550441672312512512,KeralaGuy77,1420070496,Insanity of today weirdo massive selling. $aap...,0,0,0
2,550441732014223360,DozenStocks,1420070510,S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...,0,0,0
3,550442977802207232,ShowDreamCar,1420070807,$GM $TSLA: Volkswagen Pushes 2014 Record Recal...,0,0,1
4,550443807834402816,i_Know_First,1420071005,Swing Trading: Up To 8.91% Return In 14 Days h...,0,0,1
...,...,...,...,...,...,...,...
3717959,1212159765914079234,TEEELAZER,1577836383,That $SPY $SPX puuump in the last hour was the...,1,0,6
3717960,1212159838882533376,ShortingIsFun,1577836401,In 2020 I may start Tweeting out positive news...,0,0,1
3717961,1212160015332728833,Commuternyc,1577836443,Patiently Waiting for the no twitter sitter tw...,0,0,5
3717962,1212160410692046849,MoriaCrypto,1577836537,I don't discriminate. I own both $aapl and $ms...,1,0,1


In [184]:
ctdf = pd.read_csv('Company_Tweet.csv')
ctdf

Unnamed: 0,tweet_id,ticker_symbol
0,550803612197457920,AAPL
1,550803610825928706,AAPL
2,550803225113157632,AAPL
3,550802957370159104,AAPL
4,550802855129382912,AAPL
...,...,...
4336440,1212158772015034369,TSLA
4336441,1212159099632267268,TSLA
4336442,1212159184931717120,TSLA
4336443,1212159838882533376,TSLA


In [185]:
df = tdf.join(ctdf, how='inner', lsuffix='_t', rsuffix='_ct')
df

Unnamed: 0,tweet_id_t,writer,post_date,body,comment_num,retweet_num,like_num,tweet_id_ct,ticker_symbol
0,550441509175443456,VisualStockRSRC,1420070457,"lx21 made $10,008 on $AAPL -Check it out! htt...",0,0,1,550803612197457920,AAPL
1,550441672312512512,KeralaGuy77,1420070496,Insanity of today weirdo massive selling. $aap...,0,0,0,550803610825928706,AAPL
2,550441732014223360,DozenStocks,1420070510,S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...,0,0,0,550803225113157632,AAPL
3,550442977802207232,ShowDreamCar,1420070807,$GM $TSLA: Volkswagen Pushes 2014 Record Recal...,0,0,1,550802957370159104,AAPL
4,550443807834402816,i_Know_First,1420071005,Swing Trading: Up To 8.91% Return In 14 Days h...,0,0,1,550802855129382912,AAPL
...,...,...,...,...,...,...,...,...,...
3717959,1212159765914079234,TEEELAZER,1577836383,That $SPY $SPX puuump in the last hour was the...,1,0,6,1012677309940359168,TSLA
3717960,1212159838882533376,ShortingIsFun,1577836401,In 2020 I may start Tweeting out positive news...,0,0,1,1012677639792943104,TSLA
3717961,1212160015332728833,Commuternyc,1577836443,Patiently Waiting for the no twitter sitter tw...,0,0,5,1012677722924036096,TSLA
3717962,1212160410692046849,MoriaCrypto,1577836537,I don't discriminate. I own both $aapl and $ms...,1,0,1,1012677751738904577,TSLA


In [186]:
df.drop(['tweet_id_t','tweet_id_ct', 'writer'], axis=1, inplace=True)
df

Unnamed: 0,post_date,body,comment_num,retweet_num,like_num,ticker_symbol
0,1420070457,"lx21 made $10,008 on $AAPL -Check it out! htt...",0,0,1,AAPL
1,1420070496,Insanity of today weirdo massive selling. $aap...,0,0,0,AAPL
2,1420070510,S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...,0,0,0,AAPL
3,1420070807,$GM $TSLA: Volkswagen Pushes 2014 Record Recal...,0,0,1,AAPL
4,1420071005,Swing Trading: Up To 8.91% Return In 14 Days h...,0,0,1,AAPL
...,...,...,...,...,...,...
3717959,1577836383,That $SPY $SPX puuump in the last hour was the...,1,0,6,TSLA
3717960,1577836401,In 2020 I may start Tweeting out positive news...,0,0,1,TSLA
3717961,1577836443,Patiently Waiting for the no twitter sitter tw...,0,0,5,TSLA
3717962,1577836537,I don't discriminate. I own both $aapl and $ms...,1,0,1,TSLA


In [187]:
df['interactions'] = df.iloc[:, 2:5].sum(axis=1)
df.drop(['comment_num', 'retweet_num', 'like_num'], axis=1, inplace=True)
df

Unnamed: 0,post_date,body,ticker_symbol,interactions
0,1420070457,"lx21 made $10,008 on $AAPL -Check it out! htt...",AAPL,1
1,1420070496,Insanity of today weirdo massive selling. $aap...,AAPL,0
2,1420070510,S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...,AAPL,0
3,1420070807,$GM $TSLA: Volkswagen Pushes 2014 Record Recal...,AAPL,1
4,1420071005,Swing Trading: Up To 8.91% Return In 14 Days h...,AAPL,1
...,...,...,...,...
3717959,1577836383,That $SPY $SPX puuump in the last hour was the...,TSLA,7
3717960,1577836401,In 2020 I may start Tweeting out positive news...,TSLA,1
3717961,1577836443,Patiently Waiting for the no twitter sitter tw...,TSLA,5
3717962,1577836537,I don't discriminate. I own both $aapl and $ms...,TSLA,2


In [188]:
import datetime

In [189]:
pdate = df.post_date.apply(lambda epoch : datetime.datetime.fromtimestamp(epoch))
pdate

0         2015-01-01 02:00:57
1         2015-01-01 02:01:36
2         2015-01-01 02:01:50
3         2015-01-01 02:06:47
4         2015-01-01 02:10:05
                  ...        
3717959   2020-01-01 02:53:03
3717960   2020-01-01 02:53:21
3717961   2020-01-01 02:54:03
3717962   2020-01-01 02:55:37
3717963   2020-01-01 02:55:53
Name: post_date, Length: 3717964, dtype: datetime64[ns]

In [190]:
df['post_date'] = pdate
df.rename(columns={'post_date': 'date', 'ticker_symbol': 'ticker', 'body': 'tweet'}, inplace=True)
df

Unnamed: 0,date,tweet,ticker,interactions
0,2015-01-01 02:00:57,"lx21 made $10,008 on $AAPL -Check it out! htt...",AAPL,1
1,2015-01-01 02:01:36,Insanity of today weirdo massive selling. $aap...,AAPL,0
2,2015-01-01 02:01:50,S&P100 #Stocks Performance $HD $LOW $SBUX $TGT...,AAPL,0
3,2015-01-01 02:06:47,$GM $TSLA: Volkswagen Pushes 2014 Record Recal...,AAPL,1
4,2015-01-01 02:10:05,Swing Trading: Up To 8.91% Return In 14 Days h...,AAPL,1
...,...,...,...,...
3717959,2020-01-01 02:53:03,That $SPY $SPX puuump in the last hour was the...,TSLA,7
3717960,2020-01-01 02:53:21,In 2020 I may start Tweeting out positive news...,TSLA,1
3717961,2020-01-01 02:54:03,Patiently Waiting for the no twitter sitter tw...,TSLA,5
3717962,2020-01-01 02:55:37,I don't discriminate. I own both $aapl and $ms...,TSLA,2


In [191]:
df['date'] = df['date'].apply(lambda dt : datetime.datetime.strptime(dt.strftime("%Y-%m-%d"), '%Y-%m-%d'))
df['date']

0         2015-01-01
1         2015-01-01
2         2015-01-01
3         2015-01-01
4         2015-01-01
             ...    
3717959   2020-01-01
3717960   2020-01-01
3717961   2020-01-01
3717962   2020-01-01
3717963   2020-01-01
Name: date, Length: 3717964, dtype: datetime64[ns]

In [192]:
pd.unique(df.ticker)

array(['AAPL', 'GOOG', 'GOOGL', 'AMZN', 'MSFT', 'TSLA'], dtype=object)

In [193]:
df.loc[df['ticker'] == 'GOOGL', "ticker"] = 'GOOG'

In [194]:
pd.unique(df.ticker)

array(['AAPL', 'GOOG', 'AMZN', 'MSFT', 'TSLA'], dtype=object)

In [195]:
tickers = ['AAPL', 'GOOG', 'AMZN', 'MSFT', 'TSLA']

In [196]:
t = {}
for ticker in tickers:
    t[ticker] = df.loc[df['ticker'] == ticker]
    t[ticker].drop(['ticker'], axis=1, inplace=True)
    t[ticker] = t[ticker].groupby(['date'], as_index=False).agg({'tweet': ' + '.join, 'interactions': 'sum'})
t['AAPL']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t[ticker].drop(['ticker'], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t[ticker].drop(['ticker'], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t[ticker].drop(['ticker'], axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  t[ticker].drop(['tick

Unnamed: 0,date,tweet,interactions
0,2015-01-01,"lx21 made $10,008 on $AAPL -Check it out! htt...",1843
1,2015-01-02,"$AAPL stock content, charts, analysis, & more ...",3235
2,2015-01-03,The Ever-Changing World Of Apple http://seekin...,601
3,2015-01-04,Free 5€ in account balance for first 100.000 m...,1373
4,2015-01-05,http://mf.tt/xKIKa Stock Alert Video for $GOOG...,1636
...,...,...,...
658,2016-10-20,"eBay continues shift to ""#Amazon-esque"" busine...",3447
659,2016-10-21,This Video Of Tesla's Complete Autonomy Is Pre...,2670
660,2016-10-22,Apple Can Still Profit Off Scaled Back Automob...,2477
661,2016-10-23,"$GOOG $AAPL easymoneylucy: $TBEV looking good,...",1289


In [197]:
import yfinance as yf
import numpy as np

In [198]:
p = {}
for ticker in tickers:
    p[ticker] = yf.download(ticker)
    p[ticker].drop(['High', 'Low', 'Adj Close', 'Volume'], axis=1, inplace=True)
    p[ticker] = p[ticker].reset_index(level=0)
    p[ticker].rename(columns={'Date':'date', 'Open':'open', 'Close': 'close'}, inplace=True)
    p[ticker]['closed_higher'] = np.where(p[ticker]['close'] > p[ticker]['open'], 1, 0)
p['AAPL']

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


Unnamed: 0,date,open,close,closed_higher
0,1980-12-12,0.128348,0.128348,0
1,1980-12-15,0.122210,0.121652,0
2,1980-12-16,0.113281,0.112723,0
3,1980-12-17,0.115513,0.115513,0
4,1980-12-18,0.118862,0.118862,0
...,...,...,...,...
10574,2022-11-18,152.309998,151.289993,0
10575,2022-11-21,150.160004,148.009995,0
10576,2022-11-22,148.130005,150.179993,1
10577,2022-11-23,149.449997,151.070007,1


In [227]:
data = {}
for ticker in tickers:
    data[ticker] = pd.merge(p[ticker], t[ticker], on=['date'])
data['AAPL']

Unnamed: 0,date,open,close,closed_higher,tweet,interactions
0,2015-01-02,27.847500,27.332500,0,"$AAPL stock content, charts, analysis, & more ...",3235
1,2015-01-05,27.072500,26.562500,0,http://mf.tt/xKIKa Stock Alert Video for $GOOG...,1636
2,2015-01-06,26.635000,26.565001,0,#TOPTICKERTWEETS $IMRS $SPY $AAPL $USO $GILD $...,1300
3,2015-01-07,26.799999,26.937500,1,"Apple Should Post An Astounding Q1, But It Mig...",1450
4,2015-01-08,27.307501,27.972500,1,Apple: Asian Carriers Are Making iPhone 6 Chea...,2421
...,...,...,...,...,...,...
452,2016-10-18,29.545000,29.367500,0,Apple Stock Price: 117.55 #apple $AAPL + #Appl...,2618
453,2016-10-19,29.312500,29.280001,0,$GOOG - New CEO Set for Sony Music Entertainme...,3319
454,2016-10-20,29.215000,29.264999,1,"eBay continues shift to ""#Amazon-esque"" busine...",3447
455,2016-10-21,29.202499,29.150000,0,This Video Of Tesla's Complete Autonomy Is Pre...,2670
