# Sentiment Analysis of Stock Tweet 

This Jupyter Notebook works on cleaning and processing the comments on various stocks. Given that there are a huge amount of news related to different stock every day, it becomes much more convenient if we can somehow filter all the news to understand their basic sentiment. Even though the sentiment can be unrelated to how the stock price moves, it can act as a reference to the current background of the market. 

Data taken from Kaggle at 
https://www.kaggle.com/yash612/stockmarket-sentiment-dataset

Prepared by Shing Chi Leung at 27 May 2021

In [1]:
import pandas as pd


In [2]:
df = pd.read_csv("stock_data.csv")
df

Unnamed: 0,Text,Sentiment
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,1
2,user I'd be afraid to short AMZN - they are lo...,1
3,MNTA Over 12.00,1
4,OI Over 21.37,1
...,...,...
5786,Industry body CII said #discoms are likely to ...,-1
5787,"#Gold prices slip below Rs 46,000 as #investor...",-1
5788,Workers at Bajaj Auto have agreed to a 10% wag...,1
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",1


We can see that there are a few types of text here. The simplest one is that a certain stock being over or below a certain price, that can be seen as some breakthrough of support or resistance price. The second type is about the news related to a certain company, mostly related to their company decision or big transaction. Finally, there are users comments on a certain stock. In all cases, there is one stock code or more marked by full captial letters. 

In [3]:
df.shape

(5791, 2)

In [4]:
df["Sentiment"].value_counts()

 1    3685
-1    2106
Name: Sentiment, dtype: int64

The person who prepare the news cut has provided us a nice division of good and bad tweets about stock. They do not differ very much by the number. It will help the machine learning part to have a comprehensive exposure of the possible text related. 

Before the machine learning part, I will need to remove all symbols and frequent words. 

In [83]:
# remove all punctuations
df["Text"] = df["Text"].str.replace(r"[-,.\(\):;\'\"!\?#%/]","")

# remove all stock codes
df["Text"] = df["Text"].str.replace(r"[A-Z]+ ", "")

# turn all characters to lower characters
df["Text"] = df["Text"].str.lower()

# remove all numbers
df["Text"] = df["Text"].str.replace(r"[0-9]","")

# remove all space
df["Text"] = df["Text"].str.strip()

df

Unnamed: 0,Text,Sentiment
0,kickers on my watchlist trade method or meth...,1
1,user return for the indicator just trades fo...,1
2,user id be afraid to short they are looking l...,1
3,over,1
4,over,1
...,...,...
5786,industry body said discoms are likely to suffe...,-1
5787,gold prices slip below rs as investors book p...,-1
5788,workers at bajaj auto have agreed to a wage c...,1
5789,sharemarket sensex off day’s high up points n...,1


Now we can see that the processed text remains only the key words, which should be sufficient for the machine learning part to pick up the features. 

In [84]:
corpus = df["Text"].to_numpy()
corpus[:10]

array(['kickers on my watchlist  trade method  or method  see prev posts',
       'user  return for the indicator just  trades for the year',
       'user id be afraid to short  they are looking like a nearmonopoly in ebooks and infrastructureasaservice',
       'over', 'over', 'over',
       'user if so then the current downtrend will break otherwise just a shortterm correction in medterm downtrend',
       'mondays relative weakness',
       'ower trend line channel test & volume support',
       'will watch tomorrow for entry'], dtype=object)

In [85]:
labels = df["Sentiment"].to_numpy()

## Machine Learning of Tweet Sentiment by Classification

Now we move on to the second part of the project. We will use machine learning to classify the tweets one by one. In particular, we will do the following steps:

1. Use tokens to represent all the words
2. Find out the words' relative importance by the Tfidf score

In [86]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [87]:
pipeline = Pipeline([("cv", CountVectorizer()), ("tf", TfidfTransformer())])
pipeline.fit(corpus)

Pipeline(steps=[('cv', CountVectorizer()), ('tf', TfidfTransformer())])

In [88]:
len(pipeline["cv"].vocabulary_)

8542

In [89]:
tf_vec = pipeline.transform(corpus)

Now we have the variable tf_vec storing all the scores each tweet has. We may now decompose the very high dimensional array (8542) into a lower one by SVD decomposition. This will lower the size necessary for storing all the tweets.

In [90]:
from sklearn.decomposition import TruncatedSVD
tsvd = TruncatedSVD(n_components=50)
tsvd.fit(tf_vec)

TruncatedSVD(n_components=50)

In [91]:
tf_vec50 = tsvd.transform(tf_vec)
tf_vec50.shape

(5791, 50)

The SVD method is very helpful in minimized the very sparse matrix into a compact one to reduce the storage size. 

Now we will move on to the machine learning part. In particular, we will use the Random Forest Classifier to classify the tweet. For demonstration and accuracy purpose, I will use the original vector representation of the tweet, rather than the decomposed one. Because the size of the variables is small enough to be done within minutes. 

In [92]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [97]:
train_X, test_X, train_y, test_y = train_test_split(tf_vec, labels, test_size=0.2)

In [98]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_X, train_y)

RandomForestClassifier()

In [99]:
y_pred = rfc.predict(test_X)

In [100]:
from sklearn.metrics import classification_report, accuracy_score
print(classification_report(y_pred, test_y))
print(accuracy_score(y_pred, test_y))

              precision    recall  f1-score   support

          -1       0.55      0.77      0.64       301
           1       0.91      0.78      0.84       858

    accuracy                           0.77      1159
   macro avg       0.73      0.77      0.74      1159
weighted avg       0.81      0.77      0.79      1159

0.7748058671268335


The score is not particularly high or low. THe precision for positive is quite high, but the negative one is not very enchanting. Partly because sometimes the negative sentiment is hidden by the ironic or sacastic tone, which cannot be captured without a large amount of sample. The accuracy score is about 0.8, which is also acceptable in view of the simple classification we have done. 

## Applications to Fiducial Tweet

Now we will try to test the classifier with some artificial tweet made by me. I randomly make some very similar tweets and we want to see if the classifier can handle the basic task. 

In [105]:
test_sentences = ["this is a good stock", 
                  "this is a bad stock", 
                  "this stock will go up", 
                  "this stock will go down",
                  "hope for this stock",
                  "no hope for this stock"]
test_vec = pipeline.transform(test_sentences)
rfc.predict(test_vec)

array([ 1,  1,  1, -1,  1,  1], dtype=int64)

In [106]:
print("bad = {}".format(pipeline["cv"].vocabulary_["bad"]))
print("no = {}".format(pipeline["cv"].vocabulary_["no"]))

bad = 531
no = 5318


Obviously we see that there are six sentences with 3 being positive and 3 being negative. However, the classifier can only identify one negative tweet. The major feature of the bad tweet is the use of negative tags like "bad" or "no". It seems that the classifier does not understand the negativity of these tags. To pick out the reason, we need to examine the tweets with similar features. 

Let's take a look at the tweet which contains the word bad. 

In [108]:
df[df["Text"].str.contains("bad")].groupby("Sentiment").head(3)

Unnamed: 0,Text,Sentiment
289,option traders bet on bad earnings selling ja...,-1
638,demand for breast implants at alltime highs g...,1
1186,wooooow really bad wont touch the applestock w...,-1
1738,if manufacturing index eslts are bad = below,-1
1842,feel bad for those stoplosses that got taken out,1
3794,m my position hurt so bad today butt it just s...,1


In [109]:
df[df["Text"].str.contains("no")].groupby("Sentiment").head(3)

Unnamed: 0,Text,Sentiment
2,user id be afraid to short they are looking l...,1
11,it really worries me how everyone expects the ...,1
13,user maykiljil posted that agree that is goin...,1
26,red not ready for break out,-1
107,wow not good for user sales for the nook were...,-1
109,the new wightwatchers ads are more fun and lig...,-1


From the sampling it shows that the tweets contains a mixed used of terms like "not" or "bad". Even though they do have negative meaning in the tweet, the overall statement is positive. It seems that the classifier needs to work hard in digesting the overall structure of the sentences in order to improve its accuracy. 