# Twitter Sentiment Analysis(NLP)

The following dataset consists of Twitter tweets. Aim of this code is to predict whether a tweet is Positive or Negative Statement using Natural Language Processing.

In [1]:
#importing libraries
import pandas as pd
import numpy as np

In [2]:
#reading the dataset
df=pd.read_csv("twitter_sentiment.csv",encoding='ISO-8859-1')   #to prevent unicode decode error

In [3]:
df.shape

(99988, 3)

The dataset has 2 major columns :- Sentiment Text and Sentiment with 99988 tweets

In [4]:
df.head(10)

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...
5,6,0,or i just worry too much?
6,7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,8,0,Sunny Again Work Tomorrow :-| ...
8,9,1,handed in my uniform today . i miss you ...
9,10,1,hmmmm.... i wonder how she my number @-)


In [5]:
df['Sentiment'].value_counts()        # we know positive statements are 56457 and negative statements are 43531

1    56457
0    43531
Name: Sentiment, dtype: int64

Importing Natural Language Toolkit package along with other sub-packages for preprocessing 

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [7]:
ps = PorterStemmer()
data = []

In [8]:
for i in range(0,99988):
        sen_text=df["SentimentText"][i]
        sen_text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', sen_text)    #for replacing all the mentioned characters with space
        sen_text = re.sub("(@[A-Za-z0-9_]+)","", sen_text)
        sen_text = sen_text.lower()
        sen_text = sen_text.split()
        sen_text = [ps.stem(word) for word in sen_text if not word in set(stopwords.words('english'))]
        sen_text = ' '.join(sen_text)
        data.append(sen_text)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
vect = CountVectorizer(max_features=10000)

In [11]:
x = vect.fit_transform(data)
x = x.toarray()

In [12]:
#saving the count vector using joblib so that it can be used to predict elsewhere without running the entire code
from sklearn.externals.joblib import dump     
dump(vect,"twitterdata.bin")



['twitterdata.bin']

In [13]:
x.shape             #x consists of sentiment text

(99988, 10000)

In [14]:
#y is the output column "sentiment"
y = df.iloc[:,1].values

In [15]:
y = y.reshape(-1, 1)

In [16]:
y.shape

(99988, 1)

Training the model to make predictions using Keras

In [17]:
#Splitting the dataset into 80% training and 20% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

In [18]:
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


In [19]:
model = Sequential()              #to initialize the Neural Network model




In [20]:
model.add(Dense(activation="relu", input_dim=10000, units=500, kernel_initializer="uniform"))





In [21]:
model.add(Dense(activation="relu", units=150, kernel_initializer="uniform"))

In [22]:
model.add(Dense(activation="relu", units = 20 , kernel_initializer="uniform"))

In [23]:
model.add(Dense(activation="relu", units = 6, kernel_initializer="uniform"))

In [24]:
model.add(Dense(activation = 'sigmoid', units = 1, kernel_initializer = 'uniform',))

In [25]:
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [26]:
model.fit(X_train, y_train, batch_size = 32, epochs = 20)




Epoch 1/20





Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x20268f27e88>

In [27]:
model.save("twitterdata.h5")             #saving the model so that it can be used for future reference

In [28]:
#making predictions
y_pred = model.predict(X_test)

In [29]:
y_pred = y_pred>0.5

In [49]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test,y_pred)

In [51]:
print('Accuracy: %f' % accuracy)

Accuracy: 0.723522


Testing on trained model

In [43]:
prediction = model.predict(vect.transform(["this is good"]))

In [44]:
prediction = prediction>0.5

In [45]:
prediction

array([[ True]])