# Sentiments Analysis - Twitter Dataset

### Author: [Shivam Gupta](https://www.linkedin.com/in/shivamgupta1999/)

* Classifying whether a tweet has a **positive** sentiment or a **negative** sentiment.
* Preprocessing of dataset includes:
  * Removing of punctuations and stopwords (like- and, the ,is...)
  * Removing hyperlinks and hashtags
  * Tokenizing the tweet
  * stemming the tokens
* Logistic Regression model is used and implemented from scratch without using any libraries.
* Gradient Descent algorithm is used to minimize the loss and this is also implemented in this notebook.

### Importing all the libraries

In [3]:
import re
import string
import numpy as np
import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [4]:
#Downloading the dataset
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

### Preparing the dataset

In [5]:
positive_tweets=twitter_samples.strings('positive_tweets.json')
negative_tweets=twitter_samples.strings('negative_tweets.json')

#There are 5000 tweets each of positive and negative sentiments
#Splitting dataset into train and test in ratio 4:1
train_pos = positive_tweets[:4000]
test_pos = positive_tweets[4000:]
train_neg = negative_tweets[:4000]
test_neg = negative_tweets[4000:]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

#Preparing labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


### Now we'll define a function doing all the preprocessing task

In [11]:
def process_tweet(tweet):
    
    #remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    #only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    
    #tokenizing the tweet
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)
    
    #Stemming the word and removing stopwords and punctuations
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    tweets_clean = [stemmer.stem(word) for word in tweet_tokens if word not in stopwords_english and word not in string.punctuation]
    
    return tweets_clean

### This function will return a dictionary mapping each word-sentiment pair to it's frequency in the dataset

In [12]:
def frequency_dict(tweets, sentiments):
    
    labels = np.squeeze(sentiments).tolist()

    frequencies = {}
    for label, tweet in zip(labels, tweets):
        for word in process_tweet(tweet):
            pair = (word, label)
            if pair in frequencies:
                frequencies[pair] += 1
            else:
                frequencies[pair] = 1

    return frequencies

### A function which provides the feature vector having frequencies of positive and negative tokens

In [18]:
def feature_vector(tweet, frequencies):
    
    # cleaning the tweet by removing puntuations and stopwords, tokenizing and stemming
    clean_tweet = process_tweet(tweet)
    
    # feature vector containing a bias term and positive and negative tokes's frequencies
    features = np.zeros((1, 3)) 
    
    #bias term is set to 1
    features[0,0] = 1 
    
    for word in clean_tweet:
        # increment the word count for the positive token
        features[0,1] += frequencies.get((word,1),0)
        
        # increment the word count for the negative token
        features[0,2] += frequencies.get((word,0),0)
        
    return features

### Applying preprocessing functions on actual dataset

In [19]:
frequencies = frequency_dict(train_x, train_y)

print("len(frequencies) = " + str(len(frequencies.keys())))

print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

len(frequencies) = 11345
This is an example of a positive tweet: 
 #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

This is an example of the processed version of the tweet: 
 ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


In [20]:
#Converting each tweet to corresponding feature vector
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= feature_vector(train_x[i], frequencies)

# training labels corresponding to X
Y = train_y

### Implementing Sigmoid function

In [21]:
def sigmoid(z): 
    
    sig = 1/(1+np.exp(-z))
    
    return sig

### Implementing the Gradient Descent algorithm

In [24]:
def gradientDescent(x, y, theta, LR, iterations):
    
    m = x.shape[0]
    
    for i in range(iterations):
        
        z = np.dot(x,theta)
        sig = sigmoid(z)
        cost = (-1./m)*(np.dot(y.transpose(),np.log(sig))+np.dot((np.ones(y.shape)-y).transpose(),np.log(np.ones(y.shape)-sig)))
        theta -= (LR/m)*(np.dot(x.transpose(),sig-y))
        
    cost = float(cost)
    return cost, theta

In [27]:
# Apply gradient descent
cost, theta = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500) #Learning rate is set to 1e-9 and 1500 iterations
print(f"The cost after training is {cost:.5f}.")
print(f"The resulting vector of weights is {[round(t, 5) for t in np.squeeze(theta)]}")

The cost after training is 0.24217.
The resulting vector of weights is [0.0, 0.00052, -0.00056]


### Logistic Regression function

In [28]:
def predict_tweet(tweet, frequencies, theta):
    
    # extracting features of the tweet
    x = feature_vector(tweet, frequencies)
    
    # Prediction using x and theta
    y_pred = sigmoid(np.dot(x,theta))
    
    return y_pred

def logistic_regression(test_x, test_y, frequencies, theta):

    m=test_y.shape[0]
    
    # Predictions
    y_hat = []
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, frequencies, theta)
        
        if y_pred > 0.5:
            y_hat.append(1.0)  # Positive sentiment
        else:
            y_hat.append(0)    # Negative sentiment

    # Calculating accuracy
    y_hat = np.asarray(y_hat)
    test_y = np.squeeze(test_y)
    acc = y_hat==test_y
    accuracy = np.sum(acc)/m
    
    return accuracy

### Predicting sentiments of test data

In [31]:
test_accuracy = logistic_regression(test_x, test_y, frequencies, theta)
print(f"Logistic regression model's accuracy = {test_accuracy:.4f}")  

Logistic regression model's accuracy = 0.9950


## Our model is predicting sentiments with 99.5% Accuracy!!! Isn't that great? :)

### You can also test this model:

In [36]:
# Replace my_tweet with any sentence you like and check the prediction

my_tweet = "Christian bale's performance was brilliant in Dark knight!"

y_hat = predict_tweet(my_tweet, frequencies, theta)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

Positive sentiment
