<a href="https://colab.research.google.com/github/thedatadj/computer-vision/blob/main/sentiment-analysis/deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data loading
Import:
* Set of tweets with positive sentiment
* Set of tweets with negative sentiment.

In [None]:
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


How many tweets are there in each set?

In [None]:
pn = len(all_positive_tweets)
nn = len(all_negative_tweets)
print(pn, nn)

5000 5000


# Data split
Create:
* Training set `train_x`
* Validation set `val_x`

<img src="https://drive.google.com/uc?export=view&id=19vSJsGft3277To0aw_YBpqLDAu2guDLa" width="50%" height="50%">

In [None]:
# Tool for this task
import numpy as np

Divide all the positive tweets into two subsets:
* Positive tweets for training (4,000)
* Positive tweets for validation (1,000)

In [None]:
train_pos = all_positive_tweets[:4000]
val_pos = all_positive_tweets[4000:]

Divide all the negative tweets into two subsets:
* Negative tweets for training (4,000)
* Negative tweets for validation (1,000)

In [None]:
train_neg = all_negative_tweets[:4000]
val_neg = all_negative_tweets[4000:]

Create a features training dataset, consisting of:
* Positive tweets for training (4,000)
* Negative tweets for training (4,000)

For a total of 8,000 training examples.

In [None]:
train_x = train_pos + train_neg

Create a dataset containing the labels/target of the training features.
* 1 for the positive tweets
* 0 for the negative tweets

In [None]:
positives = np.ones(4000)
negatives = np.zeros(4000)
train_y = np.append(positives, negatives)

Create validation set containing:
* Last 1,000 positive tweets from all the positive tweets.
* Last 1,000 negative tweets from all the negative tweets.

In [None]:
val_x = val_pos + val_neg

Create the validation set of labels containing:
* 1 for positive tweets
* 0 for negative tweets

In [None]:
val_y = np.append(positives[:1000], negatives[:1000])

In [None]:
print(f"length of train_x {len(train_x)}")
print(f"length of val_x {len(val_x)}")

length of train_x 8000
length of val_x 2000


# Data preprocessing
Create a function that:
* Removes unwanted characters from a tweet
* Tokenize the tweet
* Remove stopwords and punctuation
* Stem the words

In [None]:
# Tool to remove unwanted characters from a string
import re

# Tool to tokenize the tweet
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer(preserve_case=False,
                           strip_handles=True,
                           reduce_len=True)

# Tool to remove stopwords and punctuation
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_english = stopwords.words('english')
import string

# Tool to word stem
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def process_tweet(tweet):
    # Remove unwanted characters
    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    # Tokenize tweet
    tweet_tokens = tokenizer.tokenize(tweet)
    # Remove stopwords, punctuation and stem the words
    tweet_clean = []
    for token in tweet_tokens:
        if (token not in stopwords_english and # remove stopwords
            token not in string.punctuation): # remove punctuation
            token_stem = stemmer.stem(token)
            tweet_clean.append(token_stem)
    return tweet_clean

In [None]:
# Try out function that processes tweets
print("Original tweet at training position 0\n")
print("-->", train_pos[0])

print("\n\nTweet at training position 0 after processing:\n")
process_tweet(train_pos[0])

Original tweet at training position 0

--> #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)


Tweet at training position 0 after processing:



['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']

# Vocabulary
Vocabulary of the training data.
* Map each word to an index.

In [None]:
def get_vocab(train_x):
    # Special tokens
    vocab = {'__PAD__': 0,
             '__</e>__': 1,
             '__UNK__': 2}
    for tweet in train_x:
        tweet_processed = process_tweet(tweet)
        for token in tweet_processed:
            if token not in vocab:
                index = len(vocab)
                vocab[token] = index
    return vocab

In [None]:
vocab = get_vocab(train_x)
print("Total number of tokens in vocab:", len(vocab))

Total number of tokens in vocab: 9088


In [None]:
print(list(vocab.items())[:4])

[('__PAD__', 0), ('__</e>__', 1), ('__UNK__', 2), ('followfriday', 3)]


# Tweet to list of indexes
Convert a tweet to a list containing its index representation.
* Each token in the tweet is substituted by the index of such token in the vocabulary.

In [None]:
def tweet_to_list(tweet, vocab_dict, unk_token='__UNK__'):
    tokens = process_tweet(tweet)
    lista = []
    unk_ID = vocab_dict[unk_token]
    for token in tokens:
        token_id = vocab_dict.get(token, unk_ID)
        lista.append(token_id)
    return lista

In [None]:
print("Actual tweet is\n", val_pos[1])
print("\nTensor of tweet:\n", tweet_to_list(val_pos[1], vocab_dict=vocab))

Actual tweet is
 @heyclaireee is back! thnx God!!! i'm so happy :)

Tensor of tweet:
 [443, 2, 303, 566, 56, 9]


# Model

In [173]:
train_x1 = []
for t in train_x:
    tweet = tweet_to_list(t, vocab)
    train_x1.append(tweet)

In [174]:
train_x[0]

'#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)'

In [175]:
train_x1[0]

[3, 4, 5, 6, 7, 8, 9]

Padding

In [176]:
max_len = max([len(t) for t in train_x1])
max_len

51

In [177]:
tweets_padded = []
for tweet_idx in train_x1:
    n_pad = max_len - len(tweet_idx)
    pad_list = [0] * n_pad
    tweet_padded = tweet_idx + pad_list
    tweets_padded.append(tweet_padded)
    inputs = np.array(tweets_padded)

In [180]:
train_x1[0]

[3, 4, 5, 6, 7, 8, 9]

In [179]:
inputs[0]

array([3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [181]:
inputs.shape

(8000, 51)

In [183]:
train_y.shape

(8000,)

In [233]:
inputs[0].shape

(51,)

In [234]:
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(len(vocab), 16, input_length=max_len),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(256),
    tf.keras.layers.Dense(1)
])

model.compile(loss = 'binary_crossentropy',
              optimizer = 'adam',
              metrics = ['accuracy'])

In [235]:
model.fit(inputs, train_y, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x79ef01ce2620>

# Demostration

Take a positive tweet that the model haven't seen.

In [253]:
tweet = val_x[1]
tweet

"@heyclaireee is back! thnx God!!! i'm so happy :)"

Tokenize it.

In [247]:
tweet0 = tweet_to_list(tweet, vocab)
tweet0

[443, 2, 303, 566, 56, 9]

Pad it.

In [248]:
n_pad = max_len - len(tweet0)
pad_list = [0] * n_pad
tweet1 = tweet0 + pad_list
tweet2 = np.array(tweet1)
tweet2

array([443,   2, 303, 566,  56,   9,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0])

Make 2 dimensional

In [249]:
tweet3 = tweet2.reshape(1, 51)

In [250]:
tweet3.shape

(1, 51)

In [251]:
prediction = model.predict(tweet3)[0][0]



In [252]:
prediction

1.373132

The model predicts is a positive tweet.

Do the same with a negative tweet.

In [254]:
ntweet = val_x[-1]
ntweet

'@eawoman As a Hull supporter I am expecting a misserable few weeks :-('

In [255]:
ntweet0 = tweet_to_list(ntweet, vocab)
ntweet0

[2, 219, 1375, 2, 8, 5747]

In [259]:
n_pad = max_len - len(ntweet0)
pad_list = [0] * n_pad
ntweet1 = ntweet0 + pad_list
ntweet2 = np.array(ntweet1)
ntweet2

array([   2,  219, 1375,    2,    8, 5747,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0])

In [260]:
ntweet3 = ntweet2.reshape(1, 51)
prediction2 = model.predict(ntweet3)[0][0]



In [261]:
prediction2

-0.40451747

The model predicted the tweet as negative.