<a href="https://colab.research.google.com/github/Poojal04/Tweet-Decoder/blob/main/Copy_of_Copy_of_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing

In this Assignment, we will be exploring how to preprocess tweets for sentiment analysis.


In [None]:
import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random                              # pseudo-random number generator

## About the Twitter dataset

The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly. The exact match between these classes is not a coincidence. The intention is to have a balanced dataset. That does not reflect the real distributions of positive and negative classes in live Twitter streams.



In [None]:
# downloads sample twitter dataset.
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

We can load the text fields of the positive and negative tweets by using the module's `strings()` method like this:

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

Next, we'll print a report with the number of positive and negative tweets. It is also essential to know the data structure of the datasets

In [None]:
print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))
print(all_negative_tweets[0])

## Looking at raw texts



Below, you will print one random positive and one random negative tweet.

In [None]:
print(all_positive_tweets[random.randint(0,5000)])
print(all_negative_tweets[random.randint(0,5000)])

## Preprocess raw text for Sentiment analysis

Data preprocessing is one of the critical steps in any machine learning project. It includes cleaning and formatting the data before feeding into a machine learning algorithm. For NLP, the preprocessing steps are comprised of the following tasks:

* Tokenizing the string
* Lowercasing
* Removing stop words and punctuation
* Stemming




In [None]:
# Our selected sample. Complex enough to exemplify each step
tweet = all_positive_tweets[2277]
print(tweet)

Let's import a few more libraries for this purpose.

In [None]:
# download the stopwords from NLTK
nltk.download('stopwords')

In [None]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

### Remove hyperlinks,  Twitter marks and styles

Since we have a Twitter dataset, we'd like to remove some substrings commonly used on the platform like the hashtag, retweet marks, and hyperlinks. We'll use the [re](https://docs.python.org/3/library/re.html) library to perform regular expression operations on our tweet. We'll define our search pattern and use the `sub()` method to remove matches by substituting with an empty character (i.e. `''`)

In [None]:
# remove old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+', '', tweet)

# remove hyperlinks
tweet2 = re.sub(r'https?://[^\s\n\r]+', '', tweet2)

# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

### Tokenize the string

To tokenize means to split the strings into individual words without blanks or tabs. In this same step, we will also convert each word in the string to lower case. The [tokenize](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual) module from NLTK allows us to do these easily:

In [None]:
# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)

# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)

print()
print('Tokenized string:')
print(tweet_tokens)

### Remove stop words and punctuations

The next step is to remove stop words and punctuation. Stop words are words that don't add significant meaning to the text. You'll see the list provided by NLTK when you run the cells below.

In [None]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

We can see that the stop words list above contains some words that could be important in some contexts.


Time to clean up our tokenized tweet!

In [None]:

print(tweet_tokens)

tweets_clean = []

for word in tweet_tokens: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
        word not in string.punctuation):  # remove punctuation
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

Please note that the words **happy** and **sunny** in this list are correctly spelled.

### Stemming

Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.

Consider the words:
 * **learn**
 * **learn**ing
 * **learn**ed
 * **learn**t

All these words are stemmed from its common root **learn**. However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, **happi** and **sunni**. That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy:

 * **happ**y
 * **happi**ness
 * **happi**er

We can see that the prefix **happi** is more commonly used. We cannot choose **happ** because it is the stem of unrelated words like **happen**.

NLTK has different modules for stemming and we will be using the [PorterStemmer](https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter) module which uses the [Porter Stemming Algorithm](https://tartarus.org/martin/PorterStemmer/). Let's see how we can use it in the cell below.

In [None]:

print(tweets_clean)

# Instantiate stemming class
stemmer = PorterStemmer()

# Create an empty list to store the stems
tweets_stem = []

for word in tweets_clean:
    stem_word = stemmer.stem(word)  # stemming word
    tweets_stem.append(stem_word)  # append to the list

print('stemmed words:')
print(tweets_stem)

In [None]:
processed_tweet=' '.join(tweets_stem)
processed_tweet

That's it! Now we have a sentence which can be feed into to the next stage
of our  project.

.

PART 2: Sentimental Analysis

In [None]:
import numpy as np
import pandas as pd
nltk.download('twitter_samples')
# select the lists of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# concatenate the lists, 1st part is the positive tweets followed by the negative
tweets = all_positive_tweets + all_negative_tweets

In [None]:
#print tweets
print(tweets)

In [None]:
y=np.zeros(10000)
for i in range(5000):
  y[i]=1


Now make a function and implement pre-processing into all tweets and then make an array that contains all processed tweets as strings.

In [None]:
# Write your code here
clean_text1=[]
clean_text2=[]
clean_text3=[]
clean_text4=[]


def pre_processing(data):
  for word in data:
     res= re.sub(r'^RT[\s]+', "", word)
     res= re.sub(r'https?://[^\s\n\r]+', "", res)
     res =re.sub(r'#', "", res)
     if(res!=""):
      clean_text1.append(res)
  for word in clean_text1:
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(word)
    clean_text2.append(tweet_tokens)
  for words in clean_text2:
    wor=[]
    for w in words:
     if (w not in stopwords_english and
        w not in string.punctuation):
        wor.append(w)
    clean_text3.append(wor)
  for word in clean_text3:
    wor2=[]
    for w in word:
     stem_word = stemmer.stem(w)  # stemming word
     wor2.append(stem_word)
    clean_text4.append(wor2)

pre_processing(tweets)
clean_text4



Now use **TfidfVectorizer** to vectorize your tweets into a numbered matrix
 **X**.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
nested_list=[[' '.join(i)] for i in clean_text4]
flat_list = [sentence[0] for sentence in nested_list]
vectorizer = TfidfVectorizer(min_df=0.0005)
X = vectorizer.fit_transform(flat_list)
X=pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())
X

In [None]:

import tensorflow as tf
from sklearn import datasets
from sklearn.model_selection import train_test_split
array=X.values
y_train, y_test = train_test_split(y, test_size=0.2, random_state=1234)
X_train, X_test = train_test_split(array, test_size=0.2, random_state=1234)
X_train_tensor = tf.convert_to_tensor(X_train)
y_train_tensor = tf.convert_to_tensor(y_train)
X_test_tensor = tf.convert_to_tensor(X_test)
y_test_tensor = tf.convert_to_tensor(y_test)
normalise = tf.keras.layers.Normalization(axis=-1)
normalise.adapt(X_train_tensor)
def build_and_compile_model(norm):
  model = tf.keras.Sequential([
      norm,
      tf.keras.layers.Dense(500,activation='relu'),
      tf.keras.layers.Dense(100,activation='relu'),
      tf.keras.layers.Dense(50,activation='relu'),
      tf.keras.layers.Dense(10,activation='relu'),
      tf.keras.layers.Dense(2,activation='softmax')
  ])
  model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
  return model
dnn_model = build_and_compile_model(normalise)
dnn_model.fit(X_train_tensor,y_train_tensor, epochs=10, batch_size=32)
dnn_model.evaluate(X_test_tensor, y_test_tensor)






Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[1.1960663795471191, 0.7120000123977661]