# Twitter Sentiment Analysis

In this project, we aim to perform sentiment analysis on tweets related to disasters, utilizing natural language processing techniques and machine learning algorithms. Our goal is to accurately classify tweets as either disaster-related or non-disaster-related based on their content and sentiments expressed.

The dataset used for this project is sourced from https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk/data. This dataset contains a collection of tweets, each labeled as either "disaster" or "non-disaster."

Twitter serves as a significant platform for real-time information dissemination during natural disasters and emergencies. However, amid the influx of tweets, it becomes challenging for human operators to quickly discern which tweets are genuinely relevant to disasters. Therefore, an automated system that can efficiently identify and classify disaster-related tweets can be invaluable for emergency response, news agencies, and public awareness.

In this project, we will explore various natural language processing techniques, including tokenization, stemming or lemmatization, and feature extraction. We will employ machine learning models, with a particular focus on Naive Bayes, to classify tweets based on their content and sentiment, determining whether they pertain to a disaster or not. Additionally, we will evaluate the model's performance using appropriate metrics and aim to achieve high accuracy in classifying disaster-related tweets to aid in crisis management and public safety.

## 1. Download Corpora & Import Libraries

Download the Gutenberg corpus from NLTK, which contains diverse literary texts, enabling the enhancement of sentiment analysis models through training on a wide range of writing styles and historical linguistic patterns.

In [35]:
import nltk
# nltk.download('gutenberg')

from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from collections import Counter
from string import punctuation

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import Gutenberg Project
from nltk.corpus import gutenberg

We selected **Bryant-stories.txt** from the Gutenberg corpus as the text file to explore and analyze.

In [3]:
text = gutenberg.raw('bryant-stories.txt')

# print out small chunk of bryant-stories.txt
print(text.strip()[:1000])

[Stories to Tell to Children by Sara Cone Bryant 1918] 


TWO LITTLE RIDDLES IN RHYME


     There's a garden that I ken,
     Full of little gentlemen;
     Little caps of blue they wear,
     And green ribbons, very fair.
           (Flax.)

     From house to house he goes,
     A messenger small and slight,
     And whether it rains or snows,
     He sleeps outside in the night.
           (The path.)




THE LITTLE YELLOW TULIP


Once there was a little yellow Tulip, and she lived down in a little
dark house under the ground. One day she was sitting there, all by
herself, and it was very still. Suddenly, she heard a little _tap, tap,
tap_, at the door.

"Who is that?" she said.

"It's the Rain, and I want to come in," said a soft, sad, little voice.

"No, you can't come in," the little Tulip said.

By and by she heard another little _tap, tap, tap_ on the window-pane.

"Who is there?" she said.

The same soft little voice answered, "It's the 

**Bryant-stories.txt** appears to be a compilation of poems, stories written in straightforward language, and even includes dialogues, offering a diverse collection of text types for analysis in this project.

## 2. Tokenization

Tokenization is about breaking a paragraph into smaller meaningful units. In this case, we have two types:

- **Sentence tokenization:** It involves dividing a paragraph into individual sentences. 
- **Word tokenization:** It takes each sentence and breaks it down further into individual words. 

### 2.1 Step 1 - Cleaning: Remove Title, Chapter, Author, Line separation

In [4]:
# Remove title
text = text.replace("[Stories to Tell to Children by Sara Cone Bryant 1918]", "")

In [5]:
# Remove poem
text = text.strip()[402:]

In [6]:
# Remove line separation
text = text.replace("\r\n", " ")

Lets take a look at the first 10 sentences.

In [7]:
for i, line in enumerate(text.split('.')):
    if len(line) > 0:
        print(str(i) + ':\t' + line)
    if i > 10: 
        break

0:	Once there was a little yellow Tulip, and she lived down in a little dark house under the ground
1:	 One day she was sitting there, all by herself, and it was very still
2:	 Suddenly, she heard a little _tap, tap, tap_, at the door
3:	  "Who is that?" she said
4:	  "It's the Rain, and I want to come in," said a soft, sad, little voice
5:	  "No, you can't come in," the little Tulip said
6:	  By and by she heard another little _tap, tap, tap_ on the window-pane
7:	  "Who is there?" she said
8:	  The same soft little voice answered, "It's the Rain, and I want to come in!"  "No, you can't come in," said the little Tulip
9:	  Then it was very still for a long time
10:	 At last, there came a little rustling, whispering sound, all round the window: _rustle, whisper, whisper_
11:	  "Who is there?" said the little Tulip


### 2.2 Step 2 - Tokenize

In [8]:
# Tokenize and lowercase
tokenized_lowered = list(map(str.lower, word_tokenize(text)))
print(tokenized_lowered[:10])

['once', 'there', 'was', 'a', 'little', 'yellow', 'tulip', ',', 'and', 'she']


### 2.3 Step 3 - Define Stopwords

In [9]:
# Stopwords for English
stopwords_en = set(stopwords.words('english'))

# Print the example of stopwords from NLTK
print(stopwords_en)

{"needn't", 'we', 'them', 'from', 'just', 'can', 'don', 'themselves', 'when', 'o', 'below', 'yours', 've', "you'll", 'haven', "wasn't", 'a', 'there', 'own', 't', 'same', "shan't", 'does', "haven't", 'was', 'over', 'only', "you'd", 'off', 'our', 'isn', 'then', 'yourself', 'about', 'whom', 'all', 'each', 'down', 'mustn', 'having', 'its', 's', 'myself', 'because', "wouldn't", 'won', 'she', "that'll", 'doing', 'should', 'aren', 'couldn', "hasn't", 'any', 'for', 'it', "don't", 'her', 'until', 'to', 'again', 'weren', 'some', 'you', 'an', 'shan', 'had', 'more', 'hadn', 'ourselves', 'theirs', 'am', 'not', 'doesn', 'while', "you're", 'most', "didn't", "doesn't", 'wasn', "aren't", 'this', 'after', 'he', 'yourselves', 'himself', 'but', 'now', 'the', 'both', 'his', "should've", 'needn', 'itself', 'who', 'herself', 'their', 'above', 'than', 're', "she's", "couldn't", 'what', 'wouldn', 'other', 'here', 'or', 'as', 'few', 'ain', 'where', 'of', 'ours', 'mightn', 'in', 'very', "won't", 'were', "you've"

In [10]:
print([word for word in tokenized_lowered if word not in stopwords_en][:10])

['little', 'yellow', 'tulip', ',', 'lived', 'little', 'dark', 'house', 'ground', '.']


### 2.4 Step 4 - Define Punctuation

Let's take a look at what is included in the punctuation provided by library `string`.

In [11]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### 2.5 Step 5 - Remove Stopwords and Punctuation from Tokenized words.

In [12]:
# Combining the stopwords with punctuation
stopwords_en_withpunct = stopwords_en.union(set(punctuation))

# Remove both from Tokenized word
tokenized_lowered_noStop_noPunc = [word for word in tokenized_lowered if word not in stopwords_en_withpunct]

In [13]:
tokenized_lowered_noStop_noPunc

['little',
 'yellow',
 'tulip',
 'lived',
 'little',
 'dark',
 'house',
 'ground',
 'one',
 'day',
 'sitting',
 'still',
 'suddenly',
 'heard',
 'little',
 '_tap',
 'tap',
 'tap_',
 'door',
 '``',
 "''",
 'said',
 '``',
 "'s",
 'rain',
 'want',
 'come',
 "''",
 'said',
 'soft',
 'sad',
 'little',
 'voice',
 '``',
 'ca',
 "n't",
 'come',
 "''",
 'little',
 'tulip',
 'said',
 'heard',
 'another',
 'little',
 '_tap',
 'tap',
 'tap_',
 'window-pane',
 '``',
 "''",
 'said',
 'soft',
 'little',
 'voice',
 'answered',
 '``',
 "'s",
 'rain',
 'want',
 'come',
 "''",
 '``',
 'ca',
 "n't",
 'come',
 "''",
 'said',
 'little',
 'tulip',
 'still',
 'long',
 'time',
 'last',
 'came',
 'little',
 'rustling',
 'whispering',
 'sound',
 'round',
 'window',
 '_rustle',
 'whisper',
 'whisper_',
 '``',
 "''",
 'said',
 'little',
 'tulip',
 '``',
 "'s",
 'sunshine',
 "''",
 'said',
 'little',
 'soft',
 'cheery',
 'voice',
 '``',
 'want',
 'come',
 "''",
 '``',
 'n',
 '--',
 "''",
 'said',
 'little',
 'tulip

## 3. Stemming and Lemmatization

**Stemming** and **Lemmatization** are techniques used in natural language processing to reduce words to their base or root form, allowing different forms of the same word to be treated as one. For example: "walks", "walking", "walked" should all be the same as "walk".

- **Stemming:**
Stemming involves removing suffixes or prefixes from words to obtain their base form. The resulting stems might not always be actual words, but they represent the core meaning of the original words.

- **Lemmatization:**
Lemmatization, on the other hand, uses linguistic rules and knowledge to find the base or root word, which is called the lemma. Unlike stemming, lemmatization ensures that the resulting lemma is a valid word in the language. This requires the use of a lexicon or a vocabulary, as well as morphological analysis.

In [14]:
wnl = WordNetLemmatizer()

def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {
        'NN':'n', 
        'JJ':'a',
        'VB':'v', 
        'RB':'r'
    }
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' 
    
def lemmatize_sent(text): 
    return [
        wnl.lemmatize(
            word.lower(), 
            pos=penn2morphy(tag)
        ) 
        for word, tag in pos_tag(word_tokenize(text))
    ]

In [18]:
def preprocess_text(text: str):
    """
    Args
        text: str - i.e. document/sentence
    Output
         list(str) - i.e. list of lemmas
    """
    return [
        word for word in lemmatize_sent(text) 
        if word not in stopwords_en_withpunct
        and not word.isdigit()
    ]

## 4. Bag of Words

In [19]:
processed_sent = preprocess_text(text)

print('Word counts:')
print(Counter(processed_sent[:100]))

Word counts:
Counter({'little': 11, '``': 9, "''": 8, 'say': 7, 'come': 6, 'tulip': 4, "'s": 3, 'want': 3, 'soft': 3, 'voice': 3, 'still': 2, 'hear': 2, '_tap': 2, 'tap': 2, 'tap_': 2, 'rain': 2, 'ca': 2, "n't": 2, 'whisper': 2, 'yellow': 1, 'live': 1, 'dark': 1, 'house': 1, 'ground': 1, 'one': 1, 'day': 1, 'sit': 1, 'suddenly': 1, 'door': 1, 'sad': 1, 'another': 1, 'window-pane': 1, 'answer': 1, 'long': 1, 'time': 1, 'last': 1, 'rustling': 1, 'sound': 1, 'round': 1, 'window': 1, '_rustle': 1, 'whisper_': 1, 'sunshine': 1, 'cheery': 1})


## 5. Tweeter Dataset Preparation

In this section, we will use dataset from this link https://www.kaggle.com/code/pavansanagapati/knowledge-graph-nlp-tutorial-bert-spacy-nltk/data to identify which Tweets are about disaster and which ones are not.

`Training data`

In [21]:
# Load and read csv file
df = pd.read_csv("train.csv")

# Create Train dataset with 3 columns (id, text, target)
df_train = df[['id', 'text', 'target']]
df_train.head(5)

Unnamed: 0,id,text,target
0,1,Our Deeds are the Reason of this #earthquake M...,1
1,4,Forest fire near La Ronge Sask. Canada,1
2,5,All residents asked to 'shelter in place' are ...,1
3,6,"13,000 people receive #wildfires evacuation or...",1
4,7,Just got sent this photo from Ruby #Alaska as ...,1


`Testing data`

In [22]:
# Load and read csv file
df = pd.read_csv("test.csv")

# Create Train dataset with 2 columns (id, text)
df_test = df[['id', 'text']]
df_test.head(5)

Unnamed: 0,id,text
0,0,Just happened a terrible car crash
1,2,"Heard about #earthquake is different cities, s..."
2,3,"there is a forest fire at spot pond, geese are..."
3,9,Apocalypse lighting. #Spokane #wildfires
4,11,Typhoon Soudelor kills 28 in China and Taiwan


In [24]:
# Split into train, validation split, with test size of 25%
train, valid = train_test_split(df_train, test_size=0.25)

## 6. Vectorization

In [25]:
# Initialize the vectorizer and 
# override the analyzer totally with the preprocess_text().
count_vect = CountVectorizer(
    analyzer=preprocess_text
)

# When we use `CounterVectorizer.fit_transform`,
# we essentially create the dictionary and 
# vectorize our input text at the same time.
train_set = count_vect.fit_transform(train['text'])
train_tags = train['target']

# When vectorizing the validation data and testing data, we use `CountVectorizer.transform()`.
valid_set = count_vect.transform(valid['text'])
valid_tags = valid['target']

test_set = count_vect.transform(df_test['text'])

## 7. Modelling: Classification

In this section, we will focus on building a binary classification model using Naive Bayes to determine whether a tweet is related to a disaster or not. The process involves training the model on a training set and evaluating its performance on a validation set using 5-fold cross-validation. After fine-tuning the model, we will train it on the full training set and predict on a hold-out (test set) to assess its overall effectiveness in classifying disaster-related tweets accurately.

In [28]:
clf = MultinomialNB() 

# To train the classifier, simple do 
clf.fit(train_set, train_tags) 

# Add 5-fold cross-validation to the train set and calculate the mean and standard deviation of the performance
score = cross_validate(clf, train_set, train_tags, cv=5)
print("The mean Performace of Naive Bayes classifier model = ", np.mean(score['test_score']))
print("The standard deviation Performace of Naive Bayes classifier model = ", np.std(score['test_score']))

The mean Performace of Naive Bayes classifier model =  0.7962863251733279
The standard deviation Performace of Naive Bayes classifier model =  0.015196153672174995


In [29]:
# To predict our tags (i.e. whether requesters get their pizza), 
# we feed the vectorized `test_set` to .predict()
predictions_valid = clf.predict(valid_set)

print('Disaster prediction from tweet accuracy = {}'.format(
        accuracy_score(predictions_valid, valid_tags) * 100))

Disaster prediction from tweet accuracy = 79.56932773109243


Re-vectorize the train and test set since now our vectorizer is different using the full training set.

In [31]:
count_vect = CountVectorizer(
    analyzer=preprocess_text
)

full_train_set = count_vect.fit_transform(df_train['text'])
full_tags = df_train['target']

test_set = count_vect.transform(df_test['text'])

Now, train with full data set (Train data + Validation Data) and predict.

In [32]:
clf = MultinomialNB() 
clf.fit(full_train_set, full_tags) 
predictions = clf.predict(test_set)

## 8. Prediction

In [36]:
df_test["prediction"] = predictions

df_test.head(10)

Unnamed: 0,id,text,prediction
0,0,Just happened a terrible car crash,1
1,2,"Heard about #earthquake is different cities, s...",1
2,3,"there is a forest fire at spot pond, geese are...",1
3,9,Apocalypse lighting. #Spokane #wildfires,1
4,11,Typhoon Soudelor kills 28 in China and Taiwan,1
5,12,We're shaking...It's an earthquake,1
6,21,They'd probably still show more life than Arse...,0
7,22,Hey! How are you?,0
8,27,What a nice hat?,0
9,29,Fuck off!,0


Upon examining the initial 10 rows of Twitter data, it becomes evident that our Naive Bayes model has the capability to predict whether a post is associated with a disaster or not.

# Conclusion

In this project, we successfully developed a disaster prediction model using tweets and natural language processing techniques. The model achieved an accuracy of 79.57% in classifying tweets as disaster-related or non-disaster-related, demonstrating its capability to discern relevant information during emergencies.

Our journey began by preprocessing the Twitter data, including tokenization and feature extraction. We leveraged the power of Naive Bayes, a simple yet effective machine learning algorithm, to perform binary classification. The model's accuracy suggests that it can aid emergency responders, news agencies, and the public in quickly identifying disaster-related tweets amidst the vast volume of social media content.

While the achieved accuracy is commendable, there is always room for further improvement. Future work may involve exploring advanced deep learning models or incorporating domain-specific lexicons to enhance sentiment analysis. Additionally, considering contextual factors and the evolving nature of language in tweets could lead to better generalization and higher accuracy.