# Introduction
![](https://miro.medium.com/v2/resize:fit:1200/1*VT7AxioAGXplMe7RAEYfSA.png)
Sentiment analysis for Twitter is a specific application of sentiment analysis that focuses on analyzing the sentiment expressed in tweets, which are short, often informal messages posted on the Twitter platform. This type of analysis is valuable for understanding public opinion, brand perception, and tracking trends in real-time.


### Context
This is the sentiment140 dataset. It contains 1,600,000 tweets extracted using the twitter api . The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment .

#### Content
It contains the following 6 fields:

target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

ids: The id of the tweet ( 2087)

date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)

flag: The query (lyx). If there is no query, then this value is NO_QUERY.

user: the user that tweeted (robotickilldozr)

text: the text of the tweet (Lyx is cool)

#### Acknowledgements
The official link regarding the dataset with resources about how it was generated is here
The official paper detailing the approach is here

Citation: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
# plotting
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# nltk
from nltk.stem import WordNetLemmatizer
# sklearn
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Importing the dataset
DATASET_COLUMNS=['target','ids','date','flag','user','text']
DATASET_ENCODING = "ISO-8859-1"
df = pd.read_csv('/content/drive/MyDrive/DATA SETS/twitter.csv', encoding=DATASET_ENCODING,names=DATASET_COLUMNS)
df.head()

In [None]:
df.info()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df['target'].unique()

In [None]:
df['target'].nunique()

In [None]:
dataset=df[['text','target']]

In [None]:
dataset['target'] = dataset['target'].replace(4,1)

In [None]:
dataset['target'].unique()

### Steps for Text Cleaning
Text cleaning is a crucial preprocessing step in sentiment analysis to ensure that the text data is in a suitable format for analysis. Let's discuss each of the steps you've mentioned in detail:

**Step 1: Remove HTML Tags**

HTML tags are used to format text on web pages. When performing sentiment analysis on text data extracted from websites, it's essential to remove these tags as they don't contribute to the sentiment and can interfere with the analysis.

**Step 2: Remove URLs**

Uniform Resource Locators (URLs) are web addresses that often appear in text. They usually don't provide any meaningful sentiment information and can be removed to make the text more focused on the content itself.

**Step 3: Handling Emojis**

Emojis are graphical symbols that can convey emotions. You can choose to keep, remove, or replace emojis with their textual equivalents (e.g., converting 😊 to "smile"). The choice depends on whether you want to incorporate emoji sentiment into your analysis.

**Step 4: Chat word treatment**

In social media and online conversations, people often use slang, abbreviations, and informal language. You may need to replace these with their standard equivalents. For example, "u" becomes "you," "gr8" becomes "great," and so on. This step helps standardize the text for analysis.

**Step 5: Remove Punctuations**

Punctuation marks (e.g., !, ?, .) don't typically carry sentiment information and can be removed. However, in some cases, you might want to keep certain punctuation marks to understand the sentiment better, such as exclamation points to identify excitement or question marks for uncertainty.

**Step 6: Make Lower Case**

Consistency is essential in text analysis. Converting all text to lowercase ensures that "happy" and "Happy" are treated as the same word, avoiding duplication and improving analysis accuracy.

**Step 7: Spelling Correction**

Correcting misspelled words is important to improve the quality of sentiment analysis. You can use spelling correction algorithms or dictionaries to handle this step.

**Step 8: Tokenization**

Tokenization involves breaking the text into individual words or tokens. This step is crucial for further analysis because it allows you to work with individual words, making it easier to analyze sentiment at a granular level.

**Step 9: Remove Stop Words**

Stop words are common words such as "the," "and," "in," which occur frequently in the language but often don't carry much sentiment information. Removing stop words can reduce noise in the analysis and help focus on content words with more sentiment significance.

**Step 10: Stemming and Lemmatization**

Stemming and lemmatization are techniques to reduce words to their root or base forms. Stemming involves chopping off prefixes or suffixes to get to the root word (e.g., "jumping" becomes "jump"). Lemmatization, on the other hand, reduces words to their dictionary or base form (e.g., "better" becomes "good"). These techniques help ensure that variations of words are treated as the same word, improving analysis accuracy.

**Step 11: Algorithm**

#### Step 1: Remove HTML Tags

In [None]:
import re
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)



In [None]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [None]:
text

In [None]:
remove_html_tags(text)

#### Step 2: Remove URLs

In [None]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'', text)

In [None]:
text1 = 'Check out my notebook https://www.kaggle.com/ubaidshah/notebook8223fc1abb'
text2 = 'Check out my notebook http://www.kaggle.com/ubaidshah/notebook8223fc1abb'
text3 = 'Google search here www.google.com'
text4 = 'For notebook click https://www.kaggle.com/ubaidshah/notebook8223fc1abb to search check www.google.com'

In [None]:
remove_url(text2)

In [None]:
dataset['text'] = dataset['text'].apply(lambda x: remove_url(x))
dataset['text'].tail()

####  Step 3: Handling Emojis

In [None]:
# Defining dictionary containing all emojis with their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad',
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed',
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink',
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

In [None]:
def handel_emoji(text):
    for emoji in emojis.keys():
        text = text.replace(emoji, "EMOJI" + emojis[emoji])

    return text


In [None]:
handel_emoji("@stustone Your show is whack. Way worse than whack, it's wiggety-whack.  :(:(:(")

In [None]:
dataset['text']=dataset['text'].apply(lambda x:handel_emoji(x) )

#### Step 4: Chat word treatment

In [None]:
url1='https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt'
slang='/content/drive/MyDrive/DATA SETS/slang.txt'

In [None]:
slang

In [None]:
with open(slang,'r') as f:
    lines = f.readlines()


In [None]:
lines

In [None]:

(lines[0].split('='))[1][:-1]

In [None]:
chat_words=dict()
for i in range(len(lines)):
    chat_words[(lines[i].split('='))[0]]=(lines[i].split('='))[1][:-1]

In [None]:
chat_words

In [None]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [None]:
dataset.head(10)

In [None]:
print(chat_conversion(dataset.iloc[7][0]))

In [None]:
dataset['text'] = dataset['text'].apply(lambda x: chat_conversion(x))
dataset['text'].tail()

In [None]:
print(dataset.iloc[7][0])

#### Step 5: Remove Punctiations

In [None]:
import string
string.punctuation

In [None]:
exclude = string.punctuation
def remove_punc(text):
    return text.translate(str.maketrans('', '', exclude))

In [None]:
text = 'string. With. Punctuation?'

In [None]:
remove_punc(text)

In [None]:
dataset['text'] = dataset['text'].apply(lambda x: remove_punc(x))
dataset['text'].tail()

#### STEP 6: Make Lower Case

In [None]:
dataset['text']=dataset['text'].str.lower()
dataset['text'].tail()

#### Step 7: Spelling Correction

In [None]:
# ! pip install textblob

In [None]:
# from textblob import TextBlob

In [None]:
# incorrect_text = 'ceertain conditionas duriing seveal ggenerations aree moodified in the saame maner.'

# textBlb = TextBlob(incorrect_text)
# str(textBlb.correct())

In [None]:
# incorrect_text = "ahh i've always wanted to see rent  love the soundtrack!!"

# textBlb = TextBlob(incorrect_text)
# str(textBlb.correct())

In [None]:
# def spell_correct(text):
#     # return TextBlob(text).correct().string

In [None]:
# print(spell_correct('ahh ive always wanted to see rent  love the soundtrack!!'))

In [None]:
# dataset['text']=dataset['text'].apply(lambda x:spell_correct(x) )

In [None]:
# import itertools
# from autocorrect import Speller
# text="ahh ive always wanted to see rent  love the soundtrack!!"
# # #One letter in a word should not be present more than twice in continuation
# # text_correction = ''.join(''.join(s)[:2] for _, s in itertools.groupby(text))
# # print("Normal Text:\n{}".format(text_correction))
# spell = Speller(lang='en')
# ans = spell(text)
# print("After correcting text:\n{}".format(ans))

In [None]:
# def auto_correct(text):
#     spell=Speller(lang='en')
#     return spell(text)

In [None]:
# dataset['text']=dataset['text'].apply(lambda x:auto_correct(x) )

In [None]:
# dataset.iloc[201][0]

In [None]:
# import spacy
# nlp = spacy.load("en_core_web_sm")
# text = 'My email is abc@gmail.com'
# doc = nlp(text)
# l=[]
# for token in doc:
# #     print(token)
#     if not token.like_email:
# #         l.append(str(token))
# " ".join(l)

In [None]:
# !pip install autocorrect

In [None]:
# ! pip install spacy

In [None]:
# ! python -m spacy download en_core_web_sm

#### Step 8: Tokenization

In [None]:
import nltk
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [None]:
sent_tokenize("@stustone Your show is whack. Way worse than whack, it's wiggety-whack.  EMOJIsadEMOJIsadEMOJIsad")

In [None]:
sent=sent_tokenize("@stustone Your show is whack. Way worse than whack, it's wiggety-whack.  EMOJIsadEMOJIsadEMOJIsad")

In [None]:
wt=word_tokenize(("@stustone Your show is whack. Way worse than whack, it's wiggety-whack.  EMOJIsadEMOJIsadEMOJIsad"))

In [None]:
wt

In [None]:
def word_tokenize(text):
    return text

#### Step 9: Remove Stop Words

In [None]:
from nltk.corpus import stopwords

In [None]:

nltk.download('stopwords')

In [None]:
print(stopwords.words('english'))

In [None]:
sample_words = [word for word in wt if word not in stopwords.words('english')]

In [None]:
print(" ".join(sample_words))

In [None]:
" ".join(wt)

In [None]:
# import spacy
# nlp = spacy.load('en_core_web_sm')

In [None]:
# doc1 = nlp("@stustone Your show is whack. Way worse than whack, it's wiggety-whack.  EMOJIsadEMOJIsadEMOJIsad")

In [None]:
# for i in doc1:
#     print(i)

In [None]:
# w=("@stustone Your show is whack. Way worse than whack, it's wiggety-whack.  EMOJIsadEMOJIsadEMOJIsad").split()

In [None]:
# sample_words = [word for word in w if word not in stopwords.words('english')]
# print(" ".join(sample_words))

In [None]:
def token_split(text):
    lis_w=text.split()
    return lis_w


In [None]:
dataset['text']=dataset['text'].apply(lambda x:token_split(x))

In [None]:
dataset.head()

In [None]:
# def stopword_remove(lis):
#     sample_words = [word for word in lis if word not in stopwords.words('english')]
#     return " ".join(sample_words)


In [None]:
# dataset['text']=dataset['text'].apply(lambda x:stopword_remove(x))

In [None]:
dataset.head()

#### Step 10: Stemming and Lemmitization
![Imgur](https://i.imgur.com/uqNdwzX.png)


In [None]:
from nltk.stem.porter import PorterStemmer
st=PorterStemmer()
def stemming_on_text(data):
    text = [st.stem(word) for word in data]
    return " ".join(text)

In [None]:
dataset['text'] = dataset['text'].apply(lambda x: stemming_on_text(x))
dataset['text'].head()

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
lm = WordNetLemmatizer()
def lemmatizer_on_text(text):
    data = [lm.lemmatize(word,pos='v') for word in text.split()]
    return " ".join(data)


In [None]:
lemmatizer_on_text((dataset['text'][0]))

In [None]:
dataset['text'] = dataset['text'].apply(lambda x: lemmatizer_on_text(x))
dataset['text'].head()

In [None]:
dataset.head()

In [None]:
dataset['target'].value_counts()

In [None]:
dataset.to_csv('/content/drive/MyDrive/DATA SETS/processed_tweets.csv',index=False)

#### Step 11: Apply Algorithm

**TF-IDF**, which stands for **Term Frequency-Inverse Document Frequency**, is a numerical statistic used in natural language processing and information retrieval to evaluate the importance of a term (word) within a document relative to a collection of documents (corpus). It's a common technique for text feature extraction and is particularly useful for text-based applications like information retrieval, text classification, and document ranking.

The formula for calculating TF-IDF for a term in a document is as follows:

**TF-IDF(t, d) = TF(t, d) * IDF(t)**

Where:

TF(t, d) (Term Frequency): This component measures the frequency of a term (t) within a specific document (d). It indicates how often the term appears in the document and can be computed in various ways, such as simple word count or normalized frequency (e.g., by dividing the raw count by the total number of words in the document). A common formula for TF is:

**TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)**

IDF(t) (Inverse Document Frequency): This component assesses the importance of a term across a collection of documents. It's calculated as:

**IDF(t) = log(N / (n_t + 1))**

Where:

N is the total number of documents in the corpus.
n_t is the number of documents that contain the term t.
The "+1" in the denominator is a smoothing factor to avoid division by zero when a term is not found in any documents in the corpus.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(dataset['text'], dataset['target'], test_size=0.2, random_state=42)

# TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=500000,ngram_range=(1,3),stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [None]:
X_train_tfidf

In [None]:
X_train_tfidf.shape

In [None]:
print(X_train_tfidf)

In [None]:
print("Feature Names n",tfidf_vectorizer.get_feature_names_out())


In [None]:
for i, feature in enumerate(tfidf_vectorizer.get_feature_names_out()):
    print(i, feature)

In [None]:
# X_train_tfidf.toarray()

Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes are three different variants of the Naive Bayes algorithm, each suited for specific types of data and classification tasks. Here's a comparison of these variants:

1. **Gaussian Naive Bayes:**

   - **Use Case:** It is suitable for continuous data that follows a Gaussian (normal) distribution. It is commonly used for real-valued features.

   - **Mathematics:** It models the likelihood of features as Gaussian distributions. It assumes that the features are conditionally independent given the class.

   - **Strengths:**
     - Effective for continuous, real-valued data.
     - Works well for data that can be reasonably approximated by a Gaussian distribution.

   - **Weaknesses:**
     - May not perform well for data with non-Gaussian distributions.
     - Not ideal for text data or data with a large number of discrete categories.

   - **Example Use Cases:** Handwriting recognition, facial recognition, medical data analysis.

2. **Multinomial Naive Bayes:**

   - **Use Case:** It is suitable for discrete data, especially when dealing with text data. It is commonly used in text classification tasks.

   - **Mathematics:** It models the likelihood of features as a Multinomial distribution, assuming features represent word frequencies (e.g., using TF-IDF values).

   - **Strengths:**
     - Effective for text classification tasks, such as sentiment analysis and spam detection.
     - Handles discrete data well, where each feature represents the count or frequency of a term.

   - **Weaknesses:**
     - It assumes features are categorical and independent, which may not hold in some cases.
     - Ignores the order of words in a document.

   - **Example Use Cases:** Sentiment analysis, document classification, spam email detection.

3. **Bernoulli Naive Bayes:**

   - **Use Case:** It is suitable for binary data, where features represent binary attributes (0/1 values). It is often used in document classification tasks where you have binary presence/absence features.

   - **Mathematics:** It models the likelihood of features as a Bernoulli distribution.

   - **Strengths:**
     - Effective for binary data, where features are binary indicators (e.g., word presence/absence).
     - Suitable for tasks like spam detection and document classification.

   - **Weaknesses:**
     - Ignores term frequency information (only considers binary presence/absence).
     - Assumes independence of features.

   - **Example Use Cases:** Text classification with binary features, such as spam detection or sentiment analysis with a bag-of-words representation.

In summary, the choice between Gaussian, Multinomial, and Bernoulli Naive Bayes depends on the nature of your data and the specific classification task you're working on. For text classification, Multinomial and Bernoulli Naive Bayes are often more appropriate, while Gaussian Naive Bayes is better suited for continuous data. The effectiveness of each variant depends on the appropriateness of its underlying distribution assumption to the data you're working with.

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)
y_pred = nb_model.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))

In [None]:
new_tweets = ["I love this product!", "This is terrible."]
new_tweets_tfidf = tfidf_vectorizer.transform(new_tweets)

nb_sentiments = nb_model.predict(new_tweets_tfidf)

print("Sentiments (Naive Bayes):", nb_sentiments)


In [None]:
tweet=['''Agencies slipping in my DM on behalf of T Series and Adipurush and begging me to delete my tweets for some money, sorry guys you chose the wrong person. #AdipurushDisaster ''']

In [None]:
new_tweets_tfidf = tfidf_vectorizer.transform(tweet)

nb_sentiments = nb_model.predict(new_tweets_tfidf)

print("Sentiments (Naive Bayes):", nb_sentiments)