<a href="https://colab.research.google.com/github/teach65qualcomm/NLP-PREPROCESSING/blob/main/Twitter_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**
With the growing importance of social media in shaping public opinion, organizations increasingly rely on platforms like Twitter to gauge customer sentiment and brand perception. However, extracting meaningful insights from the vast amount of unstructured data on Twitter can be challenging. Many corporate teams lack the technical skills to analyze sentiment effectively or visually communicate findings in a way that drives actionable decisions. This gap in expertise hinders organizations from leveraging real-time feedback to improve customer engagement, enhance product offerings, and address reputational risks.

# **Objective**
The objective of this corporate training program is to empower participants with the skills and tools required to perform sentiment analysis on Twitter data. Through hands-on sessions, participants will learn how to collect, analyze, and visualize Twitter data using modern tools and techniques. By the end of the training, participants will be equipped to:

1) Extract and preprocess Twitter data for sentiment analysis.

2) Apply natural language processing (NLP) techniques to classify sentiments (positive, negative, neutral).

3) Design and interpret visualizations that clearly communicate sentiment trends and insights to stakeholders.

4) Leverage the analysis to inform strategic decisions, improve customer relations, and manage brand reputation.

This training will enhance organizational capability in social media analytics, fostering data-driven decision-making and improving competitive advantage.

In [3]:
!pip install pyspellchecker
#nltk.download('stopwords')

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m6.1/6.8 MB[0m [31m183.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m102.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


In [14]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_eng Averaged Perceptron Tagger (JSON)
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] averaged_perceptron_tagger_rus Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] bcp47............... BCP-47 Language Tags
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ 

True

In [15]:
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
import spacy

from spellchecker import SpellChecker

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import os
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
data=pd.read_csv("twitter_data.csv",encoding="ISO-8859-1")

In [6]:
data.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [7]:
data.isnull().sum()

Unnamed: 0,0
ItemID,0
Sentiment,0
SentimentText,0


# **Remove Unwanted Spaces**

In [4]:
data["SentimentText"][5]

'         or i just worry too much?        '

In [5]:
data["SentimentText"][5].strip()

'or i just worry too much?'

In [8]:
data["SentimentText"]=data["SentimentText"].map(lambda x:x.strip())

# **Remove Username**

In [8]:
for i in data["SentimentText"][0:100]:
    if "@" in i:
        print(i)

hmmmm.... i wonder how she my number @-)
I just cut my beard off. It's only been growing for well over a year. I'm gonna start it over. @shaunamanu is happy in the meantime.
@ginaaa &lt;3 GO TO THE SHOW TONIGHT
@Spiral_galaxy @YMPtweet  it really makes me sad when i look at Muslims reality now
and the entertainment is over, someone complained properly..   @rupturerapture experimental you say? he should experiment with a melody
I wanna be at home @ church...I wonder wht they are doing?
I will send sunshine to Northern Ireland, are you going swimming today @kezbat
I wish I could go to T4 On The Beach :'(    Would be great to see @Shontelle_Layne &amp; @DanMerriweather


In [10]:
q=data["SentimentText"][99987]

In [11]:
re.findall("@\w+",q)

['@Cupcake_Dollie']

In [9]:
data["sentiment_clean"]=data["SentimentText"].apply(lambda x : re.sub("@\w+","",x))

# **Remove Hyperlinks**

In [17]:
for i in data["sentiment_clean"][0:100]:
    if "http" in i:
        print(i)

awhhe man.... I'm completely useless rt now. Funny, all I can do is twitter. http://myloc.me/27HX
-- Meet your Meat http://bit.ly/15SSCI
(: !!!!!! - so i wrote something last week. and i got a call from someone in the new york office... http://tumblr.com/xcn21w6o7
friends are leaving me 'cause of this stupid love  http://bit.ly/ZoxZC
go give ur mom a hug right now. http://bit.ly/azFwv
- I love you guys so much that it hurts. http://tumblr.com/xkh1z19us
- Longest night ever.. ugh! http://tumblr.com/xwp1yxhi6


In [18]:
p="http:\/\/\S+"
data["sentiment_clean"]=data["sentiment_clean"].apply(lambda x :re.sub(p,"",x))

In [22]:
for i in data["sentiment_clean"][10000:100000]:
    if "http" in i:
        print(i)

&quot;The Gmail gadget does not support the &quot;Always use https&quot;&quot; grr doofes igoogle  will aber kein http nutzen........
()Went to see Bob Dylan last night, was amazin'  Going to work soon. I was put on till 13 for my first ever shift! http ...
 and who's this? https://twitter.com/frillneck and bakit wala si baylee?
 wish you were coming villey  mehhhh.You at least get the chance to avoid being in photos like this https://twitpic.com/884td
 https worked for me in some cases hehe
 yes, but to do the same thing with cURL it's like 4 times as long. I was thinking cURL was the only way to get http headers
 You should try using https://destroytwitter.com/ Really neat twitter app
 DUNNO    What does 'can't open' mean? http//thejoshuablog.com
 our addiction is killing meee!  grabe pg stalk mo ah! hehe  http:www.twitter.com/duchess07 http:www.plurk.com/Eesshh


In [10]:
p="https:\/\/\S+|http:\/\/\S+|http:\S+|http\/\/\S+|http:|https|http"
data["sentiment_clean"]=data["sentiment_clean"].apply(lambda x :re.sub(p,"",x))

In [24]:
for i in data["sentiment_clean"][0:100000]:
    if "http" in i:
        print(i)

# **Spell Checker**

In [30]:

# Load spaCy for preprocessing
nlp = spacy.load("en_core_web_sm")
spell = SpellChecker()

def spell_check_with_exceptions(text):
    corrected_words = []
    doc = nlp(text)

    for token in doc:
        # Exception handling: Skip numbers, punctuation, proper nouns, and already correct words
        if token.is_punct or token.like_num or token.is_space or token.is_upper:
            corrected_words.append(token.text)  # Keep as is
        elif spell.unknown([token.text]):  # Check if the word is unknown
            corrected_word = spell.correction(token.text)  # Correct misspelling
            corrected_words.append(corrected_word if corrected_word else token.text)
        else:
            corrected_words.append(token.text)  # Already correct word

    return " ".join(corrected_words)





KeyboardInterrupt: 

In [None]:
data["sentiment_clean"]=data["sentiment_clean"].apply(spell_check_with_exceptions)

In [32]:
data.head()

Unnamed: 0,ItemID,Sentiment,SentimentText,sentiment_clean
0,1,0,is so sad for my APL friend.............,is so sad for my APL friend.............
1,2,0,I missed the New Moon trailer...,I missed the New Moon trailer...
2,3,1,omg its already 7:30 :O,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I've been at...,.. Omgaga. Im sooo im gunna CRy. I've been at...
4,5,0,i think mi bf is cheating on me!!! T_T,i think mi bf is cheating on me!!! T_T


# **Lemmatizer**

In [11]:
from nltk.stem import WordNetLemmatizer

In [12]:
ls=WordNetLemmatizer()

In [17]:
data["sentiment_clean"]=data["sentiment_clean"].apply(lambda x :" ".join([ls.lemmatize(j,pos="v") for j in x.split() if j not in stopwords.words("english")]))

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


Visualizing Data through a Word Cloud Matrix for Negative Tweets to Identify Potential Negative Words or Most Frequently Used Terms in User Comments

In [None]:
from wordcloud import WordCloud

In [None]:
a=data.loc[data["Sentiment"]==0,"sentiment_text"]

In [None]:
words_0=" ".join([i for i in a])

In [None]:
wc_0=WordCloud(width=1920,height=1080,max_font_size=240, colormap="viridis",background_color='WHITE').generate(words_0)
plt.figure(figsize=(30,20))
plt.imshow(wc_0)
plt.axis("off")
plt.show()

Visualizing Data through a **Word Cloud Matrix** for **Positive Tweets** to Identify Potential Positive Words or Most Frequently Used Terms in User Comments

In [None]:
words_1=" ".join([i for i in data.loc[data["Sentiment"]==1,"sentiment_text"]])

wc_1=WordCloud(width=1920,height=1080,random_state=20,max_font_size=250).generate(words_1)
plt.figure(figsize=(30,20))
plt.imshow(wc_1)
plt.axis("off")
plt.show()