SOCIAL MEDIA SENTIMENT ANALYSIS

Sentiment Analysis is the NLP technique that performs on the text to determine whether the author’s intentions towards a particular topic, product, etc. are positive and negative.

In [5]:
# Import the libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import string
import nltk
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

Data Preprocessing

In [6]:
#dropping the tweets that has neutral sentiment
df = pd.read_csv('Tweets.csv')
df = df[df['airline_sentiment'] != 'neutral']

In [7]:
df=df[['text','airline_sentiment']]
df.head(5) 
# shuffle the df and pick first 5

Unnamed: 0,text,airline_sentiment
1,@VirginAmerica plus you've added commercials t...,positive
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative
5,@VirginAmerica seriously would pay $30 a fligh...,negative
6,"@VirginAmerica yes, nearly every time I fly VX...",positive


In [8]:
# Removing stopwords from text
nltk.download('stopwords')
from nltk.corpus import stopwords
stop=set(stopwords.words("english"))
print(stop)

{'off', 'we', 'about', 'll', 'again', "hasn't", 'wouldn', 'm', 'which', "that'll", 've', 'theirs', "she's", 'being', 'over', 'who', 'him', 'their', 'whom', 'doesn', 'had', 't', 'same', "didn't", 'that', 'what', 'yourselves', 'if', 'why', 'is', "haven't", 'your', 'when', 'wasn', 'how', 'a', 'because', 'me', 'then', "aren't", "doesn't", 'with', 'mustn', 'to', 'into', 'most', 'hadn', "isn't", 'each', 'he', 'was', 'during', "mustn't", "it's", 'be', 'no', "don't", "needn't", 'having', 'her', 'shan', 'will', "hadn't", 'were', 'very', "you're", 'can', 'such', 'above', "should've", 'they', 'them', 'i', 'herself', 'these', 'has', 'but', 's', 'so', 'aren', 'in', "wasn't", 'hasn', 'up', 'some', 'haven', 'while', 'does', 'before', "won't", 'and', 'through', 'once', 'not', 'out', 'both', 'won', 'further', 'ourselves', 'don', 'myself', "wouldn't", 'against', 'more', 'did', 'weren', 'those', 'of', 'itself', 'this', 'do', 'isn', 'my', 'yours', 'am', 'than', "mightn't", 'the', 'only', 'any', 'yourself'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df

Unnamed: 0,text,airline_sentiment
1,@VirginAmerica plus added commercials experien...,positive
3,@VirginAmerica really aggressive blast obnoxio...,negative
4,@VirginAmerica really big bad thing,negative
5,@VirginAmerica seriously would pay $30 flight ...,negative
6,"@VirginAmerica yes, nearly every time I fly VX...",positive
...,...,...
14633,"@AmericanAir flight Cancelled Flightled, leavi...",negative
14634,@AmericanAir right cue delays👌,negative
14635,@AmericanAir thank got different flight Chicago.,positive
14636,@AmericanAir leaving 20 minutes Late Flight. N...,negative


In [10]:
# Removing words which have length of 3 (ex:is,are,I)
df['text'] = df['text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3 ]))
df

Unnamed: 0,text,airline_sentiment
1,@VirginAmerica plus added commercials experien...,positive
3,@VirginAmerica really aggressive blast obnoxio...,negative
4,@VirginAmerica really thing,negative
5,@VirginAmerica seriously would flight seats pl...,negative
6,"@VirginAmerica yes, nearly every time “ear wor...",positive
...,...,...
14633,"@AmericanAir flight Cancelled Flightled, leavi...",negative
14634,@AmericanAir right delays👌,negative
14635,@AmericanAir thank different flight Chicago.,positive
14636,@AmericanAir leaving minutes Late Flight. warn...,negative


Removing Punctuation, Numbers, and Special Characters

Punctuation, numbers and special characters do not help much. It is better to remove them from the text. Here we will replace everything except characters and hashtags with spaces.

In [11]:
# Removing special characters and symbols
df['text'] = df['text'].str.replace("[^a-zA-Z#]"," ")
df

  


Unnamed: 0,text,airline_sentiment
1,VirginAmerica plus added commercials experien...,positive
3,VirginAmerica really aggressive blast obnoxio...,negative
4,VirginAmerica really thing,negative
5,VirginAmerica seriously would flight seats pl...,negative
6,VirginAmerica yes nearly every time ear wor...,positive
...,...,...
14633,AmericanAir flight Cancelled Flightled leavi...,negative
14634,AmericanAir right delays,negative
14635,AmericanAir thank different flight Chicago,positive
14636,AmericanAir leaving minutes Late Flight warn...,negative


Tokenization

It is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words

In [12]:
tokenized_text1= df['text'].apply(lambda x: x.split())

Stemming(imported from the NLTK package)

Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.

In [13]:
# Splitting sentence into words and extrate the base form of words
from nltk import PorterStemmer
ps = PorterStemmer()
tokenized_text1 = tokenized_text1.apply(lambda x: [ps.stem(i) for i in x])
tokenized_text1

1        [virginamerica, plu, ad, commerci, experi, tacki]
3        [virginamerica, realli, aggress, blast, obnoxi...
4                           [virginamerica, realli, thing]
5        [virginamerica, serious, would, flight, seat, ...
6        [virginamerica, ye, nearli, everi, time, ear, ...
                               ...                        
14633    [americanair, flight, cancel, flightl, leav, t...
14634                          [americanair, right, delay]
14635        [americanair, thank, differ, flight, chicago]
14636    [americanair, leav, minut, late, flight, warn,...
14638    [americanair, money, chang, flight, answer, ph...
Name: text, Length: 11541, dtype: object

Vectorization

Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which used to find word predictions, word similarities/semantics. The process of converting words into numbers are called Vectorization

In [14]:
# Coverting words to vector for comparison with input data
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df['text'])
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()



Unnamed: 0,aa,able,about,absolute,absolutely,acceptable,access,account,actually,address,...,yall,yeah,year,years,yes,yesterday,yet,you,your,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.487496,0.0,0.0,0.0,0.0,0.0


In [15]:
X = words_df
y = df['airline_sentiment']

Training the model using the machine learing methods such as Logistic regression,Support vector machines,etc.

In [16]:
# spliting dataset for training and testing
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,random_state=1,test_size=0.1,shuffle=False)

In [17]:
# Training the model using Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(x_train, y_train)
prediction_linear = logreg.predict(vectors)
logreg.score(x_test,y_test)#testing the accuracy of Logistic Regression model

  "X does not have valid feature names, but"


0.9385281385281385

In [18]:
#Training the model using SVM
from sklearn import svm
from sklearn.metrics import classification_report
# Perform classification with SVM, kernel=linear
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(vectors, df['airline_sentiment'])
prediction_linear1 = clf_svm.predict(vectors)
clf_svm.score(x_test,y_test)#testing the accuracy of SVM model

  f"X has feature names, but {self.__class__.__name__} was fitted without"


0.9645021645021645

In [19]:
 # Results
report = classification_report(df['airline_sentiment'], prediction_linear1, output_dict=True)
print('positive: ', report['positive'])
print('negative: ', report['negative'])

positive:  {'precision': 0.8985655737704918, 'recall': 0.7422767668218366, 'f1-score': 0.8129779837775203, 'support': 2363}
negative:  {'precision': 0.9364897278131192, 'recall': 0.978426672477664, 'f1-score': 0.95699898758459, 'support': 9178}


In [20]:
# function built for prediction of a owrd or sentence
def predict_sentiment():
    review=input()
    review_vector = vectorizer.transform([review])
    print(clf_svm.predict(review_vector))#accuracy of svm model is more

In [21]:
# Predicting any data 
predict_sentiment()

happy
['positive']
