##### Problem: We have a Kaggle dataset that contains tweets about a long discussion within a group of users. Here our task is to identify how many tweets are negative and positive so that we can give a conclusion. 

### Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import re
import nltk
import nltk

In [4]:
#dataset

data = pd.read_csv("datasets/twitter.csv")

In [5]:
print(data.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet  
0  !!! RT @mayasolovely: As a woman you shouldn't...  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...  


The tweet column in the above dataset contains the tweets that we need to use to analyze the feelings of those engaged in the discussion. But to go further, we have to clean up a lot of errors and other special symbols because these tweets contain a lot of language errors. So here is how we can clean up the tweet column:

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/snigdhakakkar/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

In [7]:
stemmer = nltk.SnowballStemmer("english")

In simple words stemming is reducing a word to its base word or stem in such a way that the words of similar kind lie under a common stem. For example – The words care, cared and caring lie under the same stem ‘care’. 

In [8]:
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))

In [9]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text

In [10]:
data["tweet"] = data["tweet"].apply(clean)

In [11]:
print(data.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet  
0   rt mayasolov woman shouldnt complain clean ho...  
1   rt  boy dat coldtyga dwn bad cuffin dat hoe  ...  
2   rt urkindofbrand dawg rt  ever fuck bitch sta...  
3             rt cganderson vivabas look like tranni  
4   rt shenikarobert shit hear might true might f...  


###### Now, we would calculate the sentiment scores of these tweets and assign a label to the tweets as positive, negative, or neutral.

In [12]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sentiments = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/snigdhakakkar/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [13]:
data["Positive"] = [sentiments.polarity_scores(i)["pos"] for i in data["tweet"]]
data["Negative"] = [sentiments.polarity_scores(i)["neg"] for i in data["tweet"]]
data["Neutral"] = [sentiments.polarity_scores(i)["neu"] for i in data["tweet"]]

Now, let us select the columns from this data that we need for the rest of the task of Twitter sentiment analysis

In [14]:
data = data[["tweet", "Positive", 
             "Negative", "Neutral"]]

print(data.head())

                                               tweet  Positive  Negative  \
0   rt mayasolov woman shouldnt complain clean ho...     0.147     0.157   
1   rt  boy dat coldtyga dwn bad cuffin dat hoe  ...     0.000     0.280   
2   rt urkindofbrand dawg rt  ever fuck bitch sta...     0.000     0.577   
3             rt cganderson vivabas look like tranni     0.333     0.000   
4   rt shenikarobert shit hear might true might f...     0.154     0.407   

   Neutral  
0    0.696  
1    0.720  
2    0.423  
3    0.667  
4    0.440  


Let us find out the most frequently assigned label as per the scores above:

In [15]:
x = sum(data["Positive"])
y = sum(data["Negative"])
z = sum(data["Neutral"])

In [16]:
def sentiment_score(a, b, c):
    if (a>b) and (a>c):
        print("Positive 😊 ")
    elif (b>a) and (b>c):
        print("Negative 😠 ")
    else:
        print("Neutral 🙂 ")

In [17]:
print(sentiment_score(x,y,z))

Neutral 🙂 
None


So, we see that most of the comments are neutral. Now, let us look at the individual scores

In [18]:
print("Positive: ", x)
print("Negative: ", y)
print("Neutral: ", z)

Positive:  2880.086000000009
Negative:  7201.020999999922
Neutral:  14696.887999999733


##### We can see that the total of neutral tweets is way higher than positive or negative tweets. But, the number of negative tweets is also three-folds when compared to positive tweets. Thus,we can say that most of the opinions are negative.