# Twitter Duygu Analizi

Yararlanilan Kaynak
 - https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/
 
Veri
 - https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/
 
Problem
> The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.
Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

 
Su paketleri yukleyin 
 - textblob
 - NLTK
 - tweepy

In [2]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [3]:
train = pd.read_csv('train.csv')

In [4]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7359 entries, 0 to 7358
Data columns (total 3 columns):
id       7359 non-null int64
label    7359 non-null int64
tweet    7359 non-null object
dtypes: int64(2), object(1)
memory usage: 172.6+ KB


In [6]:
train.describe()

Unnamed: 0,id,label
count,7359.0,7359.0
mean,3680.0,0.069575
std,2124.504648,0.254446
min,1.0,0.0
25%,1840.5,0.0
50%,3680.0,0.0
75%,5519.5,0.0
max,7359.0,1.0


# Yeni Ozellik ekleme

> One of the most basic features we can extract is the number of words in each tweet. The basic intuition behind this is that generally, the negative sentiments contain a lesser amount of words than the positive ones.

In [8]:
train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))
train['char_count'] = train['tweet'].str.len() ## this also includes spaces
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count
0,1,0,@user when a father is dysfunctional and is s...,21,102
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122
2,3,0,bihday your majesty,5,21
3,4,0,#model i love u take with u all the time in ...,17,118
4,5,0,factsguide: society now #motivation,8,39


### Ortalama kelime sayisi

__Varsayim__ : 
 - Negatif tweetlerde, kelime sayisi daha azdir

In [10]:
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789
2,3,0,bihday your majesty,5,21,5.666667
3,4,0,#model i love u take with u all the time in ...,17,118,7.846154
4,5,0,factsguide: society now #motivation,8,39,8.0


### Stopwords sayisi

In [11]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5
2,3,0,bihday your majesty,5,21,5.666667,1
3,4,0,#model i love u take with u all the time in ...,17,118,7.846154,5
4,5,0,factsguide: society now #motivation,8,39,8.0,1


### Hashtag sayisi

In [13]:
train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10,1
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5,3
2,3,0,bihday your majesty,5,21,5.666667,1,0
3,4,0,#model i love u take with u all the time in ...,17,118,7.846154,5,1
4,5,0,factsguide: society now #motivation,8,39,8.0,1,1


In [14]:
train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10,1,0
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5,3,0
2,3,0,bihday your majesty,5,21,5.666667,1,0,0
3,4,0,#model i love u take with u all the time in ...,17,118,7.846154,5,1,0
4,5,0,factsguide: society now #motivation,8,39,8.0,1,1,0


# Buyuk harfle baslayan kelime sayisi
> Anger or rage is quite often expressed by writing in UPPERCASE words which makes this a necessary operation to identify those words.

In [15]:
train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics,upper
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10,1,0,0
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5,3,0,0
2,3,0,bihday your majesty,5,21,5.666667,1,0,0,0
3,4,0,#model i love u take with u all the time in ...,17,118,7.846154,5,1,0,1
4,5,0,factsguide: society now #motivation,8,39,8.0,1,1,0,0


# Onisleme

In [18]:
# kucuk harfe donustur
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# noktalama isaretlerinden kurtul
train['tweet'] = train['tweet'].str.replace('[^\w\s]','')

# stopwordsleri sil
from nltk.corpus import stopwords
stop = stopwords.words('english')
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

In [19]:
# en cok gorulen 10 kelime
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

user     3974
love      610
day       494
happy     426
ãâÿââ     420
ãââ       381
amp       370
u         278
im        277
today     256
dtype: int64

### En cok gorulen 10 kelimeden kurtul

In [20]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
train.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics,upper
0,1,0,father dysfunctional selfish drags kids dysfun...,21,102,4.555556,10,1,0,0
1,2,0,thanks lyft credit cant use cause dont offer w...,22,122,5.315789,5,3,0,0
2,3,0,bihday majesty,5,21,5.666667,1,0,0,0
3,4,0,model take time urãâÿââ ãâÿââãâÿââžãâÿââãâÿââã...,17,118,7.846154,5,1,0,1
4,5,0,factsguide society motivation,8,39,8.0,1,1,0,0


In [21]:
# en az gorulen 10 kelime
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

diversityandcreativity    1
defeat                    1
goto                      1
toa                       1
beachãââ                  1
goaway                    1
tuerie                    1
aleksejplatonov           1
edited                    1
patienceãââ               1
dtype: int64

### En az gorulen 10 kelimeden kurtul

In [22]:
freq.index

Index(['diversityandcreativity', 'defeat', 'goto', 'toa', 'beachãââ', 'goaway',
       'tuerie', 'aleksejplatonov', 'edited', 'patienceãââ'],
      dtype='object')

In [23]:
freq.values

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [24]:
freq = list(freq.index)
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))

# Yazim hatasi duzeltme

In [26]:
from textblob import TextBlob
train['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

0    father dysfunctional selfish drags kiss dysfun...
1    thanks left credit can use cause dont offer wh...
2                                       midday majesty
3    model take time urãâÿââ ãâÿââãâÿââžãâÿââãâÿââã...
4                        factsguide society motivation
Name: tweet, dtype: object

In [27]:
# tokenization
TextBlob(train['tweet'][1]).words

WordList(['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'])

In [28]:
# kok bulma - stemmer
from nltk.stem import PorterStemmer
st = PorterStemmer()
train['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0        father dysfunct selfish drag kid dysfunct run
1    thank lyft credit cant use caus dont offer whe...
2                                       bihday majesti
3    model take time urãâÿââ ãâÿââãâÿââžãâÿââãâÿââã...
4                              factsguid societi motiv
Name: tweet, dtype: object

# Kok Bulma

In [29]:
from textblob import Word
train['tweet'] = train['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
train['tweet'].head()

0    father dysfunctional selfish drag kid dysfunct...
1    thanks lyft credit cant use cause dont offer w...
2                                       bihday majesty
3    model take time urãâÿââ ãâÿââãâÿââžãâÿââãâÿââã...
4                        factsguide society motivation
Name: tweet, dtype: object

# N-gram

In [31]:
print(train['tweet'][0])
TextBlob(train['tweet'][0]).ngrams(2)

father dysfunctional selfish drag kid dysfunction run


[WordList(['father', 'dysfunctional']),
 WordList(['dysfunctional', 'selfish']),
 WordList(['selfish', 'drag']),
 WordList(['drag', 'kid']),
 WordList(['kid', 'dysfunction']),
 WordList(['dysfunction', 'run'])]

### Term Frquency
> Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

In [34]:
tf1 = (train['tweet'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
tf1.columns = ['words','tf']
tf1

Unnamed: 0,words,tf
0,getthanked,1
1,offer,1
2,pdx,1
3,use,1
4,wheelchair,1
5,van,1
6,credit,1
7,dont,1
8,cause,1
9,thanks,1


# Inverse Document Frequency
> The intuition behind inverse document frequency (IDF) is that a word is not of much use to us if it’s appearing in all the documents.

IDF = log(N/n), where, N is the total number of rows and n is the number of rows in which the word was present.

In [35]:
for i,word in enumerate(tf1['words']):
      tf1.loc[i, 'idf'] = np.log(train.shape[0]/(len(train[train['tweet'].str.contains(word)])))

tf1

Unnamed: 0,words,tf,idf
0,getthanked,1,8.903679
1,offer,1,6.418773
2,pdx,1,7.805067
3,use,1,3.523782
4,wheelchair,1,8.903679
5,van,1,5.377319
6,credit,1,7.805067
7,dont,1,3.661932
8,cause,1,5.571475
9,thanks,1,4.669573


In [36]:
tf1['tfidf'] = tf1['tf'] * tf1['idf']
tf1

Unnamed: 0,words,tf,idf,tfidf
0,getthanked,1,8.903679,8.903679
1,offer,1,6.418773,6.418773
2,pdx,1,7.805067,7.805067
3,use,1,3.523782,3.523782
4,wheelchair,1,8.903679,8.903679
5,van,1,5.377319,5.377319
6,credit,1,7.805067,7.805067
7,dont,1,3.661932,3.661932
8,cause,1,5.571475,5.571475
9,thanks,1,4.669573,4.669573


> We can see that the TF-IDF has penalized words like ‘don’t’, ‘can’t’, and ‘use’ because they are commonly occurring words. However, it has given a high weight to “disappointed” since that will be very useful in determining the sentiment of the tweet.

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
stop_words= 'english',ngram_range=(1,1))
train_vect = tfidf.fit_transform(train['tweet'])

train_vect

<7359x1000 sparse matrix of type '<class 'numpy.float64'>'
	with 26566 stored elements in Compressed Sparse Row format>

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
train_bow = bow.fit_transform(train['tweet'])
train_bow

<7359x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 29761 stored elements in Compressed Sparse Row format>

# Sentiment Analysis
> Below, you can see that it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. 

In [40]:
train['tweet'][:5].apply(lambda x: TextBlob(x).sentiment)

0    (-0.3, 0.5354166666666667)
1                    (0.2, 0.2)
2                    (0.0, 0.0)
3                    (0.0, 0.0)
4                    (0.0, 0.0)
Name: tweet, dtype: object

In [41]:
train['sentiment'] = train['tweet'].apply(lambda x: TextBlob(x).sentiment[0] )
train[['tweet','sentiment']].head()

Unnamed: 0,tweet,sentiment
0,father dysfunctional selfish drag kid dysfunct...,-0.3
1,thanks lyft credit cant use cause dont offer w...,0.2
2,bihday majesty,0.0
3,model take time urãâÿââ ãâÿââãâÿââžãâÿââãâÿââã...,0.0
4,factsguide society motivation,0.0


# Word Embeddings
>  The underlying idea here is that similar words will have a minimum distance between their vectors.

Pre-trained word vectors
 - https://nlp.stanford.edu/projects/glove/
 
```
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)


from gensim.models import KeyedVectors # load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

model['go']
```

# Tweet Inceleme

Kaynak
 - https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/

        

__Authentication__:
> 
In order to fetch tweets through Twitter API, one needs to register an App through their twitter account. Follow these steps for the same:

 - Open this link and click the button: ‘Create New App’
 - Fill the application details. You can leave the callback url field empty.
 - Once the app is created, you will be redirected to the app page.
 - Open the ‘Keys and Access Tokens’ tab.
 - Copy ‘Consumer Key’, ‘Consumer Secret’, ‘Access token’ and ‘Access Token Secret’.

In [4]:
import re
import tweepy
from tweepy import OAuthHandler
from textblob import TextBlob
 
class TwitterClient(object):
    '''
    Generic Twitter Class for sentiment analysis.
    '''
    def __init__(self):
        '''
        Class constructor or initialization method.
        '''
        # keys and tokens from the Twitter Dev Console
        consumer_key = 'kTjNJbRkm9FOlD7alNfqE5OVz'
        consumer_secret = 'mvpUe965F3sMSRiYGcRQj5kqnr5kMRVhsUweTIKxp7njqmqjNz'
        access_token = '14519511-fVpPVLuuGiWbrFyRKYzbZNxc05IQg141fQGbVZoHy'
        access_token_secret = 'YgbM4yIUCiZBttRJA9Jm3VNF5glINY08XPuIzxC7n6VFw'
        
        # attempt authentication
        try:
            # create OAuthHandler object
            self.auth = OAuthHandler(consumer_key, consumer_secret)
            # set access token and secret
            self.auth.set_access_token(access_token, access_token_secret)
            # create tweepy API object to fetch tweets
            self.api = tweepy.API(self.auth)
        except:
            print("Error: Authentication Failed")
 
    def clean_tweet(self, tweet):
        '''
        Utility function to clean tweet text by removing links, special characters
        using simple regex statements.
        '''
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
 
    def get_tweet_sentiment(self, tweet):
        '''
        Utility function to classify sentiment of passed tweet
        using textblob's sentiment method
        '''
        # create TextBlob object of passed tweet text
        analysis = TextBlob(self.clean_tweet(tweet))
        # set sentiment
        if analysis.sentiment.polarity > 0:
            return 'positive'
        elif analysis.sentiment.polarity == 0:
            return 'neutral'
        else:
            return 'negative'
 
    def get_tweets(self, query, count = 10):
        '''
        Main function to fetch tweets and parse them.
        '''
        # empty list to store parsed tweets
        tweets = []
 
        try:
            # call twitter api to fetch tweets
            fetched_tweets = self.api.search(q = query, count = count)
 
            # parsing tweets one by one
            for tweet in fetched_tweets:
                # empty dictionary to store required params of a tweet
                parsed_tweet = {}
 
                # saving text of tweet
                parsed_tweet['text'] = tweet.text
                # saving sentiment of tweet
                parsed_tweet['sentiment'] = self.get_tweet_sentiment(tweet.text)
 
                # appending parsed tweet to tweets list
                if tweet.retweet_count > 0:
                    # if tweet has retweets, ensure that it is appended only once
                    if parsed_tweet not in tweets:
                        tweets.append(parsed_tweet)
                else:
                    tweets.append(parsed_tweet)
 
            # return parsed tweets
            return tweets
 
        except tweepy.TweepError as e:
            # print error (if any)
            print("Error : " + str(e))
 


In [5]:
def main():
    # creating object of TwitterClient Class
    api = TwitterClient()
    # calling function to get tweets
    tweets = api.get_tweets(query = 'Donald Trump', count = 200)
 
    # picking positive tweets from tweets
    ptweets = [tweet for tweet in tweets if tweet['sentiment'] == 'positive']
    # percentage of positive tweets
    print("Positive tweets percentage:")
    print(100*len(ptweets)/len(tweets))
    
    # picking negative tweets from tweets
    ntweets = [tweet for tweet in tweets if tweet['sentiment'] == 'negative']
    # percentage of negative tweets
    print("Negative tweets percentage:")
    print(100*len(ntweets)/len(tweets))
    
    # percentage of neutral tweets
    print("Neutral tweets percentage:")
    print(1 - 100*len(ntweets)/len(tweets) - 100*len(ptweets)/len(tweets))
 
    # printing first 5 positive tweets
    print("\n\nPositive tweets:")
    for tweet in ptweets[:10]:
        print(tweet['text'])
 
    # printing first 5 negative tweets
    print("\n\nNegative tweets:")
    for tweet in ntweets[:10]:
        print(tweet['text'])
 

In [6]:
main()

Positive tweets percentage:
46.57534246575342
Negative tweets percentage:
17.80821917808219
Neutral tweets percentage:
-63.38356164383561


Positive tweets:
RT @MillenPolitics: #PresidentialAlert — Donald Trump is still the President. 

Action is needed. Now. https://t.co/0Gv7gIfR3X
RT @PalmerReport: The next #PresidentialAlert had better be an announcement that Donald Trump has resigned.
RT @jonfavs: Yes, when rich people like Donald Trump commit tax fraud to become even richer, the real blame lies with the...tax code. https…
RT @PalmerReport: Donald Trump cheated his own father Fred Trump, cheated on all three of his wives, cheated on his taxes, cheated in the e…
RT @nytimes: How Times journalists uncovered the original source of the president’s wealth https://t.co/kBccM5h95D
RT @thehill: NYC Mayor Bill De Blasio: "The city of New York is looking to recoup any money that Donald Trump owes the people of New York C…
Jeff Flake says Trump mocking Christine Blasey Ford is 'kind of appall

# NLP Guide

Kaynak
 - https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/

In [1]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
                   ('I am exhausted of this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my badest enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus) 
print(model.classify("Their codes are amazing."))

Class_A


In [2]:
print(model.classify("I don't like their computer."))

Class_B


In [3]:
print(model.accuracy(test_corpus))


0.8333333333333334


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics import classification_report
from sklearn import svm 

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes example)
train_data = []
train_labels = []
for row in training_corpus:
    train_data.append(row[0])
    train_labels.append(row[1])

test_data = [] 
test_labels = [] 
for row in test_corpus:
    test_data.append(row[0]) 
    test_labels.append(row[1])

# Create feature vectors 
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data 
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear 
model = svm.SVC(kernel='linear') 
model.fit(train_vectors, train_labels) 
prediction = model.predict(test_vectors)

print (classification_report(test_labels, prediction))

             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6

