## Bitcoin NLP Using Twitter API

Sentiment analysis using data collected from Twitter API. What is the sentiment on a topic based on chatters over Twitter? 

Data:
Due to API limitation of a standard account, I can only access 100 tweets a day. Those tweets are then collected, and extracted for some parameters, such as created_at, status_texts, favourites_count, retweet_count, follower_count.

Tools:
Using the Textblob library, a natural language process library in Python, the status_texts are analyzed for their polarity values and sentiments. The data is then visualized using plotly for descriptive analysis. 

Result:
Overall, out of the most recent 100 tweets I've collected, roughly under 10% are negative, and the rest are split between neutral and positive, with neural tweets slightly overpowering positive tweets. There is a trend of neutral and positive tweets that receives a higher favourite counts and higher retweets, but we see significant more retweet with positive tweets. Negative tweets tend to have a high favourite_counts but very little retweets. Perhaps everyone is pushing for a positive outlook on bitcoin, but realistically find comfort in the neutral, and perhaps even negative/critical contents. 

In [224]:
import pickle
import os
import json
import pandas as pd

Accessing Twitter API

In [225]:
if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    #Twitter['Header']='POST /1.1/statuses/update.json?include_entities=true HTTP/1.1 \n Accept: */* \n Connection: close \n User-Agent: OAuth python v0.4.4 \n Content-Type: application/x-www-form-urlencoded'
    Twitter['Consumer Key'] = ''
    Twitter['Consumer Secret'] = ''
    Twitter['Access Token'] = ''
    Twitter['Access Token Secret'] = ''
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))

In [226]:
import twitter

#auth = twitter.oauth.OAuth(Twitter['Access Token'],
#                           Twitter['Access Token Secret'],
#                           Twitter['Consumer Key'],
#                           Twitter['Consumer Secret'])

auth = twitter.oauth.OAuth('',
                           '',
                           '',
                           '')

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(twitter_api)

<twitter.api.Twitter object at 0x11f956710>


## Use textblob to build a simple sentiment analysis function

In [227]:
from textblob import TextBlob
import re
import nltk

Function to extract sentiments from polarity of a text

In [228]:
def get_sentiment (t):
    s = TextBlob(t)
    if s.sentiment.polarity > 0:
        return 'positive'
    elif s.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

def get_polarity (t):
    s = TextBlob(t)
    return s.sentiment.polarity

Access Twitter REST API for tweets

In [305]:
query = input("Enter topic \n")
number = input("Enter number of tweets \n")

Enter topic 
crypto
Enter number of tweets 
100


In [306]:
result=twitter_api.search.tweets(q=query, result_type='mixed', lang='en', count=number)
statuses = result['statuses']

In [298]:
#print(result)

In [307]:
# Check how many results were extracted
len(statuses)

98

In [153]:
#Sanity check
#print(statuses)

Remove duplicates

In [308]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses  

In [309]:
# Check if any duplicates were removed
len(filtered_statuses)

82

In [156]:
#Show tweet status
#[s['text'] for s in result['statuses']]

In [157]:
# Sanity Check
#print(json.dumps(statuses[0]['user']['favourites_count'], indent=1))
#print(json.dumps(statuses[0]['user']['followers_count'], indent=1))
#print(json.dumps(statuses[0]['retweet_count'], indent=1))
#print(json.dumps(statuses[0]['created_at'], indent=1))
#print(json.dumps(statuses[0], indent=1))

Extract favourites_count, followers_count, retweet_count, created_at and tweet status from API searches and put them into arrays

In [310]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

favourites_count = [ status['user']['favourites_count']
        for status in statuses]

followers_count = [status['user']['followers_count']
                      for status in statuses]

retweet_count = [ status['retweet_count'] 
                 for status in statuses ]

created_at = [ status['created_at'] 
                 for status in statuses ]
# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

In [311]:
# Explore the results

print(json.dumps(status_texts[0], indent=1))
print(json.dumps(screen_names[0], indent=1)) 
print(json.dumps(hashtags[0], indent=1))
print(json.dumps(words[0], indent=1))
print(json.dumps(favourites_count[0], indent=1))
print(json.dumps(followers_count[0], indent=1))
print(json.dumps(retweet_count[0], indent=1))
print(json.dumps(created_at[0], indent=1))

"George Soros is preparing to trade cryptocurrencies https://t.co/1tR9NGRUTb https://t.co/8QeX574txI"
"rickyquiroz43"
"crypto"
"George"
10
51838
181
"Sun Apr 08 20:01:34 +0000 2018"


Remove links in and special character in tweet status

In [312]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

In [313]:
#Check the first item
print(clean_tweet(status_texts[0]))

tweet_clean=[clean_tweet(t)
                 for t in status_texts]


George Soros is preparing to trade cryptocurrencies


In [314]:
#Sentiment on the first tweet
print(get_sentiment(clean_tweet(status_texts[0])))

neutral


Perform sentiment analysis on all data and store sentiments and polarity values into arrays

In [315]:
tweet_sentiment = []
for t in status_texts:
    tweet_sentiment.append(get_sentiment(clean_tweet(t)))
    
print(tweet_sentiment)

tweet_polarity = []
for t in status_texts:
    tweet_polarity.append(get_polarity(clean_tweet(t)))

print(tweet_polarity)

['neutral', 'neutral', 'neutral', 'positive', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'negative', 'neutral', 'negative', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'negative', 'positive', 'neutral', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'positive', 'negative', 'positive', 'positive', 'neutral', 'neutral', 'neutral', 'positive', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'negative', 'positive', 'neutral', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'neutral', 'positive', 'neutral', 'positive', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'neutral']
[0.0, 0.0, 0.0, 0.75, 0.8, 0.0, 0.0, 0.1333333333

Construct a dataframe for the analysis

In [316]:
print('created_at: ', str(len(created_at)),
      'status_texts: ', str(len(status_texts)), 
      'hashtags: ' , str(len(hashtags)), 
      'favourites_count: ', str(len(favourites_count)), 
      'followers_count: ',  str(len(followers_count)),
      'retweet_count: ', str(len(retweet_count)),  
      'tweet_sentiment: ', str(len(tweet_sentiment)))

df = pd.DataFrame({'created_at':created_at, 'status_texts':tweet_clean, 'favourites_count':favourites_count, 'followers_count':followers_count, 'retweet_count':retweet_count, 'tweet_sentiment':tweet_sentiment, 'tweet_polarity':tweet_polarity })


created_at:  82 status_texts:  82 hashtags:  111 favourites_count:  82 followers_count:  82 retweet_count:  82 tweet_sentiment:  82


In [317]:
#Check the first 5 rows
print(df.head())

                       created_at  favourites_count  followers_count  \
0  Sun Apr 08 20:01:34 +0000 2018                10            51838   
1  Sat Apr 07 11:06:46 +0000 2018              1647           700848   
2  Sun Apr 08 11:45:04 +0000 2018                10            51838   
3  Mon Apr 09 02:11:01 +0000 2018              2176              624   
4  Mon Apr 09 02:11:01 +0000 2018               449               49   

   retweet_count                                       status_texts  \
0            181  George Soros is preparing to trade cryptocurre...   
1             86  How to Defend Yourself Against Pump and Dump C...   
2             85  Blockchain has the scope to reshape securities...   
3              1  RT XVG is forming a great cup and handle chart...   
4              0  rand look a show my car I m so happy mom I m d...   

   tweet_polarity tweet_sentiment  
0            0.00         neutral  
1            0.00         neutral  
2            0.00         neutra

In [318]:
#Perform some simple analysis 
print(df.groupby(['tweet_sentiment']).sum())
print(df.groupby(['tweet_sentiment'])['favourites_count'].count())

                 favourites_count  followers_count  retweet_count  \
tweet_sentiment                                                     
negative                    79763            28322             66   
neutral                     79203           910300           5151   
positive                   102003           119311          31387   

                 tweet_polarity  
tweet_sentiment                  
negative              -1.582639  
neutral                0.000000  
positive              14.031548  
tweet_sentiment
negative     6
neutral     34
positive    42
Name: favourites_count, dtype: int64


Time to visualize the data collected and see if we can draw some insights!

In [319]:
# Import plotly
import plotly.plotly as py
import plotly.offline as off
from plotly.graph_objs import Scatter, Figure, Layout
import plotly.graph_objs as go
off.init_notebook_mode(connected=True)
%matplotlib inline

In [320]:
#Creating a simple pie chart
labels = ['negative', 'positive','neutral']
print(df.groupby(['tweet_sentiment']).count())
positive_tweet_count = df.loc[df['tweet_sentiment'] == 'positive']['status_texts'].count()
neutral_tweet_count = df.loc[df['tweet_sentiment'] == 'neutral']['status_texts'].count()
negative_tweet_count = df.loc[df['tweet_sentiment'] == 'negative']['status_texts'].count()
values = [positive_tweet_count,neutral_tweet_count,negative_tweet_count]

trace = go.Pie(labels=labels, values=values)

py.iplot([trace], filename='basic_pie_chart')

                 created_at  favourites_count  followers_count  retweet_count  \
tweet_sentiment                                                                 
negative                  6                 6                6              6   
neutral                  34                34               34             34   
positive                 42                42               42             42   

                 status_texts  tweet_polarity  
tweet_sentiment                                
negative                    6               6  
neutral                    34              34  
positive                   42              42  


In [323]:
data =[dict(
        #type = 'scatter',
        mode = 'markers',
        x= df[df['tweet_sentiment']==sentiment]['favourites_count'],
        y=df[df['tweet_sentiment']==sentiment]['retweet_count'],
        name = sentiment, 
        #text=df['status_texts'],
        text=df['status_texts'],
        hoverinfo='text',
        opacity= 0.8,
        marker = dict(
          size = df['retweet_count'],
          sizemode = 'area',
      ),
    )for sentiment in ['positive', 'neutral', 'negative']]

layout = dict(
    title = '<b>Sentiment Analysis on Bitcoin Related Twitter Chatter</b><br>Logarithmic plot retweet_count vs. favourites_counts , size of the plot dependent on retweet_count',
    yaxis = dict(
        title='retweet_count',
        type = 'log',
    ),
    xaxis= dict(
        title= 'favourites_count',
        type = 'log',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
)

fig = dict(data=data, layout=layout)


py.iplot(fig)

Lastly, just for fun, let's perfom some word counts

In [246]:
from collections import Counter

for item in [words, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('RT', 51), ('the', 36), ('to', 24), ('is', 22), ('a', 20), ('of', 16), ('I', 15), ('and', 15), ('in', 13), ('for', 13)]

[('Bitcoin', 13), ('blockchain', 9), ('bitcoin', 9), ('cryptocurrency', 8), ('Airdrop', 5), ('crypto', 4), ('ICO', 4), ('Earn_Money', 4), ('Satoshi', 4), ('hodl', 3)]



Remove stop words

In [247]:
from nltk.corpus import stopwords
import string

In [248]:
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', 'RT']

In [249]:
from collections import Counter

clean_words = []
clean_hash = []
for w, h in zip(words, hashtags):
    if w not in stop:
        clean_words.append(w)
        clean_hash.append(h)

for item in [clean_words, clean_hash]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('Bitcoin', 2), ('#Bitcoin', 2), ('unforgeable', 2), ('😂', 1), ('certificates', 1), ('uooMAG:', 1), ('#Bethereum', 1), ('https://t.co/Io8RgSiYUr', 1), ('sir', 1), ('#Ethereum', 1)]

[('Bitcoin', 9), ('blockchain', 6), ('bitcoin', 5), ('cryptocurrency', 4), ('crypto', 4), ('Airdrop', 4), ('ICO', 3), ('Earn_Money', 3), ('ethereum', 3), ('Satoshi', 3)]



In [123]:
#print(terms_stop)

In [250]:
#Let's figure out which tweets are retweets
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

In [51]:
#print(retweets)

