## Bitcoin NLP Using Twitter API

Sentiment analysis using data collected from Twitter API. What is the sentiment on a topic based on chatters over Twitter? 

Data:
Due to API limitation of a standard account, I can only access 100 tweets a day. Those tweets are then collected, and extracted for some parameters, such as created_at, status_texts, favourites_count, retweet_count, follower_count.

Tools:
Using the Textblob library, a natural language process library in Python, the status_texts are analyzed for their polarity values and sentiments. The data is then visualized using plotly for descriptive analysis. 

Result:
Overall, out of the most recent 100 tweets I've collected, roughly under 10% are negative, and the rest are split between neutral and positive, with neural tweets slightly overpowering positive tweets. There is a trend of neutral tweets that receives a higher favourite counts than the 2 other categories, but we see significant more retweet with positive tweets. Negative tweets tend to have a high favourite_counts but very little retweets. Perhaps everyone is pushing for a positive outlook on bitcoin, but realistically find comfort in the neutral, and perhaps even negative/critical contents. 

In [3]:
import pickle
import os
import json
import pandas as pd

Accessing Twitter API

In [4]:
if not os.path.exists('secret_twitter_credentials.pkl'):
    Twitter={}
    #Twitter['Header']='POST /1.1/statuses/update.json?include_entities=true HTTP/1.1 \n Accept: */* \n Connection: close \n User-Agent: OAuth python v0.4.4 \n Content-Type: application/x-www-form-urlencoded'
    Twitter['Consumer Key'] = ''
    Twitter['Consumer Secret'] = ''
    Twitter['Access Token'] = ''
    Twitter['Access Token Secret'] = ''
    with open('secret_twitter_credentials.pkl','wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open('secret_twitter_credentials.pkl','rb'))

In [10]:
import twitter

#auth = twitter.oauth.OAuth(Twitter['Access Token'],
#                           Twitter['Access Token Secret'],
#                           Twitter['Consumer Key'],
#                           Twitter['Consumer Secret'])

auth = twitter.oauth.OAuth('978415627013533696-p3co6WWbsKixLmFmR2tuWx95EonCc9V',
                           '73YZd44AZl7oMLRBPGpu5B9fLmMbrtT3oGPCHmgUml5Ke',
                           'YBwanvyqqOpMPSwRTYFUVHyvp',
                           'MUsw0KpGUcVXEhY4VH50HvZu3KiVbNa2ISjGskSxhglyv6ClkZ')

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(twitter_api)

<twitter.api.Twitter object at 0x11c815cf8>


## Use textblob to build a simple sentiment analysis function

In [11]:
from textblob import TextBlob
import re
import nltk

Function to extract sentiments from polarity of a text

In [12]:
def get_sentiment (t):
    s = TextBlob(t)
    if s.sentiment.polarity > 0:
        return 'positive'
    elif s.sentiment.polarity == 0:
        return 'neutral'
    else:
        return 'negative'

def get_polarity (t):
    s = TextBlob(t)
    return s.sentiment.polarity

Access Twitter REST API for tweets

In [8]:
query = input("Enter topic \n")
number = input("Enter number of tweets \n")

Enter topic 
bitcoin
Enter number of tweets 
100


In [13]:
result=twitter_api.search.tweets(q=query, result_type='recent', lang='en', count=number)
statuses = result['statuses']

In [115]:
#print(result)

In [14]:
# Check how many results were extracted
len(statuses)

100

In [117]:
#Sanity check
#print(statuses)

Remove duplicates

In [15]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses  

In [16]:
# Check if any duplicates were removed
len(filtered_statuses)

92

In [1]:
#Show tweet status
#[s['text'] for s in result['statuses']]

In [2]:
# Sanity Check
#print(json.dumps(statuses[0]['user']['favourites_count'], indent=1))
#print(json.dumps(statuses[0]['user']['followers_count'], indent=1))
#print(json.dumps(statuses[0]['retweet_count'], indent=1))
#print(json.dumps(statuses[0]['created_at'], indent=1))
#print(json.dumps(statuses[0], indent=1))

Extract favourites_count, followers_count, retweet_count, created_at and tweet status from API searches and put them into arrays

In [17]:
status_texts = [ status['text'] 
                 for status in statuses ]

screen_names = [ user_mention['screen_name'] 
                 for status in statuses
                     for user_mention in status['entities']['user_mentions'] ]

hashtags = [ hashtag['text'] 
             for status in statuses
                 for hashtag in status['entities']['hashtags'] ]

favourites_count = [ status['user']['favourites_count']
        for status in statuses]

followers_count = [status['user']['followers_count']
                      for status in statuses]

retweet_count = [ status['retweet_count'] 
                 for status in statuses ]

created_at = [ status['created_at'] 
                 for status in statuses ]
# Compute a collection of all words from all tweets
words = [ w 
          for t in status_texts 
              for w in t.split() ]

In [18]:
# Explore the results

print(json.dumps(status_texts[0], indent=1))
print(json.dumps(screen_names[0], indent=1)) 
print(json.dumps(hashtags[0], indent=1))
print(json.dumps(words[0], indent=1))
print(json.dumps(favourites_count[0], indent=1))
print(json.dumps(followers_count[0], indent=1))
print(json.dumps(retweet_count[0], indent=1))
print(json.dumps(created_at[0], indent=1))

"RT @KoniosOfficial: AIRDROP IS LIVE\nGet your free KON Token now: https://t.co/cQNhnrYkS9     \n#aidrop #ico #cryptocurrency #Crypto #TokenSa\u2026"
"KoniosOfficial"
"aidrop"
"RT"
14
19
3645
"Thu Apr 05 02:57:07 +0000 2018"


Remove links in and special character in tweet status

In [19]:
def clean_tweet(tweet):
    '''
    Utility function to clean the text in a tweet by removing 
    links and special characters using regex.
    '''
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

In [27]:
#Check the first item
print(clean_tweet(status_texts[30]))

Copy successful traders automatically with Bitcoin I copy Dividend667 Do you


In [28]:
#Sentiment on the first tweet
print(get_sentiment(clean_tweet(status_texts[30])))

positive


Perform sentiment analysis on all data and store sentiments and polarity values into arrays

In [29]:
tweet_sentiment = []
for t in status_texts:
    tweet_sentiment.append(get_sentiment(clean_tweet(t)))
    
print(tweet_sentiment)

tweet_polarity = []
for t in status_texts:
    tweet_polarity.append(get_polarity(clean_tweet(t)))

print(tweet_polarity)

['positive', 'neutral', 'negative', 'neutral', 'neutral', 'positive', 'negative', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'negative', 'positive', 'neutral', 'neutral', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'negative', 'negative', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'positive', 'positive', 'neutral', 'negative', 'positive', 'neutral', 'negative', 'negative', 'positive', 'negative', 'positive', 'neutral', 'neutral', 'positive', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'negative', 'positive', 'negative', 'positive', 'neutral', 'neutral', 'positive', 'positive', 'positive', 'negative', 'negative', 'positive', 'negative', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'neutral', 'positive', 'positive', 'positive', 'neutral', 'positive',

Construct a dataframe for the analysis

In [30]:
print('created_at: ', str(len(created_at)),
      'status_texts: ', str(len(status_texts)), 
      'hashtags: ' , str(len(hashtags)), 
      'favourites_count: ', str(len(favourites_count)), 
      'followers_count: ',  str(len(followers_count)),
      'retweet_count: ', str(len(retweet_count)),  
      'tweet_sentiment: ', str(len(tweet_sentiment)))

df = pd.DataFrame({'created_at':created_at, 'status_texts':status_texts, 'favourites_count':favourites_count, 'followers_count':followers_count, 'retweet_count':retweet_count, 'tweet_sentiment':tweet_sentiment, 'tweet_polarity':tweet_polarity })


created_at:  92 status_texts:  92 hashtags:  107 favourites_count:  92 followers_count:  92 retweet_count:  92 tweet_sentiment:  92


In [31]:
#Check the first 5 rows
print(df.head())

                       created_at  favourites_count  followers_count  \
0  Thu Apr 05 02:57:07 +0000 2018                14               19   
1  Thu Apr 05 02:57:06 +0000 2018              2945              795   
2  Thu Apr 05 02:57:04 +0000 2018              3499              240   
3  Thu Apr 05 02:57:02 +0000 2018             21275             1015   
4  Thu Apr 05 02:57:01 +0000 2018               284               17   

   retweet_count                                       status_texts  \
0           3645  RT @KoniosOfficial: AIRDROP IS LIVE\nGet your ...   
1             42  RT @Fisher85M: The #Blockchain Ecosystem {Info...   
2              0  Only a dramatic and rapid capitulation will fi...   
3              0  " " RAP " " " " MUSIC " "\n(((((PICK ONE))))))...   
4              1  RT @CryptoKirby: 🚀ALERT🚀\n\nToday's #Bitcoin S...   

   tweet_polarity tweet_sentiment  
0        0.268182        positive  
1        0.000000         neutral  
2       -0.074242        negativ

In [32]:
#Perform some simple analysis 
print(df.groupby(['tweet_sentiment']).sum())
print(df.groupby(['tweet_sentiment'])['favourites_count'].count())

                 favourites_count  followers_count  retweet_count  \
tweet_sentiment                                                     
negative                   475425            98380              3   
neutral                    335815           261045           7739   
positive                   491647           137091           8022   

                 tweet_polarity  
tweet_sentiment                  
negative              -3.203409  
neutral                0.000000  
positive              16.000173  
tweet_sentiment
negative    16
neutral     34
positive    42
Name: favourites_count, dtype: int64


Time to visualize the data collected and see if we can draw some insights!

In [39]:
# Import plotly
import plotly.plotly as py
from plotly.offline as off
from plotly.graph_objs import Scatter, Figure, Layout
import plotly.graph_objs as go
init_notebook_mode(connected=True)
%matplotlib inline

SyntaxError: invalid syntax (<ipython-input-39-8880a3996b94>, line 3)

In [40]:
#Creating a simple pie chart
labels = ['negative', 'positive','neutral']
values = [8,39,45]

trace = go.Pie(labels=labels, values=values)

py.iplot([trace], filename='basic_pie_chart')

In [54]:
#Scatter plot
#import plotly.offline as off

colors = ['blue', 'orange', 'green']

opt = []
opts = []
for i in range(0, len(colors)):
    opt = dict(
        target = df['tweet_sentiment'][[i]].unique(), value = dict(marker = dict(color = colors[i]))
    )
    opts.append(opt)

print(opts)

data = [dict(
  type = 'scatter',
  mode = 'markers',
  x = df['retweet_count'],
  y = df['favourites_count'],
  text = df['status_texts'],
  hoverinfo = 'text',
  opacity = 0.8,
  marker = dict(
      size = df['retweet_count'],
      sizemode = 'area',
  ),
  transforms = [
      dict(
        type = 'groupby',
        groups = df['tweet_sentiment'],
        styles = opts
    )]
)]

layout = dict(
    title = '<b>Sentiment Analysis on Bitcoin Related Twitter Chatter</b><br>Logarithmic plot favourites_counts vs retweet_count, size of the plot dependent on retweet_count',
    yaxis = dict(
        title='favourites_count',
        type = 'log'
    ),
    xaxis= dict(
        title= 'retweet_count',
        ticklen= 5,
        zeroline= False,
        gridwidth= 2,
    ),
)


iplot({'data': data, 'layout': layout}, validate=False)

[{'value': {'marker': {'color': 'blue'}}, 'target': array(['positive'], dtype=object)}, {'value': {'marker': {'color': 'orange'}}, 'target': array(['neutral'], dtype=object)}, {'value': {'marker': {'color': 'green'}}, 'target': array(['negative'], dtype=object)}]


Lastly, just for fun, let's perfom some word counts

In [150]:
from collections import Counter

for item in [words, hashtags]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('RT', 44), ('to', 23), ('the', 22), ('Bitcoin', 21), ('and', 21), ('a', 21), ('of', 19), ('you', 14), ('is', 14), ('in', 12)]

[('bitcoin', 11), ('crypto', 10), ('Bitcoin', 10), ('blockchain', 10), ('cryptocurrency', 7), ('Cryptocurrency', 5), ('trading', 4), ('Blockchain', 3), ('Crypto', 3), ('Infographic', 3)]



Remove stop words

In [152]:
from nltk.corpus import stopwords
import string

In [153]:
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via', 'RT']

In [154]:
from collections import Counter

clean_words = []
clean_hash = []
for w, h in zip(words, hashtags):
    if w not in stop:
        clean_words.append(w)
        clean_hash.append(h)

for item in [clean_words, clean_hash]:
    c = Counter(item)
    print(c.most_common()[:10]) # top 10
    print()

[('#Bitcoin', 3), ('bitcoin', 2), ('Bitcoin', 2), ('think', 2), ('is,', 1), ('#blockchain', 1), ('week?', 1), ('@MAVRO_COIN:', 1), ('@GoldenCurrency:', 1), ('Have', 1)]

[('bitcoin', 10), ('crypto', 9), ('Bitcoin', 8), ('blockchain', 8), ('Cryptocurrency', 4), ('cryptocurrency', 4), ('Crypto', 3), ('BlackPanther', 2), ('Mavro', 2), ('Blockchain', 2)]



In [123]:
#print(terms_stop)

In [88]:
#Let's figure out which tweets are retweets
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

In [155]:
#print(retweets)