# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the news api to pull the latest news articles for bitcoin and ethereum and create a DataFrame of sentiment scores for each coin. 

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [4]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from newsapi import NewsApiClient

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\richa\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [5]:
# Read your api key environment variable
load_dotenv("example.env")
api_key = os.getenv("News_API_KEY")

In [6]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=api_key)

In [7]:
# Fetch the Bitcoin news articles
btc_articles = newsapi.get_everything(q='bitcoin', language='en', sort_by='relevancy')
btc_articles

{'status': 'ok',
 'totalResults': 4336,
 'articles': [{'source': {'id': 'engadget', 'name': 'Engadget'},
   'author': 'Nicole Lee',
   'title': 'Tampa teenager arrested for Twitter Bitcoin hack',
   'description': 'Authorities in Tampa, Florida have arrested a 17-year-old for being the alleged “mastermind” behind the Twitter Bitcoin hack that targeted several high-profile accounts on July 15th, 2020. His name has not been revealed due to his underage status. According t…',
   'url': 'https://www.engadget.com/teenager-arrested-twitter-bitcoin-hack-183302700.html',
   'urlToImage': 'https://o.aolcdn.com/images/dims?resize=1200%2C630&crop=1200%2C630%2C0%2C0&quality=95&image_uri=https%3A%2F%2Fs.yimg.com%2Fos%2Fcreatr-images%2F2020-07%2F80319ad0-c77f-11ea-adfe-d560f6400e1e&client=amp-blogside-v2&signature=3ae5e1a0ea67905f52a03c1a851c07fc1c61bdbb',
   'publishedAt': '2020-07-31T18:33:02Z',
   'content': 'Authorities in Tampa, Florida have arrested a 17-year-old for being the alleged “masterm

In [8]:
# Fetch the Ethereum news articles
ethereum_articles = newsapi.get_everything(q='ethereum', language='en', sort_by='relevancy')
ethereum_articles

{'status': 'ok',
 'totalResults': 1263,
 'articles': [{'source': {'id': 'mashable', 'name': 'Mashable'},
   'author': 'Joseph Green',
   'title': 'Master blockchain with this cheap online course',
   'description': "TL;DR: The Mega Blockchain Mastery Bundle is on sale for £29.81 as of August 17, saving you 97% on list price.\n\nCash isn't necessarily king anymore. You've probably heard that cryptocurrency and blockchain technologies (which power things like Bitcoin and Eth…",
   'url': 'https://mashable.com/uk/shopping/aug-17-mega-blockchain-mastery-bundle/',
   'urlToImage': 'https://mondrian.mashable.com/2020%252F08%252F17%252F40%252F5fe1250a25cd46bca29df0fa3c2e813f.4539c.png%252F1200x630.png?signature=PeH92TKb8dmntUe5Zygr2icxP4o=',
   'publishedAt': '2020-08-17T04:00:00Z',
   'content': "TL;DR: The Mega Blockchain Mastery Bundle is on sale for £29.81 as of August 17, saving you 97% on list price.\r\nCash isn't necessarily king anymore. You've probably heard that cryptocurrency and b

In [10]:
# Create a function - Create the sentiment scores DataFrame
def get_sentiment_scores(article_type, key_word):
    sentiments = []
    for article in article_type["articles"]:
        try:
            text = article[key_word]
            date = article["publishedAt"][:10]
            sentiment = analyzer.polarity_scores(text)
            compound = sentiment["compound"]
            pos = sentiment["pos"]
            neu = sentiment["neu"]
            neg = sentiment["neg"]

            sentiments.append({"text": text,
                               "date": date,
                               "compound": compound,
                               "positive": pos,
                               "negative": neg,
                               "neutral": neu
                               })
        except AttributeError:
            pass
    
    # Create DataFrame
    df = pd.DataFrame(sentiments)
    return df

In [12]:
# Create the Bitcoin sentiment scores DataFrame
btc_sentiment_df = get_sentiment_scores(btc_articles, 'content')
btc_sentiment_df

Unnamed: 0,text,date,compound,positive,negative,neutral
0,"Authorities in Tampa, Florida have arrested a ...",2020-07-31,-0.4767,0.0,0.094,0.906
1,"Casa, a Colorado-based provider of bitcoin sec...",2020-08-06,0.5994,0.149,0.0,0.851
2,"On July 15, a Discord user with the handle Kir...",2020-08-01,-0.4019,0.0,0.074,0.926
3,"In April, the Secret Service seized 100 Bitcoi...",2020-08-03,0.0,0.0,0.0,1.0
4,"The question still remained, though, whether a...",2020-08-06,-0.0516,0.065,0.071,0.864
5,A ransomware variant called NetWalker is doing...,2020-08-04,0.5106,0.122,0.0,0.878
6,Earlier this month a number of Twitter account...,2020-07-31,0.6249,0.184,0.0,0.816
7,Hillsborough State Attorney Andrew Warren anno...,2020-07-31,-0.6808,0.0,0.157,0.843
8,“The COVID-19 pandemic has resulted in a mass ...,2020-08-23,0.2732,0.063,0.0,0.937
9,(CNN)A teenager in Tampa was arrested and char...,2020-07-31,-0.0258,0.118,0.12,0.762


In [13]:
# Create the Ethereum sentiment scores DataFrame
ethereum_sentiment_df = get_sentiment_scores(ethereum_articles, 'content')
ethereum_sentiment_df

Unnamed: 0,text,date,compound,positive,negative,neutral
0,TL;DR: The Mega Blockchain Mastery Bundle is o...,2020-08-17,0.0,0.0,0.0,1.0
1,LONDON (Reuters) - It sounds like a surefire b...,2020-08-26,0.7579,0.181,0.0,0.819
2,NEW YORK (Reuters) - Brooklyn-based technology...,2020-08-25,0.0,0.0,0.0,1.0
3,An outspoken Bitcoin whale who rarely shows af...,2020-08-19,-0.2677,0.045,0.074,0.881
4,REUTERS/Rick Wilking\r\n<ul><li>Michael Novogr...,2020-08-14,0.34,0.072,0.0,0.928
5,"August\r\n4, 2020\r\n5 min read\r\nOpinions ex...",2020-08-04,0.5423,0.123,0.0,0.877
6,Ethereum is one of the most growing cryptocurr...,2020-08-11,0.2484,0.057,0.0,0.943
7,Ethereum 2.0s final and official public testne...,2020-08-04,0.0,0.0,0.0,1.0
8,"On Aug. 2, the price of Ethereum peaked at $41...",2020-08-02,-0.2732,0.0,0.052,0.948
9,usa one hundred dollar banknotes among the bin...,2020-08-03,0.0,0.0,0.0,1.0


In [14]:
# Describe the Bitcoin Sentiment
btc_sentiment_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,0.16628,0.06915,0.0353,0.89555
std,0.437014,0.054345,0.05445,0.052494
min,-0.6808,0.0,0.0,0.762
25%,-0.0707,0.0,0.0,0.862
50%,0.35,0.0815,0.0,0.91
75%,0.507,0.09,0.07175,0.92625
max,0.6249,0.184,0.157,1.0


In [15]:
# Describe the Ethereum Sentiment
ethereum_sentiment_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,18.0,18.0,18.0,18.0
mean,0.094278,0.054611,0.029389,0.916
std,0.379953,0.061135,0.055994,0.085755
min,-0.5994,0.0,0.0,0.732
25%,-0.077025,0.0,0.0,0.878
50%,0.0,0.051,0.0,0.9295
75%,0.33455,0.078,0.039,1.0
max,0.7717,0.181,0.189,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: Bitcoin - 0.069150

Q: Which coin had the highest compound score?

A: Ethereum - 0.771700

Q. Which coin had the highest positive score?

A: Bitcoin - 0.184000

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word
2. Remove Punctuation
3. Remove Stopwords

In [16]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [17]:
# Expand the default stopwords list if necessary
lemmatizer = WordNetLemmatizer()
sw = set(stopwords.words('english'))

In [12]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Create a list of the words
    words = []
    
    # Convert the words to lowercase
    words = word_tokenize(re_clean.lower())
    
    # Remove the punctuation
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)

    # Remove the stop words
    words = [word for word in words if word not in sw]
    
    # Lemmatize Words into root words
    lem = [lemmatizer.lemmatize(word) for word in words]
    
    return tokens

In [13]:
# Create a new tokens column for Bitcoin
btc_sentiment_df['tokens'] = btc_sentiment_df['content'].apply(tokenizer)
btc_sentiment_df

In [14]:
# Create a new tokens column for Ethereum
ethereum_sentiment_df['tokens'] = ethereum_sentiment_df['content'].apply(tokenizer)
ethereum_sentiment_df

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [15]:
from collections import Counter
from nltk import ngrams

In [20]:
def get_token(df):
    tokens = []
    for i in df['tokens']:
        tokens.extend(i)
    return tokens

btc_tokens = get_token(btc_sentiment_df)
ethereum_tokens = get_token(ethereum_sentiment_df)

KeyError: 'tokens'

In [16]:
# Generate the Bitcoin N-grams where N=2
bit_ngram = bigram_counter(btc_tokens, 2)

In [17]:
# Generate the Ethereum N-grams where N=2
eth_ngram = bigram_counter(ethereum_tokens, 2)

In [18]:
# Use the token_count function to generate the top 10 words from each coin
def token_count(tokens, N=10):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [19]:
# Get the top 10 words for Bitcoin
token_count(btc_tokens, 10)

In [20]:
# Get the top 10 words for Ethereum
token_count(eth_tokens, 10)

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [18]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [19]:
# Generate the Bitcoin word cloud
wc = WordCloud().generate(' '.join(btc_tokens))
plt.title("Bitcoin Word Cloud", fontsize = 50)
plt.imshow(wc)

NameError: name 'btc_tokens' is not defined

In [23]:
# Generate the Ethereum word cloud
wc = WordCloud().generate(' '.join(ethereum_tokens))
plt.title("Ethereum Word Cloud", fontsize = 50)
plt.imshow(wc)

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [24]:
import spacy
from spacy import displacy

In [25]:
# Optional - download a language model for SpaCy
# !python -m spacy download en_core_web_sm

In [26]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [27]:
# Concatenate all of the Bitcoin text together
btc_content = ' '.join(btc_sentiment_df['content'])
btc_content

In [28]:
# Run the NER processor on all of the text
doc = nlp(btc_content)

# Add a title to the document
doc.user_data["title"] = "BITCOIN NER"

In [29]:
# Render the visualization
displacy.render(doc, style='ent')

In [30]:
# List all Entities
for ent in doc.ents:
    print('{} {}'.format(ent.text, ent.label_))

---

### Ethereum NER

In [31]:
# Concatenate all of the Ethereum text together
eth_content = ' '.join(eth_sentiment_df['content'])
eth_content

In [32]:
# Run the NER processor on all of the text
doc = nlp(eth_content)

# Add a title to the document
doc.user_data["title"] = "Ethereum NER"

In [33]:
# Render the visualization
displacy.render(doc, style='ent')

In [34]:
# List all Entities
for ent in doc.ents:
    print('{} {}'.format(ent.text, ent.label_))

---