# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the [newsapi](https://newsapi.org/) to pull the latest news articles for Bitcoin and Ethereum and create a DataFrame of sentiment scores for each coin.

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [17]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
from datetime import datetime, timedelta
from newsapi import NewsApiClient
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/rajaabhishek/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [20]:
# Read your api key environment variable
load_dotenv()
api_key = os.getenv("news_api")

In [21]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=api_key)

In [22]:
# Define function to fetch the news article 

def news_article(crypto):
    # Fetch articles new articles
    headlines = newsapi.get_everything(
        q= crypto, 
        language="en",
        sort_by="relevancy"
    )
    return headlines

In [23]:
# Fetch the Bitcoin news articles
bitcoin_headlines = news_article("bitcoin")

# Print total articles
print(f"Total articles about bitcoin: {bitcoin_headlines['totalResults']}")

# Show sample article
bitcoin_headlines["articles"][0]

Total articles about bitcoin: 7480


{'source': {'id': None, 'name': 'Blogspot.com'},
 'author': 'noreply@blogger.com (Unknown)',
 'title': "Elon Musk reveals who bitcoin's creator Satoshi Nakamoto might be",
 'description': 'Musk.MARK RALSTON/AFP via Getty Images\r\nElon Musk seems to agree with many that hyper-secret cryptocurrency expert Nick Szabo could be Satoshi Nakamoto, the mysterious creator of the digital currency Bitcoin.\xa0"You can watch ideas evolve before Bitcoin was lau…',
 'url': 'https://techncruncher.blogspot.com/2021/12/elon-musk-reveals-who-bitcoins-creator.html',
 'urlToImage': 'https://blogger.googleusercontent.com/img/a/AVvXsEik_48hPzMzsDzwfdUeHj4jNGqYGevEuVjTTPkAGTu9bRN3oePxV6bxF897GK8Az3AaSqUOalYXNG4HSCy0fW5KUHuruCWP8hAfZxgrgbzh-dsbLM9jxyFGCthOZdBCa1dNkqk6mrVl0VtflsV2VvKXfGnwL6-68m-mxp7qHJuLlvqGIahZ9YDe5mt97w=w1200-h630-p-k-no-nu',
 'publishedAt': '2021-12-29T20:41:00Z',
 'content': 'Musk.MARK RALSTON/AFP via Getty Images\r\nElon Musk seems to agree with many that hyper-secret cryptocurrency expe

In [24]:
# Fetch the Ethereum news articles
ethereum_headlines = news_article("ethereum")

# Print total articles
print(f"Total articles about ethereum: {ethereum_headlines['totalResults']}")

# Show sample article
ethereum_headlines["articles"][0]

Total articles about ethereum: 3420


{'source': {'id': 'the-verge', 'name': 'The Verge'},
 'author': 'Corin Faife',
 'title': 'Crypto.com admits over $30 million stolen by hackers',
 'description': 'Cryptocurrency exchange Crypto.com has said that $15 million in ethereum and $18 million in bitcoin were stolen by hackers in a security breach',
 'url': 'https://www.theverge.com/2022/1/20/22892958/crypto-com-exchange-hack-bitcoin-ethereum-security',
 'urlToImage': 'https://cdn.vox-cdn.com/thumbor/mde_l3lUC4muDPEFG7LYrUz0O3g=/0x146:2040x1214/fit-in/1200x630/cdn.vox-cdn.com/uploads/chorus_asset/file/8921023/acastro_bitcoin_2.jpg',
 'publishedAt': '2022-01-20T13:23:31Z',
 'content': 'In a new blog post the company said that 4,836 ETH and 443 bitcoin were taken\r\nIllustration by Alex Castro / The Verge\r\nIn a blog post published in the early hours of Thursday morning, cryptocurrency… [+2004 chars]'}

In [25]:
# Define function for sentiment score DataFrame
def sentiment_analysis(headline):
    
    sentiments = []
    
    for article in headline["articles"]:
        try:
            date = article["publishedAt"][:10]
            text = article["content"]
            sentiment = analyzer.polarity_scores(text)
            compound = sentiment["compound"]
            pos = sentiment["pos"]
            neu = sentiment["neu"]
            neg = sentiment["neg"]
               
            sentiments.append({
                "date": date,
                "text": text,
                "compound": compound,
                "positive": pos,
                "negative": neg,
                "neutral": neu
            })
        except AttributeError:
            pass
        # Create DataFrame
        df = pd.DataFrame(sentiments)
    return df

In [26]:
# Create the Bitcoin sentiment scores DataFrame
bitcoin_df = sentiment_analysis(bitcoin_headlines)

# Show sample data
bitcoin_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2021-12-29,Musk.MARK RALSTON/AFP via Getty Images\r\nElon...,0.3612,0.077,0.0,0.923
1,2022-01-12,When Denis Rusinovich set up cryptocurrency mi...,0.0,0.0,0.0,1.0
2,2022-01-25,El Salvador introduced Bitcoin as a legal tend...,0.3182,0.105,0.0,0.895
3,2022-01-14,Were officially building an open Bitcoin minin...,-0.4404,0.0,0.083,0.917
4,2022-01-20,"In a new blog post the company said that 4,836...",0.0,0.0,0.0,1.0


In [27]:
# Create the Ethereum sentiment scores DataFrame
ethereum_df = sentiment_analysis(ethereum_headlines)

# Show sample data
ethereum_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral
0,2022-01-20,"In a new blog post the company said that 4,836...",0.0,0.0,0.0,1.0
1,2022-01-19,Hackers who made off with roughly $15 million ...,0.0,0.0,0.0,1.0
2,2022-01-20,"On some level, the new mayor is simply employi...",0.1779,0.052,0.0,0.948
3,2022-01-21,"Back in September\r\n, Robinhood announced pla...",0.0772,0.038,0.0,0.962
4,2022-01-20,Trading platform Crypto.com lost about $34 mil...,-0.1027,0.056,0.067,0.877


In [28]:
# Describe the Bitcoin Sentiment
bitcoin_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,0.025825,0.0487,0.03305,0.91825
std,0.366588,0.055739,0.041999,0.055389
min,-0.4404,0.0,0.0,0.787
25%,-0.37415,0.0,0.0,0.90775
50%,0.0,0.045,0.0,0.923
75%,0.32895,0.074,0.0785,0.942
max,0.6808,0.185,0.101,1.0


In [29]:
# Describe the Ethereum Sentiment
ethereum_df.describe()

Unnamed: 0,compound,positive,negative,neutral
count,20.0,20.0,20.0,20.0
mean,0.164365,0.05305,0.0087,0.9382
std,0.272452,0.065873,0.021502,0.066563
min,-0.1531,0.0,0.0,0.783
25%,0.0,0.0,0.0,0.894
50%,0.0,0.0395,0.0,0.951
75%,0.4068,0.09025,0.0,1.0
max,0.7579,0.217,0.067,1.0


### Questions:

Q: Which coin had the highest mean positive score?

A: Ethereum's mean positive score (0.04380) is higher than Bitcoin's mean positive score (0.034350)

Q: Which coin had the highest compound score?

A: Ethereum's max compound score (0.757900) is higher than Bitcoin's mean positive score (0.636900)

Q. Which coin had the highest positive score?

A: Ethereum's max positive score (0.21700) is higher than Bitcoin's max positive score (0.148000)

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word.
2. Remove Punctuation.
3. Remove Stopwords.

In [30]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [31]:
# Instantiate the lemmatizer
lemmatizer = WordNetLemmatizer()

# Create a list of stopwords
sw = set(stopwords.words('english'))
print(sw)
# Expand the default stopwords list if necessary
sw_add = {}

{'own', 'd', 'out', 'only', "wasn't", 'had', 'does', 'but', 'at', 'ourselves', 'ma', 'them', 'during', "needn't", 'hers', 'is', 'before', 'more', 'up', 'or', 'her', 'no', 'wasn', 'i', 'of', "you'll", 'because', 'through', 'm', 'in', 'having', 't', 'nor', 'shouldn', 'his', 's', 'their', 'whom', "mustn't", 'by', 'now', 'isn', 'how', "aren't", 'wouldn', 'same', 'theirs', 'has', 'into', 'that', 'yourself', 'needn', 'any', 'about', 'there', "it's", 'very', "isn't", "you're", 'few', 'haven', 'aren', 'he', 'below', 'then', 'be', 'all', 'will', 'off', "wouldn't", 'over', 'these', 'my', 'won', "should've", 'can', 'll', 'this', 'hasn', 'with', 'under', "don't", 'mightn', 'ours', 'too', 'those', 'not', 'should', "didn't", 're', 'herself', 'yours', 'for', 'down', "haven't", 'was', 'myself', 'from', 'yourselves', 'what', 'again', 'so', 'did', 'are', "couldn't", 'you', 'a', 'above', 'am', 'some', 'and', 'between', 'than', 'your', 'being', "that'll", 'just', 'each', "hasn't", 'she', 'weren', 'mustn',

In [15]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    # Remove the punctuation from text
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)
    # Create a tokenized list of the words
    words = word_tokenize(re_clean)
    
    # Lemmatize words into root words
    lem = [lemmatizer.lemmatize(word) for word in words]
   
    # Convert the words to lowercase and remove stop words
    output = [word.lower() for word in lem if word.lower() not in sw]

    return output

In [45]:
# Create a function to get the tokens
def add_token(crypto):
    
    globals()[f"{crypto}_text_token"] ={
        "tokens":[]
    }
    for index, row in bitcoin_df.iterrows():
        globals()[f"{crypto}_text"] = (globals()[f"{crypto}_df"])["text"][index]
        globals()[f"{crypto}_token"]=tokenizer(globals()[f"{crypto}_text"])
        (globals()[f"{crypto}_text_token"])["tokens"].append(globals()[f"{crypto}_token"])
    
    return globals()[f"{crypto}_text_token"]



{'tokens': [['muskmark',
   'ralstonafp',
   'via',
   'getty',
   'imageselon',
   'musk',
   'seems',
   'agree',
   'many',
   'hypersecret',
   'cryptocurrency',
   'expert',
   'nick',
   'szabo',
   'could',
   'satoshi',
   'nakamoto',
   'mysterious',
   'creator',
   'digital',
   'currency',
   'char'],
  ['denis',
   'rusinovich',
   'set',
   'cryptocurrency',
   'mining',
   'company',
   'maveric',
   'group',
   'kazakhstan',
   'thought',
   'hit',
   'jackpot',
   'next',
   'door',
   'china',
   'russia',
   'country',
   'everything',
   'bitcoin',
   'char'],
  ['el',
   'salvador',
   'introduced',
   'bitcoin',
   'legal',
   'tender',
   'alongside',
   'us',
   'dollar',
   'illustration',
   'alex',
   'castro',
   'verge',
   'international',
   'monetary',
   'funds',
   'executive',
   'board',
   'ha',
   'recommended',
   'el',
   'char'],
  ['officially',
   'building',
   'open',
   'bitcoin',
   'mining',
   'systemphoto',
   'joe',
   'raedlegetty',
 

In [47]:
# Create a new tokens column for Bitcoin

# Create token DataFrame using function add_token
bitcoin_token_df = pd.DataFrame(add_token('bitcoin'))

# Add token column into bitcoin DataFrame

bitcoin_df = bitcoin_df.join(bitcoin_token_df)
bitcoin_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral,tokens
0,2021-12-29,Musk.MARK RALSTON/AFP via Getty Images\r\nElon...,0.3612,0.077,0.0,0.923,"[muskmark, ralstonafp, via, getty, imageselon,..."
1,2022-01-12,When Denis Rusinovich set up cryptocurrency mi...,0.0,0.0,0.0,1.0,"[denis, rusinovich, set, cryptocurrency, minin..."
2,2022-01-25,El Salvador introduced Bitcoin as a legal tend...,0.3182,0.105,0.0,0.895,"[el, salvador, introduced, bitcoin, legal, ten..."
3,2022-01-14,Were officially building an open Bitcoin minin...,-0.4404,0.0,0.083,0.917,"[officially, building, open, bitcoin, mining, ..."
4,2022-01-20,"In a new blog post the company said that 4,836...",0.0,0.0,0.0,1.0,"[new, blog, post, company, said, eth, bitcoin,..."


In [48]:
# Create a new tokens column for Ethereum
# Create token DataFrame using function add_token
ethereum_token_df = pd.DataFrame(add_token('ethereum'))

# Add token column into ethereum DataFrame
ethereum_df = ethereum_df.join(ethereum_token_df)
ethereum_df.head()

Unnamed: 0,date,text,compound,positive,negative,neutral,tokens
0,2022-01-20,"In a new blog post the company said that 4,836...",0.0,0.0,0.0,1.0,"[new, blog, post, company, said, eth, bitcoin,..."
1,2022-01-19,Hackers who made off with roughly $15 million ...,0.0,0.0,0.0,1.0,"[hackers, made, roughly, million, ethereum, cr..."
2,2022-01-20,"On some level, the new mayor is simply employi...",0.1779,0.052,0.0,0.948,"[level, new, mayor, simply, employing, ageold,..."
3,2022-01-21,"Back in September\r\n, Robinhood announced pla...",0.0772,0.038,0.0,0.962,"[back, september, robinhood, announced, plan, ..."
4,2022-01-20,Trading platform Crypto.com lost about $34 mil...,-0.1027,0.056,0.067,0.877,"[trading, platform, cryptocom, lost, million, ..."


---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [54]:
from collections import Counter
from nltk import ngrams

In [61]:
# Create a function to generate N-grams
def bigram(crypto,n=2):
    
    bigram_counts= {
        "bigrams":[]
    }
    
    for index, row in (globals()[f"{crypto}_df"]).iterrows():
        text = (globals()[f"{crypto}_df"])["tokens"][index] 
        counts = Counter(ngrams(text, n=n))
        bigram_counts["bigrams"].append(counts)
    
    return bigram_counts
    

In [62]:
# Generate the Bitcoin N-grams where N=2
print(bigram('bitcoin',2))

{'bigrams': [Counter({('muskmark', 'ralstonafp'): 1, ('ralstonafp', 'via'): 1, ('via', 'getty'): 1, ('getty', 'imageselon'): 1, ('imageselon', 'musk'): 1, ('musk', 'seems'): 1, ('seems', 'agree'): 1, ('agree', 'many'): 1, ('many', 'hypersecret'): 1, ('hypersecret', 'cryptocurrency'): 1, ('cryptocurrency', 'expert'): 1, ('expert', 'nick'): 1, ('nick', 'szabo'): 1, ('szabo', 'could'): 1, ('could', 'satoshi'): 1, ('satoshi', 'nakamoto'): 1, ('nakamoto', 'mysterious'): 1, ('mysterious', 'creator'): 1, ('creator', 'digital'): 1, ('digital', 'currency'): 1, ('currency', 'char'): 1}), Counter({('denis', 'rusinovich'): 1, ('rusinovich', 'set'): 1, ('set', 'cryptocurrency'): 1, ('cryptocurrency', 'mining'): 1, ('mining', 'company'): 1, ('company', 'maveric'): 1, ('maveric', 'group'): 1, ('group', 'kazakhstan'): 1, ('kazakhstan', 'thought'): 1, ('thought', 'hit'): 1, ('hit', 'jackpot'): 1, ('jackpot', 'next'): 1, ('next', 'door'): 1, ('door', 'china'): 1, ('china', 'russia'): 1, ('russia', 'coun

In [63]:
# Generate the Ethereum N-grams where N=2
print(bigram('ethereum',2))

{'bigrams': [Counter({('blog', 'post'): 2, ('new', 'blog'): 1, ('post', 'company'): 1, ('company', 'said'): 1, ('said', 'eth'): 1, ('eth', 'bitcoin'): 1, ('bitcoin', 'takenillustration'): 1, ('takenillustration', 'alex'): 1, ('alex', 'castro'): 1, ('castro', 'vergein'): 1, ('vergein', 'blog'): 1, ('post', 'published'): 1, ('published', 'early'): 1, ('early', 'hour'): 1, ('hour', 'thursday'): 1, ('thursday', 'morning'): 1, ('morning', 'cryptocurrency'): 1, ('cryptocurrency', 'char'): 1}), Counter({('hackers', 'made'): 1, ('made', 'roughly'): 1, ('roughly', 'million'): 1, ('million', 'ethereum'): 1, ('ethereum', 'cryptocom'): 1, ('cryptocom', 'attempting'): 1, ('attempting', 'launder'): 1, ('launder', 'fund'): 1, ('fund', 'socalled'): 1, ('socalled', 'ethereum'): 1, ('ethereum', 'mixer'): 1, ('mixer', 'known'): 1, ('known', 'tornado'): 1, ('tornado', 'cash'): 1, ('cash', 'according'): 1, ('according', 'new'): 1, ('new', 'report'): 1, ('report', 'char'): 1}), Counter({('level', 'new'): 1,

In [None]:
# Function token_count generates the top 10 words for a given coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [None]:
# Use token_count to get the top 10 words for Bitcoin

token = []

for index, row in bitcoin_df.iterrows():
    text = bitcoin_df["tokens"][index]
    token = token + text

token_count(token,N=10)


In [None]:
# Use token_count to get the top 10 words for Ethereum
token = []

for index, row in ethereum_df.iterrows():
    text = ethereum_df["tokens"][index]
    token = token + text

token_count(token,N=10)

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

In [None]:
# Generate the Bitcoin word cloud
token = []

for index, row in bitcoin_df.iterrows():
    text = bitcoin_df["tokens"][index]
    token = token + text
    output = ' '.join(token)
    
wc = WordCloud().generate(output)
plt.imshow(wc)

In [None]:
# Generate the Ethereum word cloud
token = []

for index, row in ethereum_df.iterrows():
    text = ethereum_df["tokens"][index]
    token = token + text
    output = ' '.join(token)
    
wc = WordCloud().generate(output)
plt.imshow(wc)

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [None]:
import spacy
from spacy import displacy

In [None]:
# Download the language model for SpaCy
# !python -m spacy download en_core_web_sm

In [None]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [None]:
# Concatenate all of the Bitcoin text together
texts = ''

for index, row in bitcoin_df.iterrows():
    text = bitcoin_df["text"][index]
    texts = texts.join(text)
    bitcoin_text = ' '.join(texts)
    
print(bitcoin_text)

In [None]:
# Run the NER processor on all of the text
doc = nlp(bitcoin_text)

# Add a title to the document
#entities = [ent.text for ent in doc.ents if ent.label_ in ['GPE', 'ORG']]

In [None]:
# Render the visualization
displacy.render(doc, style='ent')

In [None]:
# List all Entities
# YOUR CODE HERE!

---

### Ethereum NER

In [None]:
# Concatenate all of the Ethereum text together
# YOUR CODE HERE!

In [None]:
# Run the NER processor on all of the text
# YOUR CODE HERE!

# Add a title to the document
# YOUR CODE HERE!

In [None]:
# Render the visualization
# YOUR CODE HERE!

In [None]:
# List all Entities
# YOUR CODE HERE!

---