# Unit 12 - Tales from the Crypto

---


## 1. Sentiment Analysis

Use the news api to pull the latest news articles for bitcoin and ethereum and create a DataFrame of sentiment scores for each coin. 

Use descriptive statistics to answer the following questions:
1. Which coin had the highest mean positive score?
2. Which coin had the highest negative score?
3. Which coin had the highest positive score?

In [18]:
# Initial imports
import os
import pandas as pd
from dotenv import load_dotenv
import nltk as nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer() 
from newsapi import NewsApiClient
%matplotlib inline

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\wazar\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [19]:
# Read your api key environment variable
load_dotenv()
api_key = os.getenv("NEWS_API_KEY")

In [21]:
# Create a newsapi client
newsapi = NewsApiClient(api_key=api_key)


In [23]:
# Fetch the Bitcoin news articles
bitcoin_news = newsapi.get_everything(
    q='bitcoin',
    language="en",
    sort_by="relevancy"
)
#bitcoin_news = newsapi.get_everything(q="Bitcoin", language="en")

bitcoin_news["totalResults"]



5463

In [24]:
# Fetch the Ethereum news articles
ethereum_news = newsapi.get_everything(q="Ethereum", language="en")
ethereum_news["totalResults"]

1340

In [26]:
# Create the Bitcoin sentiment scores DataFrame
def df_maker(news, language):
    articles = []
    for article in news:
        try:
            sentiment = analyzer.polarity_scores(article["description"])
            title = article["title"]
            description = article["description"]
            text = article["content"]
            date = article["publishedAt"][:10]
            
            articles.append({
                "Title": title,
                "Description": description,
                "Text": text,
                "Date": date,
                "Compound": sentiment["compound"],
                "Positive": sentiment["pos"],
                "Negative": sentiment["neg"],
                "Neutral": sentiment["neu"],
            })
            
            cols = ["Compound", "Negative", "Neutral", "Positive", "Text"]
        except AttributeError:
            pass
    
    return pd.DataFrame(articles)[cols]

bitcoin_df = df_maker(bitcoin_news["articles"],"en")
bitcoin_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text
0,0.7506,0.0,0.811,0.189,After reaching a previous all-time high on Nov...
1,0.4019,0.0,0.93,0.07,Its been almost three years to the day since t...
2,0.34,0.0,0.876,0.124,Everything is dumb until it works.\r\nAs 2020 ...
3,0.0813,0.048,0.897,0.055,The government of India is considering an 18% ...
4,0.3612,0.074,0.767,0.158,Just weeks after it shattered its yearslong as...


In [27]:
# Create the Ethereum sentiment scores DataFrame
ethereum_df = df_maker(ethereum_news["articles"],"en")

ethereum_df.head()

Unnamed: 0,Compound,Negative,Neutral,Positive,Text
0,0.5267,0.0,0.888,0.112,The Securities and Exchange Commission plans t...
1,0.0772,0.071,0.874,0.055,Bitcoin was once derided by serious investors ...
2,0.0788,0.051,0.891,0.058,Cryptocurrencies stole headlines on the first ...
3,-0.4019,0.083,0.917,0.0,FILE PHOTO: A representation of virtual curren...
4,-0.4019,0.083,0.917,0.0,FILE PHOTO: Representations of virtual currenc...


In [28]:
# Describe the Bitcoin Sentiment
bitcoin_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,0.221585,0.02745,0.8769,0.09555
std,0.323745,0.046389,0.081369,0.072699
min,-0.5859,0.0,0.746,0.0
25%,0.0,0.0,0.8075,0.04125
50%,0.34,0.0,0.886,0.1065
75%,0.4068,0.05175,0.9375,0.1505
max,0.7506,0.124,1.0,0.211


In [29]:
# Describe the Ethereum Sentiment
ethereum_df.describe()

Unnamed: 0,Compound,Negative,Neutral,Positive
count,20.0,20.0,20.0,20.0
mean,0.037055,0.03745,0.9112,0.05135
std,0.338136,0.047129,0.066676,0.06145
min,-0.4019,0.0,0.769,0.0
25%,-0.3182,0.0,0.87625,0.0
50%,0.0,0.0,0.917,0.042
75%,0.218425,0.07475,0.958,0.073
max,0.6369,0.162,1.0,0.197


### Questions:

Q: Which coin had the highest mean positive score?

A: 

Q: Which coin had the highest compound score?

A: 

Q. Which coin had the highest positive score?

A: 

---

## 2. Natural Language Processing
---
###   Tokenizer

In this section, you will use NLTK and Python to tokenize the text for each coin. Be sure to:
1. Lowercase each word
2. Remove Punctuation
3. Remove Stopwords

In [30]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
import re

In [31]:
# Expand the default stopwords list if necessary
#not needed

In [32]:
# Complete the tokenizer function
def tokenizer(text):
    """Tokenizes text."""
    
    # Create a list of the words

    # Convert the words to lowercase
    
    # Remove the punctuation
    
    # Remove the stop words
    
    # Lemmatize Words into root words
def tokenizer(text):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', text)
    words = word_tokenize(re_clean)
    lem = [lemmatizer.lemmatize(word) for word in words]
    tokens = [word.lower() for word in lem if word.lower() not in sw]
    return tokens    
    

In [33]:
# Create a new tokens column for Bitcoin
bitcoin_df['Text'] = str(bitcoin_df['Text'])
btc_tokens = []
for i in bitcoin_df['Text']:
    tokenized_text = tokenizer(i)
    btc_tokens.append({'tokens':tokenized_text})

btc_tokens_df = pd.DataFrame(btc_tokens)
bitcoin_df['tokens'] = btc_tokens_df
bitcoin_df.head()

NameError: name 'lemmatizer' is not defined

In [34]:
# Create a new tokens column for Ethereum
ethereum_df['Text'] = str(ethereum_df['Text'])

eth_tokens = []
for i in ethereum_df['Text']:
    tokenized_text = tokenizer(i)
    eth_tokens.append({'tokens':tokenized_text})

eth_tokens_df = pd.DataFrame(eth_tokens)
ethereum_df['tokens'] = eth_tokens_df
ethereum_df.head()

NameError: name 'lemmatizer' is not defined

---

### NGrams and Frequency Analysis

In this section you will look at the ngrams and word frequency for each coin. 

1. Use NLTK to produce the n-grams for N = 2. 
2. List the top 10 words for each coin. 

In [35]:
from collections import Counter
from nltk import ngrams

In [36]:
def process_text(doc):
    sw = set(stopwords.words('english'))
    regex = re.compile("[^a-zA-Z ]")
    re_clean = regex.sub('', doc)
    words = word_tokenize(re_clean)
    lemmatizer = WordNetLemmatizer()
    lem = [lemmatizer.lemmatize(word) for word in words]
    output = [word.lower() for word in lem if word.lower() not in sw]
    return output

def bigram_counter(corpus): 
    big_string = " ".join(corpus)
    processed = process_text(big_string)
    bigrams = ngrams(processed, n=2)
    top_15 = dict(Counter(bigrams).most_common(15))
    return pd.DataFrame(list(top_15.items()), columns=['word','count'])

# Generate the Bitcoin N-grams where N=2
corpus = bitcoin_df["Text"]
bigram_counter(corpus)

Unnamed: 0,word,count
0,"(reuters, staffrnfile)",40
1,"(staffrnfile, photo)",40
2,"(reuters, staffrnlondon)",40
3,"(reaching, previous)",20
4,"(previous, alltime)",20
5,"(alltime, high)",20
6,"(high, nov)",20
7,"(nov, almost)",20
8,"(almost, three)",20
9,"(three, year)",20


In [37]:
# Generate the Ethereum N-grams where N=2
corpus = ethereum_df["Text"]
bigram_counter(corpus)

Unnamed: 0,word,count
0,"(file, photo)",80
1,"(new, york)",80
2,"(york, reuters)",80
3,"(photo, representation)",40
4,"(representation, virtual)",40
5,"(virtual, curren)",40
6,"(photo, representations)",40
7,"(representations, virtual)",40
8,"(virtual, currenc)",40
9,"(reuters, institutional)",40


In [38]:
# Use the token_count function to generate the top 10 words from each coin
def token_count(tokens, N=3):
    """Returns the top N tokens from the frequency count"""
    return Counter(tokens).most_common(N)

In [39]:
btc_words = []
for text in bitcoin_df['tokens']:
    for word in text:
        btc_words.append(word)
        
token_count(btc_words)# Get the top 10 words for Bitcoin


KeyError: 'tokens'

In [40]:
# Get the top 10 words for Ethereum
eth_words = []
for text in ethereum_df['tokens']:
    for word in text:
        eth_words.append(word)
        
token_count(eth_words)

KeyError: 'tokens'

---

### Word Clouds

In this section, you will generate word clouds for each coin to summarize the news for each coin

In [41]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = [20.0, 10.0]

ModuleNotFoundError: No module named 'wordcloud'

In [42]:
# Generate the Bitcoin word cloud
def stringmaker(words):
    big_string = " ".join(words)
    return big_string

btc_words = stringmaker(btc_words)
eth_words = stringmaker(eth_words)

btc_word_cloud = WordCloud(width=1200, height=800, max_words=30).generate(btc_words)

plt.imshow(btc_word_cloud)

NameError: name 'WordCloud' is not defined

In [43]:
# Generate the Ethereum word cloud
eth_word_cloud = WordCloud(width=1200, height=800, max_words=30).generate(eth_words)

plt.imshow(eth_word_cloud)

NameError: name 'WordCloud' is not defined

---
## 3. Named Entity Recognition

In this section, you will build a named entity recognition model for both Bitcoin and Ethereum, then visualize the tags using SpaCy.

In [44]:
import spacy
from spacy import displacy

In [45]:
# Optional - download a language model for SpaCy
!python -m spacy download en_core_web_sm

[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


In [46]:
# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

---
### Bitcoin NER

In [47]:

articles = bitcoin_df["Text"].str.cat()
articles

# Run the NER processor on all of the text
# YOUR CODE HERE!
doc = nlp(articles)

# Add a title to the document
# YOUR CODE HERE!
doc.user_data['title'] = 'Bitcoin NER'

In [48]:
# Render the visualization
displacy.render(doc, style='ent')

In [49]:
# List all Entities
for ent in doc.ents:
    print (ent, ent.label_)

1 CARDINAL
2 CARDINAL
3 CARDINAL
India GPE
18% PERCENT
4 CARDINAL
5 CARDINAL
New York GPE
CNN ORG
6 CARDINAL
The Securities and Exchange Commission ORG
7 CARDINAL
a cent MONEY
8 CARDINAL
9      QUANTITY
Mexico GPE
10 CARDINAL
11 CARDINAL
Grayson Blackmon PERSON
12 CARDINAL
6 CARDINAL
6 CARDINAL
participating\r\nThe NORP
13 CARDINAL
Bitcoin GPE
14 CARDINAL
15 CARDINAL
16 CARDINAL
17 CARDINAL
Reuters Staff\r\nLONDON ORG
Dec 30 DATE
Reuters ORG
18 CARDINAL
19 CARDINAL
Reuters Staff\r\nLONDON ORG
Jan 4 DATE
Reuters ORG
1 CARDINAL
2 CARDINAL
3 CARDINAL
India GPE
18% PERCENT
4 CARDINAL
5 CARDINAL
New York GPE
CNN ORG
6 CARDINAL
The Securities and Exchange Commission ORG
7 CARDINAL
a cent MONEY
8 CARDINAL
9      QUANTITY
Mexico GPE
10 CARDINAL
11 CARDINAL
Grayson Blackmon PERSON
12 CARDINAL
6 CARDINAL
6 CARDINAL
participating\r\nThe NORP
13 CARDINAL
Bitcoin GPE
14 CARDINAL
15 CARDINAL
16 CARDINAL
17 CARDINAL
Reuters Staff\r\nLONDON ORG
Dec 30 DATE
Reuters ORG
18 CARDINAL
19 CARDINAL
Reuters S

---

### Ethereum NER

In [50]:
# Concatenate all of the Ethereum text together
articles = ethereum_df["Text"].str.cat()
articles

'0     The Securities and Exchange Commission plans t...\n1     Bitcoin was once derided by serious investors ...\n2     Cryptocurrencies stole headlines on the first ...\n3     FILE PHOTO: A representation of virtual curren...\n4     FILE PHOTO: Representations of virtual currenc...\n5     FILE PHOTO: Representations of virtual currenc...\n6     NEW YORK (Reuters) - Institutional investors p...\n7     NEW YORK (Reuters) - Institutional investors p...\n8     NEW YORK (Reuters) - Total investor inflows in...\n9     Bitcoin fizzled in Monday trading as the famou...\n10    Ethereum creator Vitalik Buterin.\\r\\n14 with 1...\n11    It seems only fitting to end 2020 on a depress...\n12    December\\r\\n15, 2020\\r\\n6 min read\\r\\nOpinions...\n13                                                 None\n14    Ethereum is one of the leading crypto projects...\n15    FILE PHOTO: A representation of virtual curren...\n16    LONDON (Reuters) - Bitcoin on Wednesday jumped...\n17    NEW YORK (Reuter

In [51]:
# Run the NER processor on all of the text
doc = nlp(articles)

# Add a title to the document
doc.user_data['title'] = 'Ethereum NER'

In [52]:
# Render the visualization
displacy.render(doc, style='ent')

In [53]:
# List all Entities
for ent in doc.ents:
    print (ent, ent.label_)

The Securities and Exchange Commission ORG
1 CARDINAL
Bitcoin GPE
2 CARDINAL
first ORDINAL
3 CARDINAL
4 CARDINAL
5 CARDINAL
6 CARDINAL
Reuters ORG
7 CARDINAL
Reuters ORG
8 CARDINAL
Reuters ORG
9 CARDINAL
Bitcoin GPE
Monday DATE
10 CARDINAL
Vitalik Buterin.\r\n14 PERSON
1 CARDINAL
11 CARDINAL
2020 DATE
12 CARDINAL
13 CARDINAL
14 CARDINAL
15 CARDINAL
16 CARDINAL
LONDON GPE
Reuters ORG
Bitcoin GPE
Wednesday DATE
17 CARDINAL
Reuters ORG
18 CARDINAL
19 CARDINAL
The Securities and Exchange Commission ORG
1 CARDINAL
Bitcoin GPE
2 CARDINAL
first ORDINAL
3 CARDINAL
4 CARDINAL
5 CARDINAL
6 CARDINAL
Reuters ORG
7 CARDINAL
Reuters ORG
8 CARDINAL
Reuters ORG
9 CARDINAL
Bitcoin GPE
Monday DATE
10 CARDINAL
Vitalik Buterin.\r\n14 PERSON
1 CARDINAL
11 CARDINAL
2020 DATE
12 CARDINAL
13 CARDINAL
14 CARDINAL
15 CARDINAL
16 CARDINAL
LONDON GPE
Reuters ORG
Bitcoin GPE
Wednesday DATE
17 CARDINAL
Reuters ORG
18 CARDINAL
19 CARDINAL
The Securities and Exchange Commission ORG
1 CARDINAL
Bitcoin GPE
2 CARDINAL
f

---