<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Application Programming Interface, Natural Language Processing, & Classification Modelling

### Contents:
#### Part 1 (of 3)
- [Executive Summary](#Executive-Summary)
- [Problem Statement](#Problem-Statement)
- [Background & Research](#Background-&-Research)
- [Data Collection](#Data-Collection)
- Data Wrangling
- Exploration & Visualisation
- Pre-Processing & Modelling
- Results & Analysis
- Recommendations & Conclusions
- References

## Executive Summary

A cryptocurrency is a digital currency that is secured by cryptography. Bitcoin and Ethereum are the two largest cryptocurrencies by market capitalization as of this moment. While the top two coins share some similarities, they are also different in many ways. Investors and traders who wish to know more about these two top cryptocurrencies may find it difficult to grasp the many terminologies and jargons used in this field. A natural language processing classifier will be trained on posts from the two respective subreddits. It will learn to identify keywords that are more commonly associated with or are unique to each coin. This will be used to help businesses and enterprises develop an efficient and sophisticated query-answering and routing algorithm for their online chatbot to help handle the large number of enquiries. A total of 6 vectorizer-model combinations were evaluated. The vectorizers considered were Count Vectorizer and Tfidf Vectorizer. The models considered were Multinomial Naive Bayes, K-Nearest Neighbours, and Logistic Regression. Each combination was placed in a pipeline and then passed into a grid search cv. Evaluation of the combinations was conducted using 2 metrics: accuracy and receiver operating characteristic area under curve. All in all, Tfidf Vectorizer-Logistic Regression performed the best. Just on the testing dataset alone, it scored the highest accuracy (0.884) and receiver operating characteristic area under curve (0.950). The top 3 words that predicted a post to be from the Bitcoin subreddit were ‘bitcoin’, ‘btc’, and ‘lightning’. The top 3 words that predicted a post to be from the Ethereum subreddit were ‘ethereum’, ‘eth’, and ‘gas’. With the top words that the classifier has found for each subreddit, a minimum viable product can be designed. In its implementation, this chatbot will pick up on the keywords in a user-submitted message and try to identify whether the query pertains to Bitcoin or Ethereum. It will then give a suitable answer or route the message to the right customer service representative for further management.

## Problem Statement

This project aims to help businesses and enterprises in the cryptocurrency space (e.g. news provider, brokerage platforms, coin vaults, mining pools) develop an efficient and sophisticated query-answering and routing algorithm for their online chatbot to help handle the large number of enquiries received from site visitors on a daily basis and reduce the burden on customer service operatives. The goal is to empower the chatbot to be able to accurately determine the nature of the enquiry and return an appropriate answer; where it is unable to do so, it will categorise the class of the enquiry and route to the relevant operative. For this to happen, the chatbot needs to (for a start) know how to recognise keywords from two well-known cryptocurrencies (Bitcoin and Ethereum) with the help of a natural language processing classifier trained on posts from the two respective subreddits.

## Background & Research

A cryptocurrency is a digital currency that is secured by cryptography. It can be transferred between users or, in certain permitted situations, be exchanged for goods and services. These online transactions are verified and recorded on an online ledger, known as a blockchain, that is enforced by a decentralised network of computers.
([source](https://www.investopedia.com/terms/c/cryptocurrency.asp)) Bitcoin and Ethereum are the two largest cryptocurrencies by market capitalization as of this moment. ([source](https://www.nerdwallet.com/article/investing/cryptocurrency-7-things-to-know)) Bitcoin has been around since 2009 whereas Ethereum was first introduced into circulation in 2015. ([source](https://bernardmarr.com/what-is-the-difference-between-bitcoin-and-ethereum/)) Interest in cryptocurrencies has skyrocketed in recent years, with total fund inflows into cryptocurrency products hitting USD 5.6 billion in 2020, a jump of more than 600% compared to 2019. ([source](https://www.reuters.com/article/us-crypto-currencies-flows-idUSKBN28V2OE)) Most who are new to cryptocurrency would have heard of Bitcoin and Ethereum, as shown by the first question asked in this consumer insights survey on crypto assets. ([source](https://www.oecd.org/financial/education/consumer-insights-survey-on-cryptoassets.pdf)) While the top two coins share some similarities, they are also different in many ways. ([source](https://www.bloomberg.com/news/articles/2021-05-09/bitcoin-and-ethereum-how-are-they-different-quicktake)) Investors and traders who wish to know more about these two top cryptocurrencies may find it difficult to grasp the many terminologies and jargons used in this field. A natural language processing classifier will be used to learn and identify keywords that are more commonly associated with or are unique to each coin so as to lend clarity on the distinctions between the two.

## Data Collection

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import requests
import time

pd.set_option('display.max.columns', None)

### Get Posts From Bitcoin Subreddit

In [2]:
# get posts from bitcoin subreddit and place in a dataframe
# max size of posts per request using pushshift api is 100
# so to get the desired no of posts (5000) i have to iterate 50 times
# the timestamp was decided in unison to be 1626939127
# so that all group members would start from the same post and get every post before that

url = 'https://api.pushshift.io/reddit/search/submission'

timestamp = 1626939127
repetitions = 50
pdf_btc = pd.DataFrame()

for repetition in range(repetitions):
    
    params = {'subreddit': 'Bitcoin', 'size': 100, 'before': timestamp}
    
    response = requests.get(url, params)
    print(f'Repetition {repetition + 1}: Status Code {response.status_code}')
    
    data = response.json()
    posts = data['data']
    
    df = pd.DataFrame(posts)
    pdf_btc = pd.concat([pdf_btc, df])
    timestamp = posts[-1]['created_utc']

# use this code block if the api scrape keeps getting interrupted by errors

# url = 'https://api.pushshift.io/reddit/search/submission'

# timestamp = 1626939127
# repetitions = 50
# pdf_btc = pd.DataFrame()
# count = 0

# while count < repetitions:
    
#     params = {'subreddit': 'Bitcoin', 'size': 100, 'before': timestamp}
    
#     try:
#         response = requests.get(url, params)
#         print(f'Repetition {count + 1}: Status Code {response.status_code}')
        
#         data = response.json()
#         posts = data['data']
        
#         df = pd.DataFrame(posts)
#         pdf_btc = pd.concat([pdf_btc, df])
#         timestamp = posts[-1]['created_utc']
#         count += 1
        
#     except:
#         print('Error')
#         time.sleep(5)

Repetition 1: Status Code 200
Repetition 2: Status Code 200
Repetition 3: Status Code 200
Repetition 4: Status Code 200
Repetition 5: Status Code 200
Repetition 6: Status Code 200
Repetition 7: Status Code 200
Repetition 8: Status Code 200
Repetition 9: Status Code 200
Repetition 10: Status Code 200
Repetition 11: Status Code 200
Repetition 12: Status Code 200
Repetition 13: Status Code 200
Repetition 14: Status Code 200
Repetition 15: Status Code 200
Repetition 16: Status Code 200
Repetition 17: Status Code 200
Repetition 18: Status Code 200
Repetition 19: Status Code 200
Repetition 20: Status Code 200
Repetition 21: Status Code 200
Repetition 22: Status Code 200
Repetition 23: Status Code 200
Repetition 24: Status Code 200
Repetition 25: Status Code 200
Repetition 26: Status Code 200
Repetition 27: Status Code 200
Repetition 28: Status Code 200
Repetition 29: Status Code 200
Repetition 30: Status Code 200
Repetition 31: Status Code 200
Repetition 32: Status Code 200
Repetition 33: St

In [3]:
# reset index of dataframe
pdf_btc.reset_index(drop=True, inplace=True)

In [4]:
# check that index has been resetted
pdf_btc.index

RangeIndex(start=0, stop=5000, step=1)

In [5]:
# export dataframe as csv file
pdf_btc.to_csv('../data/btc_posts.csv', index=False)

### Get Posts From Ethereum Subreddit

In [6]:
# get posts from ethereum subreddit and place in a dataframe
# max size of posts per request using pushshift api is 100
# so to get the desired no of posts (10000) i have to iterate 100 times
# the timestamp was decided in unison to be 1626939643
# so that all group members would start from the same post and get every post before that

url = 'https://api.pushshift.io/reddit/search/submission'

timestamp = 1626939643
repetitions = 100
pdf_eth = pd.DataFrame()

for repetition in range(repetitions):
    
    params = {'subreddit': 'ethereum', 'size': 100, 'before': timestamp}
    
    response = requests.get(url, params)
    print(f'Repetition {repetition + 1}: Status Code {response.status_code}')
    
    data = response.json()
    posts = data['data']
    
    df = pd.DataFrame(posts)
    pdf_eth = pd.concat([pdf_eth, df])
    timestamp = posts[-1]['created_utc']

# use this code block if the api scrape keeps getting interrupted by errors

# url = 'https://api.pushshift.io/reddit/search/submission'

# timestamp = 1626939643
# repetitions = 100
# pdf_eth = pd.DataFrame()
# count = 0

# while count < repetitions:
    
#     params = {'subreddit': 'ethereum', 'size': 100, 'before': timestamp}
    
#     try:
#         response = requests.get(url, params)
#         print(f'Repetition {count + 1}: Status Code {response.status_code}')
        
#         data = response.json()
#         posts = data['data']
        
#         df = pd.DataFrame(posts)
#         pdf_eth = pd.concat([pdf_eth, df])
#         timestamp = posts[-1]['created_utc']
#         count += 1
        
#     except:
#         print('Error')
#         time.sleep(5)

Repetition 1: Status Code 200
Repetition 2: Status Code 200
Repetition 3: Status Code 200
Repetition 4: Status Code 200
Repetition 5: Status Code 200
Repetition 6: Status Code 200
Repetition 7: Status Code 200
Repetition 8: Status Code 200
Repetition 9: Status Code 200
Repetition 10: Status Code 200
Repetition 11: Status Code 200
Repetition 12: Status Code 200
Repetition 13: Status Code 200
Repetition 14: Status Code 200
Repetition 15: Status Code 200
Repetition 16: Status Code 200
Repetition 17: Status Code 200
Repetition 18: Status Code 200
Repetition 19: Status Code 200
Repetition 20: Status Code 200
Repetition 21: Status Code 200
Repetition 22: Status Code 200
Repetition 23: Status Code 200
Repetition 24: Status Code 200
Repetition 25: Status Code 200
Repetition 26: Status Code 200
Repetition 27: Status Code 200
Repetition 28: Status Code 200
Repetition 29: Status Code 200
Repetition 30: Status Code 200
Repetition 31: Status Code 200
Repetition 32: Status Code 200
Repetition 33: St

In [7]:
# reset index of dataframe
pdf_eth.reset_index(drop=True, inplace=True)

In [8]:
# check that index has been resetted
pdf_eth.index

RangeIndex(start=0, stop=10000, step=1)

In [9]:
# export dataframe as csv file
pdf_eth.to_csv('../data/eth_posts.csv', index=False)

### Backup: Get Comments From Bitcoin Subreddit [Not Used]

In [10]:
# get comments from bitcoin subreddit and place in a dataframe
# max size of comments per request using pushshift api is 100
# so to get the desired no of comments (5000) i have to iterate 50 times
# the timestamp was decided in unison to be 1626939127
# so that all group members would start from the same post and get every post before that

# url = 'https://api.pushshift.io/reddit/search/comment'

# timestamp = 1626939127
# repetitions = 50
# cdf_btc = pd.DataFrame()

# for repetition in range(repetitions):
    
#     params = {'subreddit': 'Bitcoin', 'size': 100, 'before': timestamp}
    
#     response = requests.get(url, params)
#     print(f'Repetition {repetition + 1}: Status Code {response.status_code}')
    
#     data = response.json()
#     posts = data['data']
    
#     df = pd.DataFrame(posts)
#     cdf_btc = pd.concat([cdf_btc, df])
#     timestamp = posts[-1]['created_utc']

In [11]:
# # reset index of dataframe
# cdf_btc.reset_index(drop=True, inplace=True)

In [12]:
# # check that index has been resetted
# cdf_btc.index

In [13]:
# # export dataframe as csv file
# cdf_btc.to_csv('../data/btc_comments.csv', index=False)

### Backup: Get Comments From Ethereum Subreddit [Not Used]

In [14]:
# get comments from ethereum subreddit and place in a dataframe
# max size of comments per request using pushshift api is 100
# so to get the desired no of posts (10000) i have to iterate 100 times
# the timestamp was decided in unison to be 1626939643
# so that all group members would start from the same post and get every post before that

# url = 'https://api.pushshift.io/reddit/search/comment'

# timestamp = 1626939643
# repetitions = 100
# cdf_eth = pd.DataFrame()

# for repetition in range(repetitions):
    
#     params = {'subreddit': 'ethereum', 'size': 100, 'before': timestamp}
    
#     response = requests.get(url, params)
#     print(f'Repetition {repetition + 1}: Status Code {response.status_code}')
    
#     data = response.json()
#     posts = data['data']
    
#     df = pd.DataFrame(posts)
#     cdf_eth = pd.concat([cdf_eth, df])
#     timestamp = posts[-1]['created_utc']

In [15]:
# # reset index of dataframe
# cdf_eth.reset_index(drop=True, inplace=True)

In [16]:
# # check that index has been resetted
# cdf_eth.index

In [17]:
# # export dataframe as csv file
# cdf_eth.to_csv('../data/eth_comments.csv', index=False)