# Quickstart with coinmarketcap

This is a quickstart notebook for [coinmarketcap dataset](https://www.kaggle.com/bizzyvinci/coinmarketcap-historical-data). It includes notes, EDA and plotting ideas. Goodluck.

## Import libraries

In [None]:
import os
import sqlite3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import bokeh

from pathlib import Path

## Reading dataset

The dataset are provided in csv and sqlite. sqlite might help with sql query.

In [None]:
data_dir = Path('../input/coinmarketcap-historical-data')

### Reading csv

In [None]:
# reading csv is straightforward
coins = pd.read_csv(data_dir/'coins.csv', parse_dates=['date_added', 'date_launched'])
historical = pd.read_csv(data_dir/'historical.csv', parse_dates=['date'])

In [None]:
print(coins.shape)
coins.head()

In [None]:
print(historical.shape)
historical.head()

### Reading sqlite

sqlite is great is you want a particular subset of the data

In [None]:
con = sqlite3.connect(data_dir/'coinmarketcap.sqlite')
query = """
select h.date, h.cmc_rank, h.coin_id, c.name, c.symbol,
h.market_cap, h.price
from historical as h
left join coins as c
on h.coin_id = c.id
where symbol in ('BTC', 'ETH', 'BNB', 'DOGE')
"""

sql_df = pd.read_sql_query(query, con, parse_dates=['date'])
print(sql_df.shape)
sql_df.head()

## Tables
There are 2 tables:
* coins
* historical

so let's perform a quick EDA

### Coins
coins contains metadata of each coin

In [None]:
coins.info()

In [None]:
print(f"There are {coins.shape[0]} rows")
for x in ["id", "name", "slug", "symbol"]:
    print(f"There are {coins[x].nunique()} unique {x}")

As you can see id is only thing that's unique to a coin and that's why it is highly recommended to use id. Here's an example below

In [None]:
#coins['name'].value_counts()[:5]
coins[coins['name']=='Swarm']

In [None]:
coins['status'].value_counts()

I don't really understand the difference between active, inactive and untracked. However, extinct was a status added by me for coins that I can't info about them like others. Extinct coins are like a ghost, there's no info about them but they show up in historical.

In [None]:
coins[coins['status']=='extinct'].sample(5)

In [None]:
coins['category'].value_counts().plot.barh(title='Categories')
plt.show()

Other columns such as `description`, `subreddit`, `notice`, `platform_id`, `date_added` and `date_launched` straight forward. Others are straight forward too but they are comma seperated list such as `tags`, `website` among others.


In [None]:
coins.iloc[:, 6:].sample(3)

In [None]:
coins.isna().sum().sort_values().plot.barh(title='Column nans', figsize=(14,6))
plt.show()

### Historical
historical contains historical data including ranking price, OHLCV. `time_high` and `time_low` are the only non numeric columns here are they are the time in `"%H:%M:%S"` format for when `high` and `low` were attained.

In [None]:
historical.describe(datetime_is_numeric=True)

## Plots

In [None]:
# Let's compare price of BTC, ETH and DOGE from year to date
# And it starts with knowing there ids
mask = (coins['symbol']=='BTC') | (coins['symbol']=='ETH') | (coins['symbol']=='DOGE')
coins[mask]

In [None]:
mask = (historical['coin_id']==1) | (historical['coin_id']==74) | (historical['coin_id']==1027)
df = historical[mask].merge(coins[['id', 'name', 'symbol']], left_on='coin_id', right_on='id', how='left')

In [None]:
fig, ax = plt.subplots(2,3, figsize=(14,6))
for i,col in enumerate(['price', 'market_cap']):
    for j,(idx, name) in enumerate(zip(df['coin_id'].unique(), df['name'].unique())):
        data = df[(df['coin_id']==idx) & (df['date'].dt.year==2021)]
        ax[i,j].plot(data[col])
        ax[i,j].set_title(f'{name} {col} 2021')

Seemed each coin had taken a different path in the same direction :D. Which coins would have given us the best return in 2021?

In [None]:
def add_coin_data(df):
    return df.merge(coins[['id', 'name', 'symbol']], left_on='coin_id', right_on='id', how='left')

In [None]:
df = add_coin_data(historical[historical['date'].dt.year==2021])
print(df.shape)
df.head()

In [None]:
# Select coins that are present in max date and set coin_id as index
data = df[df['date'] == df['date'].max()][['coin_id', 'name', 'symbol', 'price']].set_index('coin_id')
data['start_date'] = df.groupby('coin_id')['date'].min()[data.index]

# There must be a better way to get start_price, this is too damn slow
start_price = []
for coin_id, start_date in zip(data.index, data.start_date):
    start_price.append(df[(df['coin_id']==coin_id) & (df['date']==start_date)]['price'].squeeze())

data['start_price'] = start_price
data['return %'] = 100 * (data['price'] - data['start_price']) / data['start_price']

# sort by return
data = data.sort_values('return %', ascending=False)
data

Wow, these are coins I don't know. The best returns are infinity because they started from 0, and worst are nan because they are now worth 0.00000 :D.

Where does BTC rank, ETH and DOGE rank?

In [None]:
print(f"BTC is in {data.index.get_loc(1)}/{data.index.size}")
print(f"ETH is in {data.index.get_loc(1027)}/{data.index.size}")
print(f"DOGE is in {data.index.get_loc(74)}/{data.index.size}")

In [None]:
data.loc[[1,1027,74]]

In [None]:
data.describe()

Most coins have a negative return, choose wisely.

### Bokeh

A lot more can be done with analysis especially when you combine bokeh's interactiveness.

In [None]:
from bokeh.plotting import figure
# Make Bokeh Push push output to Jupyter Notebook.
from bokeh.io import push_notebook, show, output_notebook
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

mask = (historical['coin_id']==74) & (historical['date'].dt.year==2021)
x = historical[mask]['date'].values
y = historical[mask]['price'].values

p = figure(title="DOGE price in 2021", x_axis_label='Time', y_axis_label='Price',
           tools="pan,wheel_zoom,box_zoom,reset", plot_width=800, plot_height=300)
p.line(x,y, line_width=1)
show(p)

### Word Cloud

Let's check the word cloud for coin descriptions

In [None]:
# copied from https://www.geeksforgeeks.org/generating-word-cloud-python/
from wordcloud import WordCloud, STOPWORDS

comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through the csv file
for val in coins['description']:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)

In [None]:
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

## Conclusion
I hope you find this notebook and dataset helpful. This is sponsored by BED coin, it's a coin that that would help you and your wellbeing. It would take the stress out of your body and make you energetic the next morning. Use it everyday and you'll keep the doctor, and ageing a little more distant. BED coin is very different and better than Btc, Eth, and Doge, although they inspire us.

FAQs

**How do you get BED coin?**:
There's a good chance you already have it in your wallet called BEDroom. But if not, you can get from the nearest store (including e-stores on your phone).

**How do I use it**:
You can lay on it, lay your children on it, your parents, your partner but I won't recommend strangers.

<br>

**If this kernel or [dataset](https://www.kaggle.com/bizzyvinci/coinmarketcap-historical-data) was helpful, drop an upvote. And thanks for reading till the end.**

