<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Python for Finance Basics

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

## `scikit-learn` package

In [None]:
!git clone https://github.com/tpq-classes/pff_basics.git
import sys
sys.path.append('pff_basics')


In [None]:
import numpy as np
import pandas as pd
from pylab import plt
np.set_printoptions(suppress=True)
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

## Supervised Learning

**Classification**

### Market Prediction

In [None]:
url = 'https://certificate.tpq.io/mlfin.csv'

In [None]:
raw = pd.read_csv(url, index_col=0, parse_dates=True)

In [None]:
raw.info()

In [None]:
sym = 'BTC='
# sym = 'EUR='

In [None]:
data = pd.DataFrame(raw[sym]).dropna()

In [None]:
data.plot();

In [None]:
data['r'] = data[sym].pct_change()

In [None]:
data['r'].plot();

In [None]:
data['r'].hist(bins=50);

### Binary Features

Only binary features and labels.

We transform the returns to binary features/labels by working with the sign of the returns.

Then we lag the "directional" data to get a number of historical lags as the features. The algorithm is then supposed to learn the future market direction (= labels) from the historical lags.

In [None]:
lags = 5

In [None]:
data['d'] = np.sign(data['r'])

In [None]:
cols = list()
for lag in range(1, lags + 1):
    col = f'lag_{lag}'
    data[col] = data['d'].shift(lag)
    cols.append(col)

In [None]:
data.head(8)

In [None]:
data.dropna(inplace=True)

In [None]:
2 ** 2

In [None]:
2 ** lags

#### Classification

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [None]:
model = GaussianNB()
model = LogisticRegression()
model = DecisionTreeClassifier()
model = SVC(probability=True)

In [None]:
model.fit(data[cols], data['d'])

In [None]:
data['p'] = model.predict(data[cols])
data['p'].head()

In [None]:
model.predict_proba(data[cols])[:5]

In [None]:
data['p'].value_counts()

In [None]:
data['d'].value_counts()

In [None]:
accuracy_score(data['d'], data['p'])

### Floating Point Features

Combining floating point features with binary labels.

In [None]:
cols = list()
for lag in range(1, lags + 1):
    col = f'lag_{lag}'
    data[col] = data['r'].shift(lag)
    cols.append(col)

In [None]:
data.head(8)

In [None]:
data.dropna(inplace=True)

In [None]:
model = GaussianNB()
model = LogisticRegression()
model = DecisionTreeClassifier(max_depth=12)
model = SVC(kernel='rbf')

In [None]:
model.fit(data[cols], data['d'])

In [None]:
data['p'] = model.predict(data[cols])
data['p'].head(5)

In [None]:
data['p'].value_counts()

In [None]:
data['d'].value_counts()

In [None]:
accuracy_score(data['d'], data['p'])

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:training@tpq.io">training@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> 