In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import lib

### Load Data

Trading on [Xetra](https://www.xetra.com/xetra-en/trading/Trading-calendar-and-trading-hours-22048) is from Mondays to Fridays from 9 until 17:30 CET.

Let's first process data from August first. 

In [None]:
df = lib.read_date_range(start_date='2019-01-01', end_date='2019-12-31')
df.head()

../data/deutsche-boerse-xetra-pds/2019-04-19/*


In [None]:
df.shape

Let's see the type of securities on offer. 

In [None]:
df['SecurityType'].value_counts()

We're only interested in common stock. 

In [None]:
df = df[df.SecurityType == 'Common stock']

In [None]:
df.shape

Next, we filter for the trading time, between 08:00 and 20:00. 

In [None]:
df.set_index('CalcDateTime', drop=True, inplace=True)

In [None]:
df = df.between_time('08:00', '20:00')

In [None]:
df.shape

Finally, we remove all auctions, i.e. TradedVolume = 0. 

In [None]:
df = df[df.TradedVolume > 0]

In [None]:
df.shape

Let's see the number of unique securities, sorted by volume traded. 

In [None]:
grouped_securites = df.groupby(['Mnemonic', 'SecurityDesc']).sum()

In [None]:
grouped_securites.shape

There are 946 unique securities. Let's sort them by volume traded. 

In [25]:
sorted_grouped_securites = grouped_securites.sort_values('TradedVolume', ascending=False)
sorted_grouped_securites.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,StartPrice,MaxPrice,MinPrice,EndPrice,TradedVolume,NumberOfTrades
Mnemonic,SecurityDesc,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
DBK,DEUTSCHE BANK AG NA O.N.,596346.0,596734.6,595946.4,596339.0,3009862000.0,2160011.0
DTE,DT.TELEKOM AG NA,1096916.0,1097261.0,1096567.0,1096914.0,2133462000.0,1144762.0
SNH,"STEINHOFF INT.HLDG.EO-,50",1339.673,1341.808,1337.258,1339.46,2112372000.0,72991.0
CBK,COMMERZBANK AG,339701.3,339920.9,339478.0,339701.1,1999202000.0,1149542.0
EOAN,E.ON SE NA O.N.,722273.7,722503.4,722041.4,722273.2,1591527000.0,917812.0
LHA,LUFTHANSA AG VNA O.N.,791084.6,791631.6,790519.6,791081.2,1427914000.0,1281523.0
IFX,INFINEON TECH.AG NA O.N.,1521153.0,1521956.0,1520339.0,1521151.0,1266239000.0,1651415.0
DAI,DAIMLER AG NA O.N.,2927073.0,2928782.0,2925333.0,2927042.0,910371100.0,2460598.0
AT1,"AROUNDTOWN EO-,01",398207.6,398360.2,398056.8,398208.0,845397500.0,530283.0
O2D,TELEFONICA DTLD HLDG NA,143106.1,143143.3,143069.3,143106.9,829270900.0,286114.0


The current analysis will be limited to the Top 100 stocks. 

In [26]:
securities = list(sorted_grouped_securites.index.get_level_values('Mnemonic')[:100])

In [None]:
import pickle
with open("securities.txt", "wb") as f:   #Pickling
    pickle.dump(securities, f)

Now we limit our dataset to this 100 securities. 

In [None]:
df = df[df['Mnemonic'].isin(securities)]

In [None]:
df.shape

In [None]:
df.head()

Sweet! Exporting this to parquet. 

In [30]:
df.to_parquet('../data/processed_data/20200904/top100stocks_cleaned_2019.parquet')