### Base de Dados

Neste notebook, nosso objetivo será pegar uma base de dados de um determinado dia contendo o retorno de ações listadas na NYSE da Trade and Quote (TAQ). Esses dados estão armazenados no DropBox com a extensão .mat (arquivo MATLAB). Para podermos ler o arquivo em python, teremos que fazer um processo de conversão dos dados no programa MATLAB de .mat para .csv (quando estivermos prontos para puxar toda base de dados podemos criar um loop em MATLAB que faça isso).

Cada base de dados é referente a um dia cujo formato do nome do arquivo é YYYYMMDD.csv contendo o horário como índice na forma HHMMSS e os símbolos das ações como colunas. Cada entrada representa o retorno em um minuto da respectiva ação no respectivo minuto.

O output desse notebook será dois arquivos csv's, um contendo 250 ações que servirão como preditivas e o outro contendo os retornos de todas ações defasados em até três períodos.

In [20]:
# pacotes
import numpy as np
import pandas as pd

In [21]:
# oculta mensagens de avisos
import warnings
warnings.filterwarnings("ignore")

In [22]:
# cria um dataframe a partir do csv de um dia de retornos
df = pd.read_csv('../../input/data/1min/by_date/20030102.csv', index_col=0)

In [23]:
# escolhendo 250 ações para usar como base preditiva
np.random.seed(0)
random_colnames = np.random.choice(df.columns, 250, replace = False)

In [24]:
# criando dataframe com as ações escolhidas aleatoriamente
pred_df = df[random_colnames]

In [25]:
# adicionar (t) no nome das colunas
for col in pred_df.columns:
    pred_df.rename(columns = {'%s'%(col):'%s(t)'%(col)}, inplace=True);

In [26]:
pred_df

Unnamed: 0_level_0,IFUL(t),RMD(t),NI(t),HYSQ(t),HSC(t),ACDO(t),GNLB(t),DRVR(t),BJCT(t),SP(t),...,BPRX(t),DLX(t),RRGB(t),PLUM(t),CALA(t),DHB(t),RRA(t),RMHT(t),FDTR(t),DRRX(t)
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
93100,0.0,0.000000,0.001001,-0.011561,0.000000,-0.000284,0.000000,0.0,0.0,0.000000,...,0.015456,0.000119,-0.021070,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.023867
93200,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,-0.011990,0.0,0.0,0.000000,...,-0.000844,-0.000119,0.000000,0.000000,0.0,0.002972,0.000000,0.000000,0.0,0.000000
93300,0.0,0.000000,0.001000,0.011561,0.000939,-0.000851,0.023551,0.0,0.0,0.000000,...,-0.016952,0.000000,0.000000,0.000000,0.0,-0.002972,0.000000,0.000000,0.0,0.000000
93400,0.0,0.000000,0.000499,0.000000,0.000313,-0.000568,0.005731,0.0,0.0,0.003069,...,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,-0.004728
93500,0.0,0.000000,0.000000,0.000000,0.000000,-0.001421,-0.028988,0.0,0.0,0.000000,...,0.000000,-0.000953,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155600,0.0,0.000000,-0.001956,-0.000602,0.000000,-0.001103,0.000000,0.0,0.0,-0.001043,...,0.000000,0.000238,-0.000039,0.000000,0.0,0.000000,0.002732,0.000000,0.0,0.000000
155700,0.0,0.000322,0.000489,0.000000,0.001547,0.000000,0.000000,0.0,0.0,-0.001044,...,0.003656,0.000000,0.000039,0.000000,0.0,0.000000,0.004084,0.007737,0.0,0.000000
155800,0.0,0.000000,-0.000979,0.000000,0.000927,0.000000,0.000000,0.0,0.0,0.000000,...,-0.001461,-0.000238,0.004294,0.003824,0.0,0.000000,0.000000,-0.007737,0.0,0.000000
155900,0.0,0.000645,0.002934,0.000000,-0.000618,-0.001656,0.000000,0.0,0.0,0.000000,...,0.000000,0.000238,-0.016893,0.011197,0.0,0.006192,0.000000,0.000000,0.0,0.000000


In [27]:
# loop para adicionar (t) ao nome da coluna e adicionar três defasagens
for col in df.columns:
    for i in range(1,4):
        df["%s(t-%s)"%(col, i)] = df[col].shift(i)
    df.rename(columns = {'%s'%(col):'%s(t)'%(col)}, inplace=True);

In [None]:
df

In [32]:
pred_df.to_csv('..\..\output\data\\20030102_y.csv', sep=',', encoding='utf-8')

In [31]:
df.to_csv('..\..\output\data\\20030102_x.csv', sep=',', encoding='utf-8')