![](https://signals.numer.ai/homepage-signals/img/signals-logo.png)

Numerai Signals is an interesting way to convert your data science skills into money:)

In contrast to the Numerai Tournament, where data are provided to participants for modeling, the Numerai Signals only provides historical targets and thus participants need to have data about stocks from somewhere for modeling.

For details, visit the [website](https://signals.numer.ai/).

To this end, I generated a stock price data obtained via YFinance API on a kaggle dataset. 

In this notebook, I will show you how to load the data, along with the targets for the Numerai Signals.

The large proportion of this notebook was derived from the following Numerai Signals baseline:

https://colab.research.google.com/drive/1ECh69C0LDCUnuyvEmNFZ51l_276nkQqo#scrollTo=tTBUzPep2dm3

# Libraries

In [None]:
!pip install numerapi==2.3.8
import numerapi

In [None]:
!pip install git+https://github.com/leonhma/yfinance.git #drop-in replacement yfinance fork for failed downloads, h/t ceunen
!pip install simplejson
import yfinance
import simplejson

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc
import pathlib
from tqdm.auto import tqdm
import json
from multiprocessing import Pool, cpu_count
import time
import requests as re
from datetime import datetime
from dateutil.relativedelta import relativedelta, FR

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

# visualize
import matplotlib.pyplot as plt
import matplotlib.style as style
from matplotlib_venn import venn2, venn3
import seaborn as sns
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('seaborn-colorblind')

import warnings
warnings.simplefilter('ignore')

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Today

In [None]:
today = datetime.now().strftime('%Y-%m-%d')
today

# Config

In [None]:
class CFG:
    INPUT = '../input/yfinance-stock-price-data-for-numerai-signals/full_data.csv'
    OUTPUT_DIR = './'

In [None]:
# Logging is always nice for your experiment:)
def init_logger(log_file='train.log'):
    from logging import getLogger, INFO, FileHandler,  Formatter,  StreamHandler
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=log_file)
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    return logger

logger = init_logger(log_file=f'{CFG.OUTPUT_DIR}/{today}.log')
logger.info(f'Start Logging...today is {today}')

# Get Numerai-Eligible Tickers
The thing is, YFiance Tickers and Numerai Tickers are a bit different, so there needs to be a mapper for a successful submission.

In [None]:
napi = numerapi.SignalsAPI()
logger.info('numerai api setup!')

In [None]:
# read in list of active Signals tickers which can change slightly era to era
eligible_tickers = pd.Series(napi.ticker_universe(), name='ticker') 
logger.info(f"Number of eligible tickers: {len(eligible_tickers)}")

In [None]:
# read in yahoo to numerai ticker map, still a work in progress, h/t wsouza and 
# this tickermap is a work in progress and not guaranteed to be 100% correct
ticker_map = pd.read_csv('https://numerai-signals-public-data.s3-us-west-2.amazonaws.com/signals_ticker_map_w_bbg.csv')
ticker_map = ticker_map[ticker_map.bloomberg_ticker.isin(eligible_tickers)]

numerai_tickers = ticker_map['bloomberg_ticker']
yfinance_tickers = ticker_map['yahoo']
logger.info(f"Number of eligible tickers in map: {len(ticker_map)}")

In [None]:
print(ticker_map.shape)
ticker_map.head()

# Load Data
This is a function to get stock price data via YFinance and map Numerai-compatible tickers.

Here we don't use that and simple load the existing data.

Note that a bit of volume data is missing (YFinance API is not perfect).

In [None]:
# If you want to fetch the data on your own, you can use this function...

def fetch_yfinance(ticker_map, start='2002-12-01'):
    """
    # fetch yfinance data
    :INPUT:
    - ticker_map : Numerai eligible ticker map (pd.DataFrame)
    - start : date (str)
    
    :OUTPUT:
    - full_data : pd.DataFrame 
    ('date', '(numerai) ticker', '(adjusted) close', 'high', 'low', 'open', 'volume')
    """
    
    # ticker map
    numerai_tickers = ticker_map['bloomberg_ticker']
    yfinance_tickers = ticker_map['yahoo']

    # fetch
    raw_data = yfinance.download(
        yfinance_tickers.str.cat(sep=' '), 
        start=start, 
        threads=True
    ) 
    
    # format
    cols = ['Adj Close', 'High', 'Low', 'Open', 'Volume']
    full_data = raw_data[cols].stack().reset_index()
    full_data.columns = ['date', 'ticker', 'close', 'high', 'low', 'open', 'volume']
    
    # map yfiance ticker to numerai tickers
    full_data['ticker'] = full_data.ticker.map(
        dict(zip(yfinance_tickers, numerai_tickers))
    )
    return full_data


In [None]:
# just load the kaggle dataset I made
full_data = pd.read_csv(CFG.INPUT)

logger.info('{:,} tickers in YFinance data'.format(full_data['ticker'].nunique()))
logger.info('{:.2f}% volume data missing'.format(100*np.sum(full_data['volume'] == 0) / len(full_data)))
logger.info('{:.2f}% price data missing'.format(100*np.sum(full_data['close'] == 0) / len(full_data)))

print(full_data.shape)
full_data.head()

In [None]:
full_data.tail()

# Load Numerai Targets 
Let's load the historical targets for Numerai Signals to see the overlap of tickers.

In [None]:
def read_numerai_signals_targets():
    # read in Signals targets
    numerai_targets = 'https://numerai-signals-public-data.s3-us-west-2.amazonaws.com/signals_train_val.csv'
    targets = pd.read_csv(numerai_targets)
    
    # to datetime int
    targets['friday_date'] = pd.to_datetime(targets['friday_date'].astype(str), format='%Y-%m-%d').dt.strftime('%Y%m%d').astype(int)
    
#     # train, valid split
#     train_targets = targets.query('data_type == "train"')
#     valid_targets = targets.query('data_type == "validation"')
    
    return targets

targets = read_numerai_signals_targets()

In [None]:
logger.info('Target shape: {}, dates {} - {}'.format(
    targets.shape, targets['friday_date'].min(), targets['friday_date'].max())
           )
targets.head()

In [None]:
targets.tail()

In [None]:
for t in ['train', 'validation']:
    plt.hist(targets.query('data_type == @t')['target'])
    plt.title(f'{t}: target distribution')
    plt.show()

# Load Numerai Signals' sample submission
You can also load a sample submission to get a sense of how your submission should look like.

In [None]:
sample_submission = 'https://numerai-signals-public-data.s3-us-west-2.amazonaws.com/example_signal/latest.csv'
ss = pd.read_csv(sample_submission)

logger.info('{:,} tickers in the sample submission'.format(ss['numerai_ticker'].nunique()))

print(ss.shape)
ss.head()

In [None]:
ss.tail()

In [None]:
for t in ['validation', 'live']:
    plt.hist(ss.query('data_type == @t')['signal'], bins=100)
    plt.title(f'{t}: example signal distribution')
    plt.show()

# Overlap of Tickers
Finally let's check if our YFinance data have enough overlap with targets and submission files with respect to tickers.

In [None]:
# check ticker overlap
venn3(
    [
        set(full_data['ticker'].unique().tolist())
        , set(targets['ticker'].unique().tolist())
        , set(ss['numerai_ticker'].unique().tolist())
    ],
    set_labels=('yf price', 'historical target', 'sample sub')
)

Actually you can submit if your signals have at least 5 tickers, so this amount of the overlap would be fine. 

Note that using this YFinance Stock Price data is NOT mandatory for the Numerai Signals. You can join the competition with your own data and approach for more original signals.

GOOD LUCK!