## Codebase1: Data Wrangling

The structured data is sourced from Sharadar's paid subscription and consists of (i) market data, (ii) financial data and (iii) metadata. The market data is used to calcuate the maximum 20 day rolling drawdown in the 1 year period following the filing of the annual report. The binary target data is defined as experiencing a positive event when this dardwown is greater than 80% and negative otherwise. 

The codebase is structured in the following sections:

1. Data retrieval and early calculations
2. Preprocessing
3. Merging datasets into target-features data frame 


### (1) Data retrieval and early calculations



The market price database was too big to be loaded by API call and was instead bulk downloaded as a CSV file from the Quandl site.

The below code uses the closing price of each equity to return the rolling 20 day max drawdowns on a daily basis.

In [None]:
'''
Converts daily equity prices from Sharadar database to rolling 20 day max
drawdowns in dataframe format with columns as ticker and dates as index
'''


import pickle
import pandas as pd
from datetime import datetime as dt


t1 = dt.now()
print(t1)

#specify inputs
window_dd = 20

input_file_1 = 'daily_equity_prices.csv'
output_file = 'monthly_rolling_20d_dd_whole_db.pickle'

#read csv file of stock price data
df_stocks = pd.read_csv(input_file_1, parse_dates=['date'])

#pivot table and select closing prices only
df_prices = df_stocks.pivot(index='date', columns='ticker', values='close')

df_prices = df_prices.sort_index()
#calculate max rolling 20 day drawdowns on rolling daily basis
    #compute rolling dd
df_dd = df_prices / df_prices.rolling(window_dd).max() -1
df_dd = df_dd.applymap(lambda x: min(x,0))

df_dd = df_dd.dropna(how='all', axis=1)

#save dict to pickle
with open(output_file, 'wb') as handle:                                     
    pickle.dump(df_dd, handle, protocol=pickle.HIGHEST_PROTOCOL)

t2 = dt.now()
print(t2 - t1)

#runtime 3min30sec

2020-08-25 11:23:12.440870


Sharadar also provides metadata for each equity ticker and this is downloaded via the Quandle API in Python.

In [None]:
import quandl
import pickle
import numpy as np


output_file = 'meta_df_whole_db.pickle' 

#API Key
quandl.ApiConfig.api_key = "key"

#Pull data from quandl in df format
df_meta = quandl.get_table('SHARADAR/TICKERS', paginate = True)     #all tickers and metadata


#Wrangle Meta Table                                                    
df_meta = df_meta[df_meta.table.eq('SF1')]                                              #filter by table 'SF1' 
df_meta.set_index('ticker', inplace=True)                                               #set ticker as index
df_meta['CIK'] = df_meta['secfilings'].apply(lambda x: x[x.find('CIK=')+4:].strip())    #form new column for CIK refernece number (number as text)                                   
df_meta.fillna(np.NaN, inplace=True)                                                    #fill None with NaN
df_meta = df_meta.transpose()                                                           

#save dataframe to file
with open(output_file, 'wb') as handle:                                     
    pickle.dump(df_meta, handle, protocol=pickle.HIGHEST_PROTOCOL)       

The next step is to download the 10Ks from the SEC website. Given the highly imbalanced dataset, we use the rolling drawdown dataframe to find those tickers with maximum drawdowns over 80% and make sure these 10Ks are downloaded first. While company tickers can change for various reasons, the CIK number is unique and this links the drawdown and SEC 10K data through the metadata. Once the target companies have been specified and the CIK numbers retrieved, we ise the existing SEC downloader library to retrieve these annual statements to the local drive.    

In [None]:
'''
Takes in CIK number from metadata and drawdown dataframe to chose tickers for 
download from SEC website. Downloads to local drive and saves custom log.
'''

import pickle
from sec_edgar_downloader import Downloader
from datetime import datetime as dt

#Specify inout and outout files

input_file_meta = 'meta_df_whole_db.pickle' 
input_file_dd = 'monthly_rolling_20d_dd_whole_db.pickle'
output_log_file = '10k_dowload_logs.pickle'

local_drive_destination = 'XXX'

with open(input_file_meta, 'rb') as f_meta:
        df_meta = pickle.load(f_meta)

with open(input_file_dd, 'rb') as f_dd:
        df_dd = pickle.load(f_dd)

#find tickers with max dd >= 80%
s_dd = df_dd.min(axis=0)
mask_dd = s_dd <= -0.8
pos_tickers = s_dd[mask_dd].index.tolist()
neg_tickers = s_dd[~mask_dd].index.tolist()         
tickers = pos_tickers + neg_tickers             #ensure pos_tickers downloaded first

t0 = dt.now()
print(t0)

df_ticker_cik = df_meta.loc['CIK']

#Initialize a downloader instance with specified destination
dl = Downloader(local_drive_destination)

# Initialize lists for custom log
descr_list = []
error_list = []

#download all 10Ks of ticker after January 1997
for idx, ticker in enumerate(tickers):                      
    cik = df_ticker_cik[ticker]
    try:
        t1 = dt.now()
        dl.get("10-K", cik, after_date="19970101")     
        t2 = dt.now()
        delta = t2-t1
        descr = str(idx) + ' : ' + ticker + ' : ' + str(delta.seconds) + 'sec'
        descr_list.append(descr)
        print(descr)
    except:
        error_list.append(ticker)
        descr = str(idx) + ' : ' + ticker + ' : ' + 'Error'
        print(descr)
        continue

d_log = {'log': descr_list, 'error_codes': error_list}

#save custom log to file
with open(output_log_file, 'wb') as handle:
    pickle.dump(d_log, handle, protocol=pickle.HIGHEST_PROTOCOL)

t3 =dt.now()

print(t3-t0)  

#runtime overnight- stopped in morning

### (2) Preprocessing

The 10Ks are pulled over a 20+ year period and are inconsistent in format (text, html, xbrl). Resultantly, the more general regex method is preferred for preprocessing. This is programmed as a function below. Stemming and lemmatization are intentionally excluded in order to leave the corpus as nuanced as possible.  

In [None]:
def remove_html_tags_char(text):
    '''Takes in string and removes defined special characters  '''
    
    #Define special Chars
    clean1 = re.compile('\n')               
    clean2 = re.compile('\r')               
    clean3 = re.compile('&nbsp;')           
    clean4 = re.compile('&#160;')
    clean5 = re.compile('  ')
    #Define html tags
    clean6 = re.compile('<.*?>')
    #remove special characters and html tags
    text = re.sub(clean1,' ', text)
    text = re.sub(clean2,' ',text)  
    text = re.sub(clean3,' ',text) 
    text = re.sub(clean4,' ',text) 
    text = re.sub(clean5,' ',text) 
    text = re.sub(clean6,' ',text) 
    # check spacing
    final_text = ' '.join(text.split())  
    
    return final_text 

In addition to cleaning the 10Ks of special characters, the below code also pulls out document metadata from the text and creates a custom log to track failed documents. The most important metadata is the filing date which will be used to join the unstructured data with the target data. The program saves the output as a dictionary with each document specified by a concatenation of the ticker name and year as the primary key. 

In [None]:


"""
Program processes downloaded 10ks with the following steps:
    (i) maps SEC CIK number to stock exchange tickers needed for later comparison to financial data
    (ii) Finds CIKs with two tickers to ensure 10k data stored for both
    (iii) Walks through directory of downloaded CIKs converting CIK to ticker label
    (iv) finds metadata section and extracts metadata for each 10k
    (v) Find main 10k body and uses regex to remove html and other tags 
         before storing document as single string (10ks from 1997 have different
         and inconsistent formats so regex preferred to html parser)
    (vi) Ticker metadata added eg: sector, industry
    (vii) user defined log and error list per ticker created and stored
    (viii) Final output is dictionary with keys for log, errors and data.
           Data is a nested dictionary with keys equal to ticker_name concateded with
           year in label of 10k document, values are another dictionary including
           document metadata, ticker metadata, and processed 10k text as string 
"""

import os
import pickle
import pandas as pd
from datetime import datetime as dt
from capstone_10k_functions import remove_html_tags_char

t0 = dt.now()

rootdir = 'XXX'  #for looping through raw 10ks 
input_file_1 = 'meta_df_whole_db.pickle'      #metadata  
output_file = '10k_clean_dict.pickle'


with open(input_file_1, 'rb') as f1:
        v = pickle.load(f1)
        

#create cik to ticker df
df_cik2tic = pd.DataFrame(v.loc['CIK',:])
df_cik2tic = df_cik2tic.reset_index()


#find duplicate tickers for single CIK
bool_series = df_cik2tic['CIK'].duplicated(keep=False)
df_dup_cik = df_cik2tic[bool_series].sort_values(by='CIK')
df_dup_cik['CIK'] = df_dup_cik['CIK'].apply(lambda x: x.lstrip('0'))
dup_n = len(df_dup_cik)

if dup_n % 2 != 0:
    print('Error: duplicate CIKs not an even number')
else:
    pass
    
index_list = [2*number for number in range(dup_n//2)]
dict_dupes_cik2tic = {df_dup_cik.CIK.iloc[j]: (df_dup_cik.ticker.iloc[j], 
                                                df_dup_cik.ticker.iloc[j+1]) for j in index_list}

#Remove duplicates from primary cik to ticker df 
df_cik2tic = df_cik2tic[~bool_series]
df_cik2tic['CIK'] = df_cik2tic['CIK'].apply(lambda x: x.lstrip('0'))
df_cik2tic = df_cik2tic.set_index('CIK')


#
n = 0

d_all = {}          #final outout dict
descr_list = []     #description list for live debugging
error_list = []     #list for error logging

#Walk thorugh 10k downloads for primary non-duplicated ciks
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        
        t1 = dt.now()       #start clock for each document 
        
        #find cik number from filename
        subdir_str = str(subdir)
        start_sub= subdir_str.find('filings') + 8
        end_sub = subdir_str.find('10-K') - 1
        cik = subdir_str[start_sub:end_sub]
        
        #find year in name of document (label in file, may not reflect report yr)
        old_fname_str = str(file)
        start_fn = old_fname_str.find('-') +1
        end_fn = start_fn + 2
        year_fn = old_fname_str[start_fn:end_fn]            
     
        #map cik to ticker for renaming & check if cik map unique
        try:
            ticker = df_cik2tic.loc[cik, 'ticker']
            key_all = ticker + '_' + year_fn
            dupe_flag = False
        except:
            list_ticker = dict_dupes_cik2tic[cik]
            key_all_0 = list_ticker[0] + '_' + year_fn
            key_all_1 = list_ticker[1] + '_' + year_fn
            dupe_flag = True        
        
        #finally ready to open and work with document
        filename = os.path.join(subdir, file)
        
        n += 1      #counter for live print debugging
       
        try:
            with open (filename, 'r') as file:    
    
                file_str = file.read()                      #read file into memory
    
                end_1 = file_str.find('<SEQUENCE>2')        #start section follow main 10k
                start = file_str.find('<SEQUENCE>1')        #start of 10k / end of metadata
                end = file_str.find('</DOCUMENT>')          #end of 10k if before end_1
                
                #Extract metadata from document
                meta_text = file_str[:start]                #meta data section
    
                doc_metadata = ['ACCESSION NUMBER:', 'CONFORMED SUBMISSION TYPE:',                #metadata labels in document (order important)
                                'PUBLIC DOCUMENT COUNT:', 'CONFORMED PERIOD OF REPORT:', 
                                'FILED AS OF DATE:', 'DATE AS OF CHANGE:']
    
                doc_key_names = ['Accession_#', 'Type', 'Doc_Count','Period',       #key names for metadata
                                 'Filed_Date', 'Change_Date']
                
                pos_start = [meta_text.find(label) for label in doc_metadata]               #start pos meta data label
                pos_end = [meta_text.find(label) + len(label) for label in doc_metadata]    #end pos meta data label
                doc_meta_values = [meta_text[pos_end[j]:pos_start[j+1]].strip()             #metadata value between end label and beg next label 
                                       for j in range(len(doc_metadata)-1)]  
                doc_meta_values[-1] = doc_meta_values[-1][:8]                               #last label manual as no next label
                
                #define 10k body and clean of html / text / xbrl etc.
                text = file_str[start:end_1]            #define sequence 1
                text = text[:end]                       #end sequence 1 doc
                
                #remove html tags and special chars
                text = remove_html_tags_char(text)
                
                #create main dict with 10k metadata and 10k text as string
                d = dict(zip(doc_key_names, doc_meta_values))
                d.update({'Text': text})
                
                #add ticker sector / industry metadata to main dict
                ticker_meta_short = ['name', 'sicsector', 'sicindustry', 'famasector',  #define metadata of interest
                                     'famaindustry', 'sector', 'industry']      #for later categorical analysis
                df_meta_short = v[ticker][ticker_meta_short]           #extract metadata 
                d.update(df_meta_short.to_dict())                           #add to main dict

                
                #treat for cik duplicate or not to populate final dict with
                #logs and errors
                if dupe_flag == False:                          #no dupe, write 10k to unique ticker
                    d_all.update({key_all: d})
                    
                    t2 = dt.now()
                    delta = t2 - t1
                    descr = str(n) + ' : ' + key_all + ' : ' + str(delta.microseconds/1000000)  #description for log
                    descr_list.append(descr)                                                    #append to log
                    print(descr)                                                                #print to screen for live record
                else:
                    d_all.update({key_all_0: d, key_all_1: d})  #if dupe, write 10k to both tickers
                
                    t2 = dt.now()
                    delta = t2 - t1
                    descr = str(n) + ' : ' + key_all_0 + ' | ' + key_all_1 + ' : ' + str(delta.microseconds/1000000)    #description for log
                    descr_list.append(descr)                #append to log
                    print(descr)                            #print to screen for live record
                    
        #if metadata and 10k wrangle fails, record error            
        except:  
                 
            try:
                if dupe_flag == False:
                    error_list.append(key_all)                           #write error to list
                else:                                                       
                    error_list.append(key_all_0 + ' | ' + key_all_1)    #if fail on duplicate, make sure to record both tickers
                        
                t2 = dt.now()            
                delta = t2 - t1
                descr = str(n) + ' : ' + key_all + ' : ' + 'Error'      #print to screen for live record of error
                descr_list.append(descr)                                #append to log
                print(descr)
                continue
        
            except:
                continue
            




d_all.update({'log' : descr_list, 'error_codes': error_list})       #write log and errors to final dict

with open(output_file, 'wb') as handle:                             #save final dict as pickle
    pickle.dump(d_all, handle, protocol=pickle.HIGHEST_PROTOCOL)

t3 = dt.now()   
print(t3-t0)    

#runtime 30mins

The dictionary is converted to a data frame format where the concatenated reference is dropped and the "tickers" and "Filed_Date" columns become the unique identifiers.

In [None]:

"""
Convert dictionary of processed 10k statements dictionary with ticker as keys and value as a dataframe
"""

import pickle
import pandas as pd

input_file = '10k_clean_dict.pickle'
output_file = '10k_clean_df.pickle'

#Load clean dictionary 
with open(input_file, 'rb') as f:
        z = pickle.load(f).copy()
        
#delete keys that are not related to k10 data        
del z['log']                    
del z['error_codes']

#create dictionary with ticker as key and a list of all annual dictionaries as values
d_temp = {}
for k, v in z.items():
    end = k.find('_')                               #find ticker (eg: AAPL) from long key name (eg: AAPL_18)
    ticker = k[:end]
    
    if ticker not in list(d_temp.keys()):           
        d_temp.update({ticker: [v]})            #if key hasn't appeared yet, initialise with list for value
    else:
        d_temp[ticker].append(v)                #if key has appeared, append value to list
        

#convert ticker dictionaries to dataframes 
df_final = pd.DataFrame()
for k, v in d_temp.items():    
    df = pd.DataFrame.from_dict(v, orient='columns')    
    df['ticker']= k                                                                    #add ticker column
    df['file_month_date'] = pd.to_datetime(df['Filed_Date'], errors = 'coerce')
    df['file_month_date'] = df['file_month_date'] + pd.offsets.MonthEnd(0)
    #add column year of of statement plus 1
    df = df.sort_values(['ticker', 'file_month_date'])    
    
    df_final = df_final.append(df)                                                        #set key (ticker) and value (df) for final dictionary


with open(output_file, 'wb') as handle:                                     #save final dictionary as pickle file
    pickle.dump(df_final, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
#runtime 10min

### (3) Merging the Datasets

Merging the 10K and drawdown dataframes will result in a loss of information. for example, there will be some price tickers with price history but no recorded filings or with incomplete filings. The below code inner joins the dataframes on the ticker and Filed_Date columns and calaculates the maximum of the drawdowns in the year following the Filing as the target variable.

The function for computing this drawdown is

In [None]:
def find_max_dd_period(s, date1, date2, window=20):
    """finds the max drawdown of the 20 day rolling dd series between the dates"""
    s_dd = pd.Series(s[window-1:].values, index=s.index[:-(window-1)], name=s.name)

    mask = (s_dd.index > date1) & (s_dd.index <= date2)
    
    max_dd = s_dd[mask].min()
    
    return max_dd


The code for the merge is

In [None]:
import pickle
import pandas as pd
from capstone_10k_functions import find_max_dd_period
from pandas.tseries.offsets import DateOffset


input_text = '10k_clean_df.pickle'
input_dd = 'monthly_rolling_20d_dd_whole_db.pickle'

output_file = 'dict_10k_matched_dd.pickle'



with open(input_text, 'rb') as f_text:
        df_text = pickle.load(f_text)

#set to datetime format
df_text['Filed_Date'] = pd.to_datetime(df_text['Filed_Date'], errors = 'coerce')
#define 10k text df for later merging
df_text_actual = df_text[['ticker', 'Filed_Date', 'Text']]
df_text_actual.columns = ['ticker_', 'Filed_Date', 'Text']
#non text data to carry through calcs before merge
df_text = df_text[['ticker', 'Filed_Date', 'sector', 'sicsector']]

#10K tickers to list
tickers_text = set(df_text['ticker'].tolist())   #len = 4,482


with open(input_dd, 'rb') as f_dd:
        df_dd = pickle.load(f_dd)
        
#drawdown tickers to list
tickers_dd = set(df_dd.columns.tolist())        #len = 16,973

#find intersection of tickers across the dataframes
tickers = tickers_text.intersection(tickers_dd)   #len = 4,456
tickers = list(tickers)


#match 10k file date with max 20d dd over next 12 months
counter=0
#loop through tickers
for code in tickers:
    s_dd = df_dd[code]
    
    #event flag column
    ticker_dd_flag = (s_dd.min() <= -0.8)*1
    
    df_10k = df_text[df_text.ticker == code].reset_index(drop=True)
    df_10k.columns = ['ticker_', 'Filed_Date', 'sector', 'sicsector']
    
    #info for meta df
    sector =df_10k['sector'][0]
    sic_sector =df_10k['sicsector'][0]
    #custm sector category
    custom_sector = str(sector) + ' : ' + str(sic_sector)
    
    meta_dict = {'sector': sector, 'sic_sector': sic_sector, 
                       'custom_sector': custom_sector, 'ticker_dd_flag': 
                           ticker_dd_flag }
    df_meta = pd.DataFrame(meta_dict, index = [code])
    
    #loop through years
    for row in range(len(df_10k)):
                    
                    #find max dd over next 1 and 2 years
                     start = df_10k.loc[row, 'Filed_Date']
                     end_1yr = start + DateOffset(months=12)
                     end_2yr = start + DateOffset(months=24)
                     max_dd_1yr = find_max_dd_period(s_dd, start, end_1yr, window=20)
                     max_dd_2yr = find_max_dd_period(s_dd, start, end_2yr, window=20)
                     df_10k.loc[row, 'max_dd_1yr'] = max_dd_1yr
                     df_10k.loc[row, 'max_dd_2yr'] = max_dd_2yr
                     df_10k.loc[row, 'year_dd_flag'] = (max_dd_1yr <= -0.8)*1
                     
    #add_cumulative_year_dd_flag (incl)
    df_10k['cum_year_dd_flag'] = df_10k['year_dd_flag'].expanding().max()
    
    df_10k = df_10k.dropna()
    
    #if emoty then skip ticker
    if df_10k.empty:
        counter +=1
        continue
    else:
        pass
    #if first iteration, initialize df for concate over next loops
    if counter == 0:
        df_final = df_10k
        df_meta_final = df_meta
    else:
        df_final = pd.concat([df_final, df_10k])
        df_meta_final = pd.concat([df_meta_final, df_meta])
        
    counter += 1

df_final = df_final.reset_index(drop=True)

#
df_final = df_final.merge(df_text_actual, on=['ticker_','Filed_Date'], how='inner')

dict_final = {'matched_df_10k_dd': df_final, 'matched_df_10k_dd_meta': df_meta_final}    #len = 4,365 tickers / 38,807 docs


#save dict to pickle
with open(output_file, 'wb') as handle:                                     
    pickle.dump(dict_final, handle, protocol=pickle.HIGHEST_PROTOCOL)




The next step is to convert the 10Ks from a text string in a column of a dataframe to a features set in td_idf matrix form which makes use of the following function 

In [None]:
def vectorize_corpus(text_series, vectorizer_func, min_df, max_df, ngram_range):
    '''vectorize corpus with specified vectorizer (tdidf or count) 
    and parameters'''
    
    vectorizer = vectorizer_func(min_df=min_df, max_df=max_df)
    vectors = vectorizer.fit_transform(text_series)
    feature_names = vectorizer.get_feature_names()

    #wrap vectors in sparse dataframe and label
    df = pd.DataFrame.sparse.from_spmatrix(vectors, columns = feature_names)
    
    #drop null columns
    df_test = df[:5]
    null_columns = df_test.columns[df_test.isnull().any()]
    df = df.drop(null_columns, axis=1)
    
    dict_answer ={'df_wv': df, 'vectorizer': vectorizer}
    
    return dict_answer

The code for the vecorization and formation of the trainning and test sets across the expanding time-series cross validation folds is given by 

In [None]:
"""
vectorize corpus and create target-features matrix across validation folds 
using expanding windows for time series data
"""


import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from capstone_10k_functions import vectorize_corpus
from datetime import datetime as dt

t1 = dt.now()
print(t1)


input_file = 'dict_10k_matched_dd.pickle'


vector_func = TfidfVectorizer    
func_name = 'TfidfVectorizer'   #['TfidfVectorizer', 'CountVectorizer']

hold_out_set_start = 2015

k_ratio = 0.2
min_df = 15
min_df_grid = [min_df]

max_df = 0.5
ngram = (1,2)
ngram_name = 'bigram'

label_cv = ['cv1', 'cv2', 'cv3', 'cv4']



with open(input_file , 'rb') as f:
        d_data = pickle.load(f)
df = d_data['matched_df_10k_dd']

df = df.sort_values("Filed_Date")


##Define validation sets
mask_hold_out = df['Filed_Date'].dt.year >= hold_out_set_start
df_v = df[~mask_hold_out]
size = df_v.shape[0]
n = int(k_ratio*size)
k_stops = [n, 2*n, 3*n, 4*n, size]


    
#Generate df master (word vector / vectorizer) sets for each cv fold


for idx_cv, label in enumerate(label_cv):
    
    output_filename = label + '_' + func_name + '_' +'min_df_' + str(min_df) +'_' + ngram_name + '.pickle'
    dict_cv = {}
    
    print(label)
    
    stop_train = k_stops[idx_cv]
    stop_test = k_stops[idx_cv + 1]
    df_test = df[stop_train: stop_test]
    df_train = df[:stop_train ]
    
    #format training data
    df_train_text = df_train[['ticker_','Filed_Date', 'Text']]
    df_train_other = df_train.drop('Text', axis=1)
    df_train_other.columns = ['ticker_', 'Filed_Date', 'sector_', 'sic_sector', 
                        'max_dd_1yr', 'max_dd_2yr', 'year_dd_flag', 
                        'cum_year_dd_flag']
    df_train_other['custom_sector'] = str(df_train_other['sector_']) + ' : ' + str(df_train_other['sic_sector'])

    #format testing data
    df_test_text = df_test[['ticker_','Filed_Date', 'Text']]
    df_test_other = df_test.drop('Text', axis=1)
    df_test_other.columns = ['ticker_', 'Filed_Date', 'sector_', 'sic_sector', 
                        'max_dd_1yr', 'max_dd_2yr', 'year_dd_flag', 
                        'cum_year_dd_flag']
    df_test_other['custom_sector'] = str(df_test_other['sector_']) + ' : ' + str(df_test_other['sic_sector'])
        

    for min_df in min_df_grid: 
        print(min_df)
        
        #name for cv dictionary specified by min_df value
        key_name = 'min_df_' + str(min_df)
        
        #vectorize corpus and assign word vector and vectorizer
        function = vectorize_corpus(df_train_text['Text'], vector_func, min_df, 
                                            max_df,ngram)
        X = function['df_wv']
        vectorizer = function['vectorizer']
        
        #Transform training data into df_master format
        vocab = X.columns.tolist()
        X['Filed_Date'] = df_train_text['Filed_Date'].values
        X['ticker_'] = df_train_text['ticker_'].values
                        
        df_train_master = df_train_other.merge(X, on=['ticker_','Filed_Date'], how='inner')
        
        #Transform test data into df master format
        arr_test_transform = vectorizer.transform(df_test_text['Text'])
        df_test_transform = pd.DataFrame.sparse.from_spmatrix(arr_test_transform,
                                                           columns = vocab)
        df_test_transform['Filed_Date'] = df_test_text['Filed_Date'].values
        df_test_transform['ticker_'] = df_test_text['ticker_'].values
        
        
        df_test_master = df_test_other.merge(df_test_transform, 
                                             on=['ticker_','Filed_Date'], 
                                                                 how='inner')
        
            
        dict_final = {'df_test_master': df_test_master, 'df_train_master': df_train_master}
                
        dict_cv[key_name] = dict_final
           
    with open(output_filename, 'wb') as handle:                                     
        pickle.dump(dict_cv, handle, protocol=pickle.HIGHEST_PROTOCOL)
                    



t2 = dt.now()
print(t2)
print(t2-t1)
              
                
    #runtime 2hrs30mins
