# Use Machine Learning model to find out Investor Presentations  
The investor presentation materials have great in-depth information but it also well highlights the high level pictures, instead of focusing on accounting details and disclousures required by regulations. Here is [an example of it](http://edgar.secdatabase.com/2259/115752318001602/filing-main.htm). 

However, these presentation slides are hidden among all the other SEC filings. No one can find it out if investor go check companies one by one. 

To make this finding easier, SECdatabase applies a Machine Learning algorithm to determine if one particular file is a investor presentations or not.

This notebook is to share how we tackel tihs problem. 

In [1]:
import SecdbAPI
import pandas as pd
import MyUtil
import re
import random
from multiprocessing.pool import ThreadPool
from multiprocessing import Manager

Then we read data from database, to help users understand the dataset, we provide a file-based snapshot, under the file 'data/8k_df.csv' which was dumped from pandas. 

In [2]:
readFromDB=False
if readFromDB:
    df = SecdbAPI.load8K(12)
    df.to_csv("data/8k_df.csv", sep="\t", header=True)
else:
    df = pd.read_csv("data/8k_df.csv",sep="\t")
    df= df.drop(columns=["Unnamed: 0"])
df.shape

(69623, 5)

### Feature Engineering
As an initial step, to find out what features we can use to detect presentations in 8-k filines. we first come up a list after we mannually check some files. Here is a [smaple file list](http://edgar.secdatabase.com/886/119312517334408/filelist.txt).
- number of files 
- number of pictures
- size of the core html
- average size of all pictures  
All the features come from the file list. However, we beliece that the text in html could give us some more insights helping us make the judgement.

### Add text based feature
We know that if a filing is a presentation, they usually have the word "presentation" in the file text and also the categorized items is "Item 7.01", or "Regulation FD Disclosure". so we try to count how many times these phrases showing up in the documemt.

In [3]:
def searchKeyWords(url):
    cont = MyUtil.requestContent(url)
    txt = cont.decode('ascii').lower()
    
    matches = txt.count("presentation")+txt.count("meeting")+\
            txt.count("webinar")+txt.count("webcast")
    matches2 =  txt.count("regulation fd disclosure")
    
    matches3 =  txt.count("earning")
    
    return (matches , matches2 , matches3 )
def extractFeature(urlt):
    try:
        cont = MyUtil.requestContent(urlt+'filelist.txt')
        ss = cont.decode('ascii')
        lines=ss.split('\r')
        num_files=len(lines)
        size_corehtml=int(lines[0].split("\t")[3])//1000
        keyfile=lines[0].split('\t')[2]
        kw1, kw2, kw3 = searchKeyWords(urlt+keyfile)
        graphs=[l for l in lines if len(l.strip())!=0 and l.split("\t")[1]=="GRAPHIC"]
        num_pict = len(graphs)
        avg_pict_size =sum([int(a.split("\t")[3]) for a in graphs])//(num_pict*1000) \
                            if num_pict!=0 else 0

        return "\t".join([urlt+'filelist.txt', str(num_files), str(size_corehtml), \
                str(num_pict), str(avg_pict_size), str(kw1), str(kw2), str(kw3)])+"\r"
    except:
        print("failed")
        return "failed"+urlt

Here we use Python's ThreadPool library to give us a hand on speeding up the program. Because of Python's [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock), python doesn't really have multithreading per se, but, as this downloading spends most of the time in waiting network I/O, multi-threading still helps us a bit.

In [None]:
urls=df.loc[:,'url']

pool = ThreadPool(processes=20)
output=pool.map(extractFeature, urls)

with open("data/8k_present3.txt", "w") as ff:
    ff.write("".join(output))
            

### Manually label our data
As a classification problem, we need labeled data to train our model, so we have to label them mannually. 
Our thoughts are firstly we focus on the boundary, for those files that contain 4 to 7 pictures, which could be a presentation or a random 8-K report having several scanned document images. Here is an example of such reports.

In [187]:
import sklearn.utils 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,\
    r2_score,classification_report,precision_recall_fscore_support

data=pd.read_csv("data/8k_present_labeled.txt",sep="\t",header=0)
data = sklearn.utils.shuffle(data)
cut=int(0.8*len(data))
train=data.head(cut)
valid=data.tail(len(data)-cut)
intputs=["size_corehtml","num_pict","avg_pict_size","kw1","kw2","kw3"]
train_x= train[intputs]
train_y= train["IsPresentation"]
valid_x=valid[intputs]
valid_y=valid["IsPresentation"]

In our testing, we found `Number of Files` is not a good indication, so we exclude it from our features.  
Like all the Machine Learning project, finding the right feature, aka Feature Engineering, is an interactive process and takes a decent amount of time. To make this demo concise, we don't include our feature selection steps. 

In [188]:

# Create linear regression object
regr = linear_model.LogisticRegression()

# Train the model using the training sets
regr.fit(train_x, train_y)

# The coefficients
print('Coefficients: \n', regr.coef_)

Coefficients: 
 [[-1.17231602e-02  1.09428410e-02  7.55687369e-04  5.24685845e-01
   1.96332677e+00  2.24943295e-01]]


The coefficients are the outcome of the whole model:  
`Coefficients: 
 [[-1.17231602e-02  1.09428410e-02  7.55687369e-04  5.24685845e-01  1.96332677e+00  2.24943295e-01]]
`

### In-sample metrics

In [189]:
# Make predictions using the testing set
pred_y = regr.predict(train_x)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(pred_y, train_y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(train_y, pred_y))

pred_y3=[1 if x>0.5 else 0 for x in pred_y]
print(classification_report(train_y, pred_y3))

Mean squared error: 0.11
Variance score: 0.42
             precision    recall  f1-score   support

          0       0.84      0.70      0.76       169
          1       0.90      0.95      0.93       487

avg / total       0.89      0.89      0.89       656



### Out-of-Sample validation

In [190]:
# Make predictions using the testing set
pred_y = regr.predict(valid_x)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(pred_y, valid_y))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(valid_y, pred_y))

pred_y2=[1 if x>0.5 else 0 for x in pred_y]
print(classification_report(valid_y, pred_y2))

Mean squared error: 0.15
Variance score: 0.26
             precision    recall  f1-score   support

          0       0.82      0.60      0.69        45
          1       0.86      0.95      0.90       119

avg / total       0.85      0.85      0.85       164



### Take a look at the incorrect predictions
This step helps us for two things. We can find out where our model predicts incorrectly, so we have intuition to adjust our model. Meanwhile, our manually labeled data may have some errors, so we can make our data better quality, which will further enhance our model.

In [191]:
#false negative, error in recall
diff=[y1==1 and y2==0 for y1, y2 in zip(valid_y, pred_y2)]
valid[diff]

Unnamed: 0,URL,num_files,size_corehtml,num_pict,avg_pict_size,kw1,kw2,kw3,IsPresentation
213,http://edgar.secdatabase.com/946/1493152180094...,55,28,52,92,0,0,0,1
740,http://edgar.secdatabase.com/893/1279569180007...,114,175,100,121,2,0,0,1
164,http://edgar.secdatabase.com/493/1144204180055...,29,1150,19,61,24,0,6,1
644,http://edgar.secdatabase.com/2219/100210518000...,94,83,88,335,2,0,0,1
159,http://edgar.secdatabase.com/2199/915761800001...,30,32,25,114,1,0,1,1
555,http://edgar.secdatabase.com/2939/875159180000...,48,32,44,146,2,0,0,1


In [192]:
#false negative, error in recall
diff=[y1==1 and y2==0 for y1, y2 in zip(train_y, pred_y3)]
train_x[diff]

Unnamed: 0,size_corehtml,num_pict,avg_pict_size,kw1,kw2,kw3
295,36,15,81,3,0,0
49,28,36,119,2,0,0
234,17,22,272,2,0,0
85,16,32,231,0,0,0
709,44,68,382,1,0,0
410,22,14,209,3,0,0
166,20,19,147,0,0,2
103,31,35,172,2,0,0
311,37,21,127,3,0,0
102,15,35,86,2,0,0


## Test Random Forest method
After looking at all the failed cases in regression model, the performance is not okay but not great, especially lack of great precision. Random Forest is built on top of a number of small decision trees, which can handle outlier and non-linear relationship pretty well. We give it a try here.

In [193]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

clf = RandomForestClassifier(max_depth=3, random_state=0)
clf.fit(train_x, train_y)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [194]:
pred_y= clf.predict(train_x)
print(classification_report(train_y, pred_y))

             precision    recall  f1-score   support

          0       0.98      0.93      0.95       169
          1       0.98      0.99      0.98       487

avg / total       0.98      0.98      0.98       656



In [195]:
pred_y2= clf.predict(valid_x)
print(classification_report(valid_y, pred_y2))
#false negative, error in recall
diff=[y1==1 and y2==0 for y1, y2 in zip(valid_y, pred_y2)]
valid[diff]['URL'].tolist()

             precision    recall  f1-score   support

          0       0.98      0.89      0.93        45
          1       0.96      0.99      0.98       119

avg / total       0.96      0.96      0.96       164



['http://edgar.secdatabase.com/946/149315218009498/filelist.txt']

After tuning the model parameters, we came up with a pretty accurate model, which has better performance than Logistic Regression. This result is not surprising at all. However, Random Forest model has its downside. A prediction by linear regression model can be easily implemented in any language without using external library, since it is nothing by a weighted average of each input; logistic regression adds a tanh or sigmoid layer at the end, but still pretty straightforward. Random Forest is a discriminent model, and the prediction is the average or votes of a number of small trees, so it is not trivial to implement the algorithm and also has to load the trees data.

# Predict for all the documents

In [200]:
dat=pd.read_csv("data/8k_present3.txt",sep="\t", names=\
        ["num_files","size_corehtml","num_pict","avg_pict_size","kw1","kw2","kw3"])
predict_y = clf.predict(dat[intputs])
dat["Prediction"]=predict_y

dat[dat["Prediction"]==1].to_csv("data/8k_presentation_predicts.csv", sep="\t")

### Save the final result  
After doing all these modeling, we had the final files that contains the presentations. However, there is still a final step left, to find out the entry URL that let the viewers go directly to the presentation.  
Our approach is pretty simple. If there is a PDF file, we use it; then we trying to find out among all the htm files, which one contrains the picture names in the HTML body. That must be the right one.  
At last, we save it in a file.

In [204]:
def detechString(urlhtm, substr_list):
    content = MyUtil.requestContent(urlhtm)
    string =content.decode('utf-8', 'ignore')
    return sum([string.count(s) for s in substr_list])
    

urlz= dat[dat["Prediction"]==1].index.tolist()
res =set()
for urlt in urlz:
    urlp= urlt.replace('/filelist.txt','')
    cont = MyUtil.requestContent(urlt)
    ss = cont.decode('ascii')
    lines=ss.split('\r')
    toadd=""
    pdfs=[x.split('\t')[2] for x in lines \
          if len(x.strip())>0 and x.split('\t')[2][-3:]=='pdf']
    htms=[x.split('\t')[2] for x in lines \
          if  len(x.strip())>0 and x.split('\t')[2][-3:]=='htm']
    graphs = [x.split('\t')[2] for x in lines \
              if len(x.strip())>0 and x.split('\t')[1]=='GRAPHIC']
    if(not graphs): continue
        
    cnts= [detechString(urlp+"/"+htm, graphs) for htm in htms]
    #either pdf or the htm that contains all the graph filenames.
    toadd= pdfs[0] if pdfs else htms[cnts.index(max(cnts))]

    res.add(urlp+"/"+toadd)

with open("data/presentation_urls.txt", 'w') as outfile:
    outfile.write("\r".join(res))

One very last thing, save our Random Forest model, we can use it our daily production.

In [205]:
from sklearn.externals import joblib
joblib.dump(clf, "data/presentation_rfmodel.pkl")

['data/presentation_rfmodel.pkl']