# Predicting on new data with the Tf-Keras patent or citation count models

This script shows how to predict on new data using the Tf-Keras objects provided in the github directory for the finance research paper "Patent text analytics and venture-backed innovation"

In [1]:
import joblib

import numpy as np

from keras.models import load_model
import keras.backend as K

from sklearn.preprocessing import MinMaxScaler

Using TensorFlow backend.


In [2]:
np.__version__

'1.18.1'

In [3]:
joblib.__version__

'0.14.1'

In [4]:
pd.__version__

'0.25.3'

31 variables, in order:

The variable of interest in the research paper:
- log.Cosine: log of cosine similarity between the description text of the patent of interest and that of a peer company's patent 
(non-negative float)

For the company creating the patent of interest (prepended with 'pat1', or patent 1): 
- pat1.DistKM: Distance in kilometers between company producing patent of interest and VC headquarters. 
Distance is great circle for international distance and driving distance for domestic distance (non-negative float)
- pat1.Co.Age: Age in days from founding of company producing patent of interest and VC first round, can be negative for companies 
that received seed money (int)
- pat1.Buyout.Stage: Indicator variable, was the company producing the patent of interest in the buyout stage or not? (1 or 0)
- pat1.Early.Stage: Indicator variable, was the company producing the patent of interest in the early stage or not? (1 or 0)
- pat1.Expansion.Stage: Indicator variable, was the company producing the patent of interest in the expansion stage or not? (1 or 0)
- pat1.Later.Stage: Indicator variable, was the company producing the patent of interest in the later stage or not? (1 or 0)
- pat1.Startup.Seed: Indicator variable, was the company producing the patent of interest in the startup/seed stage or not? (1 or 0)
- pat1.Stage.Other: Indicator variable, was the company producing the patent of interest in a stage other than above or not? (1 or 0)

Shared industry variables between company producing patent of interest and the peer company of comparison to which its patent text 
is compared. Note that only one of the following six variables should take a value of one, while the rest should be zero:
- x.Biotech: Indicator variable, are the companies in the biotech industry or not? (1 or 0)
- x.Comm.Media: Indicator variable, are the companies in the communications and media industry or not? (1 or 0)
- x.Comp.Rel: Indicator variable, are the companies in a computer-related industry or not? (1 or 0)
- x.Medical: Indicator variable, are the companies in the medical industry or not? (1 or 0)
- x.NonHT: Indicator variable, are the companies in a non-high tech industry or not? (1 or 0)
- x.Semi: Indicator variable, are the companies in the semiconductor or another high-tech industry or not? (1 or 0)

For the company of comparison producing the patent with which a text similarity measure is obtained (prepended with 'pat2', or patent 2):
- pat2.DistKM: Distance in kilometers between company producing patent of interest and VC headquarters  
Distance is great circle for international distance and driving distance for domestic distance (non-negative float)
- pat2.Co.Age: Age in days from founding of company producing patent of interest and VC first round (int)
- pat2.Buyout.stage: Indicator variable, was the company producing the patent of interest in the buyout stage or not? (1 or 0)
- pat2.Early.Stage: Indicator variable, was the company producing the patent of interest in the early stage or not? (1 or 0)
- pat2.Expansion.Stage: Indicator variable, was the company producing the patent of interest in the expansion stage or not? (1 or 0)
- pat2.Later.Stage: Indicator variable, was the company producing the patent of interest in the later stage or not? (1 or 0)
- pat2.Startup.Seed: Indicator variable, was the company producing the patent of interest in the startup/seed stage or not? (1 or 0)
- pat2.Stage.Other: Indicator variable, was the company producing the patent of interest in a stage other than above or not? (1 or 0)

Shared economic and linguistic variables between company producing patent of interest and the peer company of comparison to which 
its patent text is compared (see the research paper for exact definitions):
- pat1.Cr.Dev: credit development of the country of domicile of the company producing the patent of interest (positive float)
- pat1.Eq.Dev: equity development of the country of domicile of the company producing the patent of interest (positive float)
- pat2.Cr.Dev: credit development of the country of domicile of the company producing the patent of interest (positive float)
- pat2.Eq.Dev: equity development of the country of domicile of the company producing the patent of interest (positive float)
- pat1.Firm.ling.dist: Measure of linguistic distance for language of first VC firm for company of interest from English (float, 0-6)
- pat1.Comp.ling.dist: Measure of linguistic distance for language of company of interest from English (float, 0-6)
- pat2.Firm.ling.dist: Measure of linguistic distance for language of first VC firm for company of comparison from English (float, 0-6)
- pat2.Comp.ling.dist: Measure of linguistic distance for language of company of comparison from English (float, 0-6)

In [5]:
# load data as a numpy array for prediction. 
sample = np.loadtxt(fname="Sample_data.csv", 
                    delimiter=",", 
                    skiprows=1)

In [6]:
sample

array([ 1.71260150e-02,  1.22549000e+04, -1.53000000e+02,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  1.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  1.00000000e+00,  1.12600000e+02,
        1.00300000e+03,  0.00000000e+00,  1.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  1.17205201e+02,
        5.97811440e+01,  9.49799480e+01,  1.39973378e+02,  0.00000000e+00,
        6.00000000e+00,  9.00000000e-01,  9.00000000e-01])

In [7]:
len(sample)

31

In [8]:
# Check values are within acceptable ranges, also defined in data dictionary

# Cosine similarity
assert sample[0] >= 0

# Distance between company of interest and its first VC firm
assert sample[1] >= 0

# Check one of the six industry indicator variables is one
assert sample[9] == 1 or sample[10] == 1 or sample[11] == 1 or sample[12] == 1 or sample[13] == 1 or sample[14] == 1

# Distance between company of comparison and its first VC firm
assert sample[1] >= 0

# Language variables for both companies and both VC firms
assert sample[27] >= 0 and sample[27] <= 6
assert sample[28] >= 0 and sample[28] <= 6
assert sample[29] >= 0 and sample[29] <= 6
assert sample[30] >= 0 and sample[30] <= 6

In [9]:
# Reading in MinMaxScaler file with joblib
scalerfile = 'Patent_text_cosine_similarity_training_MinMaxScaler-2-14-20.save'
scaler = joblib.load(open(scalerfile, 'rb'))

# Transforming sample data
# (sample.reshape() is required because it is just one line, an array)
scaled_data = scaler.transform(sample.reshape(-1, 31))

In [10]:
# The data is now scaled between 0 and 1 according to the scales used with the training data for these predictive objects
scaled_data

array([[0.02549758, 0.800606  , 0.17048587, 0.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 1.        ,
        0.0073561 , 0.27007236, 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 0.4664382 , 0.10617387,
        0.35250498, 0.26557875, 0.        , 0.85714286, 0.12857143,
        0.12857143]])

In [11]:
def patent_value_loss(y_true, y_pred):
  '''
  Since this was present during training for the patent count model,
  it must be loaded as a custom object at the time of prediction.
  Remove the custom_object code below if you are predicting on
  citation counts (measure of quality).
  '''

  patent_value_loss = K.abs(1 - K.exp(y_true - y_pred)) * 50000
    
  return patent_value_loss

In [12]:
patent_model = load_model("patent_text_model_epoch_no.030-2-14-20.h5", 
                          custom_objects={'patent_value_loss': patent_value_loss})














In [13]:
prediction = patent_model.predict(scaled_data)

In [14]:
prediction

array([[1.6938591]], dtype=float32)

This is a log value that can be back-transformed for the patent or citation count.