# Using Sherlock out-of-the-box
This notebook shows how to predict a semantic type for a given table column.
The steps are basically:
- Download files for word embedding and paragraph vector feature extraction (downloads only once) and initialize feature extraction models.
- Extract features from table columns.
- Initialize Sherlock.
- Make a prediction for the feature representation of the column.

In [1]:
import numpy as np
import pandas as pd
import pyarrow as pa

from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https:/

In [2]:
%env PYTHONHASHSEED

'13'

## Initialize feature extraction models

In [2]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

Preparing feature extraction by downloading 4 files:
        
 ../sherlock/features/glove.6B.50d.txt, 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy,
        
 ../sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy, and 
 ../sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:17.058032 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:28.667573 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.397791 seconds.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
import json


In [11]:
v=[["Jane Smith", "Lute Ahorn", "Anna James"],["Amsterdam", "Haarlem", "Zwolle"],["Chabot Street 19", "1200 fifth Avenue", "Binnenkant 22, 1011BH"]]

In [10]:
json_str = """{"ld":[
        ["Jane Smith", "Lute Ahorn", "Anna James"],
        ["Amsterdam", "Haarlem", "Zwolle"],
        ["Chabot Street 19", "1200 fifth Avenue", "Binnenkant 22, 1011BH"]
        ]}"""
j = json.loads(json_str)

In [47]:
j = json.loads(json_str)
val = j["ld"]

In [53]:
data = pd.Series(val , name="values" )

In [12]:
def do_pred(req_info):
    
    df = pd.Series(req_info , name="values")

    # extract features

    extract_features(
        "../temporary.csv",
        df
    )
    feature_vectors = pd.read_csv("../temporary.csv", dtype=np.float32)


    ####################################################################
    model = SherlockModel()
    model.initialize_model_from_json(with_weights=True, model_id="sherlock")

    predicted_labels = model.predict(feature_vectors, "sherlock")
    return predicted_labels

do_pred(v)

Extracting Features: 100%|████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 49.84it/s]

Exporting 1588 column features
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor



2022-06-16 06:15:14.708043: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2022-06-16 06:15:14.721355: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3302395000 Hz
2022-06-16 06:15:14.722664: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a4ad0d1ad0 executing computations on platform Host. Devices:
2022-06-16 06:15:14.722726: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>


array(['person', 'city', 'address'], dtype=object)

## Extract features

In [50]:
data = pd.Series(
    [
        ["Jane Smith", "Lute Ahorn", "Anna James"],
        ["Amsterdam", "Haarlem", "Zwolle"],
        ["Chabot Street 19", "1200 fifth Avenue", "Binnenkant 22, 1011BH"]
    ],
    name="values"
)

In [54]:
data

0                 [Jane Smith, Lute Ahorn, Anna James]
1                         [Amsterdam, Haarlem, Zwolle]
2    [Chabot Street 19, 1200 fifth Avenue, Binnenka...
Name: values, dtype: object

In [55]:
extract_features(
    "../temporary.csv",
    data
)
feature_vectors = pd.read_csv("../temporary.csv", dtype=np.float32)

Extracting Features: 100%|████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 67.44it/s]

Exporting 1588 column features





In [10]:
feature_vectors

Unnamed: 0,n_[0]-agg-any,n_[0]-agg-all,n_[0]-agg-mean,n_[0]-agg-var,n_[0]-agg-min,n_[0]-agg-max,n_[0]-agg-median,n_[0]-agg-sum,n_[0]-agg-kurtosis,n_[0]-agg-skewness,...,par_vec_390,par_vec_391,par_vec_392,par_vec_393,par_vec_394,par_vec_395,par_vec_396,par_vec_397,par_vec_398,par_vec_399
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,-0.114918,0.024795,-0.130516,0.005319,-0.135544,-0.071919,-0.051267,-0.067097,0.087215,-0.1451
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,-0.053141,0.023613,-0.166896,-0.014876,-0.058541,0.008594,-0.046941,0.02512,0.036836,-0.088337
2,1.0,0.0,1.0,0.666667,0.0,2.0,1.0,3.0,-1.5,0.0,...,-0.022528,0.000734,0.04697,0.119519,-0.09343,0.035313,-0.004014,-0.089037,-0.118543,-0.190988


## Initialize Sherlock

In [56]:
model = SherlockModel();
model.initialize_model_from_json(with_weights=True, model_id="sherlock");

## Predict semantic type for column

In [57]:
predicted_labels = model.predict(feature_vectors, "sherlock")

In [58]:
predicted_labels

array(['name', 'city', 'address'], dtype=object)