# RLTK Notebook

Here's the notebook of RLTK.

Useful links:
- [Github Repository](https://github.com/usc-isi-i2/rltk)
- [Documents](http://rltk.readthedocs.io/en/latest/)

## Installation

1. Get RLTK from Github (it will be uploaded to PyPI later)
```
git clone https://github.com/usc-isi-i2/rltk.git
```
2. Install the dependencies.
```
pip install -r requirements.txt
```
3. For testing purpose, install [pytest](http://doc.pytest.org/en/latest/) and run test cases.
```
python -m pytest
```

## Quick Start

Let's start with a simple example. If you want to compute the Levenshtein Distance between two sequences, initialize RLTK and invoke measurement. All the files used in examples can be found in `examples` folder.

In [1]:
# this two lines are just for locating the package.
import sys
sys.path.append('..')

import rltk
tk = rltk.init()
tk.levenshtein_distance('a', 'abc')

2

For similarity measurements, RLTK supports `distance` and `similarity`.

In [2]:
tk.levenshtein_similarity('a', 'abc')

0.33333333333333337

Some of the methods need extra resources. In RLTK, you can simple load it and give it a name, then told RLTK which resource you want to use when invoking methods. This example shows how to compute the customized weighted Levenshetin Distance.

In [3]:
edit_distance_cost = {'insert': {'c':50}, 'insert_default':100, 'delete_default':100, 'substitute_default':100}

tk.load_edit_distance_table('A1', edit_distance_cost)
tk.levenshtein_distance('a', 'abc', name='A1')

150

If a method needs file path as input, it can be absolute or relative. For relative path, RLTK will get this file in `root_path`. Use `set_root_path` to change it.

In [4]:
tk.set_root_path('../examples')
tk.get_root_path()

'/home/zege/ISI/rltk/examples'

You can also load resources from files, like `load_df_corpus`. In text file, each line has some tokens separatedd by whitespace. RLTK will treat each line as a document. [Example file: [df_corpus_1.txt](df_corpus_1.txt)]

In [5]:
tk.load_df_corpus('B1', 'df_corpus_1.txt', file_type='text', mode='append')
tk.tf_idf_similarity(['a', 'b', 'a'], ['a', 'c','d','f'], name='B1')

0.17541160386140583

For `load_df_corpus`, RLTK also support loading Json Line file (each line is a Json object). `json_path` is used to extract the field(s) as document. [Example file: [jl_file_1.jsonl](jl_file_1.jsonl)]

In [6]:
tk.load_df_corpus('B2', 'jl_file_1.jsonl', file_type='json_lines', json_path='desc[*]', mode='append')
tk.tf_idf_similarity(['abc'], ['abc', 'def'], name='B2')

0.8944271909999159

Make sure using the corresponding resource type or it will cause an exception.

In [7]:
tk.levenshtein_similarity('a', 'abc', name='B1')

ValueError: Invalid name or type

## Compute Feature Vector

Let's get involved. Now you have the following json dictionaries:

In [8]:
j1 = {'id': 1, 'name': 'abc', 'gender': 'male'}
j2 = {'id': '2','name': 'bcd', 'gender': 'male'}

You want to compute a feature vector from different fields with different feature functions, you can load a configuration file [Example file: [feature_config_1.json](feature_config_1.json)] and invoke `compute_feature_vector`. The complete configuration explaination can be found [here](http://rltk.readthedocs.io/en/latest/rltk.html#core.Core.load_feature_configuration). In config file, you can define the id of entity, the way of handling exception and the feature functions and its parameters (the field name start with a `_` is a comment which will be ignored).

In [9]:
tk.load_feature_configuration('C1', 'feature_config_1.json')
tk.compute_feature_vector(j1, j2, name='C1')

{'feature_vector': [0.33333333333333337, 1.0], 'id': [1, '2']}

## Classification

After getting feature vectors, you can featurized these vectors by ground truth. [Example files: [feature_config_1.json](feature_file_1.jsonl), [ground_truth_1.jsonl](ground_truth_1.jsonl)]

In [10]:
tk.featurize_ground_truth('feature_file_1.jsonl', 'ground_truth_1.jsonl')

{u'id': [3, 4], u'feature_vector': [0.3, 0.4], 'label': 0.8}


If there's no `output_file_path` set, the result will print to STDOUT.

When featurized ground truth is generated, supervised learning can be used to train classifier.

In [11]:
featurized_ground_truth = [
    {'feature_vector': [0, 0], 'label': [0], 'id': [1, 2]},
    {'feature_vector': [1, 1], 'label': [1], 'id': [3, 4]}
]
model = tk.train_classifier(featurized_ground_truth, config={'function': 'svm'})
print model.predict([[2., 2.]])

[1]


## Performance Optimization

For some of the large data set, there are many duplicated calculation. For TF/IDF, RLTK supports to pre-compute Term Frequency and Inverse Document Frequency. It is extremely useful to compute similarities in those pairwised data.

```python

# load test data and initialize rltk
...

# compute and cache tf
cached_tf = []
for i in test_data:
    cached_tf.append(tk.compute_tf(i))
    
# compute idf
tk.compute_idf('corpus_wc', new_name='corpus_wc_cached', math_log=True)

# compute similarity
for i in range(total):
   for j in range(total):
       tk.cached_tf_idf_similarity(
           test_data[i], test_data[j], cached_tf[i], cached_tf[j], idf_name='corpus_wc_cached')

```