# RLTK Notebook

Here's the notebook of RLTK.

Useful links:
- [Github Repository](https://github.com/usc-isi-i2/rltk)
- [Documents](http://rltk.readthedocs.io/en/latest/)

## Installation

1. Get RLTK from Github (it will be uploaded to PyPI later)
```
git clone https://github.com/usc-isi-i2/rltk.git
```
2. Create virtual environment and install dependencies ([Conda](https://github.com/conda/conda) should be installed).
```
conda-env create .
source activate rltk_env
```
3. For testing purpose, install [pytest](http://doc.pytest.org/en/latest/) and run test cases.
```
python -m pytest
```

## Quick Start

Let's start with a simple example. If you want to compute the Levenshtein Distance between two sequences, initialize RLTK and invoke measurement. All the files used in examples can be found in `examples` folder.

In [7]:
# this two lines are just for locating the package.
import sys
sys.path.append('..')

import rltk
tk = rltk.init()
tk.levenshtein_distance('a', 'abc')

2

For similarity measurements, RLTK supports `distance` and `similarity`.

In [8]:
tk.levenshtein_similarity('a', 'abc')

0.33333333333333337

Some of the methods need extra resources. In RLTK, you can simple load it and give it a name, then told RLTK which resource you want to use when invoking methods. This example shows how to compute the customized weighted Levenshetin Distance.

In [9]:
edit_distance_cost = {'insert': {'c':50}, 'insert_default':100, 'delete_default':100, 'substitute_default':100}

tk.load_edit_distance_table('A1', edit_distance_cost)
tk.levenshtein_distance('a', 'abc', name='A1')

150

If a method needs file path as input, it can be absolute or relative. For relative path, RLTK will get this file in `root_path`. Use `set_root_path` to change it.

In [10]:
tk.set_root_path('../examples/ex1')
tk.get_root_path()

'/home/zege/ISI/rltk/examples/ex1'

You can also load resources from files, like `load_df_corpus`. In text file, each line has some tokens separatedd by whitespace. RLTK will treat each line as a document. [Example file: [df_corpus_1.txt](../examples/ex1/df_corpus_1.txt)]

In [11]:
tk.load_df_corpus('B1', 'df_corpus_1.txt', file_type='text', mode='append')
tk.tf_idf_similarity(['a', 'b', 'a'], ['a', 'c','d','f'], name='B1')

0.17541160386140583

For `load_df_corpus`, RLTK also support loading Json Line file (each line is a Json object). `json_path` is used to extract the field(s) as document. [Example file: [jl_file_1.jsonl](../examples/ex1/jl_file_1.jsonl)]

In [12]:
tk.load_df_corpus('B2', 'jl_file_1.jsonl', file_type='json_lines', json_path='desc[*]', mode='append')
tk.tf_idf_similarity(['abc'], ['abc', 'def'], name='B2')

0.8944271909999159

Make sure using the corresponding resource type or it will cause an exception.

In [13]:
try:
    tk.levenshtein_similarity('a', 'abc', name='B1')
except ValueError as e:
    print 'Catched exception:', e

Catched exception: Invalid name or type
