# An Introduction to the `srlearn` Python Package

Alexander L. Hayes &mdash; Health Informatics Ph.D. Student &mdash; Indiana University Bloomington  
Sriraam Natarajan &mdash; Professor of Computer Science at the University of Texas at Dallas

Email: [hayesall@iu.edu](mailto:hayesall@iu.edu)  
GitHub: [https://github.com/hayesall/srlearn](https://github.com/hayesall/srlearn)

This notebook accompanies several parts of the user guide, refer to the following pages for more information:

- [Getting Started (srlearn docs)](https://srlearn.readthedocs.io/en/latest/getting_started.html)
- [User Guide (srlearn docs)](https://srlearn.readthedocs.io/en/latest/user_guide.html)

## Quick-Start

A few things are required before executing.

- Unix-based system (this has not been sufficiently tested in a Windows environment)
- Java (>=1.8)
- Python (3.6, 3.7)

Java should be installed on your system and available on your PATH (if running `java -version` results in a version number printed to the terminal, you are probably fine).

`srlearn` can be installed with `pip`:

```bash
$ pip install srlearn
```

## `srlearn`

In [1]:
import srlearn

### Background Knowledge

**Background Knowledge** currently involves specifying the **modes** for constraining the hypothesis search space. Other parameters such as `max_tree_depth` and `node_size` may be specified here as well, but these would be more appropriate defining as part of a model. In the future, these model-specific parameters may be set elsewhere.

In [2]:
from srlearn import Background

# Modes constrain the search space for hypotheses
toy_cancer_modes = [
    "cancer(+Person).",
    "smokes(+Person).",
    "friends(+Person, -Person).",
    "friends(-Person, +Person).",
]

# Background object includes the modes and some additional parameters for how our domain may look.
bk = Background(
    modes=toy_cancer_modes,
    use_std_logic_variables=True,
)

print(bk)

setParam: nodeSize=2.
setParam: maxTreeDepth=3.
setParam: numberOfClauses=100.
setParam: numberOfCycles=100.
useStdLogicVariables: true.
mode: cancer(+Person).
mode: smokes(+Person).
mode: friends(+Person, -Person).
mode: friends(-Person, +Person).



### Database of clauses

Our next focus should be on *the data*. Data for `srlearn` takes the form of predicate logic. Here the clauses are defined inline the code, but these could just as easily be read from a file or created with a simulator for *reinforcement learning* domains.

In [3]:
from srlearn import Database

train_pos = ["cancer(Alice).", "cancer(Bob).", "cancer(Chuck).", "cancer(Fred)."]
train_neg = ["cancer(Dan).", "cancer(Earl)."]
train_facts = [
    "friends(Alice, Bob).", "friends(Alice, Fred).", "friends(Chuck, Bob).", "friends(Chuck, Fred).",
    "friends(Dan, Bob).", "friends(Earl, Bob).", "friends(Bob, Alice).", "friends(Fred, Alice).",
    "friends(Bob, Chuck).", "friends(Fred, Chuck).", "friends(Bob, Dan).", "friends(Bob, Earl).",
    "smokes(Alice).", "smokes(Chuck).", "smokes(Bob).",
]

# Instantiate a `Database` object
db = Database()

# Set the positive examples, negative examples, and facts for the Database.
db.pos = train_pos
db.neg = train_neg
db.facts = train_facts

print(db)

Positive Examples:
['cancer(Alice).', 'cancer(Bob).', 'cancer(Chuck).', 'cancer(Fred).']
Negative Examples:
['cancer(Dan).', 'cancer(Earl).']
Facts:
['friends(Alice, Bob).', 'friends(Alice, Fred).', 'friends(Chuck, Bob).', 'friends(Chuck, Fred).', 'friends(Dan, Bob).', 'friends(Earl, Bob).', 'friends(Bob, Alice).', 'friends(Fred, Alice).', 'friends(Bob, Chuck).', 'friends(Fred, Chuck).', 'friends(Bob, Dan).', 'friends(Bob, Earl).', 'smokes(Alice).', 'smokes(Chuck).', 'smokes(Bob).']


### Relational Dependency Network Learning

The model API should look familiar if you've worked with [scikit-learn](https://scikit-learn.org/stable/) before. **The only difference** is that instead of passing `X,y` numpy arrays, we pass `Database` objects.

We'll instantiate an RDN to see what some of the default parameters look like:

In [4]:
from srlearn.rdn import BoostedRDN

# Instantiate an RDN with no parameters.
dn = BoostedRDN()

print(dn)

BoostedRDN(background=None, max_tree_depth=3, n_estimators=10, node_size=2,
           target='None')


... **but fitting this model will not make sense unless** we specify a **target** and the **background knowledge** we defined earlier.

In [5]:
# Instantiate an RDN with a target to learn, and the background knowledge
dn = BoostedRDN(
    target="cancer",
    background=bk,
)

# Fit a model with the `fit` method
dn.fit(db)

BoostedRDN(background=setParam: nodeSize=2.
setParam: maxTreeDepth=3.
setParam: numberOfClauses=100.
setParam: numberOfCycles=100.
useStdLogicVariables: true.
mode: cancer(+Person).
mode: smokes(+Person).
mode: friends(+Person, -Person).
mode: friends(-Person, +Person).
,
           max_tree_depth=3, n_estimators=10, node_size=2, target='cancer')

### Testing our model on new data

Now that we've fit a model, we can perform inference to assign probabilities to whether each example should belong to a class.

In [6]:
test_pos = ["cancer(Zod).", "cancer(Xena).", "cancer(Yoda)."]
test_neg = ["cancer(Voldemort).", "cancer(Watson)."]
test_facts = [
    "friends(Zod, Xena).", "friends(Xena, Watson).", "friends(Watson, Voldemort).", "friends(Voldemort, Yoda).",
    "friends(Yoda, Zod).", "friends(Xena, Zod).", "friends(Watson, Xena).", "friends(Voldemort, Watson).",
    "friends(Yoda, Voldemort).", "friends(Zod, Yoda).", "smokes(Zod).", "smokes(Xena).", "smokes(Yoda).",
]

test_db = Database()
test_db.pos = test_pos
test_db.neg = test_neg
test_db.facts = test_facts

print(dn.predict_proba(test_db))

[0.88079619 0.88079619 0.88079619 0.3075821  0.3075821 ]
