Skip to content

Commit

Permalink
README wip + sci-kitlearn python 2 and python 3 compatibility support…
Browse files Browse the repository at this point in the history
… (version 0.20.1)
  • Loading branch information
sergioburdisso committed Nov 12, 2019
1 parent 35c2179 commit 2b12af2
Show file tree
Hide file tree
Showing 7 changed files with 174 additions and 102 deletions.
83 changes: 69 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,33 +2,88 @@
[![Documentation Status](https://readthedocs.org/projects/pyss3/badge/?version=latest)](http://pyss3.readthedocs.io/en/latest/?badge=latest)
[![Build Status](https://travis-ci.org/sergioburdisso/pyss3.svg?branch=master)](https://travis-ci.org/sergioburdisso/pyss3)

The SS3 text classifier was originally introduced in Section 3 of the [paper](https://dx.doi.org/10.1016/j.eswa.2019.05.023) entitled _"A text classification framework for simple and effective early depression detection over social media streams"_ (preprint available [here](https://arxiv.org/abs/1905.08772)).
The SS3 text classifier is a novel supervised machine learning model for text classification. SS3 was originally introduced in Section 3 of the paper _["A text classification framework for simple and effective early depression detection over social media streams"](https://dx.doi.org/10.1016/j.eswa.2019.05.023)_ (preprint available [here](https://arxiv.org/abs/1905.08772)).

**SS3 highlights:**
**Some virtues of SS3:**

* A novel text classifier having the ability to visually explain its rationale.
* Domain-independent classification that does not require feature engineering.
* Naturally supports incremental (online) learning and incremental classification.
* It has the **ability to visually explain its rationale**.
* Introduces a **domain-independent** classification model that does not require feature engineering.
* Naturally supports **incremental (online) learning** and **incremental classification**.
* Well suited to work over **text streams**.

## What is PySS3?

PySS3 is a Python package that allows you to work with SS3 in a very straightforward, interactive and visual way. In addition to the implementations of the SS3 classifier, PySS3 comes with a set of tools to help you to develop your machine learning models in a clearer and faster way. These tools let you analyze, supervise and understand your models (what they have actually learned and why). To achieve this, PySS3 provides you 3 main components: the ``SS3`` class, the ``Server`` class and the ``PySS3 Command Line`` tool, as pointed out below.

### The `SS3` class

which implements the classifier using a clear API (very similar to that of `sklearn`):
````python
from pyss3 import SS3
clf = SS3()
...
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
````

### The `Server` class

which allows you to interactively test your model and visually see the reasons behind classification decisions, **with just one line of code**:
```python
import pyss3
from pyss3 import SS3

clf = SS3(name="my_model")
...
clf.fit(x_train, y_train)
pyss3.Server.serve(clf, x_test, y_test) # <- this one! cool uh? :)
```
As shown in the image below, this will open up, locally, an interactive tool in your browser which you can use to (live) test your models with the documents given in `x_test` (or typing in your own!). This will allow you to visualize and understand what your model is actually learning.

![img](docs/_static/ss3_live_test.gif)

### And last but not least, the _PySS3 Command Line_

This is probably the most useful component of PySS3. When you install the package (for instance by using `pip install pyss3`) a new command line is automatically added to your environment, called _"pyss3"_. This command allows you to access to the _PySS3 Command Line_, an interactive command-line query tool. This tool will let you interact with your SS3 models through special commands while assisting you during the whole machine learning pipeline (model selection, training, testing, etc.). Probably one of the most important features is the ability to automatically (and permanently) record the history of every evaluation result of any type (tests, k-fold cross-validations, grid searches, etc.) that you've performed. This will allow you (with a single command) to interactively visualize and analyze your classifier performance in terms of its different hyper-parameters values (and select the best model according to your needs). For instance, let's perform a grid search with a 4-fold cross-validation on the three hyper-parameters, smoothness(`s`), significance(`l`), and sanction(`p`) as follows:

```console
your@user:/your/project/path$ pyss3
(pyss3) >>> load my_model
(pyss3) >>> grid_search path/to/dataset 4-fold -s r(.2,.8,6) -l r(.1,2,6) -p r(.5,2,6)
```
In this illustrative example, `s` will take 6 different values between .2 and .8, `l` between .1 and 2, and `p` between .5 and 2. After the grid search finishes, we can use the following command to open up the interactive plot in the browser:
```console
(pyss3) >>> plot evaluations
```
![img](docs/_static/plot_evaluations.gif)

Each dot represents an experiment/evaluation performed using that particular combination of values (s, l, and p). Also, dots are painted proportional to how good the performance was using that configuration of the model. Researchers can interactively change the evaluation metrics to be used (accuracy, precision, recall, f1, etc.) and plots will update "on the fly". Additionally, when the cursor is moved over a data point, useful information is shown (including a "compact" representation of the confusion matrix obtained in that experiment). Finally, it is worth mentioning that, before showing the 3D plots, PySS3 creates a single and portable HTML file containing the plots and stores it locally. This allows researchers to store, send or upload the plots to another place using this single HTML file (their papers can now link to these types of plots to increase experimentation transparency!). For example, we have uploaded two of these files we've obtained for the "Tutorials" section: ["Movie Review Classification"](http://tworld.io/ss3/ss3_model_evaluation[movie_review_3grams].html) and ["Topic Categorization"](http://tworld.io/ss3/ss3_model_evaluation[topics_3grams].html) evaluation plots.


## The PySS3 Workflow

### The somewhat standard way

### The "Command-Line" way

## Installation


### PyPi installation

Simply type:

pip install pyss3

```console
$ pip install pyss3
```

### Installation from source

To install latest version from github, clone the source from the project repository and install with setup.py::

git clone https://github.com/sergioburdisso/pyss3
cd pyss3
python setup.py install

To install latest version from github, clone the source from the project repository and install with setup.py:
```console
$ git clone https://github.com/sergioburdisso/pyss3
$ cd pyss3
$ python setup.py install
```

## API Documentation

Expand Down
Binary file added docs/_static/plot_evaluations.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/ss3_live_test.gif
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
183 changes: 100 additions & 83 deletions pyss3/cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -1086,34 +1086,40 @@ def k_fold_validation(
x_data, y_data = np.array(x_data), np.array(y_data)
skf = StratifiedKFold(n_splits=k_fold)
progress_bar = tqdm(total=k_fold, desc=" K-Fold Progress")
for i_fold, (train_ix, test_ix) in enumerate(skf.split(x_data, y_data)):
if not cache or not is_in_cache(
data_path, method, def_cat, s, l, p, a
):
x_train, y_train = x_data[train_ix], y_data[train_ix]
y_test = [CLF.get_category_index(y) for y in y_data[test_ix]]
x_test = x_data[test_ix]
try:
for i_fold, (train_ix, test_ix) in enumerate(skf.split(x_data, y_data)):
if not cache or not is_in_cache(
data_path, method, def_cat, s, l, p, a
):
x_train, y_train = x_data[train_ix], y_data[train_ix]
y_test = [CLF.get_category_index(y) for y in y_data[test_ix]]
x_test = x_data[test_ix]

CLF = SS3(name=model_name)
CLF.set_hyperparameters(s, l, p, a)
train(x_train, y_train, n_grams, save=False, leave_pbar=False)
CLF = SS3(name=model_name)
CLF.set_hyperparameters(s, l, p, a)
train(x_train, y_train, n_grams, save=False, leave_pbar=False)

try:
y_pred = CLF.predict(
x_test, def_cat, labels=False, leave_pbar=False
)
except InvalidCategoryError:
Print.error(ERROR_ICN % def_cat)
return
try:
y_pred = CLF.predict(
x_test, def_cat, labels=False, leave_pbar=False
)
except InvalidCategoryError:
Print.error(ERROR_ICN % def_cat)
return

results(
y_test, y_pred,
categories, def_cat,
cache, method, data_path,
plots=False, k_fold=k_fold, i_fold=i_fold
)
results(
y_test, y_pred,
categories, def_cat,
cache, method, data_path,
plots=False, k_fold=k_fold, i_fold=i_fold
)

progress_bar.update(1)
progress_bar.update(1)
except KeyboardInterrupt:
Print.set_quiet(False)
print()
Print.warn("Interrupted!")
pass

progress_bar.close()
CLF = SS3(name=model_name)
Expand Down Expand Up @@ -1151,54 +1157,60 @@ def grid_search_loop(
S, L, P, _ = CLF.get_hyperparameters()

Print.quiet_begin()
for s, l, p in slp_list:
CLF.set_hyperparameters(s, l, p)
updated = False
for a in aa:
if not cache or not is_in_cache(
data_path, method, def_cat, s, l, p, a
):
if not updated:
try:
for s, l, p in slp_list:
CLF.set_hyperparameters(s, l, p)
updated = False
for a in aa:
if not cache or not is_in_cache(
data_path, method, def_cat, s, l, p, a
):
if not updated:
progress_desc.set_description_str(
" Status: [updating model...] "
"(s=%.3f; l=%.3f; p=%.3f; a=%.3f)"
%
(s, l, p, a)
)
CLF.update_values()
updated = True

CLF.set_alpha(a)
progress_desc.set_description_str(
" Status: [updating model...] "
" Status: [classifying...] "
"(s=%.3f; l=%.3f; p=%.3f; a=%.3f)"
%
(s, l, p, a)
)
CLF.update_values()
updated = True

CLF.set_alpha(a)
progress_desc.set_description_str(
" Status: [classifying...] "
"(s=%.3f; l=%.3f; p=%.3f; a=%.3f)"
%
(s, l, p, a)
)

try:
y_pred = CLF.predict(
x_test, def_cat, labels=False, leave_pbar=False
try:
y_pred = CLF.predict(
x_test, def_cat, labels=False, leave_pbar=False
)
except InvalidCategoryError:
Print.error(ERROR_ICN % def_cat)
return

results(
y_test, y_pred,
categories, def_cat,
cache, method, data_path,
plots=False, k_fold=k_fold, i_fold=i_fold
)
except InvalidCategoryError:
Print.error(ERROR_ICN % def_cat)
return
else:
progress_desc.set_description_str(
" Status: [skipping (already cached)...] "
"(s=%.3f; l=%.3f; p=%.3f; a=%.3f)"
%
(s, l, p, a)
)
progress_bar.update(1)
progress_desc.update(1)
except KeyboardInterrupt:
Print.set_quiet(False)
print()
Print.warn("Interrupted!")

results(
y_test, y_pred,
categories, def_cat,
cache, method, data_path,
plots=False, k_fold=k_fold, i_fold=i_fold
)
else:
progress_desc.set_description_str(
" Status: [skipping (already cached)...] "
"(s=%.3f; l=%.3f; p=%.3f; a=%.3f)"
%
(s, l, p, a)
)
progress_bar.update(1)
progress_desc.update(1)
progress_desc.set_description_str(" Status: [finished]")
progress_bar.close()
progress_desc.close()
Expand Down Expand Up @@ -1251,24 +1263,29 @@ def grid_search(
position=0, total=k_fold,
desc=" K-Fold Progress"
)
for i_fold, (train_ix, test_ix) in enumerate(
skf.split(x_data, y_data)
):
x_train, y_train = x_data[train_ix], y_data[train_ix]
y_test = [CLF.get_category_index(y) for y in y_data[test_ix]]
x_test = x_data[test_ix]

CLF = SS3(name=model_name)
train(x_train, y_train, n_gram, save=False, leave_pbar=False)

grid_search_loop(
data_path, x_test, y_test, categories, def_cat,
k_fold, i_fold, ss, ll, pp, aa, cache, leave_pbar=False
)
try:
for i_fold, (train_ix, test_ix) in enumerate(
skf.split(x_data, y_data)
):
x_train, y_train = x_data[train_ix], y_data[train_ix]
y_test = [CLF.get_category_index(y) for y in y_data[test_ix]]
x_test = x_data[test_ix]

CLF = SS3(name=model_name)
train(x_train, y_train, n_gram, save=False, leave_pbar=False)

grid_search_loop(
data_path, x_test, y_test, categories, def_cat,
k_fold, i_fold, ss, ll, pp, aa, cache, leave_pbar=False
)

save_results_history()
save_results_history()

progress_bar.update(1)
progress_bar.update(1)
except KeyboardInterrupt:
Print.set_quiet(False)
print()
Print.warn("Interrupted!")

progress_bar.close()
CLF = SS3(name=model_name)
Expand Down Expand Up @@ -2634,11 +2651,11 @@ def main():
"""Main function."""
global MODELS
prompt = SS3Prompt()
prompt.prompt = '(ss3) >>> '
prompt.prompt = '(pyss3) >>> '
prompt.doc_header = "Documented commands (type help <command>):"
Print.info(
'SS3 Command Line v%s | Sergio Burdisso (sergio.burdisso@gmail.com).\n'
'SS3 comes with ABSOLUTELY NO WARRANTY. This is free software,\n'
'PySS3 Command Line v%s | Sergio Burdisso (sergio.burdisso@gmail.com).\n'
'PySS3 comes with ABSOLUTELY NO WARRANTY. This is free software,\n'
'and you are welcome to redistribute it under certain conditions\n'
'(Type "license" for more details).\n'
'Type "help" or "help <command>" for more information.\n'
Expand Down
6 changes: 3 additions & 3 deletions pyss3/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -367,7 +367,7 @@ def start_listening(port=0):
Server.__port__ = server_socket.getsockname()[1]

Print.info(
"SS3 server started (listening on port %d)"
"PySS3 server started (listening on port %d)"
%
Server.__port__
)
Expand Down Expand Up @@ -452,7 +452,7 @@ def serve(


if __name__ == "__main__":
parser = argparse.ArgumentParser(description='SS3 Server')
parser = argparse.ArgumentParser(description='PySS3 Live Test Server')

parser.add_argument('MODEL', help="the model name")
parser.add_argument('-ph', '--path', help="the test set path")
Expand All @@ -467,7 +467,7 @@ def serve(

try:
Print.warn(
'SS3 Server comes with ABSOLUTELY NO WARRANTY. This is free software,'
'PySS3 Server comes with ABSOLUTELY NO WARRANTY. This is free software,'
'\nand you are welcome to redistribute it under certain conditions'
'\n(read license.txt for more details)\n', decorator=False
)
Expand Down
2 changes: 1 addition & 1 deletion requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ sphinx_rtd_theme
Cython
numpy>=1.13.0
scipy>= 0.17.0
scikit-learn>=0.20
scikit-learn==0.20.1
tqdm>=4.8.4
matplotlib
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Cython
numpy>=1.13.0
scipy>= 0.17.0
scikit-learn>=0.20
scikit-learn==0.20.1
tqdm>=4.8.4
matplotlib

0 comments on commit 2b12af2

Please sign in to comment.