Description
Description
I'm trying to use Intelex to accelerate training of a SVC. My dataset is pretty tame (18 MB, in fact, I am attaching it, since it is a publicly available dataset - Universal Dependencies ISDT). I wasn't expecting my 16GB of ram (and 16gb of swap) to be filled by this task, so I wonder if this could be a bug. However, I am a student, so it may be an error on my part (if so, I'm sorry).
To Reproduce
Steps to reproduce the behavior:
- Download attached files in the same folder
- Change extension of train_parser from txt to py
- Install NLTK
- Run the python script
- See error
Expected behavior
A new file should be created with the training output. Instead, an Out Of Memory error is raised.
Note on NLTK implementation
The code for the function train is pretty straightforward, see source code here: https://www.nltk.org/_modules/nltk/parse/transitionparser.html#TransitionParser.train
Environment:
- OS: Ubuntu 20.04
- Intelex 2021.5
- Python 3.9.11
- scikit-learn 1.0.2
- NLTK 3.7
- conda 4.13.0
- CPU: i5-10500
Attachments
train_parser.txt
it_isdt-ud-train.txt
EDIT:
the svmlight file generated by NLTK is actually 62 MB and the memory used during sequential training (plain sklearn) is around 1GB