Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildCrawler error #353

Closed
mnavasloro opened this issue Jun 27, 2023 · 1 comment
Closed

buildCrawler error #353

mnavasloro opened this issue Jun 27, 2023 · 1 comment

Comments

@mnavasloro
Copy link

I get the following error when trying to build a model using the sample_data data (both locally and using Docker):

docker run -v \config:/config -v /data:/data -p 8080:8080 vidanyu/ache buildModel -c /config/sample_config/stopwords.txt -t /config/sample_training_data -o /config/sample_model

The problem seems to be the arff file generated.

-------------------
ACHE Crawler 0.15.0
-------------------

Preparing training data...
Positive samples: 169
Negative samples: 443
Featurizing positive samples...
Featurizing negative samples...
Selecting best features based on page frequency...
Training target classifier model...
Learning algorithm: SVM
Writting temporarily data file to: /config/sample_training_data/smile_input.arff
Failed to build model.

java.text.ParseException: Invalid attribute type or invalid enumeration
        at smile.data.parser.ArffParser.parseAttribute(ArffParser.java:280)
        at smile.data.parser.ArffParser.readHeader(ArffParser.java:210)
        at smile.data.parser.ArffParser.parse(ArffParser.java:401)
        at achecrawler.target.classifier.SmileTargetClassifierBuilder.trainModel(SmileTargetClassifierBuilder.java:40)
        at achecrawler.target.classifier.TargetClassifierBuilder.train(TargetClassifierBuilder.java:110)
        at achecrawler.Main$BuildModel.run(Main.java:182)
        at achecrawler.Main.main(Main.java:59)
@mnavasloro
Copy link
Author

mnavasloro commented Jul 25, 2023

I got to solve the issue, added a new library for multi-language detection (so you can define a list of target languages) and also added KNN algorithm as an option. The fork is available here https://github.com/mnavasloro/ache-multilingual

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant