"requests.exceptions.HTTPError: 404 Client Error" while trying tapioca train-classifier #11

heathersherry · 2019-11-19T09:23:45Z

Dear authors,

Thanks for sharing the great project.

I tried to follow the documents of this project to run it. Everything goes smoothly, until I tried to train a classifier on the dataset.
I create a Solr collection named collection_5 and run:
bunzip2 < latest-all.json.bz2 | tapioca index-dump collection_5 - --profile profiles/human_organization_place.json
Everything works well. I index the Wikidata dump in the Solr collection successfully.

Then I tried this command to get the classifier:
tapioca train-classifier -c collection_5 -b data/wd_2019-02-24.bow.pkl -p data/wd_2019-02-24.pgrank.npy -d data/merged_RSS-500_and_istex_train.ttl -o data/rss_istex_classifier.pkl
It fails with this error information:

Traceback (most recent call last):

  File "/usr/local/bin/tapioca", line 11, in <module>
    load_entry_point('opentapioca==0.1.0', 'console_scripts', 'tapioca')()
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/data2/xxx/related_work/opentapioca/opentapioca/cli.py", line 184, in train_classifier
    best_params = clf.crossfit_model(d, parameter_grid, max_iter=max_iter)
  File "/data2/xxx/related_work/opentapioca/opentapioca/classifier.py", line 113, in crossfit_model
    docid_to_mentions[str(context.uri)] = self.create_mentions(context.mention)
  File "/data2/xxx/related_work/opentapioca/opentapioca/classifier.py", line 78, in create_mentions
    mentions = self.tagger.tag_and_rank(phrase)
  File "/data2/xxx/related_work/opentapioca/opentapioca/tagger.py", line 52, in tag_and_rank
    r.raise_for_status()
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://localhost:8983/solr/collection_5/tag?overlaps=NO_SUB&tagsLimit=500&fl=id%2Clabel%2Caliases%2Cextra_aliases%2Cdesc%2Cnb_statements%2Cnb_sitelinks%2Cedges%2Ctypes&wt=json&indent=off

(I put the opentapioca project in the folder /data2/xxx/related_work)

Could you please give some hints for solving this problem? Is it some problems brought by Solr? I have checked the status of Solr, it seems everything is working well.

Found 1 Solr nodes: 

Solr process 8173 running on port 8983
{
  "solr_home":"/data2/sherry/related_work/solr-8.2.0/server/solr",
  "version":"8.2.0 31d7ec7bbfdcd2c4cc61d9d35e962165410b65fe - ivera - 2019-07-19 15:11:04",
  "startTime":"2019-10-30T06:39:53.937Z",
  "uptime":"20 days, 2 hours, 43 minutes, 18 seconds",
  "memory":"3.3 GB (%83.2) of 4 GB",
  "cloud":{
    "ZooKeeper":"localhost:9983",
    "liveNodes":"1",
    "collections":"6"}}

Thanks a lot!

The text was updated successfully, but these errors were encountered:

wetneb · 2019-11-19T10:14:27Z

If you created the Solr collection yourself, then it probably lacks the /tag endpoint that is required by opentapioca.

You should run tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json for a collection that does not exist yet: tapioca will create the collection by itself, with the appropriate /tag endpoint.

There might be a way to add the endpoint after the fact, having already ingested the dump in a collection - but I am not sure how!

I will make it clearer in the docs that you should not create the Solr collection yourself.

#11

heathersherry · 2019-11-24T13:00:22Z

Dear authors,

Thanks for the quick explanation!
I have tried several times to run tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json before creating the Solr collection by myself. However, there is another error message:

requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/original/update?commit=false

Actually, if I create the Solr collection first, then run tapioca index-dump my_collection_name latest-all.json.bz2 --profile profiles/human_organization_place.json, there will not be any problem. However, after running tapioca train-classifier, the error message mentioned in my last post occurs.

Moreover, I also tried bunzip2 < latest-all.json.bz2 | tapioca index-dump my_collection_name - --profile profiles/human_organization_place.json. But I also receive the following error:

Traceback (most recent call last):                                                                                                                                                     
File "/home/sherry/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send timeout=timeout                                                                                                                                                                  
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]                                                                                                                    
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 400, in increment raise six.reraise(type(error), error, _stacktrace)                                                                                                                                 
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 734, in reraise raise value.with_traceback(tb)                                                                                                                                                     
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen chunked=chunked,                                                                                                                                                                   
File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request six.raise_from(e, None)                                                                                                                                                            
File "<string>", line 3, in raise_from File "/home/sherry/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request httplib_response = conn.getresponse()                                                                                                                                            
File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse response.begin()                                                                                                                                                                   
File "/usr/lib/python3.6/http/client.py", line 307, in begin  version, status, reason = self._read_status()                                                                                                                                      
File "/usr/lib/python3.6/http/client.py", line 268, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")                                                                                                                           
File "/usr/lib/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b) urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

I have also checked the status of Solr. The status is the same as that I mentioned in the last post. So I am not sure why there is a "Connection aborted" error.

Could you please give some hints? Thanks a lot!

wetneb · 2019-11-24T13:14:11Z

For the HTTP 400 error you get, there should be some logs available in the Solr web interface. Can you check there and report what exactly cases this Bad Request error?

heathersherry · 2019-11-26T03:30:06Z

Hi,

Here is the logs (I use linux terminal to run Solr and Opentapioca) for the program:

2019-11-25 15:04:38,220 opentapioca.taggerfactory INFO     Stream index: 10674820
2019-11-25 15:04:38,221 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-25 15:04:41,453 opentapioca.taggerfactory INFO     Stream index: 10676994
2019-11-25 15:04:41,453 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
Traceback (most recent call last):
  File "/usr/local/bin/tapioca", line 11, in <module>
    load_entry_point('opentapioca==0.1.0', 'console_scripts', 'tapioca')()
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/xxx/.local/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/data2/xxx/related_work/opentapioca/opentapioca/cli.py", line 118, in index_dump
    batch_size=2000, commit_time=10, delete_excluded=False, skip_docs=skip)
  File "/data2/xxx/related_work/opentapioca/opentapioca/taggerfactory.py", line 91, in index_stream
    self._push_documents(batch, collection_name, commit)
  File "/data2/xxx/related_work/opentapioca/opentapioca/taggerfactory.py", line 121, in _push_documents
    r.raise_for_status()
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/cony_collection_5/update?commit=false

(I name the collection as cony_collection_5)
Meanwhile, I check the status of Solr and it seems that it works well:

$ /home/xxx/solr-8.3.0/bin/solr status

Found 1 Solr nodes:

Solr process 933 running on port 8983
{ 
  "solr_home":"/home/xxx/solr-8.3.0/server/solr",
  "version":"8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22",
  "startTime":"2019-11-25T12:26:18.094Z",
  "uptime":"0 days, 14 hours, 59 minutes, 5 seconds",
  "memory":"5.1 GB (%16) of 32 GB",
  "cloud":{
    "ZooKeeper":"localhost:9983",
    "liveNodes":"1",
    "collections":"7"}}

wetneb · 2019-11-26T08:09:51Z

Thanks! The Solr logs themselves should be accessible on the Solr web interface. By default it runs at http://hostname:8983/solr/.

heathersherry · 2019-11-26T09:04:36Z

Thanks a lot for the reply!
Actually I have tried running bunzip2 < latest-all.json.bz2 | tapioca index-dump my_collection_name - --profile profiles/human_organization_place.json for five times. I start Solr with different memory (4G, 8G, 16G, 32G and 64G). However, this error occurs every time and terminates at the same point, when the stream index reaches 10676994.

2019-11-26 09:02:02,305 opentapioca.taggerfactory INFO     Stream index: 10670820
2019-11-26 09:02:02,306 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-26 09:02:04,307 opentapioca.taggerfactory INFO     Stream index: 10672820
2019-11-26 09:02:04,309 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-26 09:02:06,467 opentapioca.taggerfactory INFO     Stream index: 10674820
2019-11-26 09:02:06,468 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
2019-11-26 09:02:12,948 opentapioca.taggerfactory INFO     Stream index: 10676994
2019-11-26 09:02:12,949 opentapioca.taggerfactory INFO     Updating 2000 docs, deleting 0 others
Traceback (most recent call last): ...

Therefore, I guess that the error is caused by the data. To skip the malicious data, is it fine to add a try-exception for line 121 r.raise_for_status() in opentapioca/opentapioca/taggerfactory.py? Is it fine to skip this step when there is error?

P.S. I am running the experiments on a Linux server without web interface. Therefore, I cannot reach the Web interface. I will tried this method later if the above solution does not help.
Thanks a lot for your help. :)

heathersherry · 2019-11-28T13:04:15Z

Add exception handling for line 121 r.raise_for_status() in opentapioca/opentapioca/taggerfactory.py solves this probblem. Now I can sucessfully run the application. Thanks a lot!

wetneb · 2019-11-28T15:12:49Z

@heathersherry wonderful! Do you think you could create a pull request for that change? I think it would make a lot of sense!

heathersherry · 2019-12-02T04:46:36Z

@heathersherry wonderful! Do you think you could create a pull request for that change? I think it would make a lot of sense!

Sure! Thanks again for creating such a great project. :)
Shall I create the pull request in the default branch? Currently it seems that the permission is denied.

wetneb · 2019-12-02T09:59:03Z

Yes, you should be able to create a pull request by first creating a fork of this repository in your own account, pushing your change there and then creating the pull request. Alternatively, if you only want to propose a change to a single file (as it is the case here), you should be able to view that file on Github and use the edit link there.
If none of these work for you I have invited you as collaborator to this project, which should make things easier.

Fixed Issue opentapioca#11.

opentapioca#11

Fixed Issue opentapioca#11.

dinani65 · 2021-03-29T07:42:00Z

I also get the same error when I want to create a collection.
Command:
tapioca index-dump col2 latest-all.json.bz2 --profile profiles/human_organization_location.json
Error:
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:8983/solr/admin/collections?action=CREATE&name=col2&collection.configName=tapioca&numShards=1
Log:
org.apache.solr.common.SolrException: Solr instance is not running in SolrCloud mode.

lucyhorowitz · 2022-12-16T16:36:16Z

I am getting the same error as just above, but in the log there are a few NoSuchFileExceptions about solr-9.0.0/lib and /dist and then org.apache.solr.common.SolrException: Error CREATEing SolrCore 'collection5_shard1_replica_n1': Unable to create core [collection5_shard1_replica_n1] Caused by: solr.XSLTResponseWriter.

wetneb added a commit that referenced this issue Nov 19, 2019

Make it clearer that you should not create the collection yourself, for

47849d6

#11

heathersherry closed this as completed Nov 26, 2019

wetneb reopened this Nov 28, 2019

heathersherry added a commit to heathersherry/opentapioca that referenced this issue Dec 3, 2019

Update taggerfactory.py

8e4fc0f

Fixed Issue opentapioca#11.

heathersherry mentioned this issue Dec 3, 2019

Update taggerfactory.py #12

Merged

ziodave pushed a commit to ziodave/opentapioca that referenced this issue Feb 24, 2021

Make it clearer that you should not create the collection yourself, for

ddf8f32

opentapioca#11

ziodave pushed a commit to ziodave/opentapioca that referenced this issue Feb 24, 2021

Update taggerfactory.py

e6ea361

Fixed Issue opentapioca#11.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"requests.exceptions.HTTPError: 404 Client Error" while trying tapioca train-classifier #11

"requests.exceptions.HTTPError: 404 Client Error" while trying tapioca train-classifier #11

heathersherry commented Nov 19, 2019 •

edited

Loading

wetneb commented Nov 19, 2019

heathersherry commented Nov 24, 2019 •

edited

Loading

wetneb commented Nov 24, 2019

heathersherry commented Nov 26, 2019 •

edited

Loading

wetneb commented Nov 26, 2019

heathersherry commented Nov 26, 2019 •

edited

Loading

heathersherry commented Nov 28, 2019

wetneb commented Nov 28, 2019

heathersherry commented Dec 2, 2019 •

edited

Loading

wetneb commented Dec 2, 2019

dinani65 commented Mar 29, 2021 •

edited

Loading

lucyhorowitz commented Dec 16, 2022 •

edited

Loading

"requests.exceptions.HTTPError: 404 Client Error" while trying tapioca train-classifier #11

"requests.exceptions.HTTPError: 404 Client Error" while trying tapioca train-classifier #11

Comments

heathersherry commented Nov 19, 2019 • edited Loading

wetneb commented Nov 19, 2019

heathersherry commented Nov 24, 2019 • edited Loading

wetneb commented Nov 24, 2019

heathersherry commented Nov 26, 2019 • edited Loading

wetneb commented Nov 26, 2019

heathersherry commented Nov 26, 2019 • edited Loading

heathersherry commented Nov 28, 2019

wetneb commented Nov 28, 2019

heathersherry commented Dec 2, 2019 • edited Loading

wetneb commented Dec 2, 2019

dinani65 commented Mar 29, 2021 • edited Loading

lucyhorowitz commented Dec 16, 2022 • edited Loading

heathersherry commented Nov 19, 2019 •

edited

Loading

heathersherry commented Nov 24, 2019 •

edited

Loading

heathersherry commented Nov 26, 2019 •

edited

Loading

heathersherry commented Nov 26, 2019 •

edited

Loading

heathersherry commented Dec 2, 2019 •

edited

Loading

dinani65 commented Mar 29, 2021 •

edited

Loading

lucyhorowitz commented Dec 16, 2022 •

edited

Loading