Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spidermon fails to add _validation field to the item. #379

Closed
proway2 opened this issue Jan 16, 2023 · 10 comments
Closed

Spidermon fails to add _validation field to the item. #379

proway2 opened this issue Jan 16, 2023 · 10 comments
Assignees

Comments

@proway2
Copy link

proway2 commented Jan 16, 2023

Ubuntu 22.04, Python 3.10.9, Spidermon 1.17.1

If an item defined with attrs library and this condition https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/scrapy/pipelines.py#L140 is True, then it fails here https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/scrapy/pipelines.py#L141 with this message:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: 'ProductList' object does not support item assignment
@VMRuiz
Copy link
Collaborator

VMRuiz commented Jan 17, 2023

This is a limitation on the Item definition as it don't support adding new fields.

In theory, using slots=False should work according to the attr dosc, however it didn't work for me.

Alternatively, adding the field _validation to the class should remove the error, however, you will get _validation automatically populated to None on each item which is not the best solution either.

On my experience I had to just convert from Attrs class to dict before spidermon validaton.

@proway2
Copy link
Author

proway2 commented Jan 17, 2023

@VMRuiz I had a hope that with this PR we wouldn't need this conversion-to-dict pipeline anymore.

@VMRuiz
Copy link
Collaborator

VMRuiz commented Jan 17, 2023

That class is just an interface to simplify read and write access to the different types of Item classes. It doesn't add any additional functionality.

I don't know if we could overcharge it to support additional fields not defined on the original item class.

@VMRuiz
Copy link
Collaborator

VMRuiz commented Jan 17, 2023

Maybe we can start by capturing the exception and raising a warning instead.

cc. @Gallaecio @rennerocha

@Gallaecio
Copy link
Member

Alternatively, adding the field _validation to the class should remove the error

I think this is the way to go. You either disable the creation of this _validation field, if Spidermon allows for that, or you add the field to your item class.

@VMRuiz
Copy link
Collaborator

VMRuiz commented Jan 24, 2023

@Gallaecio Shall we close this ticket as wont fix or similar?

@Gallaecio
Copy link
Member

Gallaecio commented Jan 25, 2023

Well, I think there might be an action here even if we go with #379 (comment).

  • If the _validation field cannot be disabled without disabling validation itself (I don’t know if that is the case), then that’s something to support.
  • We should document this scenario, i.e. the need to add _validation to an item class that does not support defining unknown fields, and the option to disable the generation of that field instead.

@VMRuiz
Copy link
Collaborator

VMRuiz commented Jan 25, 2023

This can be turn off with SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS - https://spidermon.readthedocs.io/en/latest/item-validation.html#spidermon-validation-add-errors-to-items

We may expand the docs to explicitly state that if enabled, it will only work on items that have defined the _validation field or that support adding undefined fields, like dicts .

Edit: Additionally, SPIDERMON_VALIDATION_ERRORS_FIELD can be use to replace the _validation field with any other field name.

rennerocha added a commit to okfn-brasil/querido-diario that referenced this issue Sep 18, 2023
### Descrição
Resolve #929 

### Comentários
Atualizando minha venv local com os novos requirements, a execução de
Prado-BA - mesmo raspador apontado na issue, coletando apenas desde
agosto/23 até hoje, gera os seguintes arquivos:

[prado_com_erro.csv](https://github.com/okfn-brasil/querido-diario/files/12612259/prado_com_erro.csv)
- vazio

[prado_com_erro.txt](https://github.com/okfn-brasil/querido-diario/files/12612261/prado_com_erro.txt)

Ao fazer as atualizações propostas neste PR, temos (para o mesmo período
de coleta):

[prado_modificado.csv](https://github.com/okfn-brasil/querido-diario/files/12612271/prado_modificado.csv)
- coleta com sucesso

[prado_modificado.txt](https://github.com/okfn-brasil/querido-diario/files/12612273/prado_modificado.txt)

Atualizar a biblioteca deprecada `pkg_resources` foi um dos erros
apontados em #929. Segui as instruções em
https://importlib-resources.readthedocs.io/en/latest/migration.html para
isso. Pareceu funcionar.

Quanto ao erro de validação apontado na issue #929, há uma issue aberta
no repositório do Spidermon, que dialoga com o problema:
scrapinghub/spidermon#379. Não tive certeza o
que fazer com isso.

Abro o PR como ponto de partida do que já enderecei, porém talvez haja
outras verificações e modificações para fazer.
@rennerocha
Copy link
Collaborator

rennerocha commented Sep 18, 2023

This problem was added by #358 and it is a backward incompatible change that we didn't inform our users that it was introduced. So we need to do something about it. Probably one new release that indicates that _validation field will only work if we are dealing with dict item.

@rennerocha
Copy link
Collaborator

We already have this situation in Scrapy related to files and files_urls field for the media pipelines (https://docs.scrapy.org/en/latest/topics/media-pipeline.html#usage-example). So adding the information that we need to specify _validation should be enough. However, I believe that we should put as a warning, not just in the middle of the text.

@VMRuiz VMRuiz closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants