Skip to content

Conversation

@joaquingx
Copy link
Contributor

@joaquingx joaquingx commented Dec 16, 2018

This PR adds Dublincore schema to extruct. To implement parsing I used this document as guide: http://dublincore.org/documents/dcq-html/ (Specially on 3. that explains how a DC consumer should act).

More references:

Fixes #10

@joaquingx joaquingx changed the title [MRG] Add dublincore schema [WIP] Add dublincore schema Dec 16, 2018
@joaquingx joaquingx changed the title [WIP] Add dublincore schema [WIP] Add dublincore metadata Dec 16, 2018
@codecov
Copy link

codecov bot commented Dec 18, 2018

Codecov Report

Merging #101 into master will increase coverage by 1.34%.
The diff coverage is 98.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage    87.3%   88.65%   +1.34%     
==========================================
  Files          11       12       +1     
  Lines         457      520      +63     
  Branches       97      111      +14     
==========================================
+ Hits          399      461      +62     
  Misses         52       52              
- Partials        6        7       +1
Impacted Files Coverage Δ
extruct/_extruct.py 75% <100%> (+1.66%) ⬆️
extruct/uniform.py 94.59% <100%> (+1.37%) ⬆️
extruct/dublincore.py 97.67% <97.67%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...3d4bf5d. Read the comment docs.

@joaquingx joaquingx changed the title [WIP] Add dublincore metadata [MRG] Add dublincore metadata Dec 18, 2018
@joaquingx
Copy link
Contributor Author

What you think @lopuhin ?

@lopuhin
Copy link
Member

lopuhin commented Dec 18, 2018

@joaquingx thanks for the PR! I had just a quick look, the PR looks great 👍
I have only one concern, but not sure if this is applicable. Besides extractors, we also have uniform_processors:

extruct/extruct/_extruct.py

Lines 111 to 133 in 3ab5592

if uniform:
uniform_processors = []
if 'microdata' in syntaxes:
uniform_processors.append(
('microdata',
_umicrodata_microformat,
output['microdata'],
schema_context,
))
if 'microformat' in syntaxes:
uniform_processors.append(
('microformat',
_umicrodata_microformat,
output['microformat'],
'http://microformats.org/wiki/',
))
if 'opengraph' in syntaxes:
uniform_processors.append(
('opengraph',
_uopengraph,
output['opengraph'],
None,
))

their goal is to convert the syntax as provided by the extractor into a more uniform syntax which is similar between different extractors, to make it easier to write code which supports all processors. Do you think such converter will make sense for dublincore?

@joaquingx
Copy link
Contributor Author

@lopuhin I add the uniform option, can you confirm that it works as intended 🙏 ?

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joaquingx , looks good! Left a small comment regarding the code.

Also it would be great to update README.rst, adding dublincore at the start (at "Currently, extruct supports:"), and also adding it to the list of valid syntaxes https://github.com/scrapinghub/extruct#select-syntaxes and to https://github.com/scrapinghub/extruct#uniform. I don't think we should add it under https://github.com/scrapinghub/extruct#single-extractors (not that it's different from other syntaxes, but we should just refer to tests here instead of repeating them in README - this is unrelated to this PR).

@codecov-io
Copy link

Codecov Report

Merging #101 into master will increase coverage by 1.34%.
The diff coverage is 98.48%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage    87.3%   88.65%   +1.34%     
==========================================
  Files          11       12       +1     
  Lines         457      520      +63     
  Branches       97      111      +14     
==========================================
+ Hits          399      461      +62     
  Misses         52       52              
- Partials        6        7       +1
Impacted Files Coverage Δ
extruct/_extruct.py 75% <100%> (+1.66%) ⬆️
extruct/uniform.py 94.59% <100%> (+1.37%) ⬆️
extruct/dublincore.py 97.67% <97.67%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...031427f. Read the comment docs.

@codecov-io
Copy link

codecov-io commented Jan 14, 2019

Codecov Report

Merging #101 into master will increase coverage by 1.00%.
The diff coverage is 98.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   89.23%   90.24%   +1.00%     
==========================================
  Files          12       13       +1     
  Lines         539      605      +66     
  Branches      122      136      +14     
==========================================
+ Hits          481      546      +65     
  Misses         52       52              
- Partials        6        7       +1     
Impacted Files Coverage Δ
extruct/dublincore.py 97.61% <97.61%> (ø)
extruct/_extruct.py 75.60% <100.00%> (+2.27%) ⬆️
extruct/uniform.py 95.50% <100.00%> (+1.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d245b3...043a479. Read the comment docs.

@joaquingx
Copy link
Contributor Author

Hey @lopuhin , thanks for feedback 🙏, I just have a doubt here:

I don't think we should add it under https://github.com/scrapinghub/extruct#single-extractors (not that it's different from other syntaxes, but we should just refer to tests here instead of repeating them in README - this is unrelated to this PR).

How's the correct way to do that, should I just add a title DublinCore Extractor and a link to https://github.com/joaquingx/extruct/blob/master/tests/test_extruct.py ?

@lopuhin
Copy link
Member

lopuhin commented Jan 15, 2019

@joaquingx thanks, the way you updated README looks perfect to me 👍

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks 👍

@lopuhin lopuhin changed the title [MRG] Add dublincore metadata [MRG+1] Add dublincore metadata Jan 15, 2019
@lopuhin
Copy link
Member

lopuhin commented Jan 15, 2019

@Kebniss @kmike would you like to have another look?

def _udublincore(extracted):
out = []
for obj in extracted:
context = obj.pop('namespaces', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to not modify the passed object here, and do a copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by list(extracted) you're doing a shallow copy; changes in list elements still happen in-place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @joaquingx! Do you think you can address the comment above? It seems this is the only remaining issue, the PR has everything: a clean implementation, docs, tests, ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure ill work on that 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Kebniss
Copy link
Contributor

Kebniss commented Jan 16, 2019

Aside for the comments left by @kmike this PR looks good to me. Thank you @joaquingx!

@joaquingx
Copy link
Contributor Author

Thanks @Kebniss 🙏 , changes looks good to you @kmike ?

@joaquingx
Copy link
Contributor Author

@kmike is it good to merge now?

@kmike
Copy link
Member

kmike commented Oct 5, 2019

Hey @joaquingx! Could you please check this comment: #101 (comment)?

@joaquingx joaquingx force-pushed the add-support-dublin-core-metadata branch from 3dc5a69 to e5b51a7 Compare September 20, 2020 22:28
@joaquingx
Copy link
Contributor Author

joaquingx commented Sep 20, 2020

Finally changed the shallow copy to deep copy 🙈, review it @kmike please 🙏

@joaquingx joaquingx requested a review from kmike September 20, 2020 23:37
@joaquingx joaquingx requested a review from Gallaecio October 2, 2020 15:47
@joaquingx joaquingx force-pushed the add-support-dublin-core-metadata branch from e471d95 to 043a479 Compare October 4, 2020 21:41
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!

@kmike kmike merged commit d78167c into scrapinghub:master Oct 6, 2020
@kmike
Copy link
Member

kmike commented Oct 6, 2020

Thanks @joaquingx for your persistence!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for dublin core metadata(DCMI)

6 participants