[MRG+1] Add dublincore metadata #101

joaquingx · 2018-12-16T03:54:15Z

This PR adds Dublincore schema to extruct. To implement parsing I used this document as guide: http://dublincore.org/documents/dcq-html/ (Specially on 3. that explains how a DC consumer should act).

More references:

Fixes #10

codecov · 2018-12-18T04:13:10Z

Codecov Report

Merging #101 into master will increase coverage by 1.34%.
The diff coverage is 98.48%.

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage    87.3%   88.65%   +1.34%     
==========================================
  Files          11       12       +1     
  Lines         457      520      +63     
  Branches       97      111      +14     
==========================================
+ Hits          399      461      +62     
  Misses         52       52              
- Partials        6        7       +1

Impacted Files	Coverage Δ
extruct/_extruct.py	`75% <100%> (+1.66%)`	⬆️
extruct/uniform.py	`94.59% <100%> (+1.37%)`	⬆️
extruct/dublincore.py	`97.67% <97.67%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...3d4bf5d. Read the comment docs.

joaquingx · 2018-12-18T04:49:27Z

What you think @lopuhin ?

lopuhin · 2018-12-18T07:23:36Z

@joaquingx thanks for the PR! I had just a quick look, the PR looks great 👍
I have only one concern, but not sure if this is applicable. Besides extractors, we also have uniform_processors:

extruct/extruct/_extruct.py

Lines 111 to 133 in 3ab5592

    
           if uniform: 
        
               uniform_processors = [] 
        
               if 'microdata' in syntaxes: 
        
                   uniform_processors.append( 
        
                       ('microdata', 
        
                        _umicrodata_microformat, 
        
                        output['microdata'], 
        
                        schema_context, 
        
                        )) 
        
               if 'microformat' in syntaxes: 
        
                   uniform_processors.append( 
        
                       ('microformat', 
        
                        _umicrodata_microformat, 
        
                        output['microformat'], 
        
                        'http://microformats.org/wiki/', 
        
                        )) 
        
               if 'opengraph' in syntaxes: 
        
                   uniform_processors.append( 
        
                       ('opengraph', 
        
                        _uopengraph, 
        
                        output['opengraph'], 
        
                        None, 
        
                        ))

their goal is to convert the syntax as provided by the extractor into a more uniform syntax which is similar between different extractors, to make it easier to write code which supports all processors. Do you think such converter will make sense for dublincore?

joaquingx · 2018-12-23T07:18:12Z

@lopuhin I add the uniform option, can you confirm that it works as intended 🙏 ?

lopuhin

Thanks @joaquingx , looks good! Left a small comment regarding the code.

Also it would be great to update README.rst, adding dublincore at the start (at "Currently, extruct supports:"), and also adding it to the list of valid syntaxes https://github.com/scrapinghub/extruct#select-syntaxes and to https://github.com/scrapinghub/extruct#uniform. I don't think we should add it under https://github.com/scrapinghub/extruct#single-extractors (not that it's different from other syntaxes, but we should just refer to tests here instead of repeating them in README - this is unrelated to this PR).

extruct/uniform.py

codecov-io · 2019-01-14T21:04:51Z

Codecov Report

Merging #101 into master will increase coverage by 1.34%.
The diff coverage is 98.48%.

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage    87.3%   88.65%   +1.34%     
==========================================
  Files          11       12       +1     
  Lines         457      520      +63     
  Branches       97      111      +14     
==========================================
+ Hits          399      461      +62     
  Misses         52       52              
- Partials        6        7       +1

Impacted Files	Coverage Δ
extruct/_extruct.py	`75% <100%> (+1.66%)`	⬆️
extruct/uniform.py	`94.59% <100%> (+1.37%)`	⬆️
extruct/dublincore.py	`97.67% <97.67%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab5592...031427f. Read the comment docs.

codecov-io · 2019-01-14T21:04:52Z

Codecov Report

Merging #101 into master will increase coverage by 1.00%.
The diff coverage is 98.52%.

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   89.23%   90.24%   +1.00%     
==========================================
  Files          12       13       +1     
  Lines         539      605      +66     
  Branches      122      136      +14     
==========================================
+ Hits          481      546      +65     
  Misses         52       52              
- Partials        6        7       +1

Impacted Files	Coverage Δ
extruct/dublincore.py	`97.61% <97.61%> (ø)`
extruct/_extruct.py	`75.60% <100.00%> (+2.27%)`	⬆️
extruct/uniform.py	`95.50% <100.00%> (+1.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d245b3...043a479. Read the comment docs.

joaquingx · 2019-01-15T00:30:01Z

Hey @lopuhin , thanks for feedback 🙏, I just have a doubt here:

I don't think we should add it under https://github.com/scrapinghub/extruct#single-extractors (not that it's different from other syntaxes, but we should just refer to tests here instead of repeating them in README - this is unrelated to this PR).

How's the correct way to do that, should I just add a title DublinCore Extractor and a link to https://github.com/joaquingx/extruct/blob/master/tests/test_extruct.py ?

lopuhin · 2019-01-15T10:54:02Z

@joaquingx thanks, the way you updated README looks perfect to me 👍

lopuhin

Looks great, thanks 👍

lopuhin · 2019-01-15T10:55:12Z

@Kebniss @kmike would you like to have another look?

extruct/dublincore.py

kmike · 2019-01-15T12:51:55Z

extruct/uniform.py

+def _udublincore(extracted):
+    out = []
+    for obj in extracted:
+        context = obj.pop('namespaces', None)


I think it is better to not modify the passed object here, and do a copy

modified, thanks!

by list(extracted) you're doing a shallow copy; changes in list elements still happen in-place

Hey @joaquingx! Do you think you can address the comment above? It seems this is the only remaining issue, the PR has everything: a clean implementation, docs, tests, ..

Sure ill work on that 👍

tests/samples/songkick/elysianfields.json

extruct/dublincore.py

Kebniss · 2019-01-16T20:06:53Z

Aside for the comments left by @kmike this PR looks good to me. Thank you @joaquingx!

joaquingx · 2019-01-17T21:58:46Z

Thanks @Kebniss 🙏 , changes looks good to you @kmike ?

extruct/dublincore.py

joaquingx · 2019-10-04T18:07:29Z

@kmike is it good to merge now?

kmike · 2019-10-05T11:44:03Z

Hey @joaquingx! Could you please check this comment: #101 (comment)?

joaquingx · 2020-09-20T22:32:48Z

Finally changed the shallow copy to deep copy 🙈, review it @kmike please 🙏

README.rst

tests/samples/songkick/tovestyrke.json

Gallaecio

Looks good to me, thanks!

kmike · 2020-10-06T19:42:35Z

Thanks @joaquingx for your persistence!

Add dublincore schema

ba568a3

joaquingx changed the title ~~[MRG] Add dublincore schema~~ [WIP] Add dublincore schema Dec 16, 2018

joaquingx changed the title ~~[WIP] Add dublincore schema~~ [WIP] Add dublincore metadata Dec 16, 2018

Update tests and change to raw strings

edc1f64

Fix file typo

cd01c5f

joaquingx changed the title ~~[WIP] Add dublincore metadata~~ [MRG] Add dublincore metadata Dec 18, 2018

Add uniform option

3d4bf5d

lopuhin reviewed Jan 9, 2019

View reviewed changes

extruct/uniform.py Outdated Show resolved Hide resolved

Update Readme with DublinCore Options

031427f

Fix list iteration

8cc8385

lopuhin approved these changes Jan 15, 2019

View reviewed changes

lopuhin changed the title ~~[MRG] Add dublincore metadata~~ [MRG+1] Add dublincore metadata Jan 15, 2019

kmike reviewed Jan 15, 2019

View reviewed changes

extruct/dublincore.py Outdated Show resolved Hide resolved

kmike reviewed Jan 15, 2019

View reviewed changes

extruct/dublincore.py Outdated Show resolved Hide resolved

kmike reviewed Jan 15, 2019

View reviewed changes

tests/samples/songkick/elysianfields.json Outdated Show resolved Hide resolved

kmike reviewed Jan 15, 2019

View reviewed changes

extruct/dublincore.py Show resolved Hide resolved

Make requested changes

ac2bdfc

kmike reviewed Jan 18, 2019

View reviewed changes

extruct/dublincore.py Outdated Show resolved Hide resolved

Add local variable to improve legibility

32e416b

joaquingx added 2 commits September 20, 2020 17:19

Change shallow cpy to deep cpy, update extruct, readme.

bd1448f

Merge branch 'master' into add-support-dublin-core-metadata

e5b51a7

joaquingx force-pushed the add-support-dublin-core-metadata branch from 3dc5a69 to e5b51a7 Compare September 20, 2020 22:28

joaquingx requested a review from kmike September 20, 2020 23:37

Gallaecio reviewed Oct 1, 2020

View reviewed changes

README.rst Outdated Show resolved Hide resolved

tests/samples/songkick/tovestyrke.json Outdated Show resolved Hide resolved

update README.rst, normalize indentation

dee37e6

joaquingx requested a review from Gallaecio October 2, 2020 15:47

Specify DC version, update link.

043a479

joaquingx force-pushed the add-support-dublin-core-metadata branch from e471d95 to 043a479 Compare October 4, 2020 21:41

Gallaecio approved these changes Oct 5, 2020

View reviewed changes

kmike merged commit d78167c into scrapinghub:master Oct 6, 2020

[MRG+1] Add dublincore metadata #101

[MRG+1] Add dublincore metadata #101

Uh oh!

Conversation

joaquingx commented Dec 16, 2018 • edited by Gallaecio Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joaquingx commented Dec 18, 2018

Uh oh!

lopuhin commented Dec 18, 2018

Uh oh!

joaquingx commented Dec 23, 2018

Uh oh!

lopuhin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Jan 14, 2019

Codecov Report

Uh oh!

codecov-io commented Jan 14, 2019 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joaquingx commented Jan 15, 2019

Uh oh!

lopuhin commented Jan 15, 2019

Uh oh!

lopuhin left a comment

Choose a reason for hiding this comment

Uh oh!

lopuhin commented Jan 15, 2019

Uh oh!

Uh oh!

Uh oh!

kmike Jan 15, 2019

Choose a reason for hiding this comment

Uh oh!

joaquingx Jan 17, 2019

Choose a reason for hiding this comment

Uh oh!

kmike Jan 18, 2019

Choose a reason for hiding this comment

Uh oh!

kmike Feb 21, 2019

Choose a reason for hiding this comment

Uh oh!

joaquingx Oct 5, 2019

Choose a reason for hiding this comment

Uh oh!

joaquingx Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kebniss commented Jan 16, 2019

Uh oh!

joaquingx commented Jan 17, 2019

Uh oh!

Uh oh!

joaquingx commented Oct 4, 2019

Uh oh!

kmike commented Oct 5, 2019

Uh oh!

joaquingx commented Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gallaecio left a comment

Choose a reason for hiding this comment

Uh oh!

kmike commented Oct 6, 2020

Uh oh!

Reviewers

Assignees

joaquingx commented Dec 16, 2018 •

edited by Gallaecio

Loading

codecov bot commented Dec 18, 2018 •

edited

Loading

codecov-io commented Jan 14, 2019 •

edited by codecov bot

Loading

joaquingx commented Sep 20, 2020 •

edited

Loading