Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated garbage #130

Merged
merged 7 commits into from
Jul 9, 2019
Merged

Fix duplicated garbage #130

merged 7 commits into from
Jul 9, 2019

Conversation

manycoding
Copy link
Contributor

@manycoding manycoding commented Jun 29, 2019

The pr does several things:

  1. uses apply(str) to search for patterns in nested data
  2. updates regex for better matching (it finds more symbols, especially entities)

apply(str) makes strings from nested objects (lists, dicts and any combination), and the downside is that the current regex won't search traling\ending whitespaces in nested data:
"['days inn lombard street hotel san francisco', 'lombard street days inn', 'days inn san francisco ', 'san francisco days inn hotel lombard st', 'san francisco day inn lombard street']"
Here r"(?P<spaces>^\s|\s$)" doesn't detect this whitespace in francisco '

Personally, I don't think it's crucial to find those whitespaces in nested data, any comments?

Performance (now, before):

  1. Garbage performance
    nested data
    2k sample - 331ms vs 800ms
    12k sample - 1.4s vs 1.6s (2.4s including flattening)
    90k sample - 2:07m vs 1:31m (matt takes about 1:30m too)
    not nested
    10k sample - 560ms vs 600ms
  2. Boost from removing flat_df varies, about 1s on 10k sample but it grows with the number of items. The biggest benefit is that pattern search is much faster on fewer nested columns with apply.str() than on larger number of flat columns.

And here's how it looks:
Screenshot 2019-07-01 at 12 58 00

@ivankivanov you will be interested in this one.

@manycoding manycoding added this to the 0.3.6 milestone Jun 29, 2019
@codecov
Copy link

codecov bot commented Jun 29, 2019

Codecov Report

Merging #130 into neat_report will decrease coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@               Coverage Diff               @@
##           neat_report     #130      +/-   ##
===============================================
- Coverage        79.83%   79.78%   -0.06%     
===============================================
  Files               22       22              
  Lines             1582     1583       +1     
  Branches           273      273              
===============================================
  Hits              1263     1263              
  Misses             275      275              
- Partials            44       45       +1
Impacted Files Coverage Δ
src/arche/data_quality_report.py 77.52% <ø> (ø) ⬆️
src/arche/arche.py 85.5% <100%> (ø) ⬆️
src/arche/report.py 97.75% <100%> (+0.05%) ⬆️
src/arche/rules/others.py 100% <100%> (ø) ⬆️
src/arche/readers/items.py 88.28% <0%> (-0.79%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14b4da2...75b6326. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 29, 2019

Codecov Report

Merging #130 into 0.3.6dev will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##           0.3.6dev     #130      +/-   ##
============================================
- Coverage     79.81%   79.77%   -0.04%     
============================================
  Files            22       22              
  Lines          1580     1582       +2     
  Branches        274      274              
============================================
+ Hits           1261     1262       +1     
  Misses          275      275              
- Partials         44       45       +1
Impacted Files Coverage Δ
src/arche/data_quality_report.py 77.52% <ø> (ø) ⬆️
src/arche/arche.py 85.61% <100%> (ø) ⬆️
src/arche/report.py 97.75% <100%> (+0.05%) ⬆️
src/arche/rules/others.py 100% <100%> (ø) ⬆️
src/arche/readers/items.py 88.28% <0%> (-0.79%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a79cd8...8d5688a. Read the comment docs.

@manycoding manycoding marked this pull request as ready for review July 1, 2019 17:55
@andersonberg
Copy link
Member

I agree it's not crucial, but maybe, if possible, we can make it optional. This is because I think it depends on the customer if they need to export, let's say, to a csv file and we have to remove trailing whitespaces

@manycoding
Copy link
Contributor Author

manycoding commented Jul 3, 2019

@andersonberg But nested fields are not exportable to csv, aren't they?

@andersonberg
Copy link
Member

Yes, you're right. So, yeah, we don't have to find whitespaces in nested data

@manycoding manycoding changed the title Fix garbage and get rid of flat_df Fix duplicated garbage Jul 4, 2019
@manycoding manycoding mentioned this pull request Jul 4, 2019
@manycoding manycoding changed the base branch from neat_report to 0.3.6dev July 5, 2019 15:17
@manycoding manycoding changed the base branch from 0.3.6dev to master July 5, 2019 17:18
@manycoding manycoding changed the base branch from master to 0.3.6dev July 5, 2019 17:18
src/arche/rules/others.py Outdated Show resolved Hide resolved
src/arche/rules/others.py Outdated Show resolved Hide resolved
manycoding and others added 2 commits July 8, 2019 12:52
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
@manycoding
Copy link
Contributor Author

manycoding commented Jul 8, 2019

@Gallaecio Thanks for noting inconsistencies.
I updated regex to account for &#9; and exclude&a;, and made every quantifier non-greedy.

src/arche/rules/others.py Outdated Show resolved Hide resolved
Co-Authored-By: Adrián Chaves <adrian@chaves.io>
@manycoding manycoding merged commit af1abcb into 0.3.6dev Jul 9, 2019
@manycoding manycoding deleted the fix_duplicated_garbage branch July 9, 2019 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants