Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added scripts and data retrieved from unicode CLDR #321

Merged
merged 47 commits into from Jan 15, 2018
Merged

added scripts and data retrieved from unicode CLDR #321

merged 47 commits into from Jan 15, 2018

Conversation

sarthakmadaan
Copy link
Contributor

@sarthakmadaan sarthakmadaan commented Jun 3, 2017

Added scripts use to retrieve and store data in desired format, and added the data retrieved. Still support for numbering systems and numerals need to be added and some more issues are to be dealt with.

@codecov-io
Copy link

codecov-io commented Jun 3, 2017

Codecov Report

Merging #321 into master will decrease coverage by 3.32%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
- Coverage   97.61%   94.28%   -3.33%     
==========================================
  Files          20      299     +279     
  Lines        1674     1836     +162     
==========================================
+ Hits         1634     1731      +97     
- Misses         40      105      +65
Impacted Files Coverage Δ
data/__init__.py 100% <ø> (ø)
dateparser/utils/__init__.py 98.13% <0%> (-0.1%) ⬇️
dateparser/conf.py 97.91% <0%> (ø) ⬆️
dateparser/languages/__init__.py 100% <0%> (ø) ⬆️
dateparser/languages/detection.py
dateparser/languages/language.py
dateparser/languages/validation.py
dateparser/data/date_translation_data/eu.py 100% <0%> (ø)
dateparser/data/numeral_translation_data/hi.py 0% <0%> (ø)
dateparser/data/languages_info.py 100% <0%> (ø)
... and 282 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5308434...e24f8e8. Read the comment docs.

Copy link
Member

@asadurski asadurski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm leaving some comments regarding data download scripts.

@@ -0,0 +1,480 @@
import requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any new modules (requests, orderedset) must be added to requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these are imported in the scripts which won't be used by users, do we still need to add these to requirements?

from utils import get_dict_difference
from order_languages import language_locale_dict

OAuth_Access_Token = 'OAuth_Access_Token' # Add OAuth_Access_Token here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we could replace direct access to git with requests+auth with using:
Repo.clone_from('https://github.com/unicode-cldr/cldr-dates-full', 'path')
and working on temporary cloned repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about that, will make changes soon.

redundant_keys = []
for key, value in json_dict.items():
if not value:
redundant_keys.append(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be more efficient to do with filter(). I mean something like:

def filter_func(keyvalue):
    key, value = keyvalue
    if value and not value.isdigit():  # etc... the conditions for filtering
        return True
json_dict = dict(filter(filter_func, json_dict.items()))

json_dict["date_order"] = DATE_ORDER_PATTERN.sub(
r'\1\2\3', DATE_ORDER_PATTERN.search(date_format_string).group())

json_dict["january"] = [gregorian_dict["months"]["stand-alone"]["wide"]["1"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dict creation part could be made a lot shorter with loops.

@sarthakmadaan
Copy link
Contributor Author

Thanks for the review @asadurski. I have modified the scripts and they work much faster now.

@asadurski asadurski mentioned this pull request Jan 7, 2018
@asadurski asadurski merged commit e24f8e8 into scrapinghub:master Jan 15, 2018
@asadurski asadurski mentioned this pull request Jan 22, 2018
@sarthakmadaan sarthakmadaan deleted the Integrate_CLDR branch February 5, 2018 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants