added scripts and data retrieved from unicode CLDR #321

sarthakmadaan · 2017-06-03T14:18:56Z

Added scripts use to retrieve and store data in desired format, and added the data retrieved. Still support for numbering systems and numerals need to be added and some more issues are to be dealt with.

…e changes

codecov-io · 2017-06-03T14:27:39Z

Codecov Report

Merging #321 into master will decrease coverage by 3.32%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #321      +/-   ##
==========================================
- Coverage   97.61%   94.28%   -3.33%     
==========================================
  Files          20      299     +279     
  Lines        1674     1836     +162     
==========================================
+ Hits         1634     1731      +97     
- Misses         40      105      +65

Impacted Files	Coverage Δ
data/__init__.py	`100% <ø> (ø)`
dateparser/utils/__init__.py	`98.13% <0%> (-0.1%)`	⬇️
dateparser/conf.py	`97.91% <0%> (ø)`	⬆️
dateparser/languages/__init__.py	`100% <0%> (ø)`	⬆️
dateparser/languages/detection.py
dateparser/languages/language.py
dateparser/languages/validation.py
dateparser/data/date_translation_data/eu.py	`100% <0%> (ø)`
dateparser/data/numeral_translation_data/hi.py	`0% <0%> (ø)`
dateparser/data/languages_info.py	`100% <0%> (ø)`
... and 282 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5308434...e24f8e8. Read the comment docs.

…c data

…d other changes to make parsing more efficient

… added language_order

…with same translations

asadurski

Hi! I'm leaving some comments regarding data download scripts.

asadurski · 2017-07-27T09:24:32Z

scripts/get_cldr_data.py

@@ -0,0 +1,480 @@
+import requests


Any new modules (requests, orderedset) must be added to requirements.

But these are imported in the scripts which won't be used by users, do we still need to add these to requirements?

asadurski · 2017-08-11T12:10:10Z

scripts/get_cldr_data.py

+from utils import get_dict_difference
+from order_languages import language_locale_dict
+
+OAuth_Access_Token = 'OAuth_Access_Token'       # Add OAuth_Access_Token here


I believe that we could replace direct access to git with requests+auth with using:
Repo.clone_from('https://github.com/unicode-cldr/cldr-dates-full', 'path')
and working on temporary cloned repo.

I didn't know about that, will make changes soon.

asadurski · 2017-08-11T13:14:06Z

scripts/get_cldr_data.py

+    redundant_keys = []
+    for key, value in json_dict.items():
+        if not value:
+            redundant_keys.append(key)


It would probably be more efficient to do with filter(). I mean something like:

def filter_func(keyvalue): key, value = keyvalue if value and not value.isdigit(): # etc... the conditions for filtering return True json_dict = dict(filter(filter_func, json_dict.items()))

asadurski · 2017-08-11T17:50:15Z

scripts/get_cldr_data.py

+    json_dict["date_order"] = DATE_ORDER_PATTERN.sub(
+            r'\1\2\3', DATE_ORDER_PATTERN.search(date_format_string).group())
+
+    json_dict["january"] = [gregorian_dict["months"]["stand-alone"]["wide"]["1"],


This dict creation part could be made a lot shorter with loops.

sarthakmadaan · 2017-08-14T23:28:31Z

Thanks for the review @asadurski. I have modified the scripts and they work much faster now.

… scripts

…icient data, made necessary changes

… more tests for translation

sarthakmadaan added 13 commits March 23, 2017 00:27

Attempted to implement lazy loading of languages data

68d8f4b

implemented loading if languages not present in available_language_map

a6ece73

defined _get_language_map to implement on demand loading and made som…

6c9349a

…e changes

made changes in LanguageDataLoader and DateDataParser

2bdffe1

made changes to remove errors

047cb98

added tests for loading

66a6432

added more testcases

6fb9938

Merge remote-tracking branch 'upstream/master' into reduce_import_time

f535791

Merge remote-tracking branch 'upstream/master' into reduce_import_time

3c8bcd9

added sv and ka to language_order

9cb58cc

Merge remote-tracking branch 'upstream/master' into reduce_import_time

50d2de9

made changes to load via file and added tests

48c077b

added scripts and data retrieved from unicode CLDR

b53ff4d

sarthakmadaan added 3 commits June 9, 2017 02:47

made some changes to script to correct date_order

4fd4068

set order and avoided storing data same as language in locale specifi…

e5a7050

…c data

added numeral data

5789686

waqasshabbir added the GSOC 2017 label Jun 12, 2017

sarthakmadaan added 12 commits June 16, 2017 00:37

made changes to implement LanguageDataLoader as generator function an…

e891fed

…d other changes to make parsing more efficient

Merge remote-tracking branch 'upstream/master' into reduce_import_time

8972412

made some changes and resolved failing tests

d1b7916

added some changes and more tests for date.py

b8f1503

added more tests for loading

268d410

separated numeral translation data

b11e21f

added use_given_order argument to DateDataParser

ddf666a

added test for loading with given order(not strict)

c8c0c16

added translations for am, pm

5f890b6

cleaned cldr data, modified existing data to form supplementary data,…

30ada12

… added language_order

tried to make code pep8 compliant

3f65cd7

tried to make scripts pep8 compliant

ee0df55

sarthakmadaan added 6 commits June 30, 2017 01:20

stored complete translation data in python files

dbed66f

made scripts python3 compatible, made data more clean and ordered

032b4d2

Merge branch 'reduce_import_time' into Integrate_CLDR

faef593

made changes to support parsing with new data

8c24e3a

modified detection of applicable languages

a1956bb

separated relative regex strings and combined relative regex strings …

642edfb

…with same translations

asadurski requested changes Aug 11, 2017

View reviewed changes

sarthakmadaan added 2 commits August 13, 2017 18:33

resolved some tests in test_languages

57b1cde

modified scripts and data

1d37900

sarthakmadaan added 11 commits August 16, 2017 04:59

resolved some tests

cf76f2e

edited docstrings

8900997

made changes to resolve tests for python 3 and added requirements for…

b1953ae

… scripts

resolved tests

1ee6949

removed validation and added test_data, removed languages with insuff…

7f823c5

…icient data, made necessary changes

added some more tests

b11030e

added more tests for translation

e5e44a0

modified splitting to allow month and week names with numerals, added…

45675e4

… more tests for translation

added tests for freshness_date_parser and added docstrings

fe0310c

edited README.rst and CONTRIBUTING.rst

6f6a058

added more tests for relative future dates

e24f8e8

asadurski mentioned this pull request Jan 7, 2018

GSoC features #372

Merged

asadurski merged commit e24f8e8 into scrapinghub:master Jan 15, 2018

asadurski mentioned this pull request Jan 22, 2018

Remove yaml dependency #374

Merged

sarthakmadaan deleted the Integrate_CLDR branch February 5, 2018 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added scripts and data retrieved from unicode CLDR #321

added scripts and data retrieved from unicode CLDR #321

sarthakmadaan commented Jun 3, 2017 •

edited

codecov-io commented Jun 3, 2017 •

edited by codecov bot

asadurski left a comment

asadurski Jul 27, 2017

sarthakmadaan Aug 11, 2017

asadurski Aug 11, 2017

sarthakmadaan Aug 11, 2017

asadurski Aug 11, 2017

asadurski Aug 11, 2017

sarthakmadaan commented Aug 14, 2017

added scripts and data retrieved from unicode CLDR #321

added scripts and data retrieved from unicode CLDR #321

Conversation

sarthakmadaan commented Jun 3, 2017 • edited

codecov-io commented Jun 3, 2017 • edited by codecov bot

Codecov Report

asadurski left a comment

Choose a reason for hiding this comment

asadurski Jul 27, 2017

Choose a reason for hiding this comment

sarthakmadaan Aug 11, 2017

Choose a reason for hiding this comment

asadurski Aug 11, 2017

Choose a reason for hiding this comment

sarthakmadaan Aug 11, 2017

Choose a reason for hiding this comment

asadurski Aug 11, 2017

Choose a reason for hiding this comment

asadurski Aug 11, 2017

Choose a reason for hiding this comment

sarthakmadaan commented Aug 14, 2017

sarthakmadaan commented Jun 3, 2017 •

edited

codecov-io commented Jun 3, 2017 •

edited by codecov bot