Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Language Detect #932

Merged
merged 141 commits into from
Sep 6, 2021
Merged
Show file tree
Hide file tree
Changes from 137 commits
Commits
Show all changes
141 commits
Select commit Hold shift + click to select a range
1e86099
test
gavishpoddar Jun 1, 2021
2c6e29d
test
gavishpoddar Jun 1, 2021
176565b
creating basic structure
gavishpoddar Jun 1, 2021
d1c8678
commit
gavishpoddar Jun 1, 2021
c8e22d2
updates
gavishpoddar Jun 2, 2021
39a2491
implimenting language library
gavishpoddar Jun 3, 2021
32321cf
custom language detect workable model
gavishpoddar Jun 3, 2021
499a29d
custom language parser updates
gavishpoddar Jun 3, 2021
9a41213
lang_detect implimentation
gavishpoddar Jun 4, 2021
20c5b0e
fixing error handling
gavishpoddar Jun 5, 2021
f6a2098
template update
gavishpoddar Jun 5, 2021
a3de5d5
fixes
gavishpoddar Jun 7, 2021
6531fd8
optional language detect in search_dates
gavishpoddar Jun 7, 2021
28fbd94
fixing language detection loader
gavishpoddar Jun 8, 2021
edfc760
fixes
gavishpoddar Jun 8, 2021
51f67f0
fiexs
gavishpoddar Jun 8, 2021
584f40b
fixing language detection
gavishpoddar Jun 9, 2021
b2d1ad6
fixing code and PEP8
gavishpoddar Jun 11, 2021
2b6f195
removing models
gavishpoddar Jun 11, 2021
b0c1ad0
fixes on search_data
gavishpoddar Jun 11, 2021
82ce3f1
restruction functions
gavishpoddar Jun 14, 2021
8a4268a
creating tox tests
gavishpoddar Jun 16, 2021
36a88d7
Update dateparser/date.py
gavishpoddar Jun 18, 2021
e04faf8
Update tests/test_language_detect.py
gavishpoddar Jun 18, 2021
270faf6
minor fixes
gavishpoddar Jun 18, 2021
adb8d3f
Update tests/test_language_detect.py
gavishpoddar Jun 18, 2021
6976a65
Update tests/test_language_detect.py
gavishpoddar Jun 18, 2021
ad04bcf
fixes
gavishpoddar Jun 18, 2021
4790d97
fixes
gavishpoddar Jun 18, 2021
7bc43af
updates
gavishpoddar Jun 18, 2021
a313d33
Update dateparser/search/search.py
gavishpoddar Jun 18, 2021
7fda13e
exception handling
gavishpoddar Jun 19, 2021
20aba0a
minor fixes
gavishpoddar Jun 20, 2021
57f2cca
fixes
gavishpoddar Jun 20, 2021
e6e3ed2
Update dateparser/search/search.py
gavishpoddar Jun 20, 2021
d4e68c7
Update dateparser/search/search.py
gavishpoddar Jun 20, 2021
205e29f
fixing tests and search_date default langauge
gavishpoddar Jun 23, 2021
345c18e
WIP: map_languages and fixes for USE_STRICT
gavishpoddar Jun 23, 2021
d25f3b2
fixing language_maps structure and WIP docs
gavishpoddar Jun 24, 2021
86df926
updating mapping
gavishpoddar Jun 25, 2021
d5764bf
passing settings as param to language detect.
gavishpoddar Jun 26, 2021
63262cf
WIP : Documentation
gavishpoddar Jun 26, 2021
2fd53c4
Fixing : DEFAULT_LANGUAGES
gavishpoddar Jun 26, 2021
adb3fcc
Updating : Docs
gavishpoddar Jun 26, 2021
2aa4146
Updating Docs
gavishpoddar Jun 27, 2021
6f32301
Fixing Docs
gavishpoddar Jun 29, 2021
2ce7e07
Fixing langdetect global state issue
gavishpoddar Jun 30, 2021
172361f
WIP:Language Map
gavishpoddar Jun 30, 2021
23c959d
Updating language_info with language_map
gavishpoddar Jun 30, 2021
339f7f6
WIP:Download Manager
gavishpoddar Jun 30, 2021
580a859
Updating setup.py
gavishpoddar Jul 1, 2021
2a81f26
complete : datearser-download
gavishpoddar Jul 1, 2021
0b7607a
download_manager HTTP error handling
gavishpoddar Jul 4, 2021
bbfebc9
Updating docs custom_lang_detect
gavishpoddar Jul 4, 2021
f6e8c7b
Updating date.py `text` param
gavishpoddar Jul 4, 2021
4f134d3
Update docs
gavishpoddar Jul 4, 2021
50b7224
dateparser-download setting default dir
gavishpoddar Jul 5, 2021
11b55d1
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar Jul 5, 2021
2971134
updating params position
gavishpoddar Jul 5, 2021
3836548
Updating docs
gavishpoddar Jul 5, 2021
19ca2ff
Fixning docs
gavishpoddar Jul 16, 2021
c715e58
Implimenting clear_cache and remaning detect_lang_function
gavishpoddar Jul 18, 2021
03cd5b5
Updating Docs
gavishpoddar Jul 18, 2021
d6448d0
Updating Docs: Apply suggestions from code review
gavishpoddar Jul 19, 2021
639fb0a
Updating: Docs
gavishpoddar Jul 19, 2021
25780f8
Commenting test
gavishpoddar Jul 20, 2021
a5fe589
Minor fixes
gavishpoddar Jul 20, 2021
82cad00
print -> logging
gavishpoddar Jul 20, 2021
a684fdf
DEFAULT_LANGUAGE works without optional langauge detection
gavishpoddar Jul 22, 2021
03dd0be
fixning check_data_model_home_existance()
gavishpoddar Jul 22, 2021
df0d54a
implimenting argparse in dateparser-download
gavishpoddar Jul 22, 2021
54ccdf8
Updating docs
gavishpoddar Jul 22, 2021
2c8c007
Updating tests
gavishpoddar Jul 26, 2021
b5e0a30
Commenting test
gavishpoddar Jul 26, 2021
985dad6
Updating tests
gavishpoddar Jul 26, 2021
6745790
Removing fasttext default confidence_threshold and removing results l…
gavishpoddar Jul 26, 2021
69ea22f
caching fasettext model
gavishpoddar Jul 26, 2021
e184092
Fixing tests
gavishpoddar Jul 26, 2021
239f4a3
fixing texts and removinf confidence_threshold
gavishpoddar Jul 26, 2021
abab857
improving coverage
gavishpoddar Jul 26, 2021
ffea246
updating settings
gavishpoddar Jul 26, 2021
e199252
updating tests codecov
gavishpoddar Jul 26, 2021
9266b06
Creating new codecov tests in test_language_detect
gavishpoddar Jul 26, 2021
7073f39
removing unnecessary files
gavishpoddar Jul 27, 2021
fdc5d17
Minor improvement : map_languages.py += to .extend()
gavishpoddar Jul 29, 2021
4cabce8
fixing _load_fasttext_model()
gavishpoddar Jul 29, 2021
08e416e
updating dateparser-cli
gavishpoddar Jul 29, 2021
68f7e10
adding support for windows
gavishpoddar Jul 29, 2021
4601424
Adding exception for file not found
gavishpoddar Jul 29, 2021
d0d2cc2
Fixing: fasttext working in windows
gavishpoddar Jul 29, 2021
3cd316b
Fixing tests for python 3.5
gavishpoddar Jul 29, 2021
c68b2f3
Improving dateparser-download
gavishpoddar Jul 29, 2021
7139c79
Updating langdetect.py `_get_language_probablities`
gavishpoddar Jul 30, 2021
3ac2fd9
Update += to .extend() in date.py
gavishpoddar Jul 30, 2021
8896467
updating settings.rst
gavishpoddar Jul 30, 2021
1833f1f
improvements: dateparser-download
gavishpoddar Jul 30, 2021
1318afd
creating setting
gavishpoddar Jul 30, 2021
5401b88
Updating docs
gavishpoddar Jul 30, 2021
0507b48
adding comments in langdetect.py
gavishpoddar Jul 31, 2021
166ac33
Making Factory in langdetect.py locally accessible
gavishpoddar Jul 31, 2021
6e6eb14
minor fixes
gavishpoddar Aug 1, 2021
1fb9768
LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD aditionat checks and test
gavishpoddar Aug 2, 2021
a95cf0c
updating tests
gavishpoddar Aug 2, 2021
dd76db3
minor fixes and improvements
gavishpoddar Aug 5, 2021
ea5be90
fixes
gavishpoddar Aug 7, 2021
5de7f98
improving language_mapping
gavishpoddar Aug 8, 2021
67cfad2
improvising dateparser
gavishpoddar Aug 9, 2021
3cb5257
improving dateparser-cli
gavishpoddar Aug 9, 2021
4b8f850
renamming create_language_maps to generate_language_map
gavishpoddar Aug 9, 2021
8de4d3d
DEFAULT_WIXDOWS_CACHE_DIR python 3.5 compitability
gavishpoddar Aug 9, 2021
05f43bb
improving:detect_languages_function
gavishpoddar Aug 9, 2021
b3db7d7
Updating docs
gavishpoddar Aug 9, 2021
6c1475a
trying to resolve git conflicting files
gavishpoddar Aug 9, 2021
b8a25dc
Trying to even this base with master
gavishpoddar Aug 9, 2021
57d3386
Merge branch 'master' into language
gavishpoddar Aug 9, 2021
51d1b80
Creating default_languages extra_check
gavishpoddar Aug 9, 2021
2602650
Improvements
gavishpoddar Aug 9, 2021
1a035ea
Apply suggestions from code review
gavishpoddar Aug 10, 2021
10bd877
micro improvements from review
gavishpoddar Aug 12, 2021
ab15a3b
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar Aug 12, 2021
ebe64f9
fixing typo
gavishpoddar Aug 12, 2021
1ba3dc2
removing env variable
gavishpoddar Aug 17, 2021
a2c999f
return type checking in language_mapping
gavishpoddar Aug 17, 2021
88d080e
Updates from code review
gavishpoddar Aug 18, 2021
909c5f3
commit of checks
gavishpoddar Aug 19, 2021
658e11f
Apply suggestions from code review
gavishpoddar Aug 21, 2021
5a17537
Apply suggestions from code review
gavishpoddar Aug 21, 2021
3b296db
updating tests and docs
gavishpoddar Aug 23, 2021
cbb2564
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar Aug 23, 2021
b6cbded
updating docs
gavishpoddar Aug 23, 2021
5b19b86
Update dateparser/data/languages_info.py
gavishpoddar Aug 23, 2021
03a689c
removing __init__.py
gavishpoddar Aug 24, 2021
6a78400
Merge branch 'scrapinghub:master' into language
gavishpoddar Aug 24, 2021
dd59484
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar Aug 24, 2021
0555e38
adding tests
gavishpoddar Aug 24, 2021
5df60c2
adding __init__.py file
gavishpoddar Aug 30, 2021
e68260a
updating dateparser-downloads and docs
gavishpoddar Aug 30, 2021
141d2ca
Merge branch 'scrapinghub:master' into language
gavishpoddar Sep 1, 2021
001a9d7
updating dateparser-download
gavishpoddar Sep 1, 2021
ae36bc6
Merge branch 'language' of https://github.com/gavishpoddar/dateparser…
gavishpoddar Sep 1, 2021
b8dcf7b
PIP8 : new line
gavishpoddar Sep 3, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
13 changes: 10 additions & 3 deletions dateparser/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@


@apply_settings
def parse(date_string, date_formats=None, languages=None, locales=None, region=None, settings=None):
def parse(date_string, date_formats=None, languages=None, locales=None,
region=None, settings=None, detect_languages_function=None):
"""Parse date and time from given date string.

:param date_string:
Expand Down Expand Up @@ -39,6 +40,12 @@ def parse(date_string, date_formats=None, languages=None, locales=None, region=N
Configure customized behavior using settings defined in :mod:`dateparser.conf.Settings`.
:type settings: dict

:param detect_languages_function:
A function for language detection that takes as input a string (the `date_string`) and
a `confidence_threshold`, and returns a list of detected language codes.
Note: this function is only used if ``languages`` and ``locales`` are not provided.
:type detect_languages_function: function

:return: Returns :class:`datetime <datetime.datetime>` representing parsed date if successful, else returns None
:rtype: :class:`datetime <datetime.datetime>`.
:raises:
Expand All @@ -47,9 +54,9 @@ def parse(date_string, date_formats=None, languages=None, locales=None, region=N
"""
parser = _default_parser

if languages or locales or region or not settings._default:
if languages or locales or region or detect_languages_function or not settings._default:
parser = DateDataParser(languages=languages, locales=locales,
region=region, settings=settings)
region=region, settings=settings, detect_languages_function=detect_languages_function)

data = parser.get_date_data(date_string, date_formats)

Expand Down
33 changes: 33 additions & 0 deletions dateparser/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from datetime import datetime
from functools import wraps

from dateparser.data.languages_info import language_order
from .parser import date_order_chart
from .utils import registry

Expand All @@ -25,6 +26,8 @@ class Settings:
* `NORMALIZE`
* `RETURN_TIME_AS_PERIOD`
* `PARSERS`
* `DEFAULT_LANGUAGES`
* `LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD`
"""

_default = True
Expand Down Expand Up @@ -129,6 +132,28 @@ def _check_parsers(setting_name, setting_value):
_check_repeated_values(setting_name, setting_value)


def _check_default_languages(setting_name, setting_value):
unsupported_languages = set(setting_value) - set(language_order)
if unsupported_languages:
raise SettingValidationError(
"Found invalid languages in the '{}' setting: {}".format(
setting_name, ', '.join(map(repr, unsupported_languages))
)
)
_check_repeated_values(setting_name, setting_value)


def _check_between_0_and_1(setting_name, setting_value):
is_valid = 0 <= setting_value <= 1
if not is_valid:
raise SettingValidationError(
'{} is not a valid value for {}. It can take values between 0 and '
'1.'.format(
setting_value, setting_name,
)
)


def check_settings(settings):
"""
Check if provided settings are valid, if not it raises `SettingValidationError`.
Expand Down Expand Up @@ -193,6 +218,14 @@ def check_settings(settings):
'PREFER_LOCALE_DATE_ORDER': {
'type': bool
},
'DEFAULT_LANGUAGES': {
'type': list,
'extra_check': _check_default_languages
},
'LANGUAGE_DETECTION_CONFIDENCE_THRESHOLD': {
'type': float,
'extra_check': _check_between_0_and_1
},
}

modified_settings = settings._mod_settings # check only modified settings
Expand Down
Empty file.
45 changes: 45 additions & 0 deletions dateparser/custom_language_detection/fasttext.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import os

import fasttext

from dateparser_cli.fasttext_manager import fasttext_downloader
from dateparser_cli.utils import dateparser_model_home, create_data_model_home
from dateparser_cli.exceptions import FastTextModelNotFoundException
kishan3 marked this conversation as resolved.
Show resolved Hide resolved


_supported_models = ["large.bin", "small.bin"]
_DEFAULT_MODEL = "small"


class _FastTextCache:
model = None


def _load_fasttext_model():
if _FastTextCache.model:
return _FastTextCache.model
create_data_model_home()
downloaded_models = [
file for file in os.listdir(dateparser_model_home)
if file in _supported_models
]
if not downloaded_models:
fasttext_downloader(_DEFAULT_MODEL)
return _load_fasttext_model()
model_path = os.path.join(dateparser_model_home, downloaded_models[0])
if not os.path.isfile(model_path):
raise FastTextModelNotFoundException('Fasttext model file not found')
_FastTextCache.model = fasttext.load_model(model_path)
return _FastTextCache.model


def detect_languages(text, confidence_threshold):
_language_parser = _load_fasttext_model()
text = text.replace('\n', ' ').replace('\r', '')
language_codes = []
parser_data = _language_parser.predict(text)
for idx, language_probability in enumerate(parser_data[1]):
if language_probability > confidence_threshold:
language_code = parser_data[0][idx].replace("__label__", "")
language_codes.append(language_code)
return language_codes
37 changes: 37 additions & 0 deletions dateparser/custom_language_detection/langdetect.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import langdetect
kishan3 marked this conversation as resolved.
Show resolved Hide resolved


noviluni marked this conversation as resolved.
Show resolved Hide resolved
# The below _Factory is set to prevent setting global state of the library
noviluni marked this conversation as resolved.
Show resolved Hide resolved
# but still get consistent results.
# Refer : https://github.com/Mimino666/langdetect

class _Factory:
data = None


def _init_factory():
if _Factory.data is None:
_Factory.data = langdetect.detector_factory.DetectorFactory()
_Factory.data.load_profile(langdetect.detector_factory.PROFILES_DIRECTORY)
_Factory.data.seed = 0


def _get_language_probablities(text):
_init_factory()
detector = _Factory.data.create()
detector.append(text)
return detector.get_probabilities()


def detect_languages(text, confidence_threshold):
language_codes = []
try:
parser_data = _get_language_probablities(text)
for language_candidate in parser_data:
if language_candidate.prob > confidence_threshold:
language_codes.append(language_candidate.lang)
except langdetect.lang_detect_exception.LangDetectException:
# This exception can be produced with empty strings or inputs without letters like `10-10-2021`.
# As this could be really common, we ignore them.
pass
kishan3 marked this conversation as resolved.
Show resolved Hide resolved
return language_codes
noviluni marked this conversation as resolved.
Show resolved Hide resolved
18 changes: 18 additions & 0 deletions dateparser/custom_language_detection/language_mapping.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from dateparser.data.languages_info import language_map


def map_languages(language_codes):
"""
Returns the candidates from the supported languages codes.
:param language_codes:
A list of language codes, e.g. ['en', 'es'] in ISO 639 Standard.
:type language_codes: list
:return: Returns list[str] representing supported languages
:rtype: list[str]
"""
return [
language_code
for language in language_codes
if language in language_map
for language_code in language_map[language]
]