# `clean_language()`: Clean Language Names

Follow the [ISO 639 country codes](https://en.wikipedia.org/wiki/ISO_639/)
Convert a language name into the formats:
1. "name": the language name
2. "alpha-2": two letter abbreviation
3. "alpha-3": three letter abbreviation

# Features

1. Create a Database with ISO 639 data, and associated regular expressions (see  [pycountry](https://pypi.org/project/pycountry/) , please refer `iso639-x.jso` in its `databases` folder),  for string formatting. This could be a json file, and can be stored in Python in a dict.

2. Standardize null values

3. If `input_format` is `auto`: use regex matching and table lookups to try to identify the languages, then output the languages in the output format. 

4. If `input_format` is not `auto`: map directly from the input to the output format

5. (Optional) Fuzzy matching language names when the `input_format` is `name`. (Hint: please refer to the implementation of [clean_country](https://github.com/sfu-db/dataprep/blob/develop/dataprep/clean/clean_country.py))



# Tentative design

In [None]:
def clean_language(
    df: Union[pd.DataFrame, dd.DataFrame],
    column: str,
    input_format: str = 'auto',
    output_format: str = 'name',
    fuzzy: bool = False,
    fuzzy_dict: float = 0.0,
    inplace: bool = False,
    report: bool = True,
    progress: bool = True,
) -> pd.DataFrame:
    """
    Parameters
    ----------
    df
        A pandas or Dask DataFrame containing the data to be cleaned.
    column
        The name of the column containing language names.
    input_format
        The ISO 639 input format of the language.
            - 'auto': infer the input format
            - 'name': language name ('English')
            - 'alpha-2': alpha-2 code ('en')
            - 'alpha-3': alpha-3 code ('eng')
        (default: 'auto')
    output_format
        The desired ISO 639 format of the language:
            - 'name': language name ('English')
            - 'alpha-2': alpha-2 code ('en')
            - 'alpha-3': alpha-3 code ('eng')
        (default: 'name')
    fuzzy
        If False, matching for input formats 'name' is done by looking
        for a direct match. If True, matching is done by searching the input for a
        regex match.
        (default: False)
    fuzzy_dist
        The maximum edit distance (number of single character insertions, deletions
        or substitutions required to change one word into the other) between a language value
        and input that will count as a match. Only applies to 'auto', 'name'
        input formats.
        (default: 0.0)
    inplace
        If True, delete the column containing the data that was cleaned. Otherwise,
        keep the original column.
        (default: False)
    report
        If True, output the summary report. Otherwise, no report is outputted.
        (default: True)
    progress
        If True, display a progress bar.
        (default: True)
    """

# Resources
   1. [pycountry](https://pypi.org/project/pycountry/)
   2. [Implementation of clean_country function in DataPrep](https://github.com/sfu-db/dataprep/blob/develop/dataprep/clean/clean_country.py)
