languages can be converted to and from the following formats via the `input_format` and `output_format` parameters:

* `name`: reference language name, like "English"
* `alpha-2`: two letter 639-1 identifier, like "en"
* `alpha-3`: three letter 639-3 identifier, like "eng"

`input_format` can be set to "auto" which automatically infers the input format. A tuple of input formats may also be used to indicate that the input may be any of the given input formats. 


`kb_path` and `encode` parameters can be used to customize the knowledge base. The default knowledge base comes from [ISO 639-3 official website](https://iso639-3.sil.org/code_tables/download_tables). In current stage, the user specified knowledge base should be in the user's local directory and follow by the format of [the default one](https://github.com/sfu-db/dataprep/blob/develop/dataprep/clean/language_data.csv): a csv file containing at least 3 columns "name", "alpha-2" and "alpha-3". These two parameters will be passed to `pd.read_csv` to load data.


Invalid parsing is handled with the `errors` parameter:

* `coerce` (default): invalid parsing will be set to NaN
* `ignore`: invalid parsing will return the input
* `raise`: invalid parsing will raise an exception

The following sections demonstrate the functionality of `clean_language()` and `validate_language()`. 

### An example dataset containing language names

In [None]:
import pandas as pd
import numpy as np
df = pd.DataFrame(
    {'messy_language': [
            'eng',
            'zh',
            'Japanese',
            "english",
            "Zh",
            "tp", # fake language code
            "233",
            304,
            "dd eng",
            " tr ",
            "hello",
            np.nan,
            "NULL",
     ],
    }
)
df

## 1. Default `clean_language()`

By default, the `input_format` parameter is set to "auto" (automatically determines the input format), the `output_format` parameter is set to "name". The `kb_path` is set to "default". The `errors` parameter is set to "coerce" (set NaN when parsing is invalid).

In [None]:
from dataprep.clean import clean_language
clean_language(df, "messy_language")

## 2. Input formats

This section demonstrates the supported language input formats.

### `auto` (default)
Automatically detect the input format of each value. "name"
When the length of input value is 2, "alpha-2" is preferred. Similarly, when the length is 3, "alpha-3" is preferred. Otherwise, "name", "alpha-2", "alpha-3" will be considered sequentially.

In [None]:
clean_language(df, "messy_language", input_format="auto")

### `name`
Looks for a direct match with a reference language name, case insensitive and ignoring leading and trailing whitespace.

In [None]:
clean_language(df, "messy_language", input_format="name")

### `alpha-2`

In [None]:
clean_language(df, "messy_language", input_format="alpha-2")

### `alpha-3`

In [None]:
clean_language(df, "messy_language", input_format="alpha-3")

### `(name, alpha-3)`

A tuple containing any combination of input formats may be used to clean any of the given input formats.

In [None]:
clean_language(df, "messy_language", input_format=("name", "alpha-3"))

## 3. Output formats

This section demonstrates the supported output language formats.

### `name` (default)

In [None]:
clean_language(df, "messy_language", output_format="name")

### `alpha-2`

In [None]:
clean_language(df, "messy_language", output_format="alpha-2")

### `alpha-3`

In [None]:
clean_language(df, "messy_language", output_format="alpha-3")

## 4. Knowledge base

Customize the knowledge base used to clean language names. We will use a toy knowledge base as follow:

In [None]:
alternative_path = "./alternative_language_data.csv"
pd.read_csv(alternative_path, encoding = "utf-8")

In [None]:
clean_language(df, "messy_language", kb_path = alternative_path, encode = "utf-8")

Note that "tr" is not a valid language now, since it does not appear in the alternative knowledge base.

## 5. `inplace` parameter
This just deletes the given column from the returned dataframe. 
A new column containing cleaned coordinates is added with a title in the format `"{original title}_clean"`.

In [None]:
clean_language(df, "messy_language", inplace=True)

## 8. `errors` parameter

### `coerce` (default)

In [None]:
clean_language(df, "messy_language", errors="coerce")

### `ignore`

In [None]:
clean_language(df, "messy_language", errors="ignore")

## 7. `validate_language()`

`validate_language()` returns `True` when the input is a valid language name. Otherwise it returns `False`.

The input of `validate_language()` can be a string, a Pandas DataSeries, a Dask DataSeries, a Pandas DataFrame and a Dask DataFrame.

When the input is a string, a Pandas DataSeries or a Dask DataSeries, user doesn't need to specify a column name to be validated. 

When the input is a Pandas DataFrame or a dask DataFrame, user can both specify or not specify a column name to be validated. If user specify the column name, `validate_language()` only returns the validation result for the specified column. If user doesn't specify the column name, `validate_ilanguage()` returns the validation result for the whole DataFrame.

In [None]:
from dataprep.clean import validate_language

print(validate_language("english"))
print(validate_language("zh"))
print(validate_language(" ZH "))
print(validate_language("tp"))
print(validate_language("eng"))
print(validate_language("hello"))
print(validate_language("233"))
print(validate_language("dd eng"))
print(validate_language(""))

### An example dataset containing multiple columns

In [None]:
df2 = pd.DataFrame(
    {'some_messy_language': [
        'eng', 
        'zh',
        'Japanese', 
        "english",
        "Zh",
        "tp",
    ], 
     'other_messy_language':[
        "233", 
        304, 
        " tr ", 
        "hello", 
        np.nan, 
        "NULL",
     ],
    }
)
df2

### Series

In [None]:
validate_language(df2["some_messy_language"])

### DataFrame + Specify Column

In [None]:
validate_language(df2, column="other_messy_language")

### Only DataFrame

In [None]:
validate_language(df2)

### Specify `input_format`

In [None]:
validate_language(df2["other_messy_language"], input_format = "alpha-3")

With `input_format` specified as "alpha-3", "tr" becomes false.

### Change knowledge base

In [None]:
validate_language(df2["other_messy_language"], kb_path = alternative_path)

Note that "tr" becomes False since it does not appear in the alternative knowledge base.