### An example dataset with country values

In [None]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
    "country": [
        "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ",
        "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG",
        "hello", np.nan, "NULL"
    ]
})
df

## 1. Default `clean_country()`

By default, the `input_format` parameter is set to "auto" (automatically determines the input format), the `output_format` parameter is set to "name". The `fuzzy_dist` parameter is set to 0 and `strict` is False. The `errors` parameter is set to "coerce" (set NaN when parsing is invalid).

In [None]:
from dataprep.clean import clean_country
clean_country(df, "country")

Note "Canada" is considered not cleaned in the report since it's cleaned value is the same as the input. Also, "northern ireland" is invalid because it is part of the United Kingdom. Kinshasa and Brazzaville are the capital cities of their respective countries.

## 2. Input formats

This section demonstrates the supported country input formats.

### name

If the input contains a match with one of the country regexes then it is successfully converted.

In [None]:
clean_country(df, "country", input_format="name")

### official

Does the same thing as `input_format="name"`.

In [None]:
clean_country(df, "country", input_format="official")

### alpha-2

Looks for a direct match with a ISO 3166-1 alpha-2 country code, case insensitive and ignoring leading and trailing whitespace.

In [None]:
clean_country(df, "country", input_format="alpha-2")

### alpha-3

Looks for a direct match with a ISO 3166-1 alpha-3 country code, case insensitive and ignoring leading and trailing whitespace.

In [None]:
clean_country(df, "country", input_format="alpha-3")

### numeric

Looks for a direct match with a ISO 3166-1 numeric country code, case insensitive and ignoring leading and trailing whitespace. Works on integers and strings.

In [None]:
clean_country(df, "country", input_format="numeric")

### (name, alpha-2)

A tuple containing any combination of input formats may be used to clean any of the given input formats.

In [None]:
clean_country(df, "country", input_format=("name", "alpha-2"))

## 3. Output formats

This section demonstrates the supported output country formats.

### official

In [None]:
clean_country(df, "country", output_format="official")

### alpha-2

In [None]:
clean_country(df, "country", output_format="alpha-2")

### alpha-3

In [None]:
clean_country(df, "country", output_format="alpha-3")

### numeric

In [None]:
clean_country(df, "country", output_format="numeric")

### Any combination of input and output formats may be used.

In [None]:
clean_country(df, "country", input_format="alpha-2", output_format="official")

## 4. `strict` parameter

This parameter allows for control over the type of matching used for "name" and "official" input formats. When False, the input is searched for a regex match. When True, matching is done by looking for a direct match with a country in the same format. 

In [None]:
clean_country(df, "country", strict=True)

"foo canada bar", "congo kinshasa" and "congo brazzaville" are now invalid because they are not a direct match with a country in the "name" or "official" formats. 

## 5. Fuzzy Matching

The `fuzzy_dist` parameter sets the maximum edit distance (number of single character insertions, deletions or substitutions required to change one word into the other) allowed between the input and a country regex. If an input is successfully cleaned by `clean_country()` with `fuzzy_dist=0` then that input with one character inserted, deleted or substituted will match with `fuzzy_dist=1`. This parameter only applies to the "name" and "official" input formats.

### `fuzzy_dist=1`

Countries at most one edit away from matching a regex are successfully cleaned.

In [None]:
df = pd.DataFrame({
    "country": [
        "canada", "cnada", "australa", "xntarctica", "koreea", "cxnda",
        "afghnitan", "country: cnada", "foo indnesia bar"
    ]
})
clean_country(df, "country", fuzzy_dist=1)

### `fuzzy_dist=2`

Countries at most two edits away from matching a regex are successfully cleaned.

In [None]:
clean_country(df, "country", fuzzy_dist=2)

## 6. `inplace` parameter
This just deletes the given column from the returned dataframe. 
A new column containing cleaned coordinates is added with a title in the format `"{original title}_clean"`.

In [None]:
clean_country(df, "country", fuzzy_dist=2, inplace=True)

## 7. `validate_country()`

`validate_country()` returns True when the input is a valid country value otherwise it returns False. Valid types are the same as `clean_country()`. By default `strict=True`, as opposed to `clean_country()` which has `strict` set to False by default. The default `input_type` is "auto".

In [None]:
from dataprep.clean import validate_country

print(validate_country("switzerland"))
print(validate_country("country = united states"))
print(validate_country("country = united states", strict=False))
print(validate_country("ca"))
print(validate_country(800))

### `validate_country()` on a pandas series

Since `strict=True` by default, the inputs "foo canada bar", "congo, kinshasa" and "congo, brazzaville" are invalid since they don't directly match a country in the "name" or "official" formats.

In [None]:
df = pd.DataFrame({
    "country": [
        "Canada", "foo canada bar", "cnada", "northern ireland", " ireland ",
        "congo, kinshasa", "congo, brazzaville", 304, "233", " tr ", "ARG",
        "hello", np.nan, "NULL"
    ]
})

df["valid"] = validate_country(df["country"])
df

### `strict=False`
For "name" and "official" input types the input is searched for a regex match.

In [None]:
df["valid"] = validate_country(df["country"], strict=False)
df

### Specifying `input_format`

In [None]:
df["valid"] = validate_country(df["country"], input_format="numeric")
df

## Credit

The country data and regular expressions used are based on the [country_converter](https://github.com/konstantinstadler/country_converter) project.