Currently IPv4 and IPv6 are supported as valid input.

The IP addresses can be converted into any of the following desired formats:
* `compressed`: provides a compressed version of the ip address,
* `full`: provides full version of the ip address,
* `binary`: provides binary representation of the ip address,
* `hexa`: provides hexadecimal representation of the ip address,
* `integer`: provides integer representation of the ip address.

The default output format is `compressed`.

Invalid parsing is handled with the `errors` parameter:

* "coerce" (default): invalid parsing will be set to NaN
* "ignore": invalid parsing will return the input
* "raise": invalid parsing will raise an exception

After cleaning, a **report** is printed that provides the following information:

* How many values were cleaned (the value must have been transformed).
* How many values could not be parsed.
* A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.

### An example dataset containing ip addresses

In [1]:
import pandas as pd
df = pd.DataFrame({
    "ips": [
        "00.000.0.0", "455.0.0.0", None, 876234, {}, "00.12.021.255",
        "684D:1111:222:3333:4444:5555:6:77"
    ]
})
df

Unnamed: 0,ips
0,00.000.0.0
1,455.0.0.0
2,
3,876234
4,{}
5,00.12.021.255
6,684D:1111:222:3333:4444:5555:6:77


## 1. Default `clean_ip`

By default, `clean_ip` will clean ip addresses in IPv4 and IPv6 and output them in the compressed format.

In [2]:
from dataprep.clean import clean_ip
clean_ip(df, "ips")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0.0.0.0
1,455.0.0.0,
2,,
3,876234,0.13.94.202
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:222:3333:4444:5555:6:77


## 2. Input formats

This section demonstrates the input parameter.

### `ipv4`

Will parse only IPv4 addresses.

In [3]:
clean_ip(df, "ips", input_format="ipv4")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	2 values cleaned (28.57%)
	4 values unable to be parsed (57.14%), set to NaN
Result contains 2 (28.57%) values in the correct format and 5 null values (71.43%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0.0.0.0
1,455.0.0.0,
2,,
3,876234,0.13.94.202
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,


### `ipv6`

Will parse only IPv6 address. 

In [4]:
clean_ip(df, "ips", input_format="ipv6")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	1 values cleaned (14.29%)
	5 values unable to be parsed (71.43%), set to NaN
Result contains 1 (14.29%) values in the correct format and 6 null values (85.71%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,
1,455.0.0.0,
2,,
3,876234,
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:222:3333:4444:5555:6:77


### `auto` (default parameter)

Will parse both IPv4 and IPv6 addresses.

In [5]:
clean_ip(df, "ips", input_format="auto")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0.0.0.0
1,455.0.0.0,
2,,
3,876234,0.13.94.202
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:222:3333:4444:5555:6:77


## 3. Output formats

### `compressed` (default)

In [6]:
clean_ip(df, "ips", output_format="compressed")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0.0.0.0
1,455.0.0.0,
2,,
3,876234,0.13.94.202
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:222:3333:4444:5555:6:77


### `full`

In [7]:
clean_ip(df, "ips", output_format="full")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0000.0000.0000.0000
1,455.0.0.0,
2,,
3,876234,0000.0013.0094.0202
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:0222:3333:4444:5555:0006:0077


### `binary`

In [8]:
clean_ip(df, "ips", output_format="binary")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,00000000000000000000000000000000
1,455.0.0.0,
2,,
3,876234,00000000000011010101111011001010
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,0110100001001101000100010001000100000010001000...


### `hexa`

In [9]:
clean_ip(df, "ips", output_format="hexa")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0x0
1,455.0.0.0,
2,,
3,876234,0xd5eca
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,0x684d1111022233334444555500060077


### `integer`

In [10]:
clean_ip(df, "ips", output_format="integer")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	2 values cleaned (28.57%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0
1,455.0.0.0,
2,,
3,876234,876234
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,138639864568240772614187040837063802999


## 3. `errors` parameter

### `coerce` (default)

In [11]:
clean_ip(df, "ips", errors="coerce")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), set to NaN
Result contains 3 (42.86%) values in the correct format and 4 null values (57.14%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0.0.0.0
1,455.0.0.0,
2,,
3,876234,0.13.94.202
4,{},
5,00.12.021.255,
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:222:3333:4444:5555:6:77


### `ignore`

In [12]:
clean_ip(df, "ips", errors="ignore")

  0%|          | 0/8 [00:00<?, ?it/s]

IP Cleaning Report:
	3 values cleaned (42.86%)
	3 values unable to be parsed (42.86%), left unchanged
Result contains 3 (42.86%) values in the correct format and 1 null values (14.29%)


Unnamed: 0,ips,ips_clean
0,00.000.0.0,0.0.0.0
1,455.0.0.0,455.0.0.0
2,,
3,876234,0.13.94.202
4,{},{}
5,00.12.021.255,00.12.021.255
6,684D:1111:222:3333:4444:5555:6:77,684d:1111:222:3333:4444:5555:6:77


## 4. `validate_ip()`

`validate_ip()` returns `True` if the input is a valid IP, otherwise `False`.

In [13]:
from dataprep.clean import validate_ip

print(validate_ip("455.0.0.0"))
print(validate_ip({}))
print(validate_ip(" "))
print(validate_ip("0.0.0.0"))
print(validate_ip("684D:1111:222:3333:4444:5555:6:77"))

False
False
False
True
True


In [14]:
df_2 = validate_ip(df["ips"])
df_2

0     True
1    False
2    False
3     True
4    False
5    False
6     True
Name: ips, dtype: bool