Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Pandas String Methods in a Single Class #694

Merged
merged 39 commits into from
Jul 13, 2020
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
bb73f91
Updated general_functions API page
Jun 24, 2020
0f4391a
Updated changelog
Jun 25, 2020
c0152f7
String Method functions
Jul 1, 2020
6ca95a5
Merge branch 'dev' of https://github.com/samukweku/pyjanitor into pyj…
Jul 3, 2020
f69d1cf
updates to string methods
Jul 3, 2020
19d9448
updates to pyjanitor string methods
Jul 9, 2020
e6ad2c5
update string methods
Jul 9, 2020
c346683
updates to string methods
Jul 9, 2020
7095dd7
update to string methods
Jul 9, 2020
3a064bc
updates to string methods
Jul 9, 2020
ef10739
Update janitor/functions.py
samukweku Jul 11, 2020
59e558e
Update janitor/functions.py
samukweku Jul 11, 2020
eed0295
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
be2bed3
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
bd5eaf5
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
3d4e63e
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
abf2e95
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
a8f4373
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
0ee65e8
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
d7b5e79
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
87b028f
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
6d57a42
Update tests/functions/test_process_text.py
samukweku Jul 11, 2020
662a967
updates to string methods
Jul 11, 2020
88c6fda
update to string methods
Jul 11, 2020
5b2ef35
updates to string methods
Jul 12, 2020
9b98f92
changes to travis.yml
Jul 13, 2020
a6f6c4c
updates to string methods
Jul 13, 2020
b6d4b57
Update tests/functions/test_process_text.py
samukweku Jul 13, 2020
320f9fe
Update tests/functions/test_process_text.py
samukweku Jul 13, 2020
64577f6
Update tests/functions/test_process_text.py
samukweku Jul 13, 2020
5c92e82
Update tests/functions/test_process_text.py
samukweku Jul 13, 2020
6eda096
Update janitor/functions.py
samukweku Jul 13, 2020
6b8d303
Update janitor/functions.py
samukweku Jul 13, 2020
230ab34
updates to string methods
Jul 13, 2020
0540a5d
updates to changelog
Jul 13, 2020
4422cf2
updates to changelog
Jul 13, 2020
b54c142
Merge branch 'dev' into pyjanitor_string_wrapper
samukweku Jul 13, 2020
c76eba3
Update CHANGELOG.rst
hectormz Jul 13, 2020
66334e5
Update general_functions.rst
hectormz Jul 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 5 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,23 @@ install:
- conda update -q conda
- conda config --add channels conda-forge

- conda install -c conda-forge mamba

# Useful for debugging any issues with conda
- conda info -a

# Install Python, py.test, and required packages.
- conda env create -f environment-dev.yml
- mamba env create -f environment-dev.yml

# This guarantees that Python version is matrixed.
- conda install python=$PYTHON_VERSION
- mamba install python=$PYTHON_VERSION -n pyjanitor-dev
- source $HOME/miniconda/etc/profile.d/conda.sh
- conda activate pyjanitor-dev
- python -m ipykernel install --name pyjanitor-dev --user
- python setup.py develop

# Build development container
- docker build -t ericmjl/pyjanitor:devcontainer -f .devcontainer/Dockerfile .
# - docker build -t ericmjl/pyjanitor:devcontainer -f .devcontainer/Dockerfile .

# We use TravisCI to build docs and not run tests.
# Tests are covered on Azure.
Expand Down
1 change: 1 addition & 0 deletions docs/reference/general_functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,4 @@ Other
move
toset
unionize_dataframe_categories
process_text
106 changes: 105 additions & 1 deletion janitor/functions.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
""" General purpose data cleaning functions. """

import datetime as dt
import inspect
import re
import unicodedata
import warnings
Expand Down Expand Up @@ -1083,7 +1084,9 @@ def deconcatenate_column(
df_deconcat = df[column_name].str.split(sep, expand=True)
else:
df_deconcat = pd.DataFrame(
df[column_name].to_list(), columns=new_column_names, index=df.index
df[column_name].to_list(),
columns=new_column_names,
index=df.index,
)

if preserve_position:
Expand Down Expand Up @@ -4001,3 +4004,104 @@ def expand_grid(
dfs, dicts = _check_instance(others)

return _grid_computation(dfs, dicts)


@pf.register_dataframe_method
def process_text(
df: pd.DataFrame,
column: str,
string_function: str,
*args: str,
**kwargs: str,
) -> pd.DataFrame:
"""
Applies a Pandas string method to an existing column,
and returns a dataframe.
samukweku marked this conversation as resolved.
Show resolved Hide resolved
This function aims to make string cleaning easy, while chaining,
by simply passing the string method name to the
``process_text`` function.
Note that this modifies an existing column, and should not be
used to create a new column.
samukweku marked this conversation as resolved.
Show resolved Hide resolved
A list of all the string methods in Pandas can be accessed here:
https://pandas.pydata.org/docs/user_guide/text.html#method-summary.

Example:

.. code-block:: python

import pandas as pd
import janitor as jn

df = pd.DataFrame({"text":["ragnar","sammywemmy","ginger"],
"code" : [1, 2, 3]})

df.process_text(column = "text", string_function = "lower")
# text | code
# ragnar | 1
# sammywemmy | 2
# ginger | 3

#For string methods with parameters, simply pass the arguments :
df.process_text(
column = "text",
string_function = "extract",
pat = r"(ag)",
flags = re.IGNORECASE
)

# text | code
# ag | 1
# NaN | 2
# NaN | 3


Functional usage syntax:

.. code-block:: python

import pandas as pd
import janitor as jn

df = pd.DataFrame(...)
df = jn.process_text(
df = df,
string_function = "string_func_name_here",
args, kwargs
)

Method-chaining usage syntax:

.. code-block:: python

import pandas as pd
import janitor as jn

df = (
pd.DataFrame(...)
.process_text(
string_function = "string_func_name_here",
args, kwargs
)
)


:param df: A pandas dataframe.
:param column: String column to be operated on.
:param args, kwargs: Arguments for parameters.
:returns: A pandas dataframe with modified column.
:raises: KeyError if ``string_function`` is not a Pandas string method.
:raises: TypeError if wrong ``arg`` or ``kwarg`` is supplied.
"""

pandas_string_methods = [
func.__name__
for _, func in inspect.getmembers(pd.Series.str, inspect.isfunction)
if not func.__name__.startswith("_")
]

if string_function not in pandas_string_methods:
raise KeyError(f"{string_function} is not a Pandas string method.")

df[column] = getattr(df[column].str, string_function)(*args, **kwargs)

return df
99 changes: 99 additions & 0 deletions tests/functions/test_process_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import numpy as np
import pandas as pd
import pytest
from pandas.testing import assert_frame_equal


def test_str_split():
"Test wrapper for Pandas ``.str.split()`` method."

df = pd.DataFrame(
{"text": ["a_b_c", "c_d_e", np.nan, "f_g_h"], "numbers": range(1, 5)}
)

expected = df.copy()
expected["text"] = expected["text"].str.split("_")

result = df.process_text(column="text", string_function="split", pat="_")

assert_frame_equal(result, expected)


def test_str_cat():
"Test wrapper for Pandas ``.str.cat()`` method."

df = pd.DataFrame({"text": ["a", "b", "c", "d"], "numbers": range(1, 5)})

expected = df.copy()
expected["text"] = expected["text"].str.cat(others=["A", "B", "C", "D"])

result = df.process_text(
column="text", string_function="cat", others=["A", "B", "C", "D"],
)

assert_frame_equal(result, expected)


def test_str_get():
"""Test wrapper for Pandas ``.str.get()`` method."""

df = pd.DataFrame(
{"text": ["aA", "bB", "cC", "dD"], "numbers": range(1, 5)}
)

expected = df.copy()
expected["text"] = expected["text"].str.get(1)
result = df.process_text(column="text", string_function="get", i=-1)

assert_frame_equal(result, expected)


def test_str_lower():
"""Test string conversion to lowercase using ``.str.lower()``"""
samukweku marked this conversation as resolved.
Show resolved Hide resolved

df = pd.DataFrame(
{
"codes": range(1, 7),
"names": [
"Graham Chapman",
"John Cleese",
"Terry Gilliam",
"Eric Idle",
"Terry Jones",
"Michael Palin",
],
}
)

expected = df.copy()
expected["names"] = expected["names"].str.lower()

result = df.process_text(column="names", string_function="lower")

assert_frame_equal(result, expected)


def test_str_wrong():
"""
Test that string_function that is not a Pandas string method
actually errors out.
"""
samukweku marked this conversation as resolved.
Show resolved Hide resolved
df = pd.DataFrame(
{"text": ["ragnar", "sammywemmy", "ginger"], "code": [1, 2, 3]}
)
with pytest.raises(KeyError):
df.process_text(column="text", string_function="ragnar")
samukweku marked this conversation as resolved.
Show resolved Hide resolved


def test_str_wrong_parameters():
"""
Test that wrong parameters for a Pandas string method
actually errors out.
"""
samukweku marked this conversation as resolved.
Show resolved Hide resolved

df = pd.DataFrame(
{"text": ["a_b_c", "c_d_e", np.nan, "f_g_h"], "numbers": range(1, 5)}
)

with pytest.raises(TypeError):
df.process_text(column="text", string_function="split", pattern="_")