-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Column utils added. (ifelse and case_when statements) (#15)
* replace_na and drop_na implemented. * Changes to core logic of na functions * Code refreactored and test cases fixed * ifelse and case_when statements added. * Add build files for website.
- Loading branch information
1 parent
5ae6671
commit 8bdea91
Showing
20 changed files
with
882 additions
and
5 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
238 changes: 238 additions & 0 deletions
238
docs/_build/html/_modules/tidypyspark/column_utils.html
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,238 @@ | ||
<!DOCTYPE html> | ||
<html class="writer-html5" lang="en" > | ||
<head> | ||
<meta charset="utf-8" /> | ||
<meta name="viewport" content="width=device-width, initial-scale=1.0" /> | ||
<title>tidypyspark.column_utils — tidypyspark documentation</title> | ||
<link rel="stylesheet" href="../../_static/pygments.css" type="text/css" /> | ||
<link rel="stylesheet" href="../../_static/css/theme.css" type="text/css" /> | ||
<link rel="stylesheet" href="../../_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css" type="text/css" /> | ||
<link rel="stylesheet" href="../../_static/graphviz.css" type="text/css" /> | ||
<!--[if lt IE 9]> | ||
<script src="../../_static/js/html5shiv.min.js"></script> | ||
<![endif]--> | ||
|
||
<script data-url_root="../../" id="documentation_options" src="../../_static/documentation_options.js"></script> | ||
<script src="../../_static/jquery.js"></script> | ||
<script src="../../_static/underscore.js"></script> | ||
<script src="../../_static/_sphinx_javascript_frameworks_compat.js"></script> | ||
<script src="../../_static/doctools.js"></script> | ||
<script src="../../_static/js/theme.js"></script> | ||
<link rel="index" title="Index" href="../../genindex.html" /> | ||
<link rel="search" title="Search" href="../../search.html" /> | ||
</head> | ||
|
||
<body class="wy-body-for-nav"> | ||
<div class="wy-grid-for-nav"> | ||
<nav data-toggle="wy-nav-shift" class="wy-nav-side"> | ||
<div class="wy-side-scroll"> | ||
<div class="wy-side-nav-search" > | ||
|
||
|
||
|
||
<a href="../../index.html" class="icon icon-home"> | ||
tidypyspark | ||
</a> | ||
<div role="search"> | ||
<form id="rtd-search-form" class="wy-form" action="../../search.html" method="get"> | ||
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" /> | ||
<input type="hidden" name="check_keywords" value="yes" /> | ||
<input type="hidden" name="area" value="default" /> | ||
</form> | ||
</div> | ||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu"> | ||
<ul> | ||
<li class="toctree-l1"><a class="reference internal" href="../../changelog.html">Changelog</a></li> | ||
<li class="toctree-l1"><a class="reference internal" href="../../autoapi/index.html">API Reference</a></li> | ||
</ul> | ||
|
||
</div> | ||
</div> | ||
</nav> | ||
|
||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" > | ||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i> | ||
<a href="../../index.html">tidypyspark</a> | ||
</nav> | ||
|
||
<div class="wy-nav-content"> | ||
<div class="rst-content"> | ||
<div role="navigation" aria-label="Page navigation"> | ||
<ul class="wy-breadcrumbs"> | ||
<li><a href="../../index.html" class="icon icon-home" aria-label="Home"></a></li> | ||
<li class="breadcrumb-item"><a href="../index.html">Module code</a></li> | ||
<li class="breadcrumb-item active">tidypyspark.column_utils</li> | ||
<li class="wy-breadcrumbs-aside"> | ||
</li> | ||
</ul> | ||
<hr/> | ||
</div> | ||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article"> | ||
<div itemprop="articleBody"> | ||
|
||
<h1>Source code for tidypyspark.column_utils</h1><div class="highlight"><pre> | ||
<span></span><span class="kn">import</span> <span class="nn">pyspark.sql.functions</span> <span class="k">as</span> <span class="nn">F</span> | ||
|
||
<div class="viewcode-block" id="ifelse"><a class="viewcode-back" href="../../autoapi/tidypyspark/column_utils/index.html#tidypyspark.column_utils.ifelse">[docs]</a><span class="k">def</span> <span class="nf">ifelse</span><span class="p">(</span><span class="n">condition</span><span class="p">,</span> <span class="n">yes</span><span class="p">,</span> <span class="n">no</span><span class="p">):</span> | ||
<span class="w"> </span> | ||
<span class="w"> </span><span class="sd">'''</span> | ||
<span class="sd"> Vectorized if and else statement.</span> | ||
<span class="sd"> ifelse returns a value with the same shape as condition which is filled with </span> | ||
<span class="sd"> elements selected from either yes or no depending on whether the element of</span> | ||
<span class="sd"> condition is TRUE or FALSE.</span> | ||
|
||
<span class="sd"> Parameters</span> | ||
<span class="sd"> ----------</span> | ||
<span class="sd"> condition: expression or pyspark col</span> | ||
<span class="sd"> Should evaluate to a boolean list/array/Series</span> | ||
<span class="sd"> yes: expression or list/array/Series</span> | ||
<span class="sd"> Should evaluate to a pyspark col for true elements of condition.</span> | ||
<span class="sd"> no: expression or list/array/Series</span> | ||
<span class="sd"> Should evaluate to a pyspark col for false elements of condition.</span> | ||
|
||
<span class="sd"> Returns</span> | ||
<span class="sd"> -------</span> | ||
<span class="sd"> pyspark col</span> | ||
|
||
<span class="sd"> Examples</span> | ||
<span class="sd"> --------</span> | ||
<span class="sd"> >>> from pyspark.sql import SparkSession </span> | ||
<span class="sd"> >>> import pyspark.sql.functions as F </span> | ||
<span class="sd"> >>> spark = SparkSession.builder.getOrCreate()</span> | ||
<span class="sd"> >>> import pyspark</span> | ||
|
||
<span class="sd"> >>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)],</span> | ||
<span class="sd"> ["letter", "number"]</span> | ||
<span class="sd"> )</span> | ||
<span class="sd"> >>> df.show()</span> | ||
<span class="sd"> +------+------+</span> | ||
<span class="sd"> |letter|number|</span> | ||
<span class="sd"> +------+------+</span> | ||
<span class="sd"> | a| 1|</span> | ||
<span class="sd"> | b| 2|</span> | ||
<span class="sd"> | c| 3|</span> | ||
<span class="sd"> +------+------+</span> | ||
|
||
<span class="sd"> >>> df.withColumn("new_number",</span> | ||
<span class="sd"> ifelse(F.col("number") == 1, </span> | ||
<span class="sd"> F.lit(10), </span> | ||
<span class="sd"> F.lit(0)</span> | ||
<span class="sd"> )</span> | ||
<span class="sd"> ).show()</span> | ||
<span class="sd"> +------+------+----------+</span> | ||
<span class="sd"> |letter|number|new_number|</span> | ||
<span class="sd"> +------+------+----------+</span> | ||
<span class="sd"> | a| 1| 10|</span> | ||
<span class="sd"> | b| 2| 0|</span> | ||
<span class="sd"> | c| 3| 0|</span> | ||
<span class="sd"> +------+------+----------+</span> | ||
<span class="sd"> '''</span> | ||
|
||
<span class="k">return</span> <span class="n">F</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">condition</span><span class="p">,</span> <span class="n">yes</span><span class="p">)</span><span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="n">no</span><span class="p">)</span></div> | ||
|
||
<div class="viewcode-block" id="case_when"><a class="viewcode-back" href="../../autoapi/tidypyspark/column_utils/index.html#tidypyspark.column_utils.case_when">[docs]</a><span class="k">def</span> <span class="nf">case_when</span><span class="p">(</span><span class="n">list_of_tuples</span><span class="p">,</span> <span class="n">default</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span> | ||
|
||
<span class="w"> </span><span class="sd">"""</span> | ||
<span class="sd"> Implements a case_when function using PySpark.</span> | ||
<span class="sd"> </span> | ||
<span class="sd"> Parameters:</span> | ||
<span class="sd"> ----------</span> | ||
<span class="sd"> list_of_tuples (list): </span> | ||
<span class="sd"> A list of tuples, where each tuple represents a condition </span> | ||
<span class="sd"> and its corresponding value.</span> | ||
<span class="sd"> default (optional): </span> | ||
<span class="sd"> The default value to use when no conditions are met. Defaults to None.</span> | ||
<span class="sd"> </span> | ||
<span class="sd"> Returns:</span> | ||
<span class="sd"> ----------</span> | ||
<span class="sd"> PySpark Column: A PySpark column representing the case_when expression.</span> | ||
|
||
<span class="sd"> Examples:</span> | ||
<span class="sd"> ----------</span> | ||
<span class="sd"> >>> from pyspark.sql import SparkSession </span> | ||
<span class="sd"> >>> import pyspark.sql.functions as F </span> | ||
<span class="sd"> >>> spark = SparkSession.builder.getOrCreate()</span> | ||
<span class="sd"> >>> import pyspark</span> | ||
|
||
<span class="sd"> >>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], </span> | ||
<span class="sd"> ["letter", "number"]</span> | ||
<span class="sd"> )</span> | ||
<span class="sd"> >>> df.show()</span> | ||
<span class="sd"> +------+------+</span> | ||
<span class="sd"> |letter|number|</span> | ||
<span class="sd"> +------+------+</span> | ||
<span class="sd"> | a| 1|</span> | ||
<span class="sd"> | b| 2|</span> | ||
<span class="sd"> | c| 3|</span> | ||
<span class="sd"> +------+------+</span> | ||
|
||
<span class="sd"> >>> df.withColumn("new_number",</span> | ||
<span class="sd"> case_when([(F.col("number") == 1, F.lit(10)),</span> | ||
<span class="sd"> (F.col("number") == 1, F.lit(20)),</span> | ||
<span class="sd"> (F.col("number") == 3, F.lit(30))],</span> | ||
<span class="sd"> default = F.lit(0)</span> | ||
<span class="sd"> )</span> | ||
<span class="sd"> ).show()</span> | ||
<span class="sd"> +------+------+----------+</span> | ||
<span class="sd"> |letter|number|new_number|</span> | ||
<span class="sd"> +------+------+----------+</span> | ||
<span class="sd"> | a| 1| 10|</span> | ||
<span class="sd"> | b| 2| 0|</span> | ||
<span class="sd"> | c| 3| 30|</span> | ||
<span class="sd"> +------+------+----------+</span> | ||
|
||
<span class="sd"> """</span> | ||
|
||
<span class="k">assert</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">list_of_tuples</span><span class="p">,</span> <span class="nb">list</span><span class="p">),</span> \ | ||
<span class="s2">"list_of_tuples should be a list of tuples"</span> | ||
|
||
<span class="k">assert</span> <span class="nb">all</span><span class="p">([</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="nb">tuple</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">list_of_tuples</span><span class="p">]),</span>\ | ||
<span class="s2">"list_of_tuples should be a list of tuples"</span> | ||
|
||
<span class="k">assert</span> <span class="nb">all</span><span class="p">([</span><span class="nb">len</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">list_of_tuples</span><span class="p">]),</span>\ | ||
<span class="s2">"list_of_tuples should be a list of tuples of length 2"</span> | ||
|
||
<span class="c1"># Create a list of PySpark expressions for each condition in list_of_tuples</span> | ||
<span class="n">conditions</span> <span class="o">=</span> <span class="p">([</span><span class="n">F</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">condition</span><span class="p">,</span><span class="n">value</span><span class="p">)</span> | ||
<span class="k">for</span> <span class="n">condition</span><span class="p">,</span><span class="n">value</span> <span class="ow">in</span> <span class="n">list_of_tuples</span><span class="p">]</span> | ||
<span class="p">)</span> | ||
|
||
<span class="c1"># Define a pyspark expression that checks conditions in order and returns</span> | ||
<span class="c1"># the corresponding value if the condition is met. If no conditions are met,</span> | ||
<span class="c1"># return the default value.</span> | ||
<span class="k">if</span> <span class="n">default</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span> | ||
<span class="n">case_when_expression</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">coalesce</span><span class="p">(</span><span class="o">*</span><span class="n">conditions</span><span class="p">)</span> | ||
<span class="k">else</span><span class="p">:</span> | ||
<span class="n">case_when_expression</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">coalesce</span><span class="p">(</span><span class="o">*</span><span class="n">conditions</span><span class="p">,</span> <span class="n">default</span><span class="p">)</span> | ||
|
||
<span class="k">return</span> <span class="n">case_when_expression</span></div> | ||
</pre></div> | ||
|
||
</div> | ||
</div> | ||
<footer> | ||
|
||
<hr/> | ||
|
||
<div role="contentinfo"> | ||
<p>© Copyright 2023, Srikanth Komala sheshachala.</p> | ||
</div> | ||
|
||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a | ||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a> | ||
provided by <a href="https://readthedocs.org">Read the Docs</a>. | ||
|
||
|
||
</footer> | ||
</div> | ||
</div> | ||
</section> | ||
</div> | ||
<script> | ||
jQuery(function () { | ||
SphinxRtdTheme.Navigation.enable(true); | ||
}); | ||
</script> | ||
|
||
</body> | ||
</html> |
122 changes: 122 additions & 0 deletions
122
docs/_build/html/_sources/autoapi/tidypyspark/column_utils/index.rst.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
:py:mod:`tidypyspark.column_utils` | ||
================================== | ||
|
||
.. py:module:: tidypyspark.column_utils | ||
Module Contents | ||
--------------- | ||
|
||
|
||
Functions | ||
~~~~~~~~~ | ||
|
||
.. autoapisummary:: | ||
|
||
tidypyspark.column_utils.ifelse | ||
tidypyspark.column_utils.case_when | ||
|
||
|
||
|
||
.. py:function:: ifelse(condition, yes, no) | ||
Vectorized if and else statement. | ||
ifelse returns a value with the same shape as condition which is filled with | ||
elements selected from either yes or no depending on whether the element of | ||
condition is TRUE or FALSE. | ||
|
||
:param condition: Should evaluate to a boolean list/array/Series | ||
:type condition: expression or pyspark col | ||
:param yes: Should evaluate to a pyspark col for true elements of condition. | ||
:type yes: expression or list/array/Series | ||
:param no: Should evaluate to a pyspark col for false elements of condition. | ||
:type no: expression or list/array/Series | ||
|
||
:rtype: pyspark col | ||
|
||
.. rubric:: Examples | ||
|
||
>>> from pyspark.sql import SparkSession | ||
>>> import pyspark.sql.functions as F | ||
>>> spark = SparkSession.builder.getOrCreate() | ||
>>> import pyspark | ||
|
||
>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], | ||
["letter", "number"] | ||
) | ||
>>> df.show() | ||
+------+------+ | ||
|letter|number| | ||
+------+------+ | ||
| a| 1| | ||
| b| 2| | ||
| c| 3| | ||
+------+------+ | ||
|
||
>>> df.withColumn("new_number", | ||
ifelse(F.col("number") == 1, | ||
F.lit(10), | ||
F.lit(0) | ||
) | ||
).show() | ||
+------+------+----------+ | ||
|letter|number|new_number| | ||
+------+------+----------+ | ||
| a| 1| 10| | ||
| b| 2| 0| | ||
| c| 3| 0| | ||
+------+------+----------+ | ||
|
||
|
||
.. py:function:: case_when(list_of_tuples, default=None) | ||
Implements a case_when function using PySpark. | ||
|
||
Parameters: | ||
---------- | ||
list_of_tuples (list): | ||
A list of tuples, where each tuple represents a condition | ||
and its corresponding value. | ||
default (optional): | ||
The default value to use when no conditions are met. Defaults to None. | ||
|
||
Returns: | ||
---------- | ||
PySpark Column: A PySpark column representing the case_when expression. | ||
|
||
Examples: | ||
---------- | ||
>>> from pyspark.sql import SparkSession | ||
>>> import pyspark.sql.functions as F | ||
>>> spark = SparkSession.builder.getOrCreate() | ||
>>> import pyspark | ||
|
||
>>> df = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)], | ||
["letter", "number"] | ||
) | ||
>>> df.show() | ||
+------+------+ | ||
|letter|number| | ||
+------+------+ | ||
| a| 1| | ||
| b| 2| | ||
| c| 3| | ||
+------+------+ | ||
|
||
>>> df.withColumn("new_number", | ||
case_when([(F.col("number") == 1, F.lit(10)), | ||
(F.col("number") == 1, F.lit(20)), | ||
(F.col("number") == 3, F.lit(30))], | ||
default = F.lit(0) | ||
) | ||
).show() | ||
+------+------+----------+ | ||
|letter|number|new_number| | ||
+------+------+----------+ | ||
| a| 1| 10| | ||
| b| 2| 0| | ||
| c| 3| 30| | ||
+------+------+----------+ | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.