# How to use this notebook

To improve the quality of the presentation, a couple of notes:
1. Many cells are written in HTML which are flagged as "skip" for the slide type, so the cell that shows in the presentation is a simple one-liner [e.g. display(HTML(<<some_html_snippet_defined_in_a_prior_cell>>))]
2. This notebook contains a mix of code cells that run, and while some code is stored in RAW cells followed by images from https://https://carbon.now.sh/  of that same code snippet that display more nicely for presentation purposes.

Should the user want to run the code cells, change the cell type from RAW to code -- and keep in mind that there may be earlier cells in the notebook that must be run first before any later cell runs correctly.


In [1]:
# setup
from IPython.display import display, HTML, IFrame, Image
from pathlib import Path
from qrcode.image.styledpil import StyledPilImage
from qrcode.image.styles.moduledrawers import GappedSquareModuleDrawer

# main notebook talk
import numpy as np
import pandas as pd
import polars as pl

# for QR code to github repo
import qrcode

# important paths
data_path = Path("data")
data_csv = Path("data", "python_dev_universe.csv")
data_csv_gz = Path(data_path, "python_dev_universe.csv.gz")
data_parquet = Path("data", "python_dev_universe.parquet")
images_code_path = Path("images_code")
images_path = Path("images")
qr_imagefile = "qr_code_extended_talk.png"
qr_full_path = Path(images_path, qr_imagefile)
git_url_for_this_talk = (
    "https://github.com/surfaceowl/talk_nov2023_pandas_polars_arrow.git"
)

In [2]:
import requests

response = requests.get(git_url_for_this_talk)
if bool(response.status_code < 400):
    print(
        f"The github repo for this talk is public and available ({git_url_for_this_talk})"
    )
else:
    print("ERROR - github repo not available or URL error")

The github repo for this talk is public and available (https://github.com/surfaceowl/talk_nov2023_pandas_polars_arrow.git)


# Note: RAW cells used as input for Carbon slides, code not runnable as no active df to work with

<html>
<head>
</head>
<body style="background-color: #FFFFFF;">
  <h1 align="center" style="font-weight: bold; font-style: italic; font-size: 390%;">Better Together: Pandas + Polars + Apache Arrow</h1>
  <table border="0" align="center" width="100%" bgcolor="#FFFFFF">
    <tr>
      <td align="center" width="50%" bgcolor="#FFFFFF">
        <img src="images/pandas_logo.1280x517.png" width="620" height="250">
      </td>
      <td align="center" width="50%" bgcolor="#FFFFFF">
        <img src="images/polars.round.400x400.png" width="250" height="250">
      </td>
    </tr>
    <tr>
      <td colspan="2" align="center" bgcolor="#FFFFFF">
        <img src="images/arrow-logo_horizontal.1800x936.png" width="481" height="250">
      </td>
    </tr>
  </table>
  <br>
  <div style="font-size: 300%;">
    <b>Better Together: Unleashing the Synergy of Pandas, Polars, and Apache Arrow</b><br>
    <b>Speaker:  Chris Brousseau</b> <br>
    <b>30 Nov 2023</b> 
  </div>
</body>
</html>
 </div>
</body>
</html>


<table class="custom-slide-table">
    <tr>
        <td class="image-content">
            <img src="./images/intro_slide_full.jpg" alt="Images of Chris - full slide" width="105%">
        </td>
    </tr>
</table>

<table style="width: 105%; border-collapse: collapse; margin-top: 0;">
    <tr>
        <td style="vertical-align: top;">
            <h1 style="font-size: 48px; color: #333; margin-top: 0; margin-bottom: 10px;">TLDR</h1>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0; margin-bottom: 10px;">- Use Pandas - power (completeness) / flexibility / stability</h2>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0; margin-bottom: 10px;">- Add Polars where you can for speed</h2>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0; margin-bottom: 10px;">- Both getting faster - Arrow is the driver</h2>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0;">- Build better pipelines -- use together where it makes sense</h2>
        </td>
    </tr>
</table>


<h1>What is Apache Arrow?</h1>
<table border="0" style="width: 95%">
    <tr style="font-weight: bold;">
      <th style="text-align: left; padding: 10px">
        <img src="images/arrow-logo_horizontal.1800x936.png" alt="in-memory columnar format" width="620" height="150" style="margin-left: 10px;">
</table>
<p style="font-size: 24px; line-height: 2.5;">
  -- <strong>Software platform for in-memory analytics & queries</strong><br>
  -- <strong>In-memory columnar data format</strong> for tabular data<br>
  -- Fast/language-agnostic messaging & bindings<br>
  -- batch & streaming data<br>
  -- IO to local/remote filesystems and other data structures<br>
</p>

<h1>Why is Arrow a Game Changer?</h1>

<p style="font-size: 24px; line-height: 2;">
  <strong>Interoperability ==> Easier</strong> makes data program independent<br> <br>
  <strong>Speed ==> Faster</strong> ...Columnar format; zero-copy reads ==> transfer pointers + metadata<br> <br>
  <strong>Datatypes ==> More + Better</strong>  ...Nullable; amazing Strings<br> <br>
</p>


<img src="./images/arrow_interop.jpg" alt="Interoperability - full slide" width="100%">

<img src="./images/arrow_speed.jpg" alt="Speed - full slide"  width="100%">

<h1>In-Memory Columnar Format</h1>
<br>
<br>
<br>
<br>
<div style="text-align: center; width:110%">
    <img src="./images/arrow_simd.948x651.png" alt="in-memory columnar format"  width="80%">
</div>

<img src="./images/arrow_datatypes.jpg" alt="Datatypes - full slide"  width="100%">

<h1>Zero Copy Reads</h1>

<table border="1" style="width: 100%">
  <thead>
    <tr style="font-weight: bold;">
      <th style="text-align: center; padding: 10px">
        <img src="./images/arrow_data_copy.574x318.png" alt="current world copies data" style="margin-left: 10px;">
      </th>
      <th style="text-align: center; padding: 10px">
        <img src="./images/arrow_data_shared.png" alt="arrow uses zero-copy reads" style="margin-left: 10px;">
      </th>
    </tr>
  </thead>


In [3]:
display(
    IFrame(
        "https://arrow.apache.org/docs/python/pandas.html",
        width="105%",
        height="1000px",
    )
)

# Into Pandas and Polars

<h1>Key Differences - Packages</h1>

<table border="1" style="width: 80%; font-size: 24px;">
  <thead>
    <tr style="font-weight: bold;">
      <th style="vertical-align: bottom;">Feature</th>
      <th style="text-align: left;">
        <img src="./images/pandas_secondary.svg" alt="Pandas" style="width: 300px; max-width: 100%;">
      </th>
      <th style="text-align: left;">
        <img src="./images/polars.round.400x400.png" alt="Polars" style="width: 200px; max-width: 100%;">
      </th>
    </tr>
  </thead>
  <tbody>
    <tr style="height:100px"; style="background-color: #FFFFFF;">
      <td>First Release Date</td>
      <td>2008</td>
      <td>2019</td>
    </tr>
    <tr style="height:100px"; style="background-color: #F0F0F0;">
      <td>Current Release</td>
      <td>2.1.3</td>
      <td>0.19.15</td>
    </tr>
    <tr style="height:100px"; style="background-color: #FFFFFF;">
      <td>Programming Language</td>
      <td>C, Cython, Python</td>
      <td>Rust</td>
    </tr>
    <tr style="height:100px"; style="background-color: #FFFFFF;">
      <td>Project Goal</td>
      <td>To be the fundamental building block for python data analysis & manipulation - most powerful and flexible dataframe tool</td>
      <td>To provide lightning-fast dataframes that use all local resources</td>
    </tr>
  </tbody>
</table>


<h1>Key Differences - Memory</h1>

<table border="1" style="width: 80%; font-size: 24px;">
  <thead>
    <tr style="font-weight: bold;">
      <th style="vertical-align: bottom;">Feature</th>
      <th style="text-align: left;">
        <img src="./images/pandas_secondary.svg" alt="Pandas" style="width: 300px; max-width: 100%;">
      </th>
      <th style="text-align: left;">
        <img src="./images/polars.round.400x400.png" alt="Polars" style="width: 200px; max-width: 100%;">
      </th>
    </tr>
  </thead>
  <tbody>
    <tr style="height:100px"; style="background-color: #F0F0F0; font-weight: bold">
      <td><strong>Memory</strong></td>
      <td></td>
      <td></td>
    </tr>
    <tr style="height:100px"; style="background-color: #FFFFFF;">
      <td>Memory Backend</td>
      <td>Numpy (default) or <strong>Apache Arrow</strong></td>
      <td><strong>Apache Arrow</strong></td>
    </tr>
    <tr style="height:100px"; style="background-color: #F0F0F0;">
      <td>Memory implementation</td>
      <td>Pyarrow <br>(C++ wrapper on data)</td>
      <td>Arrow2 <br>(Rust wrapper on data)</td>
    </tr>
    <tr style="height:100px"; style="background-color: #FFFFFF;">
      <td>Larger-than-Memory/ Out-of-Core</td>
      <td>No  (but via Dask)</td>
      <td><strong>Native on Lazy df only (`collect(streaming=True`);  chunking + spill</strong></td>
    </tr>
    <tr style="height:100px"; style="background-color: #F0F0F0;">
      <td>Represent Missing Data</td>
      <td>"NaN" or "None"</td>
      <td>"null"</td>
    </tr>
  </tbody>
</table>


<h1>Key Differences - API</h1>

<table border="1" style="width: 80%; font-size: 24px;">
  <thead>
    <tr style="font-weight: bold;">
      <th style="vertical-align: bottom;">Feature</th>
      <th style="text-align: left;">
        <img src="./images/pandas_secondary.svg" alt="Pandas" style="width: 300px; max-width: 100%;">
      </th>
      <th style="text-align: left;">
        <img src="./images/polars.round.400x400.png" alt="Polars" style="width: 200px; max-width: 100%;">
      </th>
    </tr>
  </thead>
  <tbody>
    <tr style="height:70px"; style="background-color: #FFFFFF;">
      <td>Number of Methods</td>
      <td><strong><i>Many</i></strong></td>
      <td><strong><i>not 1:1</i></strong></td>
    </tr>
    <tr style="height:70px"; style="background-color: #F0F0F0;">
      <td>Index/Multindex</td>
      <td>Yes</td>
      <td><strong>No - "index free"</strong></td>
    </tr>
    <tr style="height:70px"; style="background-color: #FFFFFF;">
      <td>Nullable dtype</td>
      <td>Yes</td>
      <td>Yes</td>
    </tr>
    <tr style="height:70px"; style="background-color: #F0F0F0;">
      <td>API mode</td>
      <td>Eager</td>
      <td>Eager (+ Lazy)</td>
    </tr>
    <tr style="height:70px"; style="background-color: #FFFFFF;">
      <td>Query Optimization</td>
      <td>No</td>
      <td>Yes (with Lazy)</td>
    </tr>
    <tr style="height:100px"; style="background-color: #F0F0F0;">
      <td>Parallelization</td>
      <td>No - single threaded</td>
      <td><strong>Yes - multithreaded</td>
    </tr>
    <tr style="height:70px"; style="background-color: #FFFFFF;">
      <td>SIMD</td>
      <td>No</td>
      <td><strong>Yes</td>
    </tr>
  </tbody>
</table>


<h1>Recent Updates</h1>
<table border="1" style="width: 80%; font-size: 24px;">
  <thead>
    <tr style="font-weight: bold;">
      <th style="vertical-align: bottom;">2023 - Speed & Consistency</th>
      <th style="text-align: left;">
        <img src="./images/pandas_secondary.svg" alt="Pandas" style="width: 300px; max-width: 100%;">
      </th>
      <th style="text-align: left;">
        <img src="./images/polars.round.400x400.png" alt="Polars" style="width: 200px; max-width: 100%;">
      </th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: #FFFFFF;">
      <td>Backend</td>
      <td>Apache Arrow. <strong>Pyarrow required after 3.0.0</strong></td>
      <td>+cloud reading; +speed +bugfix</td>
    </tr>
    <tr style="height:100px"; style="background-color: #F0F0F0;">
      <td></td>
      <td><i>Lazy </i><strong>Copy-on-Write</strong><br>-simplifies API<br>-only mod one object<br>-less defensive copies</td>
      <td></td>
    </tr>
    <tr style="height:100px"; style="background-color: #FFFFFF;">
      <td>Feature Flags / optional dependencies</td>
      <td>pip install <br>pandas[aws, performance]</td>
      <td>pip install <br>polars[pandas, ffspec]</td>
    </tr>
    <tr style="height:100px"; style="background-color: #F0F0F0;">
      <td>Reference</td>
      <td><a href="https://pandas.pydata.org/docs/whatsnew/index.html">pandas release notes</td>
      <td><a href="https://github.com/pola-rs/polars/releases">polars release notes</td>
    </tr>
  </tbody>
</table>

<h1>But how much faster?</h1>
<table border="1" style="width: 80%; ">
  <thead>
    <tr style="font-size: 24px; font-weight: bold;">
      <th style="vertical-align: bottom;">Anecdotes & Considerations</th>
      <th style="text-align: left;">
        <img src="./images/pandas_secondary.svg" alt="Pandas" style="width: 300px; max-width: 100%;">
      </th>
      <th style="text-align: left;">
        <img src="./images/polars.round.400x400.png" alt="Polars" style="width: 200px; max-width: 100%;">
      </th>
    </tr>
  </thead>
  <tbody>
    <tr style="height:70px; font-size: 24px; background-color: #FFFFFF;">
      <td>read csv +csv.gz</td>
      <td></td>
      <td><strong>~2x-20x faster</strong></td>
    </tr>
    <tr style="height:70px; font-size: 24px; background-color: #F0F0F0;">
      <td>read parquet</td>
      <td></td>
      <td>~1x-5x</td>
    </tr>
    <tr style="height:70px; font-size: 24px; background-color: #FFFFFF;">
      <td>groupby</td>
      <td></td>
      <td>~10x</td>
    </tr>
    <tr style="height:70px;  font-size: 24px; background-color: #F0F0F0;">
      <td>Other Considerations</td>
      <td>Huge Ecosystem / examples</td>
      <td>Smaller but growing Ecosystem</td>
    <tr style="height:100px; font-size: 24px; background-color: #FFFFFF;">
      <td>API</td>
      <td><strong>Stable</strong></td>
      <td><strong>Less Stable - but improving</strong></td>
    </tr>
      <tr style="height:70px; font-size: 24px; background-color: #F0F0F0;">
      <td> </td>
      <td> </td>
      <td>Categorical; some window functions / plotting /etc.</td>
    </tr>
    <tr style="height:70px; font-size: 24px; background-color: #FFFFFF;">
      <td><strong>Reference - see repo</strong></td>
      <td><a href="https://pandas.pydata.org/docs/reference/index.html">pandas api</td>
      <td><a href="https://pola-rs.github.io/polars/py-polars/html/reference/">polars api</td>
    </tr>
  </tbody>
</table>

<style>
    .custom-slide-table {
        width: 100%;
        table-layout: fixed;
    }

    .custom-slide-table td {
        vertical-align: middle;
    }

    .custom-title {
        font-size: 32px;
        text-align: left;
        margin-top: 40px;
        margin-bottom: 20px;
    }
</style>

<h1 class="custom-title">Caution comparing speeds!  Pandas configs have a big impact!</h1>

<table class="custom-slide-table">
    <tr>
        <td class="image-content">
            <img src="./images/csv_read_speed_comparison.jpg" alt="A lot faster!" width="120%">
        </td>
    </tr>
</table>


# Get the Best out of Pandas <br>
<br>

### 0- Configure Pandas properly - huge impact on performance <br><br>

### 1- use pyarrow for I/O & nullable dtypes (faster)
pd.read_csv( my_datat.csv",  engine="pyarrow" dtype_backend="pyarrow")<br>pd.read_parquet) <br> <br>
<br>
### 2- set pyarrow for all string data (faster/smaller)
pd.options.future.infer_string = True   <br>
<br>
### 3- enable Copy-On-Write (lazy + more consistent api) -- just do this anyway; will be default!
pd.options.mode.copy_on_write = True <br>
<br>

# some syntax differences

In [4]:
# code for image displayed below
syntax_compare_image = (
    '<img src="./images_code/pandas_polars_syntax_compare.png" style="width:100%"/>'
)

In [5]:
display(HTML(syntax_compare_image))

# Use Columnar Example


In [6]:
# code for image displayed below
columnar_image = (
    '<img src="./images_code/pandas_pred_pushdown.png" style="width:100%"/>'
)

In [7]:
display(HTML(columnar_image))

# Polars - Lazy Query Example

In [8]:
# polars Eager API
df_pl2 = pl.read_parquet(data_parquet).filter(
    (pl.col("occupation") == "Data Engineer")
    & (pl.col("psf_membership_status") != "Not Yet a Member")
)

In [9]:
# polars Lazy API - chain transformations then collect()

df_pl2 = (
    pl.read_parquet(data_parquet)
    .lazy()
    .filter(
        (pl.col("occupation") == "Apache Arrow Understudy")
        & (pl.col("psf_membership_status") == "Contributing")
    )
    .collect()
)

In [10]:
eager_lazy_api_side_by_side = """
<table style="width:100%; table-layout: fixed;">
    <tr>
        <td style="width:50%; vertical-align: top; background-color: white;">
            <img src="./images_code/polars_eager_api_match_lines.png" style="width:100%;"/>
        </td>
        <td style="width:50%; vertical-align: top; background-color: white;">
            <img src="./images_code/polars_lazy_api.png" style="width:100%;"/>
        </td>
    </tr>
</table>
"""

In [11]:
display(HTML(eager_lazy_api_side_by_side))

0,1
,


<h1 style="font-size: 32px; color: #333; margin-bottom: 20px;">Integrating Pandas and Polars</h1>

<table style="width: 90%;">
    <tr>
        <td style="vertical-align: top; padding-right: 10px;">
            <h2 style="font-size: 32px; font-weight: bold; color: #555; margin: 15px 0;">Why</h2>
            <ul>
                <li style="font-size: 24px; color: #777; margin: 10px 0;">Creating new pipeline (great!)</li>
                <li style="font-size: 24px; color: #777; margin: 10px 0;">Need big speed improvements + worth your time + budget</li>
                <li style="font-size: 24px; color: #777; margin: 10px 0;">Expensive Pandas ops with direct matches (e.g. select)</li>
                <li style="font-size: 24px; color: #777; margin: 10px 0;">Input data is Polars-friendly (a lot)</li>
                <ul style="font-size: 24px; margin-left: 40px; list-style-type: disc;">
                    <li style="font-size: 24px;"> Pandas df</li>
                    <li style="font-size: 24px;"> Flat files</li>
                    <li style="font-size: 24px;"> Standard db (sql) or supported datastore</li>
                    <li style="font-size: 24px;"> Already in Arrow (e.g. Spark)</li>
                    <li style="font-size: 24px;"> <strong> API does NOT require  pandas (e.g. Snowflake)</strong></li>
                </ul>
            </ul>
        </td>
    </tr>
</table>


### Reminder: RAW cells used as input for Carbon slides, code not runnable as no active df to work with

In [12]:
# code for image displayed below
polars_to_from_pandas = (
    '<img src="./images_code/polars_to_from_pandas.png" style="width:100%"/>'
)

In [13]:
display(HTML(polars_to_from_pandas))

In [14]:
# code for image displayed below
polars_from_pandas_detail = (
    '<img src="./images_code/polars_from_pandas_detail.png" style="width:100%"/>'
)

In [15]:
display(HTML(polars_from_pandas_detail))

In [16]:
# code for image displayed below
polars_to_pandas_detail = (
    '<img src="./images_code/polars_to_pandas_detail.png" style="width:100%"/>'
)

In [17]:
display(HTML(polars_to_pandas_detail))

In [18]:
display(
    IFrame(
        "https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas",
        width="105%",
        height="1000px",
    )
)

In [19]:
# code for image displayed below
pandas_with_pyarrow = (
    '<img src="./images_code/pandas_with_pyarrow.png" style="width:100%"/>'
)

In [20]:
display(HTML(pandas_with_pyarrow))

## Re: Speed - what are practical implications of transfers to/from Arrow?


In [21]:
# code for image displayed below
pl_pd_xfer_speed_setup = (
    '<img src="./images_code/pl_pd_xfer_speed_setup.png" style="width:100%"/>'
)

In [22]:
display(HTML(pl_pd_xfer_speed_setup))

In [23]:
# code for image displayed below
pl_to_pandas_slow = (
    '<img src="./images_code/pl_to_pandas_slow.png" style="width:100%"/>'
)

In [24]:
display(HTML(pl_to_pandas_slow))

In [25]:
# code for image displayed below
pl_to_pandas_fast = (
    '<img src="./images_code/pl_to_pandas_fast.png" style="width:100%"/>'
)

In [26]:
display(HTML(pl_to_pandas_fast))

<table style="width: 100%; border-collapse: collapse; margin-top: 0;">
    <tr>
        <td style="vertical-align: top;">
            <h1 style="font-size: 48px; color: #333; margin-top: 0; margin-bottom: 10px;">TLDR</h1>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0; margin-bottom: 10px;">- Use Pandas - power (completeness) / flexibility / stability</h2>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0; margin-bottom: 10px;">- Add Polars where you can for speed</h2>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0; margin-bottom: 10px;">- Both getting faster - Arrow is the driver</h2>
        </td>
    </tr>
    <tr>
        <td style="vertical-align: top;">
            <h2 style="font-size: 24px; color: #555; margin-top: 0;">- Build better pipelines -- use together where it makes sense</h2>
        </td>
    </tr>
</table>


# Thank You / Wrap Up


In [27]:
# check or create QR code for this github repo
if not Path.is_file(qr_full_path):
    print(f"QR code missing; recreating {qr_full_path}")
    qr = qrcode.QRCode(error_correction=qrcode.constants.ERROR_CORRECT_M)
    qr.add_data(git_url_for_this_talk)

    qr_code_extended_talk = qr.make_image(
        image_factory=StyledPilImage,
        module_drawer=GappedSquareModuleDrawer(),
    )
    qr_code_extended_talk.save(qr_full_path)
    qr_code_extended_talk
else:
    print("QR code exists")

QR code exists


<html>
<head>
    <link href="https://fonts.googleapis.com/css2?family=Architects+Daughter&display=swap" rel="stylesheet">
</head>
<body>
    <table style="border-collapse: collapse; border: none; background-color: #F7F7F7; width: 105%; font-size: 148px;">
        <tr>
            <td colspan="2" style="border: none; background-color: #E0E0E0; text-align: center; font-weight: bold; font-family: 'Architects Daughter';">Thank you!</td>
        </tr>
        <tr>
            <td style="border: none; background-color: #F7F7F7; text-align: left; font-size: 48px;">Chris Brousseau</td>
        </tr>
        <tr>
            <td style="border: none; background-color: #F7F7F7; text-align: left; font-weight: bold; font-size: 36px;">Data / ML & Data Science Consulting</td>
        </tr>
        <tr>
            <td style="border: none; background-color: #F7F7F7; text-align: left; font-weight: bold; font-size: 36px;">chris@surfaceowl.com</td>
        </tr>
        <tr>
            <td style="padding: 8px; border: none; background-color: #F7F7F7; text-align: left; font-size: 24px;">
                <a href="https://github.com/surfaceowl/talk_nov2023_pandas_polars_arrow" target="_blank"> https://github.com/surfaceowl/talk_nov2023_pandas_polars_arrow</a>
            </td>
            <td style="padding: 8px; border: none; background-color: #F7F7F7; text-align: center;">
                <img src="./images/qr_code_extended_talk.png" alt="QR Code for Extended Better Together Talk" style="max-width: 100%; height: auto;">
            </td>
        </tr>
    </table>
</body>
</html>

# Appendix

<div style="font-size: 48px; width: 75%; text-align: left;">
    <table style="font-size: 18px; width: 105%;">
        <tr>
            <td style="vertical-align: top;"><b>Arrow Revolution</b></td>
            <td><a href="https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i">https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i</a><br><br></td>
        </tr>
            <td style="vertical-align: top;"><b>interesting articles/video</b></td>
            <td>
                <a href="https://www.youtube.com/watch?v=QfLzEp-yt_U">Richey Vink; Polars Creator</a><br>
                <a href="https://kyleake.medium.com/pandas-to-polars-a-comprehensive-transition-guide-81b6f50e9154">https://kyleake.medium.com/pandas-to-polars-a-comprehensive-transition-guide-81b6f50e9154</a><br><br>
            </td>
        <tr>
            <td style="vertical-align: top;"><b>pandas</b></td>
            <td><a href="https://pandas.pydata.org/docs/whatsnew/index.html">https://pandas.pydata.org/docs/whatsnew/index.html</a><br><br></td>
        </tr>
        <tr>
            <td style="vertical-align: top;"><b>polars</b></td>
            <td><a href="https://pola-rs.github.io/polars/">https://pola-rs.github.io/polars/</a><br><br></td>
        </tr>
        <tr>
            <td style="vertical-align: top;"><b>convert to/from pandas/pyarrow</b></td>
            <td><a href="https://arrow.apache.org/docs/python/pandas.html">https://arrow.apache.org/docs/python/pandas.html</a><br><br></td>
        </tr>
        <tr>
            <td style="vertical-align: top;"><b>apache arrow</b></td>
            <td>
                <a href="https://arrow.apache.org/overview/">https://arrow.apache.org/overview/</a><br><br>
                <a href="https://arrow.apache.org/docs/python/index.html">https://arrow.apache.org/docs/python/index.html</a><br><br>
            </td>
        </tr>
        <tr>
            <td style="vertical-align: top;"><b>dataframe API standard</b></td>
            <td>
                <a href="https://data-apis.org/dataframe-api/draft/index.html">https://data-apis.org/dataframe-api/draft/index.html</a><br>
                <a href="https://ponder.io/how-the-python-dataframe-interchange-protocol-makes-life-better/">https://ponder.io/how-the-python-dataframe-interchange-protocol-makes-life-better/</a><br><br>
            </td>
        </tr>
        <tr>
            <td style="vertical-align: top;"><b>Reference - Datatypes</b></td>
            <td>
                <a href="https://pythonspeed.com/articles/pandas-string-dtype-memory/">https://pythonspeed.com/articles/pandas-string-dtype-memory/</a><br>
                <a href="https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/">https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/</a><br>
                <a href="https://pandas.pydata.org/docs/user_guide/basics.html#dtypes">https://pandas.pydata.org/docs/user_guide/basics.html#dtypes</a><br>
                <a href="https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html">https://pola-rs.github.io/polars/py-polars/html/reference/datatypes.html</a><br>
                <a href="https://arrow.apache.org/docs/python/pandas.html">https://arrow.apache.org/docs/python/pandas.html</a><br><br>
            </td>
        </tr>
    </table>
</div>


In [28]:
# string memory usage - credit to:  https://pythonspeed.com/articles/pandas-string-dtype-memory/
from random import random
import sys

prefix = sys.argv[1]

# A Python list of strings generated from random numbers:
random_strings = [prefix + str(random()) for i in range(1_000_000)]

# The default dtype, object:
object_dtype = pd.Series(random_strings)
print("object", object_dtype.memory_usage(deep=True))

# A normal Pandas string dtype:
standard_dtype = pd.Series(random_strings, dtype="string")
print("string", standard_dtype.memory_usage(deep=True))

# The new Arrow string dtype from Pandas 2.1.3:
arrow_dtype = pd.Series(random_strings, dtype="string[pyarrow]")
print("arrow ", arrow_dtype.memory_usage(deep=True))

object 77270726
string 77270726
arrow  24270726


In [29]:
arrow_string_savings_pct = (
    (standard_dtype.memory_usage(deep=True)) - arrow_dtype.memory_usage(deep=True)
) / standard_dtype.memory_usage(deep=True)

arrow_string_savings_pct = round(arrow_string_savings_pct * 100, 0)
arrow_string_savings_pct

69.0