# Getting Started in Polars
---
**Alier Reng**

**Date: 2024-04-20**


# Motivation
I've always been intrigued by the streamlined syntax and speed of the `Polars` library. However, my initial experience with it was challenging. I tried to clean and manipulate data with `Polars`, but I struggled to import the data correctly due to the required `null_values` argument in the import functions. This left me feeling discouraged, and I gave up. But I didn't lose hope and kept trying. Finally, I succeeded in creating a tutorial on the basics of the `Polars` library in May 2023, which I updated on April 20, 2024. This tutorial is my way of sharing my knowledge with those interested in learning about the `Polars` package. [That tutorial can be found here](https://github.com/tongakuot/python_tutorials/tree/main/Data%20Wrangling%20with%20Polars)

This tutorial is the first of my **Learning the Polars Library the Right Way** series, which will be published on medium.com and later on alierwaidatastudio.com.

# Getting Started
First, I will import necessary libraries, read South Sudan's 2008 census data, and then showcase how to clean and transform this dataset with the `polar` library.

As a bonus, I will demonstrate how to plot the data with the `lets-plot` Python library derived from `ggplto2` and tabulate data with the `great_tables` Python package, a Python version of the `R gt` package. My goal is to walk my readers through, step by step, from loading packages to data import data to data wrangling, visualization, and tabulation. I will conclude with a comprehensive summary.

## Loading the Required Libraries
Here, I will load `polars`, `lets_plot`, `great_tables`, `numpy`, `seaborn`, and `matplotlib.pyplot`.

In [2]:
# Libraries -------
import polars as pl
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from great_tables import GT, md, html, style, loc
from lets_plot import *
LetsPlot.setup_html()


## Importing Data


In [3]:
census_raw = pl.read_csv("data/ss_2008_census_data_raw.csv", null_values="NA")

# Inspect the first 5 rows
print(census_raw.head(5))

shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ Total    ┆ units ┆ Persons ┆ 964353 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 150872 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 5 to 9   ┆ units ┆ Persons ┆ 151467 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 10 to 14 ┆ units ┆ Persons ┆ 126140 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 15 to 19 ┆ units ┆ Persons ┆ 103804 │
└────

## Importing the Data Lazily


In [4]:
census_lazy = pl.scan_csv(
    'data/ss_2008_census_data_raw.csv', null_values='NA'
    )

# Inspect the first 5 rows
print(census_lazy.collect().head(5))

shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ Total    ┆ units ┆ Persons ┆ 964353 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 150872 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 5 to 9   ┆ units ┆ Persons ┆ 151467 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 10 to 14 ┆ units ┆ Persons ┆ 126140 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 15 to 19 ┆ units ┆ Persons ┆ 103804 │
└────

In [5]:
# Inspect the last 5 rows
print(census_raw.tail(5))

shape: (5, 10)
┌───────────────┬────────────────────┬──────────┬──────────┬───┬──────────┬───────┬─────────┬──────┐
│ Region        ┆ Region Name        ┆ Region - ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008 │
│ ---           ┆ ---                ┆ RegionId ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---  │
│ str           ┆ str                ┆ ---      ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64  │
│               ┆                    ┆ str      ┆          ┆   ┆          ┆       ┆         ┆      │
╞═══════════════╪════════════════════╪══════════╪══════════╪═══╪══════════╪═══════╪═════════╪══════╡
│ KN.A11        ┆ Eastern Equatoria  ┆ SS-EE    ┆ KN.B8    ┆ … ┆ 60 to 64 ┆ units ┆ Persons ┆ 5274 │
│ KN.A11        ┆ Eastern Equatoria  ┆ SS-EE    ┆ KN.B8    ┆ … ┆ 65+      ┆ units ┆ Persons ┆ 8637 │
│ null          ┆ null               ┆ null     ┆ null     ┆ … ┆ null     ┆ null  ┆ null    ┆ null │
│ Source:       ┆ National Bureau of ┆ null     ┆ null     ┆ … ┆ null     ┆ 

In [44]:
list(census_raw['Region Name'].tail())

['Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 None,
 'National Bureau of Statistics, South Sudan',
 'http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan']

In [6]:
# Inspect random 5 rows
print(census_raw.sample(5))

shape: (5, 10)
┌────────┬─────────────────────────┬──────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name             ┆ Region - ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---                     ┆ RegionId ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str                     ┆ ---      ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64    │
│        ┆                         ┆ str      ┆          ┆   ┆          ┆       ┆         ┆        │
╞════════╪═════════════════════════╪══════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A8  ┆ Lakes                   ┆ SS-LK    ┆ KN.B2    ┆ … ┆ 30 to 34 ┆ units ┆ Persons ┆ 47137  │
│ KN.A11 ┆ Eastern Equatoria       ┆ SS-EE    ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 126467 │
│ KN.A5  ┆ Warrap                  ┆ SS-WR    ┆ KN.B2    ┆ … ┆ 60 to 64 ┆ units ┆ Persons ┆ 12243  │
│ KN.A6  ┆ Northern Bahr el Ghazal ┆ SS-BN    ┆ KN.B5    ┆ … ┆ 65+      ┆ un

## Checking for Missing Values


In [7]:
# Inspect the last 5 rows
print(census_raw.null_count())

shape: (1, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬───────┬──────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units ┆ 2008 │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---   ┆ ---  │
│ u32    ┆ u32         ┆ u32               ┆ u32      ┆   ┆ u32      ┆ u32   ┆ u32   ┆ u32  │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═══════╪══════╡
│ 1      ┆ 1           ┆ 3                 ┆ 3        ┆ … ┆ 3        ┆ 3     ┆ 3     ┆ 3    │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴───────┴──────┘


In [8]:
import polars.selectors as cs
# Inspect the last 5 rows
print(
    census_raw
    .select(cs.all().is_null().sum())
)

shape: (1, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬───────┬──────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units ┆ 2008 │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---   ┆ ---  │
│ u32    ┆ u32         ┆ u32               ┆ u32      ┆   ┆ u32      ┆ u32   ┆ u32   ┆ u32  │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═══════╪══════╡
│ 1      ┆ 1           ┆ 3                 ┆ 3        ┆ … ┆ 3        ┆ 3     ┆ 3     ┆ 3    │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴───────┴──────┘


## Selecting Columns of Interest

In [9]:
# Selecting columns of interest: polars provides various methods for selecting columns
print(
    census_raw
    .select(cs.ends_with('Name'), '2008')
    .columns
)

['Region Name', 'Variable Name', 'Age Name', '2008']


In [10]:
age_mapping = {
    "0 to 4": "0-14",
    "5 to 9": "0-14",
    "10 to 14": "0-14",
    "15 to 19": "15-24",
    "20 to 24": "15-24",
    "25 to 29": "25-34",
    "30 to 34": "25-34",
    "35 to 39": "35-44",
    "40 to 44": "35-44",
    "45 to 49": "45-54",
    "50 to 54": "45-54",
    "55 to 59": "55-64",
    "60 to 64": "55-64",
    "65+": "65 and above",
}
census = (
    census_raw
    .select(
        ["Region Name", "Variable Name", "Age Name", "2008"]
    )
    .rename(
        {
            "Region Name": "state",
            "Variable Name": "gender",
            "Age Name": "age_category",
            "2008": "population",
        }
    )
    .with_columns(
        gender=pl.col("gender").str.split(" ").list.get(1),
        age_category=pl.col("age_category").replace(age_mapping),
    )
    .filter(
        (pl.col("gender") != "Total") & (pl.col("age_category") != "Total")
    )
    .group_by(['state', 'gender', 'age_category'])
    .agg(total=pl.col('population').sum())
    .sort('total', descending=True)
)

## 

In [11]:
print(census.head(5))

shape: (5, 4)
┌───────────────────┬────────┬──────────────┬────────┐
│ state             ┆ gender ┆ age_category ┆ total  │
│ ---               ┆ ---    ┆ ---          ┆ ---    │
│ str               ┆ str    ┆ str          ┆ i64    │
╞═══════════════════╪════════╪══════════════╪════════╡
│ Jonglei           ┆ Male   ┆ 0-14         ┆ 338443 │
│ Jonglei           ┆ Female ┆ 0-14         ┆ 263646 │
│ Central Equatoria ┆ Male   ┆ 0-14         ┆ 242247 │
│ Upper Nile        ┆ Male   ┆ 0-14         ┆ 237461 │
│ Warrap            ┆ Male   ┆ 0-14         ┆ 230854 │
└───────────────────┴────────┴──────────────┴────────┘


# Tabulating Data with the Great Tables Package

In [40]:
# Summarize data by state
state = (
    census
    .group_by(['state', 'gender'])
    .agg(total=pl.col('total').sum())
    .sort('total', descending=True)
    .pivot(index='state', columns='gender', values='total')
    .join(
        census
        .group_by(['state', 'gender'])
        .agg(total=pl.col('total').sum())
        .with_columns(
            map_total=pl.col('total').to_struct()
        )
    )
)

# Dispay output
print(state)

shape: (10, 3)
┌─────────────────────────┬────────┬────────┐
│ state                   ┆ Male   ┆ Female │
│ ---                     ┆ ---    ┆ ---    │
│ str                     ┆ i64    ┆ i64    │
╞═════════════════════════╪════════╪════════╡
│ Jonglei                 ┆ 734327 ┆ 624275 │
│ Central Equatoria       ┆ 581722 ┆ 521835 │
│ Upper Nile              ┆ 525430 ┆ 438923 │
│ Warrap                  ┆ 470734 ┆ 502194 │
│ Eastern Equatoria       ┆ 465187 ┆ 440974 │
│ Northern Bahr el Ghazal ┆ 348290 ┆ 372608 │
│ Lakes                   ┆ 365880 ┆ 329850 │
│ Western Equatoria       ┆ 318443 ┆ 300586 │
│ Unity                   ┆ 300247 ┆ 285554 │
│ Western Bahr el Ghazal  ┆ 177040 ┆ 156391 │
└─────────────────────────┴────────┴────────┘


In [86]:
# Tabulate state data frame
import polars.selectors as cs
st_gt = (
    GT(state, rowname_col='state')
    .tab_header(
        title='Jonglei State leads with over 13.6 million persons',
        subtitle=md(
            'Western Bahr el Ghazal State has the lowest <br>population according to the 2008 census'
        )
    )
    .fmt_integer(columns=cs.integer(), use_seps=True)
    .cols_align(align='center', columns=['Male', 'Female'])
    .opt_align_table_header(align="left")
    .cols_label(state='State')
    .tab_spanner(columns=cs.integer(), label='Population by Gender')
    .tab_source_note(source_note=html('<a href="http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan"><strong>2008 Census: National Bureau of Statistics, South Sudan</strong></a>'))
)

In [75]:
# Save gt table as a png file
GT.save(st_gt, file='img/state.png')

# Visualizing Census Data with Lets-Plot Package

In [216]:
state_by_gender = (
    census
    .group_by(['state', 'gender'])
    .agg(total=pl.col('total').sum())
    .sort('total', descending=True)
)

# Inspect output
print(state_by_gender)

shape: (20, 3)
┌────────────────────────┬────────┬────────┐
│ state                  ┆ gender ┆ total  │
│ ---                    ┆ ---    ┆ ---    │
│ str                    ┆ str    ┆ i64    │
╞════════════════════════╪════════╪════════╡
│ Jonglei                ┆ Male   ┆ 734327 │
│ Jonglei                ┆ Female ┆ 624275 │
│ Central Equatoria      ┆ Male   ┆ 581722 │
│ Upper Nile             ┆ Male   ┆ 525430 │
│ Central Equatoria      ┆ Female ┆ 521835 │
│ …                      ┆ …      ┆ …      │
│ Western Equatoria      ┆ Female ┆ 300586 │
│ Unity                  ┆ Male   ┆ 300247 │
│ Unity                  ┆ Female ┆ 285554 │
│ Western Bahr el Ghazal ┆ Male   ┆ 177040 │
│ Western Bahr el Ghazal ┆ Female ┆ 156391 │
└────────────────────────┴────────┴────────┘


In [217]:
# Plot state population by gender
from lets_plot import *

(
    ggplot(
        state_by_gender, 
        aes(x='state', y='total', fill='gender')
    ) + 
    geom_bar(stat='identity') +
    scale_y_continuous(limits=[0, 1_500_000], format='~e') +
    scale_fill_manual(values=['#a9a9a9', '#4D5B68']) +
    labs(
        title='Jonglei State leads in both categories',
        subtitle='Western Bahr el Ghazal State has the lowest \ncount in both categories',
        x='',
        y='Population (in persons)',
        fill='Gender'
    ) +
    theme(plot_title=element_text(hjust=0, size=25, family='Arial'))
)

In [172]:
# Visualize state data by age category
census_by_age = (
    census
    .group_by(['state', 'gender', 'age_category'])
    .agg(total=pl.col('total').sum())
    .sort('total', descending=True)
)

# Inspect output
print(census_by_age.head())

shape: (5, 4)
┌───────────────────┬────────┬──────────────┬────────┐
│ state             ┆ gender ┆ age_category ┆ total  │
│ ---               ┆ ---    ┆ ---          ┆ ---    │
│ str               ┆ str    ┆ str          ┆ i64    │
╞═══════════════════╪════════╪══════════════╪════════╡
│ Jonglei           ┆ Male   ┆ 0-14         ┆ 338443 │
│ Jonglei           ┆ Female ┆ 0-14         ┆ 263646 │
│ Central Equatoria ┆ Male   ┆ 0-14         ┆ 242247 │
│ Upper Nile        ┆ Male   ┆ 0-14         ┆ 237461 │
│ Warrap            ┆ Male   ┆ 0-14         ┆ 230854 │
└───────────────────┴────────┴──────────────┴────────┘


In [215]:
# Plot census_by_age
(
    ggplot(census_by_age, aes('state', 'total', fill='gender')) +
    geom_bar(stat='identity', show_legend='none') +
    scale_fill_discrete() +
    scale_fill_manual(values=['#5C6468', '#110D01']) +
    facet_grid(x='gender', y='age_category', scales='free_y') +
    scale_y_continuous(format='~e') +
    labs(
        title='Facet Grid Display of the South Sudan 2008 Census Data',
        x='',
        y='Population (in persons)'
    ) +
    theme(
        plot_title=element_text(size=20, margin=[15, 0, 15, 0]),
        axis_line_x='blank', 
        axis_ticks=element_line(color='white'), 
        panel_grid_major_x='blank', 
        strip_background=element_rect(color='black', fill='white'), 
        axis_tooltip=element_rect(color='black', fill='white'),
        legend_position='none'
    )
)
