# Getting Started in Polars
---
**Alier Reng**

**Date: 2024-04-20**


# Motivation
I've always been intrigued by the streamlined syntax and speed of the `Polars` library. However, my initial experience with it was challenging. I tried to clean and manipulate data with `Polars`, but I struggled to import the data correctly due to the required `null_values` argument in the import functions. This left me feeling discouraged, and I gave up. But I didn't lose hope and kept trying. Finally, I succeeded in creating a tutorial on the basics of the `Polars` library in May 2023, which I updated on April 20, 2024. This tutorial is my way of sharing my knowledge with those interested in learning about the `Polars` package. [That tutorial can be found here](https://github.com/tongakuot/python_tutorials/tree/main/Data%20Wrangling%20with%20Polars)

This tutorial is the first of my **Learning the Polars Library the Right Way** series, which will be published on medium.com and later on alierwaidatastudio.com.

# Getting Started
First, I will import necessary libraries, read South Sudan's 2008 census data, and then showcase how to clean and transform this dataset with the `polar` library.

As a bonus, I will demonstrate how to plot the data with the `lets-plot` Python library derived from `ggplto2` and tabulate data with the `great_tables` Python package, a Python version of the `R gt` package. My goal is to walk my readers through, step by step, from loading packages to data import data to data wrangling, visualization, and tabulation. I will conclude with a comprehensive summary.

## Loading the Required Libraries
Here, I will load `polars` and `great_tables`.

In [1]:
# Libraries -------
import polars as pl
import polars.selectors as cs 
from great_tables import GT, md, html, style, loc

# pl.__version__

## Importing Data


In [2]:
# Load the dataset
census_raw = pl.read_csv(
    'data/ss_2008_census_data_raw.csv', 
    null_values='NA'
)

# Inspect the first 5 rows
print(census_raw.head(5))

shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ Total    ┆ units ┆ Persons ┆ 964353 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 150872 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 5 to 9   ┆ units ┆ Persons ┆ 151467 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 10 to 14 ┆ units ┆ Persons ┆ 126140 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 15 to 19 ┆ units ┆ Persons ┆ 103804 │
└────

## Importing the Data Lazily


In [3]:
# Load the dataset lazily
census_lazy = pl.scan_csv(
    'data/ss_2008_census_data_raw.csv', 
    null_values='NA'
)

# Inspect the first 5 rows
print(census_lazy.collect().head(5))

shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬────────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008   │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---    │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64    │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪════════╡
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ Total    ┆ units ┆ Persons ┆ 964353 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 0 to 4   ┆ units ┆ Persons ┆ 150872 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 5 to 9   ┆ units ┆ Persons ┆ 151467 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 10 to 14 ┆ units ┆ Persons ┆ 126140 │
│ KN.A2  ┆ Upper Nile  ┆ SS-NU             ┆ KN.B2    ┆ … ┆ 15 to 19 ┆ units ┆ Persons ┆ 103804 │
└────

In [4]:
# Inspect the last 5 rows
print(census_raw.tail(5))

shape: (5, 10)
┌───────────────┬────────────────────┬──────────┬──────────┬───┬──────────┬───────┬─────────┬──────┐
│ Region        ┆ Region Name        ┆ Region - ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008 │
│ ---           ┆ ---                ┆ RegionId ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---  │
│ str           ┆ str                ┆ ---      ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64  │
│               ┆                    ┆ str      ┆          ┆   ┆          ┆       ┆         ┆      │
╞═══════════════╪════════════════════╪══════════╪══════════╪═══╪══════════╪═══════╪═════════╪══════╡
│ KN.A11        ┆ Eastern Equatoria  ┆ SS-EE    ┆ KN.B8    ┆ … ┆ 60 to 64 ┆ units ┆ Persons ┆ 5274 │
│ KN.A11        ┆ Eastern Equatoria  ┆ SS-EE    ┆ KN.B8    ┆ … ┆ 65+      ┆ units ┆ Persons ┆ 8637 │
│ null          ┆ null               ┆ null     ┆ null     ┆ … ┆ null     ┆ null  ┆ null    ┆ null │
│ Source:       ┆ National Bureau of ┆ null     ┆ null     ┆ … ┆ null     ┆ 

In [5]:
# Pull out any column and inspect its values; 
# format the output with the list()
list(
    census_raw.get_column('Region Name')
    .tail()
)

['Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 'Eastern Equatoria',
 None,
 'National Bureau of Statistics, South Sudan',
 'http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan']

In [6]:
# Pull out any column and inspect its values; 
# format the output with the list()
list(
    census_raw
    .get_column('Region Name')
    .unique()
    .sort()
)

[None,
 'Central Equatoria',
 'Eastern Equatoria',
 'Jonglei',
 'Lakes',
 'National Bureau of Statistics, South Sudan',
 'Northern Bahr el Ghazal',
 'Unity',
 'Upper Nile',
 'Warrap',
 'Western Bahr el Ghazal',
 'Western Equatoria',
 'http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan']

In [7]:
# Inspect random 5 rows
print(census_raw.sample(5, seed=254, with_replacement=True))

shape: (5, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬─────────┬───────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units   ┆ 2008  │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---     ┆ ---   │
│ str    ┆ str         ┆ str               ┆ str      ┆   ┆ str      ┆ str   ┆ str     ┆ i64   │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═════════╪═══════╡
│ KN.A11 ┆ Eastern     ┆ SS-EE             ┆ KN.B2    ┆ … ┆ 30 to 34 ┆ units ┆ Persons ┆ 57187 │
│        ┆ Equatoria   ┆                   ┆          ┆   ┆          ┆       ┆         ┆       │
│ KN.A8  ┆ Lakes       ┆ SS-LK             ┆ KN.B5    ┆ … ┆ 65+      ┆ units ┆ Persons ┆ 10100 │
│ KN.A9  ┆ Western     ┆ SS-EW             ┆ KN.B2    ┆ … ┆ 30 to 34 ┆ units ┆ Persons ┆ 42527 │
│        ┆ Equatoria   ┆                   ┆          ┆   ┆          ┆       ┆         ┆       │
│ KN.A10 ┆ Cent

## Checking for Missing Values


In [8]:
# Inspect the dataset for missing values using .null_count()
print(census_raw.null_count())

shape: (1, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬───────┬──────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units ┆ 2008 │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---   ┆ ---  │
│ u32    ┆ u32         ┆ u32               ┆ u32      ┆   ┆ u32      ┆ u32   ┆ u32   ┆ u32  │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═══════╪══════╡
│ 1      ┆ 1           ┆ 3                 ┆ 3        ┆ … ┆ 3        ┆ 3     ┆ 3     ┆ 3    │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴───────┴──────┘


In [9]:
# Inspect the dataset for missing values with the help of the polars selectors
print(
    census_raw
    .select(cs.all().is_null().sum())
)

shape: (1, 10)
┌────────┬─────────────┬───────────────────┬──────────┬───┬──────────┬───────┬───────┬──────┐
│ Region ┆ Region Name ┆ Region - RegionId ┆ Variable ┆ … ┆ Age Name ┆ Scale ┆ Units ┆ 2008 │
│ ---    ┆ ---         ┆ ---               ┆ ---      ┆   ┆ ---      ┆ ---   ┆ ---   ┆ ---  │
│ u32    ┆ u32         ┆ u32               ┆ u32      ┆   ┆ u32      ┆ u32   ┆ u32   ┆ u32  │
╞════════╪═════════════╪═══════════════════╪══════════╪═══╪══════════╪═══════╪═══════╪══════╡
│ 1      ┆ 1           ┆ 3                 ┆ 3        ┆ … ┆ 3        ┆ 3     ┆ 3     ┆ 3    │
└────────┴─────────────┴───────────────────┴──────────┴───┴──────────┴───────┴───────┴──────┘


## Selecting Columns of Interest

In [10]:
# Selecting columns of interest: 
# polars provides various methods for selecting columns
print(
    census_raw
    .select(cs.ends_with('Name'), '2008')
    .columns
)

['Region Name', 'Variable Name', 'Age Name', '2008']


In [11]:
# Selecting multiple columns - pass in a list of column names
print(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
)

shape: (453, 4)
┌─────────────────────────────────┬─────────────────────────────┬──────────┬────────┐
│ Region Name                     ┆ Variable Name               ┆ Age Name ┆ 2008   │
│ ---                             ┆ ---                         ┆ ---      ┆ ---    │
│ str                             ┆ str                         ┆ str      ┆ i64    │
╞═════════════════════════════════╪═════════════════════════════╪══════════╪════════╡
│ Upper Nile                      ┆ Population, Total (Number)  ┆ Total    ┆ 964353 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 0 to 4   ┆ 150872 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 5 to 9   ┆ 151467 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 10 to 14 ┆ 126140 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 15 to 19 ┆ 103804 │
│ …                               ┆ …                           ┆ …        ┆ …      │
│ Eastern Equatoria               ┆ Po

In [12]:
# Selecting multiple columns
print(
    census_raw
    .select('Region Name', 'Variable Name', 'Age Name', '2008')
)

shape: (453, 4)
┌─────────────────────────────────┬─────────────────────────────┬──────────┬────────┐
│ Region Name                     ┆ Variable Name               ┆ Age Name ┆ 2008   │
│ ---                             ┆ ---                         ┆ ---      ┆ ---    │
│ str                             ┆ str                         ┆ str      ┆ i64    │
╞═════════════════════════════════╪═════════════════════════════╪══════════╪════════╡
│ Upper Nile                      ┆ Population, Total (Number)  ┆ Total    ┆ 964353 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 0 to 4   ┆ 150872 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 5 to 9   ┆ 151467 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 10 to 14 ┆ 126140 │
│ Upper Nile                      ┆ Population, Total (Number)  ┆ 15 to 19 ┆ 103804 │
│ …                               ┆ …                           ┆ …        ┆ …      │
│ Eastern Equatoria               ┆ Po

# Handling String Data in Columns

In [13]:
# Splitting string column values
(
    census_raw
    .select(pl.col('Variable Name').str.split(' '))
    .head(5)
)

Variable Name
list[str]
"[""Population,"", ""Total"", ""(Number)""]"
"[""Population,"", ""Total"", ""(Number)""]"
"[""Population,"", ""Total"", ""(Number)""]"
"[""Population,"", ""Total"", ""(Number)""]"
"[""Population,"", ""Total"", ""(Number)""]"


In [14]:
# Next, we will keep only the middle piece
(
    census_raw
    .select(pl.col('Variable Name').str.split(' ').list.get(1))
    .head(5)
)

Variable Name
str
"""Total"""
"""Total"""
"""Total"""
"""Total"""
"""Total"""


In [15]:
# Modifying a column in place; add a new column
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(gender=pl.col('Variable Name').str.split(' ').list.get(1))
    .head(5)
)

Region Name,Variable Name,Age Name,2008,gender
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""Total"""


In [16]:
# Using index access to get the desired element
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(gender=pl.col('Variable Name').str.split(' ').list[1])
    .head(5)
)

Region Name,Variable Name,Age Name,2008,gender
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""Total"""


## Replacing String Patterns

In [17]:
# Replacing a pattern with blank
(
    census_raw
    .select(pl.col('Variable Name').str.replace('Population,', ''))
    .head(5)
)

Variable Name
str
""" Total (Number)"""
""" Total (Number)"""
""" Total (Number)"""
""" Total (Number)"""
""" Total (Number)"""


In [18]:
# Replacing a pattern with blank
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(gender=pl.col('Variable Name').str.replace('Population,', ''))
    .with_columns(gender=pl.col('gender').str.replace('\\(Number\\)', ''))
    .head(5)
)

Region Name,Variable Name,Age Name,2008,gender
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,""" Total """
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,""" Total """
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,""" Total """
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,""" Total """
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,""" Total """


In [19]:
# Using str.strip to remove white spaces
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(gender=pl.col('Variable Name').str.replace('Population,', ''))
    .with_columns(gender=pl.col('gender').str.replace('\\(Number\\)', ''))
    .with_columns(gender=pl.col('gender').str.strip_chars())
    .head(5)
)

Region Name,Variable Name,Age Name,2008,gender
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""Total"""


In [20]:
# Using str.replace_many()
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(
        gender=pl.col('Variable Name')
        .str.replace_many(['Population, ', ' (Number)'], '',)
    )
    .head()
)

Region Name,Variable Name,Age Name,2008,gender
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""Total"""


In [21]:
# Modifying the Age Name column
old_cats = ['0 to 4', '5 to 9', '10 to 14', 
            '15 to 19', '20 to 24', 
            '25 to 29', '30 to 34', 
            '35 to 39', '40 to 44',
            '45 to 49', '50 to 54'
            '55 to 59', '60 to 64',
            '65+'
            ]
new_cats = ['0-14', '0-14', '0-14',
            '15-24', '15-24',
            '25-34', '25-34',
            '35-44', '35-44'
            '45-54', '45-54',
            '55-64', '55-64',
            '65 and above'
            ]

(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(category=pl.col('Age Name').str.replace_many(old_cats, new_cats))
    .head(5)
)

Region Name,Variable Name,Age Name,2008,category
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""15-24"""


In [22]:
# Using the .replace()
old_cats = ['0 to 4', '5 to 9', '10 to 14', 
            '15 to 19', '20 to 24', 
            '25 to 29', '30 to 34', 
            '35 to 39', '40 to 44',
            '45 to 49', '50 to 54',
            '55 to 59', '60 to 64',
            '65+'
            ]
new_cats = ['0-14', '0-14', '0-14',
            '15-24', '15-24',
            '25-34', '25-34',
            '35-44', '35-44',
            '45-54', '45-54',
            '55-64', '55-64',
            '65 and above'
            ]
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(category=pl.col('Age Name').replace(old_cats, new_cats))
    .head(5)
)

Region Name,Variable Name,Age Name,2008,category
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""15-24"""


In [23]:
# Using the dictionary
age_mapping = {
    '0 to 4': '0-14',
    '5 to 9': '0-14',
    '10 to 14': '0-14',
    '15 to 19': '15-24',
    '20 to 24': '15-24',
    '25 to 29': '25-34',
    '30 to 34': '25-34',
    '35 to 39': '35-44',
    '40 to 44': '35-44',
    '45 to 49': '45-54',
    '50 to 54': '45-54',
    '55 to 59': '55-64',
    '60 to 64': '55-64',
    '65+': '65 and above',
}
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .with_columns(category=pl.col("Age Name").replace(age_mapping))
    .head(5)
)

Region Name,Variable Name,Age Name,2008,category
str,str,str,i64,str
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353,"""Total"""
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140,"""0-14"""
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804,"""15-24"""


## Filtering: Selecting Rows

In [24]:
# Filtering data: removing unwanted rows
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .filter(pl.col('Age Name') == 'Total')
    .head(5)
)


Region Name,Variable Name,Age Name,2008
str,str,str,i64
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353
"""Upper Nile""","""Population, Male (Number)""","""Total""",525430
"""Upper Nile""","""Population, Female (Number)""","""Total""",438923
"""Jonglei""","""Population, Total (Number)""","""Total""",1358602
"""Jonglei""","""Population, Male (Number)""","""Total""",734327


In [25]:
# Filtering data: removing unwanted rows
(
    census_raw
    .select(["Region Name", "Variable Name", "Age Name", "2008"])
    .filter(pl.col('Age Name') == 'Total')
    .sample(5)
)

Region Name,Variable Name,Age Name,2008
str,str,str,i64
"""Central Equatoria""","""Population, Female (Number)""","""Total""",521835
"""Upper Nile""","""Population, Female (Number)""","""Total""",438923
"""Upper Nile""","""Population, Male (Number)""","""Total""",525430
"""Western Bahr el Ghazal""","""Population, Female (Number)""","""Total""",156391
"""Western Equatoria""","""Population, Female (Number)""","""Total""",300586


In [26]:
# Multiple conditions
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .filter((pl.col('Age Name') == 'Total') | (pl.col('Variable Name') == 'Total'))
    .sample(n=5, seed=254)
)

Region Name,Variable Name,Age Name,2008
str,str,str,i64
"""Western Equatoria""","""Population, Female (Number)""","""Total""",300586
"""Lakes""","""Population, Male (Number)""","""Total""",365880
"""Central Equatoria""","""Population, Total (Number)""","""Total""",1103557
"""Western Equatoria""","""Population, Male (Number)""","""Total""",318443
"""Warrap""","""Population, Female (Number)""","""Total""",502194


In [27]:
# Selecting desired rows
old_cats = ['0 to 4', '5 to 9', '10 to 14', 
            '15 to 19', '20 to 24', 
            '25 to 29', '30 to 34', 
            '35 to 39', '40 to 44',
            '45 to 49', '50 to 54',
            '55 to 59', '60 to 64',
            '65+'
            ]
new_cats = ['0-14', '0-14', '0-14',
            '15-24', '15-24',
            '25-34', '25-34',
            '35-44', '35-44',
            '45-54', '45-54',
            '55-64', '55-64',
            '65 and above'
            ]
print(
    census_raw
    .select(cs.ends_with('Name'), '2008')
    .with_columns(
        gender=pl.col('Variable Name').str.replace_many(['Population, ', ' (Number)'], '',),
        category=pl.col('Age Name').replace(old_cats, new_cats)
    )
    .select(pl.col('*').exclude(['Variable Name', 'Age Name']))
    .filter(~((pl.col('gender') == 'Total') | (pl.col('category') == 'Total')))
)

shape: (280, 4)
┌───────────────────┬───────┬────────┬──────────────┐
│ Region Name       ┆ 2008  ┆ gender ┆ category     │
│ ---               ┆ ---   ┆ ---    ┆ ---          │
│ str               ┆ i64   ┆ str    ┆ str          │
╞═══════════════════╪═══════╪════════╪══════════════╡
│ Upper Nile        ┆ 82690 ┆ Male   ┆ 0-14         │
│ Upper Nile        ┆ 83744 ┆ Male   ┆ 0-14         │
│ Upper Nile        ┆ 71027 ┆ Male   ┆ 0-14         │
│ Upper Nile        ┆ 57387 ┆ Male   ┆ 15-24        │
│ Upper Nile        ┆ 42521 ┆ Male   ┆ 15-24        │
│ …                 ┆ …     ┆ …      ┆ …            │
│ Eastern Equatoria ┆ 13727 ┆ Female ┆ 45-54        │
│ Eastern Equatoria ┆ 9482  ┆ Female ┆ 45-54        │
│ Eastern Equatoria ┆ 5740  ┆ Female ┆ 55-64        │
│ Eastern Equatoria ┆ 5274  ┆ Female ┆ 55-64        │
│ Eastern Equatoria ┆ 8637  ┆ Female ┆ 65 and above │
└───────────────────┴───────┴────────┴──────────────┘


In [28]:
# We could have accomplished the above this way
(
    census_raw
    .select(cs.ends_with('Name'), '2008')
    .with_columns(
        gender=pl.col('Variable Name').str.replace_many(['Population, ', ' (Number)'], '',),
        category=pl.col('Age Name').replace(old_cats, new_cats)
    )
    .select(pl.col('*').exclude(['Variable Name', 'Age Name']))
    .filter(((pl.col('gender') != 'Total') & (pl.col('category') != 'Total')))
    # .sample(n=5, seed=254)
)

Region Name,2008,gender,category
str,i64,str,str
"""Upper Nile""",82690,"""Male""","""0-14"""
"""Upper Nile""",83744,"""Male""","""0-14"""
"""Upper Nile""",71027,"""Male""","""0-14"""
"""Upper Nile""",57387,"""Male""","""15-24"""
"""Upper Nile""",42521,"""Male""","""15-24"""
…,…,…,…
"""Eastern Equatoria""",13727,"""Female""","""45-54"""
"""Eastern Equatoria""",9482,"""Female""","""45-54"""
"""Eastern Equatoria""",5740,"""Female""","""55-64"""
"""Eastern Equatoria""",5274,"""Female""","""55-64"""


In [29]:
# Extracting rows with missing values
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .filter(pl.col('Age Name').is_null())
)

Region Name,Variable Name,Age Name,2008
str,str,str,i64
,,,
"""National Bureau of Statistics,…",,,
"""http://southsudan.opendatafora…",,,


In [30]:
# Removing rows with missing values - by negating the condition
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .filter(~pl.col('Age Name').is_null())
)

Region Name,Variable Name,Age Name,2008
str,str,str,i64
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804
…,…,…,…
"""Eastern Equatoria""","""Population, Female (Number)""","""45 to 49""",13727
"""Eastern Equatoria""","""Population, Female (Number)""","""50 to 54""",9482
"""Eastern Equatoria""","""Population, Female (Number)""","""55 to 59""",5740
"""Eastern Equatoria""","""Population, Female (Number)""","""60 to 64""",5274


In [31]:
# Removing rows with missing values - using .is_not_null() method
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .filter(pl.col('Age Name').is_not_null())
)

Region Name,Variable Name,Age Name,2008
str,str,str,i64
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804
…,…,…,…
"""Eastern Equatoria""","""Population, Female (Number)""","""45 to 49""",13727
"""Eastern Equatoria""","""Population, Female (Number)""","""50 to 54""",9482
"""Eastern Equatoria""","""Population, Female (Number)""","""55 to 59""",5740
"""Eastern Equatoria""","""Population, Female (Number)""","""60 to 64""",5274


In [32]:
# Removing rows with missing values - using .drop_null() method
(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .drop_nulls()
)

Region Name,Variable Name,Age Name,2008
str,str,str,i64
"""Upper Nile""","""Population, Total (Number)""","""Total""",964353
"""Upper Nile""","""Population, Total (Number)""","""0 to 4""",150872
"""Upper Nile""","""Population, Total (Number)""","""5 to 9""",151467
"""Upper Nile""","""Population, Total (Number)""","""10 to 14""",126140
"""Upper Nile""","""Population, Total (Number)""","""15 to 19""",103804
…,…,…,…
"""Eastern Equatoria""","""Population, Female (Number)""","""45 to 49""",13727
"""Eastern Equatoria""","""Population, Female (Number)""","""50 to 54""",9482
"""Eastern Equatoria""","""Population, Female (Number)""","""55 to 59""",5740
"""Eastern Equatoria""","""Population, Female (Number)""","""60 to 64""",5274


## Using the .when() Method
We can also use the .when() method to relabel or replace row values. This code snippet illustrates how to add new column 'former_region' with the .when() function.

In [33]:
# Relabeling string values with .when() method
print(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .drop_nulls()
    .with_columns(former_region=pl.when(pl.col('Region Name').is_in(['Upper Nile', 'Unity', 'Jonglei']))
                                .then(pl.lit('Greater Upper Nile'))
                                .when(pl.col('Region Name').str.ends_with('Equatoria'))
                                .then(pl.lit('Greater Equatoria'))
                                .when(pl.col('Region Name').str.contains('Ghazal'))
                                .then(pl.lit('Greater Bahr el Ghazal'))
                                .otherwise(pl.lit('Greater Bahr el Ghazal'))
    )
)

shape: (450, 5)
┌───────────────────┬─────────────────────────────┬──────────┬────────┬────────────────────┐
│ Region Name       ┆ Variable Name               ┆ Age Name ┆ 2008   ┆ former_region      │
│ ---               ┆ ---                         ┆ ---      ┆ ---    ┆ ---                │
│ str               ┆ str                         ┆ str      ┆ i64    ┆ str                │
╞═══════════════════╪═════════════════════════════╪══════════╪════════╪════════════════════╡
│ Upper Nile        ┆ Population, Total (Number)  ┆ Total    ┆ 964353 ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 0 to 4   ┆ 150872 ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 5 to 9   ┆ 151467 ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 10 to 14 ┆ 126140 ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 15 to 19 ┆ 103804 ┆ Greater Upper Nile │
│ …                 ┆ …                           ┆ … 

## Renaming Columns

In [34]:
# Renaming columns
print(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .drop_nulls()
    .with_columns(former_region=pl.when(pl.col('Region Name').is_in(['Upper Nile', 'Unity', 'Jonglei']))
                                .then(pl.lit('Greater Upper Nile'))
                                .when(pl.col('Region Name').str.ends_with('Equatoria'))
                                .then(pl.lit('Greater Equatoria'))
                                .when(pl.col('Region Name').str.contains('Ghazal'))
                                .then(pl.lit('Greater Bahr el Ghazal'))
                                .otherwise(pl.lit('Greater Bahr el Ghazal'))
    )
    .rename({'Region Name': 'state',
            'Variable Name': 'gender',
            'Age Name': 'age_category',
            '2008': 'population',
    })
)

shape: (450, 5)
┌───────────────────┬─────────────────────────────┬──────────────┬────────────┬────────────────────┐
│ state             ┆ gender                      ┆ age_category ┆ population ┆ former_region      │
│ ---               ┆ ---                         ┆ ---          ┆ ---        ┆ ---                │
│ str               ┆ str                         ┆ str          ┆ i64        ┆ str                │
╞═══════════════════╪═════════════════════════════╪══════════════╪════════════╪════════════════════╡
│ Upper Nile        ┆ Population, Total (Number)  ┆ Total        ┆ 964353     ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 0 to 4       ┆ 150872     ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 5 to 9       ┆ 151467     ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 10 to 14     ┆ 126140     ┆ Greater Upper Nile │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 15 to 19     ┆ 103804  

In [35]:
# Transforming column names
print(
    census_raw
    .select(['Region Name', 'Variable Name', 'Age Name', '2008'])
    .drop_nulls()
    .rename(lambda c: c.lower().replace(' ', '_'))
)

shape: (450, 4)
┌───────────────────┬─────────────────────────────┬──────────┬────────┐
│ region_name       ┆ variable_name               ┆ age_name ┆ 2008   │
│ ---               ┆ ---                         ┆ ---      ┆ ---    │
│ str               ┆ str                         ┆ str      ┆ i64    │
╞═══════════════════╪═════════════════════════════╪══════════╪════════╡
│ Upper Nile        ┆ Population, Total (Number)  ┆ Total    ┆ 964353 │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 0 to 4   ┆ 150872 │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 5 to 9   ┆ 151467 │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 10 to 14 ┆ 126140 │
│ Upper Nile        ┆ Population, Total (Number)  ┆ 15 to 19 ┆ 103804 │
│ …                 ┆ …                           ┆ …        ┆ …      │
│ Eastern Equatoria ┆ Population, Female (Number) ┆ 45 to 49 ┆ 13727  │
│ Eastern Equatoria ┆ Population, Female (Number) ┆ 50 to 54 ┆ 9482   │
│ Eastern Equatoria ┆ Population, Female (Number

# Putting It Together
In the upcoming code snippets, I will consolidate all the steps and elaborate on the methods I have not yet covered.

In [36]:
# Selecting desired rows
old_cats = ['0 to 4', '5 to 9', '10 to 14', 
            '15 to 19', '20 to 24', 
            '25 to 29', '30 to 34', 
            '35 to 39', '40 to 44',
            '45 to 49', '50 to 54',
            '55 to 59', '60 to 64',
            '65+'
            ]
new_cats = ['0-14', '0-14', '0-14',
            '15-24', '15-24',
            '25-34', '25-34',
            '35-44', '35-44',
            '45-54', '45-54',
            '55-64', '55-64',
            '65 and above'
            ]
print(
    census_raw
    .select(cs.ends_with('Name'), '2008')
    .with_columns(
        gender=pl.col('Variable Name').str.replace_many(['Population, ', ' (Number)'], '',),
        category=pl.col('Age Name').replace(old_cats, new_cats)
    )
    .select(pl.col('*').exclude(['Variable Name', 'Age Name']))
    .filter(~((pl.col('gender') == 'Total') | (pl.col('category') == 'Total')))
     .drop_nulls()
    .rename({'Region Name': 'state', '2008': 'population'})
    .with_columns(former_region=pl.when(pl.col('state').is_in(['Upper Nile', 'Unity', 'Jonglei']))
                                .then(pl.lit('Greater Upper Nile'))
                                .when(pl.col('state').str.ends_with('Equatoria'))
                                .then(pl.lit('Greater Equatoria'))
                                .when(pl.col('state').str.contains('Ghazal'))
                                .then(pl.lit('Greater Bahr el Ghazal'))
                                .otherwise(pl.lit('Greater Bahr el Ghazal'))
    )
    .group_by(['former_region', 'state', 'gender', 'category'])
    .agg(total=pl.col('population').sum())
    .sort('total', descending=True)
)


shape: (140, 5)
┌────────────────────────┬────────────────────────┬────────┬──────────────┬────────┐
│ former_region          ┆ state                  ┆ gender ┆ category     ┆ total  │
│ ---                    ┆ ---                    ┆ ---    ┆ ---          ┆ ---    │
│ str                    ┆ str                    ┆ str    ┆ str          ┆ i64    │
╞════════════════════════╪════════════════════════╪════════╪══════════════╪════════╡
│ Greater Upper Nile     ┆ Jonglei                ┆ Male   ┆ 0-14         ┆ 338443 │
│ Greater Upper Nile     ┆ Jonglei                ┆ Female ┆ 0-14         ┆ 263646 │
│ Greater Equatoria      ┆ Central Equatoria      ┆ Male   ┆ 0-14         ┆ 242247 │
│ Greater Upper Nile     ┆ Upper Nile             ┆ Male   ┆ 0-14         ┆ 237461 │
│ Greater Bahr el Ghazal ┆ Warrap                 ┆ Male   ┆ 0-14         ┆ 230854 │
│ …                      ┆ …                      ┆ …      ┆ …            ┆ …      │
│ Greater Bahr el Ghazal ┆ Lakes                 

# Converting the Code into a Function

In [37]:
# Write a function for cleaning and transforming South Sudan census data
def tweak_census(df, grouping_cols, condition):
    """
    Cleans and transforms South Sudan census data.

    This function processes a DataFrame containing South Sudan census data by:
    - Selecting relevant columns
    - Renaming and recategorizing age groups
    - Filtering rows based on a given condition
    - Handling null values
    - Renaming columns for clarity
    - Adding a new column to categorize regions
    - Grouping data by specified columns
    - Aggregating population counts
    - Sorting the results in descending order of total population

    Parameters:
    df (pl.DataFrame): The input DataFrame containing South Sudan census data.
    grouping_cols (list): A list of column names to group by.
    condition (pl.Expr): A condition to filter the DataFrame.

    Returns:
    pl.DataFrame: A transformed DataFrame with aggregated population counts, grouped by specified columns, and sorted by total population.
    """
    old_cats = ['0 to 4', '5 to 9', '10 to 14', 
                '15 to 19', '20 to 24', 
                '25 to 29', '30 to 34', 
                '35 to 39', '40 to 44',
                '45 to 49', '50 to 54',
                '55 to 59', '60 to 64',
                '65+'
               ]
    new_cats = ['0-14', '0-14', '0-14',
                '15-24', '15-24',
                '25-34', '25-34',
                '35-44', '35-44',
                '45-54', '45-54',
                '55-64', '55-64',
                '65 and above'
               ]
    return (
        df
        .select(cs.ends_with('Name'), '2008')
        .with_columns(
            gender=pl.col('Variable Name').str.replace_many(['Population, ', ' (Number)'], ''),
            category=pl.col('Age Name').replace(old_cats, new_cats)
        )
        .select(pl.col('*').exclude(['Variable Name', 'Age Name']))
        .filter(condition)
        .drop_nulls()
        .rename({'Region Name': 'state', '2008': 'population'})
        .with_columns(former_region=pl.when(pl.col('state').is_in(['Upper Nile', 'Unity', 'Jonglei']))
                                    .then(pl.lit('Greater Upper Nile'))
                                    .when(pl.col('state').str.ends_with('Equatoria'))
                                    .then(pl.lit('Greater Equatoria'))
                                    .when(pl.col('state').str.contains('Ghazal'))
                                    .then(pl.lit('Greater Bahr el Ghazal'))
                                    .otherwise(pl.lit('Greater Bahr el Ghazal'))
        )
        .group_by(grouping_cols)
        .agg(total=pl.col('population').sum())
        .sort('total', descending=True)
    )   

In [38]:
# Test the new function
print(
    tweak_census(
    census_raw, 
    grouping_cols=['former_region', 'state', 'gender', 'category'], 
    condition=~((pl.col('gender') == 'Total') | (pl.col('category') == 'Total')) 
    )
)

shape: (140, 5)
┌────────────────────────┬────────────────────────┬────────┬──────────────┬────────┐
│ former_region          ┆ state                  ┆ gender ┆ category     ┆ total  │
│ ---                    ┆ ---                    ┆ ---    ┆ ---          ┆ ---    │
│ str                    ┆ str                    ┆ str    ┆ str          ┆ i64    │
╞════════════════════════╪════════════════════════╪════════╪══════════════╪════════╡
│ Greater Upper Nile     ┆ Jonglei                ┆ Male   ┆ 0-14         ┆ 338443 │
│ Greater Upper Nile     ┆ Jonglei                ┆ Female ┆ 0-14         ┆ 263646 │
│ Greater Equatoria      ┆ Central Equatoria      ┆ Male   ┆ 0-14         ┆ 242247 │
│ Greater Upper Nile     ┆ Upper Nile             ┆ Male   ┆ 0-14         ┆ 237461 │
│ Greater Bahr el Ghazal ┆ Warrap                 ┆ Male   ┆ 0-14         ┆ 230854 │
│ …                      ┆ …                      ┆ …      ┆ …            ┆ …      │
│ Greater Bahr el Ghazal ┆ Lakes                 

# Tabulating Data with the Great Tables Package

In [48]:
# Summarize data by state
state = (
    tweak_census(
    census_raw, 
    grouping_cols=['former_region', 'state', 'gender', 'category'], 
    condition=~((pl.col('gender') == 'Total') | (pl.col('category') == 'Total')) 
    )
    .group_by(['former_region', 'state', 'gender'])
    .agg(total=pl.col('total').sum())
    .sort('total', descending=True)
    .pivot(index=['former_region', 'state'], columns='gender', values='total')
)

# Dispay output
print(state)

shape: (10, 4)
┌────────────────────────┬─────────────────────────┬────────┬────────┐
│ former_region          ┆ state                   ┆ Male   ┆ Female │
│ ---                    ┆ ---                     ┆ ---    ┆ ---    │
│ str                    ┆ str                     ┆ i64    ┆ i64    │
╞════════════════════════╪═════════════════════════╪════════╪════════╡
│ Greater Upper Nile     ┆ Jonglei                 ┆ 734327 ┆ 624275 │
│ Greater Equatoria      ┆ Central Equatoria       ┆ 581722 ┆ 521835 │
│ Greater Upper Nile     ┆ Upper Nile              ┆ 525430 ┆ 438923 │
│ Greater Bahr el Ghazal ┆ Warrap                  ┆ 470734 ┆ 502194 │
│ Greater Equatoria      ┆ Eastern Equatoria       ┆ 465187 ┆ 440974 │
│ Greater Bahr el Ghazal ┆ Northern Bahr el Ghazal ┆ 348290 ┆ 372608 │
│ Greater Bahr el Ghazal ┆ Lakes                   ┆ 365880 ┆ 329850 │
│ Greater Equatoria      ┆ Western Equatoria       ┆ 318443 ┆ 300586 │
│ Greater Upper Nile     ┆ Unity                   ┆ 300247 ┆ 

In [55]:
# Tabulate state data frame
(
    GT(state, rowname_col='state', groupname_col='former_region')
    .tab_header(
        title='Jonglei State leads with over 13.6 million persons',
        subtitle=md(
            'Western Bahr el Ghazal State has the lowest <br>population according to the 2008 census'
        )
    )
    .fmt_integer(columns=cs.integer(), use_seps=True)
    .opt_align_table_header(align="left")
    .cols_label(state='State')
    .tab_source_note(source_note=html('<a href="http://southsudan.opendataforafrica.org/fvjqdpe/population-by-age-and-sex-2008-south-sudan"><strong>2008 Census: National Bureau of Statistics, South Sudan</strong></a>'))
)

Jonglei State leads with over 13.6 million persons,Jonglei State leads with over 13.6 million persons,Jonglei State leads with over 13.6 million persons
Western Bahr el Ghazal State has the lowest population according to the 2008 census,Western Bahr el Ghazal State has the lowest population according to the 2008 census.1,Western Bahr el Ghazal State has the lowest population according to the 2008 census.2
,,
Greater Upper Nile,Greater Upper Nile,Greater Upper Nile
Upper Nile,525430,438923
Unity,300247,285554
,,
Greater Equatoria,Greater Equatoria,Greater Equatoria
Eastern Equatoria,465187,440974
Western Equatoria,318443,300586
,,
Greater Bahr el Ghazal,Greater Bahr el Ghazal,Greater Bahr el Ghazal


In [54]:
# Save gt table as a png file
# GT.save(st_obj, file='img/state.png')