It is very common to see pandas when it comes to dataframes.
That in itself is a reason to be familiar with the syntax of pandas.

However, pandas is not the best tool around.

Just have a look at this benchmark:

<img src="polars.png">

Polars is written in Rust, extremely fast, and can handle big amounts of data.
This is a reason why most courses can use pandas without problems, but you might get problems
with that if you encounter real-world datasets. 

I will reproduce the cleaning_data.ipynb lesson in polars, [read the documentation](https://pola-rs.github.io/polars-book/user-guide/index.html) if you want to know more.

In [1]:
from pathlib import Path
datadir = Path("../data/raw/")
outputdir = Path("../data/processed/")
filename = datadir / "les1.csv"
filename.resolve(), filename.exists()

(PosixPath('/Users/raoulgrouls/code/DME22/notebooks/les1/data/raw/les1.csv'),
 True)

Read data

In [17]:
import polars as pl
df = pl.read_csv(filename)
df.head()

x1,x2,name
i64,f64,str
4,0.683287,"""Python Regius"""
5,0.787097,"""Python Regius"""
7,,"""Python Regius"""
9,0.802364,"""Python Regius"""
0,,"""Python Regius"""


In [18]:
df.describe()

describe,x1,x2,name
str,f64,f64,str
"""mean""",4.62,0.587542,
"""std""",2.834777,0.229508,
"""min""",0.0,0.203242,"""Python Regius"""
"""max""",9.0,0.996141,"""Python Regius"""
"""median""",4.5,0.632648,


Find null values.

In [19]:
df.null_count() > 0 

x1,x2,name
bool,bool,bool
False,True,False


Drop them

In [20]:
df = df.drop_nulls("x2")

Check types

In [22]:
df.dtypes

[polars.datatypes.Int64, polars.datatypes.Float64, polars.datatypes.Utf8]

Apply functions

In [23]:
import re

regex = re.compile("^[\w]+")
def extract(regex, msg):
    out = re.search(regex, msg)
    return out.group()

In [25]:
df = df.with_columns(
    pl.col("name").apply(lambda x: extract(regex, x))
)

In [26]:
df.head()

x1,x2,name
i64,f64,str
4,0.683287,"""Python"""
5,0.787097,"""Python"""
9,0.802364,"""Python"""
8,0.855227,"""Python"""
9,0.861283,"""Python"""


Save data

In [27]:
from datetime import datetime
tag = datetime.now().strftime("%Y%m%d-%H%M") + ".csv"
output = outputdir / tag
df.write_csv(output)