It is very common to see pandas when it comes to dataframes.
That in itself is a reason to be familiar with the syntax of pandas.

However, pandas is not the best tool around.

Just have a look at this benchmark:

<img src="polars.png">

Polars is written in Rust, extremely fast, and can handle big amounts of data.
This is a reason why most courses can use pandas without problems, but you might get problems
with that if you encounter real-world datasets. 

I will reproduce the cleaning_data.ipynb lesson in polars, [read the documentation](https://pola-rs.github.io/polars-book/user-guide/index.html) if you want to know more.

In [None]:
from pathlib import Path
datadir = Path("../data/raw/")
outputdir = Path("../data/processed/")
filename = datadir / "les1.csv"
filename.resolve(), filename.exists()

Read data

In [None]:
import polars as pl
df = pl.read_csv(filename)
df.head()

In [None]:
df.describe()

Find null values.

In [None]:
df.null_count() > 0 

Drop them

In [None]:
df = df.drop_nulls("x2")

Check types

In [None]:
df.dtypes

Apply functions

In [None]:
import re

regex = re.compile("^[\w]+")
def extract(regex, msg):
    out = re.search(regex, msg)
    return out.group()

In [None]:
df = df.with_columns(
    pl.col("name").apply(lambda x: extract(regex, x))
)

In [None]:
df.head()

Save data

In [None]:
from datetime import datetime
tag = datetime.now().strftime("%Y%m%d-%H%M") + ".csv"
output = outputdir / tag
df.write_csv(output)