# **Polars** ❄️

Polars and Pandas are both popular Python libraries for data manipulation and analysis, but they have different strengths and weaknesses. The best choice depends on your specific needs and priorities.

**Polars Advantages:**

* **Performance:** Polars generally outperforms Pandas, especially on larger datasets. This is due to its use of Apache Arrow, a columnar memory format, and its efficient query execution engine. It leverages parallel processing more effectively.

* **Memory Efficiency:** Polars' columnar storage and lazy evaluation significantly reduce memory consumption compared to Pandas, allowing it to handle much larger datasets without crashing or slowing down.

* **Lazy Evaluation:** Polars uses lazy evaluation, meaning that computations are not performed until the final result is needed. This allows for optimization and can significantly speed up complex operations.

* **Data Types:** Polars supports a wider range of data types than Pandas, including more sophisticated categorical types.

* **Extensibility:** Polars offers a more flexible and extensible architecture, making it easier to integrate with other libraries and tools.

* **Modern API:** Polars features a more modern and intuitive API, often considered easier to learn and use for certain operations.


**Polars Disadvantages:**

* **Maturity:** Polars is a relatively newer library compared to Pandas, so it has a smaller community and fewer readily available resources (tutorials, documentation, etc.). While the community is growing rapidly, this can still be a hurdle.

* **Ecosystem:** The ecosystem of tools and integrations around Pandas is significantly larger. While Polars is catching up, there might be some libraries or workflows that are more readily available for Pandas.

* **Learning Curve:** While the API is often considered more modern, the shift from Pandas' familiar syntax can present a learning curve for experienced Pandas users.

**In Summary:**

| Feature | Polars | Pandas |
|-----------------|---------------------------------------|-----------------------------------------|
| Performance | Generally superior, especially on large datasets | Can be slow with large datasets |
| Memory Efficiency | Significantly better | Can be memory-intensive |
| Maturity | Less mature, smaller community | Mature, large and active community |
| Ecosystem | Growing rapidly, but smaller | Extensive and well-established |
| API | Modern and often considered more intuitive | More established, but can be less intuitive for some tasks |
| Lazy Evaluation | Yes | No |


Choose **Polars** if:

* You're working with extremely large datasets that exceed Pandas' capabilities.
* Performance is critical.
* Memory efficiency is a major concern.
* You value a modern and potentially more intuitive API.


Choose **Pandas** if:

* You need a mature and well-supported library with extensive documentation and community resources.
* You require access to a vast ecosystem of tools and integrations.
* You're already proficient in Pandas and the learning curve of Polars is a significant concern.


Ultimately, the best way to decide is to try both libraries on your specific data and workflows to see which one better suits your needs. You might even find that using both libraries together, leveraging their respective strengths, is the optimal approach.


<https://docs.pola.rs/>

In [5]:
import polars as pl

## **DataFrame**

Another important type of object in the pandas library is the DataFrame. This object is similar in form to a matrix as it consists of rows and columns. Both rows and columns can be indexed with integers or String names. One DataFrame can contain many different types of data types, but within a column, everything has to be the same data type. A column of a DataFrame is essentially a Series. All columns must have the same number of elements (rows).

The default row indices are 0,1,2..., but these can be changed. For example, they can be set to be the elements in one of the columns of the DataFrame.


### **Creation** 

You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query. 

#### **Dictionary to DataFrame**

You can pass in a dictionary to pd.DataFrame(). Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. 

In [6]:
dict_input = {
    "Product ID": [1, 2, 3, 4],
    "Product Name": ["t-shirt", "t-shirt", "skirt", "skirt"],
    "Color": ["blue", "green", "red", "black"],
}

polars_df1 = pl.DataFrame(dict_input)
polars_df1

Product ID,Product Name,Color
i64,str,str
1,"""t-shirt""","""blue"""
2,"""t-shirt""","""green"""
3,"""skirt""","""red"""
4,"""skirt""","""black"""


#### **Lists to DataFrame**

You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [7]:
list_input = [
    [1, "San Diego", 100],
    [2, "Los Angeles", 120],
    [3, "San Francisco", 90],
    [4, "Sacramento", 115],
]

polars_df2 = pl.DataFrame(list_input, schema=["Store ID", "Location", "Number of Employees"])
polars_df2

  polars_df2 = pl.DataFrame(list_input, schema=["Store ID", "Location", "Number of Employees"])


Store ID,Location,Number of Employees
i64,str,i64
1,"""San Diego""",100
2,"""Los Angeles""",120
3,"""San Francisco""",90
4,"""Sacramento""",115


#### **External Data**

When you have data in a CSV, you can load it into a DataFrame using `read_csv()`. We can also save data to a CSV, using `to_csv()`.


In [None]:
import os

from src.config import CREDIT_RISK_DATA_DIR

os.chdir(CREDIT_RISK_DATA_DIR)

ModuleNotFoundError: No module named 'src'

In [None]:
df = pl.read_csv("credit_risk_dataset.csv")

#### **Dealing with Multiple Files**

Often, you have the same data separated out into multiple files.

Let’s say that we have a ton of files following the filename structure: `'file1.csv'`, `'file2.csv'`, `'file3.csv'`, and so on. The power of pandas is mainly in being able to manipulate large amounts of structured data. We want to be able to get all of the relevant information into one table so that we can analyze the aggregate data.

We can combine the use of `glob`, a Python library for working with files, with `pandas` to organize this data better. `glob` can open multiple files using shell-style wildcard matching to get the filenames:

```python
import glob
import polars as pl

files = glob.glob("file*.csv")

df_list = []
for filename in files:
  data = pl.read_csv(filename)
  df_list.append(data)

df = pd.concat(df_list)

print(files)
```


### **Inspection**



In [None]:
df.head()

person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
i64,i64,str,f64,str,str,i64,f64,i64,f64,str,i64
22,59000,"""RENT""",123.0,"""PERSONAL""","""D""",35000,16.02,1,0.59,"""Y""",3
21,9600,"""OWN""",5.0,"""EDUCATION""","""B""",1000,11.14,0,0.1,"""N""",2
25,9600,"""MORTGAGE""",1.0,"""MEDICAL""","""C""",5500,12.87,1,0.57,"""N""",3
23,65500,"""RENT""",4.0,"""MEDICAL""","""C""",35000,15.23,1,0.53,"""N""",2
24,54400,"""RENT""",8.0,"""MEDICAL""","""C""",35000,14.27,1,0.55,"""Y""",4


In [None]:
# df.info()

In [None]:
df.glimpse()

Rows: 32581
Columns: 12
$ person_age                 <i64> 22, 21, 25, 23, 24, 21, 26, 24, 24, 21
$ person_income              <i64> 59000, 9600, 9600, 65500, 54400, 9900, 77100, 78956, 83000, 10000
$ person_home_ownership      <str> 'RENT', 'OWN', 'MORTGAGE', 'RENT', 'RENT', 'OWN', 'RENT', 'RENT', 'RENT', 'OWN'
$ person_emp_length          <f64> 123.0, 5.0, 1.0, 4.0, 8.0, 2.0, 8.0, 5.0, 8.0, 6.0
$ loan_intent                <str> 'PERSONAL', 'EDUCATION', 'MEDICAL', 'MEDICAL', 'MEDICAL', 'VENTURE', 'EDUCATION', 'MEDICAL', 'PERSONAL', 'VENTURE'
$ loan_grade                 <str> 'D', 'B', 'C', 'C', 'C', 'A', 'B', 'B', 'A', 'D'
$ loan_amnt                  <i64> 35000, 1000, 5500, 35000, 35000, 2500, 35000, 35000, 35000, 1600
$ loan_int_rate              <f64> 16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, 11.11, 8.9, 14.74
$ loan_status                <i64> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1
$ loan_percent_income        <f64> 0.59, 0.1, 0.57, 0.53, 0.55, 0.25, 0.45, 0.44, 0.42, 0.16
$ cb_per

In [None]:
df.columns

['person_age',
 'person_income',
 'person_home_ownership',
 'person_emp_length',
 'loan_intent',
 'loan_grade',
 'loan_amnt',
 'loan_int_rate',
 'loan_status',
 'loan_percent_income',
 'cb_person_default_on_file',
 'cb_person_cred_hist_length']

In [None]:
df.dtypes

[Int64,
 Int64,
 String,
 Float64,
 String,
 String,
 Int64,
 Float64,
 Int64,
 Float64,
 String,
 Int64]

In [None]:
df.shape

(32581, 12)

In [None]:
df.estimated_size()

2667563

### **Manipulation**

#### **Selecting Columns**

There are two possible syntaxes for selecting all values from a column:

1. Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type `customers['age']` to select the ages.
2. If the name of a column follows all of the rules for a variable name (doesn’t start with a number, doesn’t contain spaces or special characters, etc.), then you can select it using the following notation `customers.age`.

In [None]:
df["person_age"]

person_age
i64
22
21
25
23
24
…
57
54
65
56


In [None]:
subset = df[["person_age", "loan_grade"]]
subset.head()

person_age,loan_grade
i64,str
22,"""D"""
21,"""B"""
25,"""C"""
23,"""C"""
24,"""C"""


#### **Selecting Rows**

DataFrames are zero-indexed, meaning that we start with the 0th row and count up from there. When we select a single row, the result is a Series (just like when we select a single column).

In [None]:
df.row(2)

(25, 9600, 'MORTGAGE', 1.0, 'MEDICAL', 'C', 5500, 12.87, 1, 0.57, 'N', 3)

In [None]:
df.slice(2, 3)

person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
i64,i64,str,f64,str,str,i64,f64,i64,f64,str,i64
25,9600,"""MORTGAGE""",1.0,"""MEDICAL""","""C""",5500,12.87,1,0.57,"""N""",3
23,65500,"""RENT""",4.0,"""MEDICAL""","""C""",35000,15.23,1,0.53,"""N""",2
24,54400,"""RENT""",8.0,"""MEDICAL""","""C""",35000,14.27,1,0.55,"""Y""",4


#### **Logical Subsets**

You can select a subset of a DataFrame by using logical statements. In Python, `==` is how we test if a value is exactly equal to another value.


In [None]:
df.filter(pl.col("person_age") > 30).head()

person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
i64,i64,str,f64,str,str,i64,f64,i64,f64,str,i64
144,250000,"""RENT""",4.0,"""VENTURE""","""C""",4800,13.57,0,0.02,"""N""",3
144,200000,"""MORTGAGE""",4.0,"""EDUCATION""","""B""",6000,11.86,0,0.03,"""N""",2
123,80004,"""RENT""",2.0,"""EDUCATION""","""B""",20400,10.25,0,0.25,"""N""",3
123,78000,"""RENT""",7.0,"""VENTURE""","""B""",20000,,0,0.26,"""N""",4
32,1200000,"""MORTGAGE""",1.0,"""VENTURE""","""A""",12000,7.51,0,0.01,"""N""",8


You can also combine multiple logical statements, as long as each statement is in parentheses. In Python, `|` means “or” and `&` means “and”.

In [None]:
df.filter((pl.col("person_age") > 30) & (pl.col("loan_intent") == "VENTURE")).head()

person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
i64,i64,str,f64,str,str,i64,f64,i64,f64,str,i64
144,250000,"""RENT""",4.0,"""VENTURE""","""C""",4800,13.57,0,0.02,"""N""",3
123,78000,"""RENT""",7.0,"""VENTURE""","""B""",20000,,0,0.26,"""N""",4
32,1200000,"""MORTGAGE""",1.0,"""VENTURE""","""A""",12000,7.51,0,0.01,"""N""",8
34,120000,"""RENT""",17.0,"""VENTURE""","""B""",35000,10.59,0,0.29,"""N""",6
33,350000,"""MORTGAGE""",0.0,"""VENTURE""","""C""",10000,14.65,0,0.03,"""Y""",10


We could use the `is_in` command to check that a column is one of a list of values.

In [None]:
df.filter(pl.col("loan_grade").is_in(["C", "B"])).head()

person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
i64,i64,str,f64,str,str,i64,f64,i64,f64,str,i64
21,9600,"""OWN""",5.0,"""EDUCATION""","""B""",1000,11.14,0,0.1,"""N""",2
25,9600,"""MORTGAGE""",1.0,"""MEDICAL""","""C""",5500,12.87,1,0.57,"""N""",3
23,65500,"""RENT""",4.0,"""MEDICAL""","""C""",35000,15.23,1,0.53,"""N""",2
24,54400,"""RENT""",8.0,"""MEDICAL""","""C""",35000,14.27,1,0.55,"""Y""",4
26,77100,"""RENT""",8.0,"""EDUCATION""","""B""",35000,12.42,1,0.45,"""N""",3


#### **Setting Indices**

When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use `iloc()`. We can fix this using the method `reset_index()`. Note that the old indices have been moved into a new column called 'index'. Unless you need those values for something special, it’s probably better to use the keyword `drop=True`. If we use the keyword `inplace=True` we can just modify our existing DataFrame. You can also change the name of the index by setting a name to `names`.

In [None]:
df_reset = df.with_row_index("index")
df_reset.head()

index,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
u32,i64,i64,str,f64,str,str,i64,f64,i64,f64,str,i64
0,22,59000,"""RENT""",123.0,"""PERSONAL""","""D""",35000,16.02,1,0.59,"""Y""",3
1,21,9600,"""OWN""",5.0,"""EDUCATION""","""B""",1000,11.14,0,0.1,"""N""",2
2,25,9600,"""MORTGAGE""",1.0,"""MEDICAL""","""C""",5500,12.87,1,0.57,"""N""",3
3,23,65500,"""RENT""",4.0,"""MEDICAL""","""C""",35000,15.23,1,0.53,"""N""",2
4,24,54400,"""RENT""",8.0,"""MEDICAL""","""C""",35000,14.27,1,0.55,"""Y""",4


#### **Adding Columns**

Sometimes, we want to add a column to an existing DataFrame. We might want to add new information or perform a calculation based on the data that we already have.

One way that we can add a new column is by giving a list of the same length as the existing DataFrame.

In [None]:
df["List Column"] = list(range(1, len(df) + 1))

We can also add a new column that is the same for all rows in the DataFrame. 

In [None]:
df = df.with_columns(pl.lit(True).alias("Constant Column"))  # Adds column D with value 10

Finally, you can add a new column by performing a function on the existing columns.

In [None]:
polars_df = df.with_columns(
    (df["person_age"] * 0.075).alias("Function Column")  # A transformation based on other columns
)

#### **Column Operations**

Often, the column that we want to add is related to existing columns, but requires a calculation more complex than multiplication or addition. We can use the `apply` function to apply a function to every value in a particular column.

In [None]:
df = df.with_columns(pl.col("loan_intent").str.to_lowercase().alias("Lowercase"))

A lambda function is a way of defining a function in a single line of code. Usually, we would assign them to a variable. We can make our lambdas more complex by using a modified form of an if statement.

In [None]:
def myfunction(x):
    if x > 40:
        return 40 + (x - 40) * 1.50
    else:
        return x

Below is a lambda function that does the same thing:

In [None]:
myfunction = lambda x: 40 + (x - 40) * 1.50

In general, the syntax for an if function in a lambda function is:

```python
lambda x: [OUTCOME IF TRUE] if [CONDITIONAL] else [OUTCOME IF FALSE]
```

In Pandas, we often use lambda functions to perform complex operations on columns. 

In [None]:
df = df.with_columns(pl.col("person_age").map_elements(myfunction).alias("Lambda Column"))

Expr.map_elements is significantly slower than the native expressions API.
Only use if you absolutely CANNOT implement your logic otherwise.
Replace this expression...
  - pl.col("person_age").map_elements(lambda x: ...)
with this one instead:
  + 40 + ((pl.col("person_age") - 40) * 1.5)

  .map_elements(myfunction)
  df = df.with_columns(


#### **Row Operations**

This Polars code uses `pl.when` and `pl.then`/`pl.otherwise` to mimic the conditional logic of your Pandas `lambda` function. This approach is significantly faster and more memory-efficient than using `apply` for large datasets because it operates on entire columns at once rather than row by row. Remember to replace the sample data with your actual data loading. This vectorized method is the recommended approach in Polars for optimal performance.

In [None]:
df = df.with_columns(
    pl.when(pl.col("person_age") > 60)
    .then(pl.col("person_age") + pl.col("person_income"))
    .otherwise(pl.col("person_income") * 100)
    .alias("Row Operation Column")
)

No, you cannot directly use NumPy's `np.where` function within Polars. Polars has its own equivalent functionality built into its expression system, specifically the `pl.when().then().otherwise()` construct, as demonstrated in the previous examples. This is designed for better performance and integration with Polars' data structures.

#### **Renaming Columns**

This code uses a lambda function within the rename method to convert each column name to lowercase. This is a concise and efficient way to perform this operation in Polars.

In [None]:
# Rename columns to lowercase
df = df.rename(lambda col_name: col_name.lower())
df.columns

['person_age',
 'person_income',
 'person_home_ownership',
 'person_emp_length',
 'loan_intent',
 'loan_grade',
 'loan_amnt',
 'loan_int_rate',
 'loan_status',
 'loan_percent_income',
 'cb_person_default_on_file',
 'cb_person_cred_hist_length',
 'constant column',
 'lowercase',
 'lambda column',
 'row operation column']

You also can rename individual columns by using the `.rename` method. Pass a dictionary like the one below to the `columns` keyword argument:

In [None]:
# Rename columns
df = df.rename({"person_age": "age", "person_income": "income"})
df.columns

['age',
 'income',
 'person_home_ownership',
 'person_emp_length',
 'loan_intent',
 'loan_grade',
 'loan_amnt',
 'loan_int_rate',
 'loan_status',
 'loan_percent_income',
 'cb_person_default_on_file',
 'cb_person_cred_hist_length',
 'constant column',
 'lowercase',
 'lambda column',
 'row operation column']

### **Aggregation**

#### **Group By**

In general, we use the following syntax to calculate aggregates:

```python
df.groupby('column1').column2.measurement()
```

In [None]:
df.group_by("loan_intent").agg(pl.col("age").mean())

loan_intent,age
str,f64
"""DEBTCONSOLIDATION""",27.606293
"""PERSONAL""",28.208477
"""MEDICAL""",27.998023
"""VENTURE""",27.568456
"""HOMEIMPROVEMENT""",29.066574
"""EDUCATION""",26.588099


Sometimes, the operation that you want to perform is more complicated than `mean` or `count`. In those cases, you can use the `apply` method and `lambda` functions, just like we did for individual column operations. Note that the input to our `lambda` function will always be a list of values.

In [None]:
# Calculate the 75th percentile of income for each loan_intent group
result = df.group_by("loan_intent").agg(pl.col("income").quantile(0.75))

Sometimes, we want to group by more than one column. We can easily do this by passing a list of column names into the `groupby` method.

In [None]:
# Group by loan_intent and loan_status, calculate mean age
result = df.group_by(["loan_intent", "loan_status"]).agg(pl.col("age").mean())

We can also perform multiple aggregations on a single column.

In [None]:
# Group by loan_intent and calculate mean, median, and standard deviation of income
result = df.group_by("loan_intent").agg(
    [
        pl.col("income").mean().alias("mean_income"),
        pl.col("income").median().alias("median_income"),
        pl.col("income").std().alias("std_income"),
    ]
)

You can finally perform multiple aggregations on multiple grouping columns.

In [None]:
# Group and aggregate in Polars
result = df.group_by(["loan_intent", "loan_status"]).agg(
    [
        pl.col("age").mean().alias("mean_age"),
        pl.col("age").median().alias("median_age"),
        pl.col("income").mean().alias("mean_income"),
        pl.col("income").median().alias("median_income"),
    ]
)

#### **Pivot Tables**

When we perform a groupby across multiple columns, we often want to change how our data is stored. Reorganizing a table in this way is called pivoting. The new table is called a pivot table.

```python
df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')
```

Just like with groupby, the output of a pivot command is a new DataFrame, but the indexing tends to be “weird”, so we usually follow up with `.reset_index()`.

In [None]:
# Efficient pivot in Polars
df.group_by(["loan_intent", "loan_status"]).agg(pl.col("age").mean()).pivot(
    "loan_intent", index="loan_status", values="age"
)

loan_status,VENTURE,DEBTCONSOLIDATION,EDUCATION,HOMEIMPROVEMENT,MEDICAL,PERSONAL
i64,f64,f64,f64,f64,f64,f64
1,26.838253,27.709396,27.152115,27.671626,27.77483,27.361566
0,27.695402,27.565019,26.470797,29.559309,28.079326,28.41872


Alternatively, the grouping can be performed in the pivot table itself.

In [None]:
df.pivot(index="loan_intent", columns="loan_status", values="age", aggregate_function="mean")

  df.pivot(index="loan_intent", columns="loan_status", values="age", aggregate_function='mean')


loan_intent,1,0
str,f64,f64
"""PERSONAL""",27.361566,28.41872
"""EDUCATION""",27.152115,26.470797
"""MEDICAL""",27.77483,28.079326
"""VENTURE""",26.838253,27.695402
"""HOMEIMPROVEMENT""",27.671626,29.559309
"""DEBTCONSOLIDATION""",27.709396,27.565019


### **Multiple Tables**

#### **Merging**

The `.merge()` method looks for columns that are common between two DataFrames and then looks for rows where those column’s values are the same. It then combines the matching rows into a single row in a new table.

In [None]:
# Efficient merge in Polars
merged_df = df.join(result, on=["loan_intent", "loan_status"], how="left")

#### **Specifying Join Columns**

Because the join columns would mean something different in each table, our default merges would be wrong. One way that we could address this problem is to use `.rename()` to rename the columns for our merges.

```python
# Efficient merge in Polars
merged_df = orders.join(customers.rename({"id": "customer_id"}), on="customer_id", how="inner")
```

If we don’t want to do that, we have another option. We could use the keywords `left_on` and `right_on` to specify which columns we want to perform the merge on.

```python
# Efficient merge in Polars
merged = orders.join(customers, left_on="customer_id", right_on="id", how="inner")
```

If we use this syntax, we’ll end up with two columns called `id`, one from the first table and one from the second. Pandas won’t let you have two columns with the same name, so it will change them to `id_x` and `id_y`.

Polars' join method handles potential naming conflicts automatically by appending suffixes (usually `_x` and `_y`). If you need more control over suffixes, you can rename conflicting columns before performing the join, as shown in the commented-out section.

#### **Concatenation**

Sometimes, a dataset is broken into multiple tables. For instance, data is often split into multiple CSV files so that each download is smaller.

When we need to reconstruct a single DataFrame from multiple smaller DataFrames, we can use the method `pl.concat([df1, df2, df3, ...])`. This method only works if all of the columns are the same in all of the DataFrames.

```python
# Concatenate the two menus to form a new menu
menu = pl.concat([bakery, ice_cream])
```