## 区别

### 区别1，polars没有索引

Pandas gives a label to each row with an index.   
Polars does not use an index and  
 each row **is indexed by its integer position** in the table.

### 区别2，polars使用arrow array

Polars uses Apache Arrow arrays to represent data in memory  
 while Pandas uses Numpy arrays

Apache Arrow is an emerging standard for in-memory columnar analytics that can   - -
- accelerate data load times, 
- reduce memory usage 
- and accelerate calculations.
Polars can convert data to Numpy format with the `to_numpy` method.

即arrow array更快，而且polars也支持向numpy转换

### 区别3，更多的并行支持

Polars has more support for parallel operations than Pandas

Polars exploits the strong support for concurrency in Rust to run many operations in parallel.   
While some operations in Pandas are multi-threaded the core of the library is single-threaded   
and an additional library such as **Dask** must be used to parallelise operations.

即pandas是单线程的，不得不用类似于dask之类的包来并行加速

### 区别4 polars有lazy模式

Polars can lazily evaluate queries and apply query optimization

Eager evaluation is where code is evaluated as soon as you run the code.  
Lazy evaluation is where running a line of code means that  
    the underlying logic is added to a query plan rather than being evaluated.  

Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation.   
The lazy evaluation mode is powerful because Polars carries out automatic query optimization where it examines the query plan and looks for ways to accelerate the query or reduce memory usage.

Dask also supports lazy evaluation where it generates a query plan. However, Dask does not carry out query optimization on the query plan.  
Dask又被踩了一脚，即它虽然有lazy模式，但是他不做算法优化，只是延迟处理

## 一些操作

### 用polars写法

不要用pandas的写法来写polars，那样会很慢

在pandas中，选择列a
```python
df['a']
df.loc[:,'a']
```

在polars中，选择列a
```python
df.select(['a'])
```

在polars中，选择列a中满足条件的行
```python
df.filter(pl.col('a')<10)
```



### Be lazy，多使用懒模式

Working in lazy evaluation mode is straightforward and  
 should be your default in Polars as the lazy mode allows Polars to do query optimization.

We can run in lazy mode by either using an implicitly lazy function (such as **scan_csv**) or explicitly using the lazy method.

Take the following simple example where we read a CSV file from disk   and do a groupby.   
The CSV file has numerous columns but we just want to do a groupby on one of the id columns (id1)   
and then sum by a value column (v1). 

In Pandas this would be:

```python
df = pd.read_csv(csvFile)
groupedDf = df.loc[:,['id1','v1']].groupby('id1').sum('v1')
```
1 读入  
2 选出对应的数据，然后按id1分组，按v1求和

In Polars you can build this query in lazy mode with query optimization   
and evaluate it by replacing the eager Pandas function read_csv with
 the implicitly lazy Polars function scan_csv:

In Polars this would be:

```python
df = pl.scan_csv(csvFile)
roupedDf = df.groupby('id1').agg(pl.col('v1').sum()).collect()
```
1 读入  
2 选出对应的数据，然后按id1分组，按v1求和

Polars optimizes this query by identifying that   
**only the id1 and v1 columns are relevant and**  
 **so will only read these columns from the CSV.**  
By calling the `.collect` method at the end of the second line we instruct Polars to eagerly evaluate the query.

If you do want to run this query in eager mode you can just replace scan_csv with read_csv in the Polars code.

Read more about working with lazy evaluation in the lazy API section.

### Express yourself，多用表达式

A typical Pandas script consists of multiple data transformations that are executed sequentially.   
However, in Polars these transformations can be executed in parallel using expressions.  
即pandas是序列地执行一系列操作（他的调用也是链式地）
而polars是并行的运行表达式

#### Column assignment，列赋值/列新增

We have a dataframe df with a column called `value`.  
We want to add two new columns,   
a column called `tenXValue` where the value column is multiplied by 10 and  
a column called `hundredXValue` where the value column is multiplied by 100.

In Pandas this would be:

```python
df["tenXValue"] = df["value"] * 10
df["hundredXValue"] = df["value"] * 100
```
These column assignments are executed sequentially.

In Polars we add columns to df using the `.with_column` method   
and name them with the .alias method:

```python
df.with_columns([
    (pl.col("value") * 10).alias("tenXValue"),
    (pl.col("value") * 100).alias("hundredXValue"),
])
```
These column assignments are executed **in parallel**.  
所以破拉绒的语法看起来更费劲，但他是并行的

#### Column assignment based on predicate

In this case we have a dataframe df with columns `a`,`b` and `c`.   
We want to re-assign the values in column `a` based on a condition.   
When the value in column `c` is equal to 2 then we replace the value in `a` with the value in `b`.

In Pandas this would be:
```python
df.loc[df["c"] == 2, "a"] = df.loc[df["c"] == 2, "b"]
```

In Polars this would be:

```python
df.with_column(
    pl.when(pl.col("c") == 2)
    .then(pl.col("b"))
    .otherwise(pl.col("a")).alias("a")
)
```
仔细对比下两种操作逻辑，很有意思  
pandas是对满足条件的a修改为的值  
polars是，如果满足条件，则值取b的，不满足条件，值取a的  
那么自然就不会对原来的a进行修改了

The Polars way is pure in that **the original DataFrame is not modified**.  
The mask is also not computed twice as in Pandas  
 (you could prevent this in Pandas, but that would require setting a temporary variable).
也就是说，不会修改原df

还跟对if语句并行
Additionally Polars can compute every branch of   
an `if -> then -> otherwise` in parallel.   
This is valuable, when the branches get more expensive to compute.

#### Filtering

We want to filter the dataframe df with housing data **based on some criteria**.  
In Pandas you filter the dataframe by passing **Boolean expressions** to the loc method:
```python
df.loc[(df['sqft_living'] > 2500) & (df['price'] < 300000)]
```

In Polars this would be:

```python
df.filter(
    (pl.col("m2_living") > 2500) & (pl.col("price") < 300000)
)
```
The query optimizer in Polars can also detect if you write multiple filters separately and combine them into a single filter in the optimized plan.

## Pandas transform

The Pandas documentation demonstrates an operation on a groupby called transform.  
In this case we have a dataframe `df` and  
we want **a new column showing the number of rows** in each group.

In Pandas this would be:

In [2]:
import pandas as pd
df = pd.DataFrame({
    "type": ["m", "n", "o", "m", "m", "n", "n"],
    "c": [1, 1, 1, 2, 2, 2, 2],
})
df

Unnamed: 0,type,c
0,m,1
1,n,1
2,o,1
3,m,2
4,m,2
5,n,2
6,n,2


In [5]:
df["size"] = df.groupby("c")["type"].transform(len)
df

Unnamed: 0,type,c,size
0,m,1,3
1,n,1,3
2,o,1,3
3,m,2,4
4,m,2,4
5,n,2,4
6,n,2,4


Here Pandas 
1. does a groupby on "c",   
2. takes column "type",   
3. computes the group length and  
4. then joins the result back to the original DataFrame producing:

```python
   c type size
0  1    m    3
1  1    n    3
2  1    o    3
3  2    m    4
4  2    m    4
5  2    n    4
6  2    n    4
```

In Polars this would be:

```python
import polars as pl

df.select([
    pl.all(),
    pl.col("type").count().over("c").alias("size")
])
```
说一下差异吧  
panda的步骤上面讲了  
那么polars呢  
是按c分组————over("c")  
然后选出type列，然后统计其组内长度count()

输出
```python
shape: (7, 3)
┌─────┬──────┬──────┐
│ c   ┆ type ┆ size │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ u32  │
╞═════╪══════╪══════╡
│ 1   ┆ m    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ n    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ o    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    │
└─────┴──────┴──────┘
```

Because we can store the whole operation in a single expression,  
we can combine several window functions and even combine different groups!  
因为会缓存所有的操作到一个表达式  
所以多个窗口函数和分组都是可以结合的，这样就提升效率了  
不过说实话，这个也就是手册的例子，我们一般也不会同时对多个组分组  
然后又把他们弄到一起，一般还是分开弄然后merge吧？

Polars will cache（缓存） window expressions that are applied over the same group,  
so storing them in a single select is both convenient and optimal.
In the following example we look at a case 
where we are calculating group statistics over "c" twice:

```python
df.select([
    pl.all(),
    pl.col("c").count().over("c").alias("size"),
    pl.col("c").sum().over("type").alias("sum"),
    pl.col("c").reverse().over("c").flatten().alias("reverse_type")
])

# 输出
shape: (7, 5)
┌─────┬──────┬──────┬─────┬──────────────┐
│ c   ┆ type ┆ size ┆ sum ┆ reverse_type │
│ --- ┆ ---  ┆ ---  ┆ --- ┆ ---          │
│ i64 ┆ str  ┆ u32  ┆ i64 ┆ i64          │
╞═════╪══════╪══════╪═════╪══════════════╡
│ 1   ┆ m    ┆ 3    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ n    ┆ 3    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ o    ┆ 3    ┆ 1   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    ┆ 5   ┆ 1            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    ┆ 5   ┆ 1            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    ┆ 5   ┆ 1            │
└─────┴──────┴──────┴─────┴──────────────┘
```

### Missing data

Pandas uses `NaN` and/or `None` values to indicate missing values 
    depending on the dtype of the column.  
In addition the behaviour in Pandas varies depending on 
    whether the default dtypes or optional nullable arrays are used.  
In Polars missing data corresponds to a `null` value for all data types.
也就是说，pandas的缺失值，跟原数据类型有关，用不同的符号表示  
但是破拉绒就是用null

但是polars对于浮点型的缺失值，允许使用`NaN`
For float columns Polars permits the use of `NaN` values.   
These `NaN` values are not considered to be missing data but instead a special floating point value.

In Pandas an integer column with missing values is cast to be a float column with `NaN` values for the missing values   
(unless using optional nullable integer dtypes).   
In Polars any missing values in an integer column are simply `null` values   
and the column remains an integer column.

See the missing data section for more details.