# Value counts
By the end of this lecture you will be able to:
- count occurrences in a column with `value_counts`
- create a bar chart of the outputs
- use `value_counts` in an expression
- use `value_counts` in lazy mode

In [None]:
import polars as pl

In [None]:
csv_file = '../data/titanic.csv'

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

## Count occurences in a `Series`
We use `value_counts` to count occurences in a `Series`

In [None]:
df['Pclass'].value_counts()

> In Pandas the output of this operation is a `Series` but in Polars the output is a `DataFrame` with one column for the categories and one for the counts.

The order will vary each time you run `value_counts` unless you pass the `sort` argument

In [None]:
df['Pclass'].value_counts(sort=True)

We can also sort by the category using the `sort` method

In [None]:
df['Pclass'].value_counts().sort("Pclass")

As `value_count` works on a single column it is not done in parallel by default. If we have a long `Series` it might be worth doing this in parallel with the `parallel` argument

In [None]:
df['Pclass'].value_counts(parallel=True)

## Value counts as an expression
We can call `value_counts` in an expression

In [None]:
(
    df
    .select(
        pl.col("Pclass").value_counts()
    )
)

However, the output is a one-column `DataFrame` with a `pl.Struct` column.

We can get the output as a two-column `DataFrame` by calling `.struct.unnest` on the `Series` to give the same output as calling `value_counts` on a `Series`

In [None]:
(
    df
    .select(
        pl.col("Pclass").value_counts()
    )
    ["Pclass"]
    .struct.unnest()
)

## Plotting the value counts

To display the output we need to convert the integer `Pclass` column to string dtype.

We call `value_counts` on a `Series` again

In [None]:
(
    df['Pclass']
    .value_counts()
    .sort("Pclass")
    .with_columns(
        pl.col("Pclass").cast(pl.Utf8)
    )
    .plot
    .bar(
        x="Pclass",
        y="count"
    )
)

## Value counts in lazy mode
To calculate value counts in lazy mode we call `value_counts` as an expression on a `LazyFrame`.

As the output of the `value_counts` expression is a `struct` dtype we then:
- trigger evaluation of the `LazyFrame`
- transform the `struct` column to a `DataFrame`

In [None]:
(
    pl.scan_csv(csv_file)
    .select(
        pl.col("Pclass").value_counts()
    )
    .collect()
    ["Pclass"]
    .struct.unnest()
)

In this lazy query Polars detects that only the `Pclass` column needs to be read from the CSV in lazy mode.

In [None]:
print(
    pl.scan_csv(csv_file)
    .select(
        pl.col("Pclass").value_counts()
    )
    .explain()
)

We see this from `PROJECT 1/12 COLUMNS` in the optimised query plan.

However, `value_counts` does not currently work in streaming mode as the `value_counts` expression occurs after the `STREAMING` block of the streaming query plan

In [None]:
print(
    pl.scan_csv(csv_file)
    .select(
        pl.col("Pclass").value_counts()
    )
    .explain(streaming=True)
)

## Exercises

In the exercises you will develop your understanding of:
- calculating value counts
- calculating percentages
- visualising the outputs
- doing `value_counts` in lazy mode

### Exercise 1
Calculate the value counts on the `Survived` column as a `Series`. 

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Sort the output from highest to lowest

Calculate the value counts on the `Survived` column as an expression 

Calculate the value counts on the `Survived` column as an expression and convert the `pl.Struct` column to a `DataFrame`

### Exercise 2
As in the first part of Exercise 1, calculate the value counts on the `Survived` column as a `Series`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
)

Add an additional column with the percentage of passengers in each class (divide the `counts` column by the sum of the `counts` column. 

Express the percentages as values ranging from 0 to 100.

Visualise the percentage values for each class in a bar chart

### Exercise 3

Construct the query that produces the following optimized query plan
```
 SELECT [col("Age").round().value_counts()] FROM

    Csv SCAN ../data/titanic.csv
    PROJECT 1/12 COLUMNS
```


In [None]:
dfLazy = (
     <blank>
)


print(dfLazy.explain())

### Exercise 4
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Create a `DataFrame` with the 5 most common tracks by count of rows

Create a bar chart of the 5 most common tracks by count of rows in your preferred plotting library (solutions provided for hvPlot, Altair, Matplotlib and Plotly)

## Solutions

### Solution to Exercise 1

Calculate the value counts on the `Survived` column as a `Series`

In [None]:
(
    pl.read_csv(csv_file)
    ['Survived']
    .value_counts()
)

Sort by the counts from highest to lowest

In [None]:
(
    pl.read_csv(csv_file)
    ['Survived']
    .value_counts(sort=True)
)

Calculate the value counts on the `Survived` column as an expression

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col("Survived").value_counts()
    )
)

Calculate the value counts on the `Survived` column as an expression and convert the `pl.Struct` column to a `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col("Survived").value_counts()
    )
    ["Survived"]
    .struct.unnest()
)

### Solution to Exercise 2
As in the first part of Exercise 1, calculate the value counts on the `Survived` column as a `Series`

In [None]:
(
    pl.read_csv(csv_file)
    ['Survived']
    .value_counts(sort=True)
)

Add an additional column with the percentage of passengers in each class (divide the `counts` column by the sum of the `counts` column. 

In [None]:
(
    pl.read_csv(csv_file)
    ['Survived']
    .value_counts(sort=True)
    .with_columns(
        (pl.col("count")/pl.col("count").sum()).alias("percent")
    )
)

Express the percentages as values ranging from 0 to 100.

In [None]:
(
    pl.read_csv(csv_file)
    ['Survived']
    .value_counts(sort=True)
    .with_columns(
        (100*(pl.col("count")/pl.col("count").sum())).alias("percent")
    )
)

Visualise the outputs as a bar chart

In [None]:
survived_count_df = (
    pl.read_csv(csv_file)
    ['Survived']
    .value_counts(sort=True)
    .with_columns(
        pl.col("Survived").cast(pl.Utf8)
    )
    .with_columns(
        (100*(pl.col("count")/pl.col("count").sum())).alias("percent")
    )
)

(
    survived_count_df
    .plot
    .bar(
        x="Survived",
        y="percent"
    )
)

### Solution to Exercise 3
Construct the query that produces the following optimized query plan
```
 SELECT [col("Age").round().value_counts()] FROM

    Csv SCAN ../data/titanic.csv
    PROJECT 1/12 COLUMNS
```


In [None]:
dfLazy = (
    pl.scan_csv(csv_file)
    .select(
        pl.col("Age").round(0).value_counts()
    )
)
print(dfLazy.explain())

### Solution to Exercise 4
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(100)
pl.Config.set_tbl_rows(10)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Create a `DataFrame` with the 5 most common tracks by count of rows

In [None]:
(
    spotify_df
    ["title"]
    .value_counts(sort=True)
    .head()
)

Create a bar chart of the 5 most common tracks by count of rows (solutions provided for hvPlot, Altair, Matplotlib and Plotly)

In [None]:
# hvPlot
top_titles_df = (
    spotify_df
    ["title"]
    .value_counts(sort=True)
    .head()
    )
(
    top_titles_df
    .plot
    .bar(
        x="title",
        y="count",
    )
)

In [None]:
import altair as alt
alt.Chart(
    (
    spotify_df
    ["title"]
    .value_counts(sort=True)
    .head()
    ),
    width=600
).mark_bar().encode(
    x=alt.X("title:N",sort="-y"),
    y="count:Q"
)

In [None]:
import matplotlib.pyplot as plt
top_titles_df = (
    spotify_df
    ["title"]
    .value_counts(sort=True)
    .head()
    )
plt.bar(
    x=top_titles_df["title"],
    height=top_titles_df["count"],
)

In [None]:
import plotly.express as px
top_titles_df = (
    spotify_df
    ["title"]
    .value_counts(sort=True)
    .head()
    )
px.bar(
    top_titles_df,
    x="title",
    y="count",
)