<a id="back_to_top">

# Ibis Cheat Sheet

#### Table of Contents

- [Reading a local csv file](#csv)
- [Viewing schema](#schema)
- [Number of rows and columns](#rows_columns)
- [Selecting certain columns](#selecting_columns)
- [Creating new column based on value from another column using mutate and split() combination](#split)
- [Creating new column using CASE statements](#case)
- [Running Total / Cumulative Sum](#running_total)
- [Forward fill / backward fill](https://github.com/ibis-project/ibis/wiki/ffill-and-bfill-using-window-functions)
- [Aggregations / Summarizations](#aggregations)
- [Filtering](#filtering)
- [Convert Pandas dataframe to Ibis table expression](#pandas_to_ibis)
- [Obtain distinct values in column(s)](#distinct)
- [Find Duplicate Values](#duplicates)
- [Percent of Rows](#perc_of_rows)
- [Percent of Columns](#perc_of_columns)

In [None]:
import ibis
import ibis.selectors as s
import pandas as pd
from ibis import _
ibis.options.interactive = True

# create a DuckDB client
client = ibis.duckdb.connect()

## Reading a CSV file

[[back to top]](#back_to_top)

In [2]:
# read in a local CSV file as an Ibis table
cars = client.read_csv('./data/cars.csv')

In [3]:
cars.head()

<a id="schema">

## Viewing Schema

[[back to top]](#back_to_top)

In [4]:
cars.schema()

ibis.Schema {
  Car           string
  MPG           float64
  Cylinders     int64
  Displacement  float64
  Horsepower    float64
  Weight        float64
  Acceleration  float64
  Model         int64
  Origin        string
}

<a id="rows_columns">

## Number of Rows and Columns

[[back to top]](#back_to_top)

\# of rows:

In [5]:
cars.count()

[1;36m406[0m

\# of Columns:

In [6]:
len(cars.schema())

9

<a id="selecting_columns">

## Selecting Certain Columns

[[back to top]](#back_to_top)

#### We can use pandas square bracket syntax

In [7]:
cars['Origin','Car','MPG'].head()

#### or use ibis `select()` method

In [8]:
cars.select('Origin', 'Car', 'MPG').head()

#### Select columns based on their data types

In [9]:
from ibis import selectors as s
cars.select(s.of_type('float64'))

#### We can also use boolean logic OR using `|` to select columns with different column types

In [10]:
cars.select(s.of_type('string') | s.of_type('int64'))

<a id="split">

## Creating new column based on value from another column using mutate and split() combination

[[back to top]](#back_to_top)

The `Car` column happens to contain both the car's make and model names.  If we want the make, it looks like we just need to extract the first string token after the splitting the string value in the `Car` column based on a single white space.  The model name is the 2nd token.  We will use array index of zero/0 to obtain the make and index of 1 to obtain the model name.

In [11]:
cars = cars.mutate(Make=cars.Car.split(' ')[0])
cars = cars.mutate(ModelName=cars.Car.split(' ')[1])

In [12]:
cars.head()

<a id="case">

## Creating new column using CASE statements

[[back to top]](#back_to_top)

In [13]:
foods = client.read_csv('./data/food.csv')

In [14]:
foods

In [15]:
case = (
    foods.food
    .case()
    .when('bacon', 'pig')
    .when('pulled pork', 'pig')
    .when('pastrami', 'cow')
    .when('corned beef', 'cow')
    .when('honey ham', 'pig')
    .else_('salmon')
    .end()
)

foods = foods.mutate(animal=case)

In [16]:
foods

<a id="running_total">

## Running Total / Cumulative Sum

[[back to top]](#back_to_top)

Let's calculate running total or cumulative sum of MPG by country of origin.

In [17]:
# Define running total calculation
window_spec = ibis.window(
    order_by='MPG',
    group_by='Origin'
)

In [18]:
running_total = cars['MPG'].cumsum().over(window_spec).name('running_total')

In [19]:
running_total

In [20]:
# Apply running_total window spec.  If we want space in column name, we can do a "relable" to rename the column
cars_cumulative_mpg = cars['Origin','MPG'].mutate(running_total=running_total).relabel(dict(running_total='Running Total'))

In [21]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'US')

In [22]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Japan')

In [23]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Europe')

<a id="aggregations">

## Aggregations / Summarizations

[[back to top]](#back_to_top)

**GOAL:** By country of origin, calculate min mpg, max mpg, and avg mpg.

In [24]:
agged = cars.group_by('Origin').aggregate(
    min_mpg=cars['MPG'].min().name("Min MPG"),
    max_mpg=cars['MPG'].max().name("Max MPG"),
    avg_mpg=cars['MPG'].mean().name("Avg MPG"),
)

agged.relabel(
    dict(
        min_mpg='Min MPG',
        max_mpg='Max MPG',
        avg_mpg='Avg MPG',
    )
).head()

<a id="filtering">

## Filtering

[[back to top]](#back_to_top)

In [25]:
cars

When using `filter()` and multiple conditions, you need to wrap each condition with parenthesis.  When using boolean logic, use `&` and `|` symbols for `AND` or `OR` logic, respectively.

In [26]:
cars.filter(
    (cars['MPG'] > 0)
    & (cars['Cylinders'] < 5)
)['MPG'].mean()

[1;36m29.118750000000002[0m

In [27]:
cars.filter(
    (cars['MPG'] > 0) & (cars['Cylinders'] < 5)
)

In [28]:
min_mpg = 0
max_cyl = 5
condition = (cars['MPG'] > min_mpg) & (cars['Cylinders'] < max_cyl)

cars.filter(condition)

<a id="pandas_to_ibis">

## Convert Pandas dataframe to Ibis table expression

[[back to top]](#back_to_top)

In [29]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4 + ['three'] * 1, 'k2': [3, 2, 1, 3, 3, 4, 4, 1]})

In [30]:
data

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
4,two,3
5,two,4
6,two,4
7,three,1


In [31]:
ibis_data = ibis.memtable(data)

In [32]:
ibis_data

<a id="distinct">

## Obtain distinct values in column(s)

[[back to top]](#back_to_top)

Distinct values in specific column

In [33]:
ibis_data.select("k1").distinct()

or across all columns

In [34]:
ibis_data.distinct()

<a id="duplicates">

## Find Duplicate Values

[[back to top]](#back_to_top)

In [35]:
(
    ibis_data
    .group_by('k1')
    .count()
    .relabel({'count': 'Count'})
    .filter(_.Count > 1)
)

<a id="perc_of_rows">

## Percent of Rows

[[back to top]](#back_to_top)

In [36]:
data = pd.DataFrame(
    {'group': [80, 70, 75, 75],
     'ounces': [20, 30, 25, 25],
     'size': [100, 100, 100, 100]
    }
)

In [37]:
ibis_data = ibis.memtable(data)

In [38]:
ibis_data

In [39]:
(
    ibis_data
    .mutate(group_perc=_.group / (_.group + _.ounces + _.size) * 100)
    .mutate(ounces_perc=_.ounces / (_.group + _.ounces + _.size) * 100)
    .mutate(size_perc=_.size / (_.group + _.ounces + _.size) * 100)
    .select(s.contains('perc'))
)

<a id="perc_of_columns">

## Percent of Columns

[[back to top]](#back_to_top)

In [40]:
(
    ibis_data
    .mutate(group_perc=_.group / (_.group.sum()) * 100)
    .mutate(ounces_perc=_.ounces / (_.ounces.sum()) * 100)
    .mutate(size_perc=_.size / (_.size.sum()) * 100)
    .select(s.contains('perc'))
)