<a id="back_to_top">

# Ibis Cheat Sheet

#### Table of Contents

- [Reading a local csv file](#csv)
- [Viewing schema](#schema)
- [Number of rows and columns](#rows_columns)
- [Selecting certain columns](#selecting_columns)
- [Creating new column based on value from another column using mutate and split() combination](#split)
- [Creating new column using CASE statements](#case)
- [Running Total / Cumulative Sum](#running_total)
- [Aggregations / Summarizations](#aggregations)
- [Filtering](#filtering)

In [1]:
import ibis
ibis.options.interactive = True

# create a DuckDB client
client = ibis.duckdb.connect()

## Reading a CSV file

[[back to top]](#back_to_top)

In [2]:
# read in a local CSV file as an Ibis table
cars = client.read_csv('./data/cars.csv')

In [3]:
cars.head()

<a id="schema">

## Viewing Schema

[[back to top]](#back_to_top)

In [4]:
cars.schema()

ibis.Schema {
  Car           string
  MPG           float64
  Cylinders     int64
  Displacement  float64
  Horsepower    float64
  Weight        float64
  Acceleration  float64
  Model         int64
  Origin        string
}

<a id="rows_columns">

## Number of Rows and Columns

[[back to top]](#back_to_top)

\# of rows:

In [5]:
cars.count()

[1;36m406[0m

\# of Columns:

In [6]:
len(cars.schema())

9

<a id="selecting_columns">

## Selecting Certain Columns

[[back to top]](#back_to_top)

In [7]:
cars['Origin','Car','MPG'].head()

#### Select columns based on their data types

In [8]:
from ibis import selectors as s
cars.select(s.of_type('float64'))

#### We can also use boolean logic OR using `|` to select columns with different column types

In [9]:
cars.select(s.of_type('string') | s.of_type('int64'))

<a id="split">

## Creating new column based on value from another column using mutate and split() combination

[[back to top]](#back_to_top)

The `Car` column happens to contain both the car's make and model names.  If we want the make, it looks like we just need to extract the first string token after the splitting the string value in the `Car` column based on a single white space.  The model name is the 2nd token.  We will use array index of zero/0 to obtain the make and index of 1 to obtain the model name.

In [10]:
cars = cars.mutate(Make=cars.Car.split(' ')[0])
cars = cars.mutate(ModelName=cars.Car.split(' ')[1])

In [11]:
cars.head()

<a id="case">

## Creating new column using CASE statements

[[back to top]](#back_to_top)

In [12]:
foods = client.read_csv('./data/food.csv')

In [13]:
foods

In [14]:
case = (
    foods.food
    .case()
    .when('bacon', 'pig')
    .when('pulled pork', 'pig')
    .when('pastrami', 'cow')
    .when('corned beef', 'cow')
    .when('honey ham', 'pig')
    .else_('salmon')
    .end()
)

foods = foods.mutate(animal=case)

In [15]:
foods

<a id="running_total">

## Running Total / Cumulative Sum

[[back to top]](#back_to_top)

Let's calculate running total or cumulative sum of MPG by country of origin.

In [16]:
# Define running total calculation
window_spec = ibis.window(
    order_by='MPG',
    group_by='Origin'
)

running_total = cars['MPG'].cumsum().over(window_spec).name('running_total')

# Apply running_total window spec.  If we want space in column name, we can do a "relable" to rename the column
cars_cumulative_mpg = cars['Origin','MPG'].mutate(running_total=running_total).relabel(dict(running_total='Running Total'))

In [17]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'US')

In [18]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Japan')

In [19]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Europe')

<a id="aggregations">

## Aggregations / Summarizations

[[back to top]](#back_to_top)

**GOAL:** By country of origin, calculate min mpg, max mpg, and avg mpg.

In [20]:
agged = cars.group_by('Origin').aggregate(
    min_mpg=cars['MPG'].min().name("Min MPG"),
    max_mpg=cars['MPG'].max().name("Max MPG"),
    avg_mpg=cars['MPG'].mean().name("Avg MPG"),
)

agged.relabel(
    dict(
        min_mpg='Min MPG',
        max_mpg='Max MPG',
        avg_mpg='Avg MPG',
    )
).head()

<a id="filtering">

## Filtering

[[back to top]](#back_to_top)

In [21]:
cars

When using `filter()` and multiple conditions, you need to wrap each condition with parenthesis.  When using boolean logic, use `&` and `|` symbols for `AND` or `OR` logic, respectively.

In [22]:
cars.filter(
    (cars['MPG'] > 0)
    & (cars['Cylinders'] < 5)
)['MPG'].mean()

[1;36m29.118750000000002[0m

In [23]:
cars.filter(
    (cars['MPG'] > 0) & (cars['Cylinders'] < 5)
)

In [24]:
min_mpg = 0
max_cyl = 5
condition = (cars['MPG'] > min_mpg) & (cars['Cylinders'] < max_cyl)

cars.filter(condition)