<a id="back_to_top">

# Ibis Cheat Sheet

#### Table of Contents

- [Obtaining built-in sample data sets](#sample_data_sets)
- [Reading a local csv file](#csv)
- [Viewing schema](#schema)
- [Number of rows and columns](#rows_columns)
- [Selecting certain columns](#selecting_columns)
- [Creating new column based on value from another column using mutate and split() combination](#split)
- [Creating new column using CASE statements](#case)
- [Aggregations / Summarizations](#aggregations)
- [Running Total / Cumulative Sum](#running_total)
- [Renaming a column](#renaming)
- [Forward fill / backward fill](https://ibis-project.org/posts/ffill-and-bfill-using-ibis/index.html)
- [Finding maximum value across all groups](#argmax)
- [Filtering](#filtering)
- [Convert Pandas dataframe to Ibis table expression](#pandas_to_ibis)
- [Convert Ibis table expression to Pandas DataFrame](#ibis_to_pandas)
- [Obtain distinct values in column(s)](#distinct)
- [Find Duplicate Values](#duplicates)
- [Percent of Rows](#perc_of_rows)
- [Percent of Columns](#perc_of_columns)
- [Adding Leading Zeros](#zfill)
- [Creating column values from column names](https://www.youtube.com/watch?v=-0pjPE6VgDs#t=17m)
- [Unnest an array](https://www.youtube.com/watch?v=J7sEn9VklKY#t=7m)
- [Custom Array Functions](https://www.youtube.com/watch?v=6TgpRMmvNQs)
- Additional resources:
    - Youtube Videos: [Phillip in the Cloud](https://www.youtube.com/@cpcloud)

In [1]:
import ibis
import ibis.selectors as s
import pandas as pd
from ibis import _
ibis.options.interactive = True

# create a DuckDB client
client = ibis.duckdb.connect()

<a id="sample_data_sets">

## Obtaining built-in sample data sets

[[back to top]](#back_to_top)

In [2]:
from ibis.interactive import ex

#### To get a list of available data sets

In [3]:
dir(ex)[:20]   # The list is quite long, so limited to just 20

['Aids2',
 'Aids2_raw',
 'Alfalfa',
 'Alfalfa_raw',
 'AllstarFull',
 'AllstarFull_raw',
 'Appearances',
 'Appearances_raw',
 'Assay',
 'Assay_raw',
 'AwardsManagers',
 'AwardsManagers_raw',
 'AwardsPlayers',
 'AwardsPlayers_raw',
 'AwardsShareManagers',
 'AwardsShareManagers_raw',
 'AwardsSharePlayers',
 'AwardsSharePlayers_raw',
 'BOD',
 'BOD_raw']

In [4]:
ex.relig_income_raw

relig_income_raw(name='relig_income_raw', help='Pew religion and income survey')

In [5]:
relig_income = ex.relig_income_raw.fetch()

In [6]:
relig_income

<a id="csv">

## Reading a CSV file

[[back to top]](#back_to_top)

In [7]:
# read in a local CSV file as an Ibis table
cars = client.read_csv('./data/cars.csv')

In [8]:
cars.head()

<a id="schema">

## Viewing Schema

[[back to top]](#back_to_top)

In [9]:
cars.schema()

ibis.Schema {
  Car           string
  MPG           float64
  Cylinders     int64
  Displacement  float64
  Horsepower    float64
  Weight        float64
  Acceleration  float64
  Model         int64
  Origin        string
}

<a id="rows_columns">

## Number of Rows and Columns

[[back to top]](#back_to_top)

Number of rows:

In [10]:
cars.count()

[1;36m406[0m

Number of Columns:

In [11]:
len(cars.schema())

9

<a id="selecting_columns">

## Selecting Certain Columns

[[back to top]](#back_to_top)

#### We can use pandas square bracket syntax

In [12]:
cars['Origin','Car','MPG'].head()

#### or use ibis `select()` method

In [13]:
cars.select('Origin', 'Car', 'MPG').head()

#### Select columns based on their data types

In [14]:
from ibis import selectors as s
cars.select(s.of_type('float64'))

#### We can also use boolean logic OR using `|` to select columns with different column types

In [15]:
cars.select(s.of_type('string') | s.of_type('int64'))

#### We can also select based on range of column indices using special `s.r` object

In [16]:
cars.select(s.r[:3])

<a id="split">

## Creating new column based on value from another column using mutate and split() combination

[[back to top]](#back_to_top)

The `Car` column happens to contain both the car's make and model names.  If we want the make, it looks like we just need to extract the first string token after the splitting the string value in the `Car` column based on a single white space.  The model name is the 2nd token.  We will use array index of zero/0 to obtain the make and index of 1 to obtain the model name.

In [17]:
cars = cars.mutate(Make=cars.Car.split(' ')[0])
cars = cars.mutate(ModelName=cars.Car.split(' ')[1])

In [18]:
cars.head()

or we can use "method chaining" syntax to use a single mutate function

In [19]:
cars.mutate(
    Make=cars.Car.split(' ')[0],
    ModelName=cars.Car.split(' ')[1]
)

<a id="case">

## Creating new column using CASE statements

[[back to top]](#back_to_top)

In [20]:
foods = client.read_csv('./data/food.csv')

In [21]:
foods

In [22]:
case = (
    foods.food
    .case()
    .when('bacon', 'pig')
    .when('pulled pork', 'pig')
    .when('pastrami', 'cow')
    .when('corned beef', 'cow')
    .when('honey ham', 'pig')
    .else_('salmon')
    .end()
)

foods = foods.mutate(animal=case)

In [23]:
foods

<a id="aggregations">

## Aggregations / Summarizations

[[back to top]](#back_to_top)

In [24]:
cars.MPG.max()

[1;36m46.6[0m

In [25]:
cars['MPG'].max()

[1;36m46.6[0m

It may be surprising that you can't perform min, max, etc after doing a `select()`

In [26]:
cars.select('MPG').max()

AttributeError: 'Table' object has no attribute 'max'

**GOAL:** By country of origin, calculate min mpg, max mpg, and avg mpg.

**NOTE:** If you want to group by more than one column, need to pass in a list of column names

In [27]:
agged = cars.group_by('Origin').aggregate(
    min_mpg=cars['MPG'].min(),
    max_mpg=cars['MPG'].max(),
    avg_mpg=cars['MPG'].mean(),
)
agged

<a id="renaming">

## Renaming column names

[[back to top]](#back_to_top)

In case you want to rename the columns:

In [28]:
agged.rename(
    min_miles_per_gallon='min_mpg',
    max_miles_per_gallon='max_mpg',
    avg_miles_per_gallon='avg_mpg',
)

Using a window function like `row_number()` to obtain the top 3 most fuel efficient car from each Origin/Country:

In [29]:
(
    cars
    .mutate(rank=ibis.row_number().over(group_by=_.Origin, order_by=ibis.desc(_.MPG)) + 1)
    .order_by(['Origin', ibis.desc('MPG')])
    .filter(_.rank <= 3)
)

<a id="running_total">

## Running Total / Cumulative Sum

[[back to top]](#back_to_top)

Let's calculate running total or cumulative sum of MPG by country of origin.

If you have a background in SQL and familiar with window functions, below is defining our "window" to order and group our data by:

In [30]:
# Define running total calculation
window_spec = ibis.window(
    order_by='MPG',
    group_by='Origin'
)

Then we will calcuate our running total of MPG (cumulative sum) utilizing the window we just defined above, the cumulative sum will be in a column called "running_total"

In [31]:
running_total = cars['MPG'].cumsum().over(window_spec).name('running_total')

In [32]:
running_total

Apply running_total window spec in the mutate() function.

In [33]:
cars_cumulative_mpg = cars['Origin','Car','MPG'].mutate(running_total=running_total)

Since we can't view the entire results, we will look at results for each Origin and find out which of their respective model had the highest MPG.

In [34]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'US')

In [35]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'US').filter(_.MPG == _.MPG.max())

In [36]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Japan')

In [37]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Japan').filter(_.MPG == _.MPG.max())

In [38]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Europe')

In [39]:
cars_cumulative_mpg.filter(cars_cumulative_mpg['Origin'] == 'Europe').filter(_.MPG == _.MPG.max())

<a id="argmax">

## Finding maximum value across all groups

[[back to top]](#back_to_top)

Above, we had to issue 3 filters to find the country with the maximum MPG.  But, what if there's a way to do this with fewer steps?  `argmax()` returns the largest value in a column that also happens to be in groupings.  `argmax()` will then return the largest value regardless of the group.

The country or origin with largest MPG for all origins:

In [40]:
cars_cumulative_mpg.Origin.argmax(cars_cumulative_mpg.MPG)

[32m'Japan'[0m

The criteria above can then be used in our subsequent filtering below to obtain max MPG:

In [41]:
cars_cumulative_mpg.filter(
    (_.Origin == cars_cumulative_mpg.Origin.argmax(cars_cumulative_mpg.MPG))
    & (_.MPG == _.MPG.max())
)

<a id="filtering">

## Filtering

[[back to top]](#back_to_top)

In [42]:
cars

When using `filter()` and multiple conditions, you need to wrap each condition with parenthesis.  When using boolean logic, use `&` and `|` symbols for `AND` or `OR` logic, respectively.

In [43]:
cars.filter(
    (cars['MPG'] > 0)
    & (cars['Cylinders'] < 5)
)['MPG'].mean()

[1;36m29.118750000000002[0m

In [44]:
cars.filter(
    (cars['MPG'] > 0) & (cars['Cylinders'] < 5)
)

In [45]:
min_mpg = 0
max_cyl = 5
condition = (cars['MPG'] > min_mpg) & (cars['Cylinders'] < max_cyl)

cars.filter(condition)

<a id="pandas_to_ibis">

## Convert Pandas dataframe to Ibis table expression

[[back to top]](#back_to_top)

In [46]:
data = pd.DataFrame({'k1': ['one'] * 3 + ['two'] * 4 + ['three'] * 1, 'k2': [3, 2, 1, 3, 3, 4, 4, 1]})

In [47]:
data

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
4,two,3
5,two,4
6,two,4
7,three,1


In [48]:
ibis_data = ibis.memtable(data)

In [49]:
ibis_data

<a id="ibis_to_pandas">

## Convert Ibis table expression to Pandas DataFrame

[[back to top]](#back_to_top)

In [50]:
pdf = ibis_data.to_pandas()

In [51]:
pdf

Unnamed: 0,k1,k2
0,one,3
1,one,2
2,one,1
3,two,3
4,two,3
5,two,4
6,two,4
7,three,1


<a id="distinct">

## Obtain distinct values in column(s)

[[back to top]](#back_to_top)

Distinct values in specific column

In [52]:
ibis_data.select("k1").distinct()

or across all columns

In [53]:
ibis_data.distinct()

<a id="duplicates">

## Find Duplicate Values

[[back to top]](#back_to_top)

In [54]:
(
    ibis_data
    .group_by('k1')
    .aggregate(
        Count = _.k1.count()
    )
    .filter(_.Count > 1)
)

<a id="perc_of_rows">

## Percent of Rows

[[back to top]](#back_to_top)

In [55]:
data = pd.DataFrame(
    {'group': [80, 70, 75, 75],
     'ounces': [20, 30, 25, 25],
     'size': [100, 100, 100, 100]
    }
)

In [56]:
ibis_data = ibis.memtable(data)

In [57]:
ibis_data

In [58]:
(
    ibis_data
    .mutate(group_perc=_.group / (_.group + _.ounces + _.size) * 100)
    .mutate(ounces_perc=_.ounces / (_.group + _.ounces + _.size) * 100)
    .mutate(size_perc=_.size / (_.group + _.ounces + _.size) * 100)
    .select(s.contains('perc'))
)

<a id="perc_of_columns">

## Percent of Columns

[[back to top]](#back_to_top)

In [59]:
(
    ibis_data
    .mutate(group_perc=_.group / (_.group.sum()) * 100)
    .mutate(ounces_perc=_.ounces / (_.ounces.sum()) * 100)
    .mutate(size_perc=_.size / (_.size.sum()) * 100)
    .select(s.contains('perc'))
)

<a id="zfill">

## Adding Leading Zeros

[[back to top]](#back_to_top)

pandas has a handy [zfill()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.zfill.html) method which ibis does not have.  But it is easy relatively easy to implement this using ibis.

In [60]:
us_fips_codes = client.read_csv('data/co-est2020-alldata.csv')

In [61]:
us_fips_codes

Goal:
- New column: The state FIPS code is a 2-character code that should have a leading zero when the state code is a single digit code
- New column: The county FIPS codes is a 3-character code that should have 2 leading zeros when the county code is a single digit code, 1 leading zero when it is a 2 digit code

In [62]:
(
    us_fips_codes
    .mutate(state_fips=ibis.literal("0") * (2 - us_fips_codes.STATE.cast('string').length()) + us_fips_codes.STATE.cast('string'))
    .mutate(county_fips=ibis.literal("0") * (3 - us_fips_codes.COUNTY.cast('string').length()) + us_fips_codes.COUNTY.cast('string'))
)