In [1]:
!pip install "polars[all]"

Collecting polars[all]
  Downloading polars-0.14.26-cp37-abi3-macosx_11_0_arm64.whl (11.7 MB)
[K     |████████████████████████████████| 11.7 MB 7.6 MB/s eta 0:00:01    |█████▊                          | 2.1 MB 1.6 MB/s eta 0:00:07     |█████████▍                      | 3.4 MB 1.6 MB/s eta 0:00:06     |███████████▋                    | 4.3 MB 1.6 MB/s eta 0:00:05     |█████████████████████████████▊  | 10.9 MB 4.0 MB/s eta 0:00:01
Collecting xlsx2csv>=0.8.0
  Downloading xlsx2csv-0.8.0-py3-none-any.whl (13 kB)
Collecting connectorx
  Downloading connectorx-0.3.1-cp39-cp39-macosx_11_0_arm64.whl (42.0 MB)
[K     |████████████████████████████████| 42.0 MB 4.6 MB/s eta 0:00:01    |██                              | 2.7 MB 295 kB/s eta 0:02:14     |█████                           | 6.5 MB 192 kB/s eta 0:03:05     |██████▍                         | 8.4 MB 192 kB/s eta 0:02:55     |███████                         | 9.2 MB 4.3 MB/s eta 0:00:08     |████████▎                       | 10.9 MB 4.3 

# Introduction to Polars

## Polars provides Python next-generation data frames

* **Expressive.** Queries are familiar, readable, and combosable.
* **Parallel.** Can use all cores/threads
* **Fast.** [Fastest] in-memory data frames
* **Lazy.** Allows lazy evaluation for
    * Efficient memory usage
    * Query optimization
    * Filter pushdown
* **Eager.** Allows eager evaluation for convenience on small data sets.

In [3]:
import polars as pl

## Our first dataframe

In [4]:
df = pl.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "Love_of_R": [2, 5, 11],
                   "years_at_wsu": [4, 17, 5]})
df.head()

Names,Python_mastery,Love_of_R,years_at_wsu
str,f64,i64,i64
"""Iverson""",10.0,2,4
"""Malone""",5.0,5,17
"""Bergen""",1.0,11,5


## Reading from a csv

* Most data sets will be read in from a csv or JSON data file
* `Pandas` provides `read_csv` and `read_json`

### Open a local file w/ relative path

In [5]:
# Won't work in colab
artists = pl.read_csv('./data/Artists.csv')
artists.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson...","""American, 1930...","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz...","""Spanish, born ...","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born...","""American""","""Male""",1941,0,,
4,"""Charles Arnold...","""American, born...","""American""","""Male""",1946,0,"""Q1063584""",500027998.0
5,"""Per Arnoldi""","""Danish, born 1...","""Danish""","""Male""",1941,0,,


### Open a web address

In [6]:
url = "https://github.com/MuseumofModernArt/collection/raw/master/Artists.csv"
artists =  pl.read_csv(url)
artists.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson...","""American, 1930...","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz...","""Spanish, born ...","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born...","""American""","""Male""",1941,0,,
4,"""Charles Arnold...","""American, born...","""American""","""Male""",1946,0,"""Q1063584""",500027998.0
5,"""Per Arnoldi""","""Danish, born 1...","""Danish""","""Male""",1941,0,,


# JSON data file

* Another (more modern) storage
* Here the data is stored in row `dict`

```{json}
[
{
  "ConstituentID": 1,
  "DisplayName": "Robert Arneson",
  "ArtistBio": "American, 1930–1992",
  "Nationality": "American",
  "Gender": "Male",
  "BeginDate": 1930,
  "EndDate": 1992,
  "Wiki QID": null,
  "ULAN": null
},
{
  "ConstituentID": 2,
  "DisplayName": "Doroteo Arnaiz",
  "ArtistBio": "Spanish, born 1936",
  "Nationality": "Spanish",
  "Gender": "Male",
  "BeginDate": 1936,
  "EndDate": 0,
  "Wiki QID": null,
  "ULAN": null
},
...
```

## `polars` can read `json` data

In [7]:
# Won't work in colab
artists =  pl.read_json('./data/Artists.json')
artists.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,str
1,"""Robert Arneson...","""American, 1930...","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz...","""Spanish, born ...","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born...","""American""","""Male""",1941,0,,
4,"""Charles Arnold...","""American, born...","""American""","""Male""",1946,0,"""Q1063584""","""500027998"""
5,"""Per Arnoldi""","""Danish, born 1...","""Danish""","""Male""",1941,0,,


## <font color="red"> Exercise 2.1.2 </font>
    
Use tab-completion and `help` to discover and explore two more methods of reading a file into a `Pandas` dataframe.


In [None]:
pl.read_ #<-- Tab here

> Discuss what you found here

> Looks like we can read csv, excel, json, parquet; as well as perform batch jobs.

## <font color="red"> Exercise 2.1.2 </font>
    
Read in the `Artwork.csv` from [https://github.com/MuseumofModernArt/collection](https://github.com/MuseumofModernArt/collection) and display the head of the resulting dataframe.


In [10]:
# Your code ther

## So what is a `DateFrame`

* Like R, Pandas focuses on columns
* Think `dict` of `(str, Series)` pairs 
* A series is a typed list-like structure

In [17]:
# This is how I imagine a dataframe
df = pl.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "years_at_wsu": [4.5, 17.5, 5.5]})

In [18]:
type(df)

polars.internals.dataframe.frame.DataFrame

## Columns are `Series` and hold one type of data

In [19]:
type(artists['BeginDate']), type(artists['DisplayName'])

(polars.internals.series.series.Series, polars.internals.series.series.Series)

In [20]:
artists['BeginDate'].dtype, artists['DisplayName'].dtype

(polars.datatypes.Int64, polars.datatypes.Utf8)

## Two ways to access a column

* **Method 1:** Actual data series
    * `df["column_name"]`
* **Method 2:** lazy column expression used in other context
    * `pl.col('column_name')`
    * Only for proper names!

In [22]:
artists['BeginDate'].head(2)

shape: (2,)
Series: 'BeginDate' [i64]
[
	1930
	1936
]

In [24]:
pl.col('BeginDate') # Nothing (yet)

## More on data types

* See all data types with `df.dtypes`
* You can set the `dtypes` when you read a dataframe
* Read more about types: [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html)

In [25]:
import polars.datatypes

dir(polars.datatypes)

['Any',
 'Binary',
 'Boolean',
 'Callable',
 'Categorical',
 'ColumnsType',
 'DTYPE_TEMPORAL_UNITS',
 'DataType',
 'DataTypeMappings',
 'Date',
 'Datetime',
 'Decimal',
 'Dict',
 'Duration',
 'Field',
 'Float32',
 'Float64',
 'ForwardRef',
 'Int16',
 'Int32',
 'Int64',
 'Int8',
 'List',
 'Mapping',
 'NoneType',
 'Null',
 'Object',
 'OptionType',
 'Optional',
 'PolarsDataType',
 'Schema',
 'Sequence',
 'Struct',
 'T',
 'TYPE_CHECKING',
 'TemporalDataType',
 'Time',
 'Tuple',
 'Type',
 'TypeVar',
 'UInt16',
 'UInt32',
 'UInt64',
 'UInt8',
 'Union',
 'UnionType',
 'Unknown',
 'Utf8',
 '_DOCUMENTING',
 '_DataTypeMappings',
 '_PYARROW_AVAILABLE',
 '_SimpleCData',
 '__annotations__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_base_type',
 '_custom_reconstruct',
 '_get_idx_type',
 'annotations',
 'cache',
 'ctypes',
 'date',
 'datetime',
 'dtype_str_repr',
 'dtype_to_arrow_type',
 'dtype_to_ctype',
 'dtype_to_ffiname',


In [26]:
artists.dtypes

[polars.datatypes.Int64,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8,
 polars.datatypes.Int64,
 polars.datatypes.Int64,
 polars.datatypes.Utf8,
 polars.datatypes.Utf8]

## Setting `dtypes` with `read_csv`

We can pass a `dict` of types to `dtype` keyword

In [27]:
import numpy as np
artist_types = {'ConstituentID': pl.Int64,
                'DisplayName': pl.Utf8,
                'ArtistBio': pl.Utf8,
                'Nationality': pl.Utf8,
                'Gender':pl.Utf8,
                'BeginDate': pl.Int64,
                'EndDate': pl.Int64,
                'Wiki QID': pl.Utf8,
                'ULAN':pl.Int64} 
artists2 = pl.read_csv('./data/artists.csv', dtypes = artist_types)
artists2.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson...","""American, 1930...","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz...","""Spanish, born ...","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born...","""American""","""Male""",1941,0,,
4,"""Charles Arnold...","""American, born...","""American""","""Male""",1946,0,"""Q1063584""",500027998.0
5,"""Per Arnoldi""","""Danish, born 1...","""Danish""","""Male""",1941,0,,


## More on `None` and `NaN`

* `None`/`null` is a missing value.
* `NaN` represents the result of an undefined operation
* `NaN` is **not** missing

In [28]:
df = pl.DataFrame({'a': [-1, 0 , 1, None],
                   'b': [1, 2, None, 4],
                   'c': [1, 2, float('nan'), 4]})
df

a,b,c
i64,i64,f64
-1.0,1.0,1.0
0.0,2.0,2.0
1.0,,
,4.0,4.0


### `Nan` are a result of undefined operations

In [29]:
df_w_sqrt = (df
             .select([pl.col('a'),
                      pl.col('a').sqrt().alias('sqrt_a'),
                     ])
)
df_w_sqrt

a,sqrt_a
i64,f64
-1.0,
0.0,0.0
1.0,1.0
,


### `Nan`  are not `None` 

In [30]:
(df_w_sqrt
 .with_columns([
          pl.col('sqrt_a').is_null().alias('sqrt_a_is_null'),
          pl.col('sqrt_a').is_nan().alias('sqrt_a_is_nan'),
             ])
)

a,sqrt_a,sqrt_a_is_null,sqrt_a_is_nan
i64,f64,bool,bool
-1.0,,False,True
0.0,0.0,False,False
1.0,1.0,False,False
,,True,


## `NaN` and `None` affect aggregation differently.

We will discuss the effects of these values on aggregation in a future lecture.

#### Getting summary statistics for each column with `describe`

In [32]:
df.describe()

describe,a,b,c
str,f64,f64,f64
"""count""",4.0,4.0,4.0
"""null_count""",1.0,1.0,0.0
"""mean""",0.0,2.333333,
"""std""",1.0,1.527525,
"""min""",-1.0,1.0,1.0
"""max""",1.0,4.0,4.0
"""median""",0.0,2.0,3.0


## Preview of coming attractions

* Now we can switch `BeginDate` and `EndDate` from `0` to `pl.null`
* We will do this in the next section

# Getting to know your data

## Basic inspection tools

* `df.head()`        first five rows
* `df.tail()`        last five rows
* `df.sample(5)`     random sample of rows
* `df.shape`         number of rows/columns in a tuple
* `df.describe()`    calculates measures of central tendency
* `df.info()`

## <font color="red"> Exercise 2.1.3: Inspect the artwork from MoMA </font>

#### Read the csv and inspect the `head`

In [None]:
artwork.head()

**Task:** Write a few sentences describing an problems

*Your thoughts here*

#### Inspect the column names with the `columns` attribute

In [17]:
artwork.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

**Question:** See any problems?

*Your thoughts here*

#### Inspect the tail

In [None]:
artwork.tail()

#### Check out the `shape`

In [None]:
artwork.shape

**Question:** What do these number mean?

*Your thoughts here*

#### Use `describe` to compute statistics

In [None]:
artwork.describe()

**Question:** What did you learn from the last cell?

*Your thoughts here*