## Colab Prep

Execute the following code cells to whenever you open/restart the notebook in Google Colab.

In [None]:
!pip install "polars[all]" #execute each time you start/restart a Colab session

In [None]:
!wget https://github.com/WSU-DataScience/dsci_325_module6_basic_data_management_in_python/raw/main/sample_data.zip

In [None]:
!unzip ./sample_data.zip

# Module 6.1 - Reading Data in `polars`

## Dataframes in Python

Here a summary of some of the important the data management libraries in Python.

* `pandas` was the first (and still most popular) data frame library.  It was based on `R` data frames, but is starting to show its age.
* `polars` is a new library similar to `pandas`, but has new features that make it easier to work with and more efficient for large data and multi-core machines.
* `pyspark` is used for managing very large data on a distributed network of machines.
* `koalas` is an interface to `pyspark` that based on the `pandas` interface.

**Note.** We will be primarily focusing on `polars`, but will occasionally need to convert to `pandas` to work with other libraries.

## Polars provides Python next-generation data frames

* **Expressive.** Queries are familiar, readable, and combosable.
* **Parallel.** Can use all cores/threads
* **Fast.** [Fastest] in-memory data frames
* **Lazy.** Allows lazy evaluation for
    * Efficient memory usage
    * Query optimization
    * Filter pushdown
* **Eager.** Allows eager evaluation for convenience on small data sets.

In [1]:
import polars as pl

## Our first dataframe

In [4]:
df = pl.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10.0, 5.5, 1.0],
                   "Love_of_R": [2, 5, 11],
                   "years_at_wsu": [4, 17, 5]})
df.head()

Names,Python_mastery,Love_of_R,years_at_wsu
str,f64,i64,i64
"""Iverson""",10.0,2,4
"""Malone""",5.5,5,17
"""Bergen""",1.0,11,5


## Reading from a data file

* Most data sets will be read in from a csv or JSON data file
* `Pandas` provides `read_csv` and `read_json`

### Open a CSV file from a local file w/ relative path

In [5]:
artists = pl.read_csv('./sample_data/Artists.csv')
artists.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson""","""American, 1930–1992""","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz""","""Spanish, born 1936""","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born 1941""","""American""","""Male""",1941,0,,
4,"""Charles Arnoldi""","""American, born 1946""","""American""","""Male""",1946,0,"""Q1063584""",500027998.0
5,"""Per Arnoldi""","""Danish, born 1941""","""Danish""","""Male""",1941,0,,


### Open a CSV using a web address

In [6]:
url = "https://github.com/MuseumofModernArt/collection/raw/main/Artists.csv"
artists =  pl.read_csv(url)
artists.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson""","""American, 1930–1992""","""American""","""male""",1930,1992,,
2,"""Doroteo Arnaiz""","""Spanish, born 1936""","""Spanish""","""male""",1936,0,,
3,"""Bill Arnold""","""American, born 1941""","""American""","""male""",1941,0,,
4,"""Charles Arnoldi""","""American, born 1946""","""American""","""male""",1946,0,"""Q1063584""",500027998.0
5,"""Per Arnoldi""","""Danish, born 1941""","""Danish""","""male""",1941,0,,


### What is a JSON data file?

* Another (more modern) storage
* Here the data is stored in row `dict`

```{json}
[
{
  "ConstituentID": 1,
  "DisplayName": "Robert Arneson",
  "ArtistBio": "American, 1930–1992",
  "Nationality": "American",
  "Gender": "Male",
  "BeginDate": 1930,
  "EndDate": 1992,
  "Wiki QID": null,
  "ULAN": null
},
{
  "ConstituentID": 2,
  "DisplayName": "Doroteo Arnaiz",
  "ArtistBio": "Spanish, born 1936",
  "Nationality": "Spanish",
  "Gender": "Male",
  "BeginDate": 1936,
  "EndDate": 0,
  "Wiki QID": null,
  "ULAN": null
},
...
```

### `polars` can read `json` data

In [7]:
artists =  pl.read_json('./sample_data/Artists.json')
artists.head()

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,str
1,"""Robert Arneson""","""American, 1930–1992""","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz""","""Spanish, born 1936""","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born 1941""","""American""","""Male""",1941,0,,
4,"""Charles Arnoldi""","""American, born 1946""","""American""","""Male""",1946,0,"""Q1063584""","""500027998"""
5,"""Per Arnoldi""","""Danish, born 1941""","""Danish""","""Male""",1941,0,,


## <font color="red"> Exercise 6.1.1 </font>
    
Use tab-completion and `help` to discover and explore two more methods of reading a file into a `Pandas` dataframe.


In [None]:
pl.read_ #<-- Tab here

> Discuss what you found here

## <font color="red"> Exercise 6.1.2 </font>
    
Read in the `./sample_data/Artwork.csv` from [https://github.com/MuseumofModernArt/collection](https://github.com/MuseumofModernArt/collection) and display the head of the resulting dataframe.


In [None]:
# Your code here

## Working with other character encodings

Data stored in a text file 

* Is encoding using some [character encoding](https://en.wikipedia.org/wiki/Character_encoding) and 
* Is commonly stored using [UTF-8](https://en.wikipedia.org/wiki/UTF-8), but
* Needs to be read and converted when using another encoding.

### Example - MoMA exhibitions

An example of a data set that is stored with a non-standard encoding is the `./sample_data/MoMAExhibitions1929to1989.csv` provided by the [Museam of Modern Art (MoMA)](https://github.com/MuseumofModernArt/collection).

### The exhibition file gives encoding errors by default

When trying to read this file, we get an error about the encoding.

In [8]:
exhibitions = pl.read_csv('./sample_data/MoMAExhibitions1929to1989.csv')

ComputeError: could not parse `"C�zanne, Gauguin, Seurat, Van Gogh"` as dtype `str` at column 'ExhibitionTitle' (column number 3)

The current offset in the file is 403 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `"C�zanne, Gauguin, Seurat, Van Gogh"` to the `null_values` list.

Original error: ```invalid utf-8 sequence```

## Switching encodings fixes the problem

* This file uses ISO-8859-1 encoding, see [this Stack Overflow question](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python)
* More details on [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
* How to read non-utf8 encodings
    * Use Python's tools (`with` statement and `open`) to read the file.
    * Encode as `utf-8` and pass to `polars`

In [10]:
with open('./sample_data/MoMAExhibitions1929to1989.csv', 'r', encoding='ISO-8859-1') as fh:
    converted_file = fh.read().encode('utf-8')
    exhibitions = pl.read_csv(converted_file,
                              ignore_errors=True,
                              try_parse_dates=True)
    
exhibitions.head(2)

ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,ConstituentID,ConstituentType,DisplayName,AlphaSort,FirstName,MiddleName,LastName,Suffix,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
i64,str,str,str,str,str,i64,str,str,str,i64,str,str,str,str,str,str,str,str,str,i64,i64,str,str,i64,str,i64,str
2557,"""1""","""Cézanne, Gauguin, Seurat, Van …","""[MoMA Exh. #1, November 7-Dece…","""11/7/1929""","""12/7/1929""",1,"""moma.org/calendar/exhibitions/…","""Curator""","""Director""",9168,"""Individual""","""Alfred H. Barr, Jr.""","""Barr Alfred H. Jr.""","""Alfred""","""H.""","""Barr""","""Jr.""",,"""American""",1902,1981,"""American, 19021981""","""Male""",109252853,"""Q711362""",500241556,"""moma.org/artists/9168"""
2557,"""1""","""Cézanne, Gauguin, Seurat, Van …","""[MoMA Exh. #1, November 7-Dece…","""11/7/1929""","""12/7/1929""",1,"""moma.org/calendar/exhibitions/…","""Artist""","""Artist""",1053,"""Individual""","""Paul Cézanne""","""Cézanne Paul""","""Paul""",,"""Cézanne""",,,"""French""",1839,1906,"""French, 18391906""","""Male""",39374836,"""Q35548""",500004793,"""moma.org/artists/1053"""


## So what is a `DateFrame`

* Like R, `polars` focuses on columns
* Think `dict` of `(str, Series)` pairs 
* A series is a typed list-like structure

In [11]:
# This is how I imagine a dataframe
df = pl.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10.0, 5.5, 1.0],
                   "years_at_wsu": [4.5, 17.5, 5.5]})

In [12]:
type(df)

polars.dataframe.frame.DataFrame

In [13]:
df

Names,Python_mastery,years_at_wsu
str,f64,f64
"""Iverson""",10.0,4.5
"""Malone""",5.5,17.5
"""Bergen""",1.0,5.5


## Two ways to access a column

* **Method 1:** Actual data series
    * `df["column_name"]`
* **Method 2:** lazy column expression used in other context
    * `pl.col('column_name')`
    * Only for proper names!

In [14]:
artists['BeginDate'].head(2)

BeginDate
i64
1930
1936


In [15]:
pl.col('BeginDate') # Lazy - Nothing (yet)

## Columns are type `Series` and hold one type of data

In [16]:
type(artists['BeginDate'])

polars.series.series.Series

In [17]:
type(artists['DisplayName'])

polars.series.series.Series

In [18]:
artists['BeginDate'].dtype

Int64

In [19]:
artists['DisplayName'].dtype

String

## More on data types

* a list of all `polars` data types are available in `pl.datatypes`
    * Look for names starting with a capital letter.
* Use `df.dtypes` to see the column types in a dataframe named `df`

#### A list of all `polars` data types

In [20]:
[m for m in dir(pl.datatypes) if m.istitle()] # istitle used to filter names starting with a capital letter

['Array',
 'Binary',
 'Boolean',
 'Categorical',
 'Date',
 'Datetime',
 'Decimal',
 'Duration',
 'Enum',
 'Field',
 'Float32',
 'Float64',
 'Int16',
 'Int32',
 'Int64',
 'Int8',
 'List',
 'Null',
 'Object',
 'String',
 'Struct',
 'Time',
 'Unknown',
 'Utf8']

#### Inspecting the data types for a data frame

In [21]:
artists.dtypes

[Int64, String, String, String, String, Int64, Int64, String, String]

## Setting `dtypes` with `read_csv`

We can pass a `dict` of types to `dtype` keyword

In [22]:
artist_types = {'ConstituentID': pl.Int64,
                'DisplayName': pl.Utf8,
                'ArtistBio': pl.Utf8,
                'Nationality': pl.Utf8,
                'Gender':pl.Utf8,
                'BeginDate': pl.Int64,
                'EndDate': pl.Int64,
                'Wiki QID': pl.Utf8,
                'ULAN':pl.Int64} 

artists2 = pl.read_csv('./sample_data/Artists.csv', dtypes = artist_types)
artists2.head()

  artists2 = pl.read_csv('./sample_data/Artists.csv', dtypes = artist_types)


ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson""","""American, 1930–1992""","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz""","""Spanish, born 1936""","""Spanish""","""Male""",1936,0,,
3,"""Bill Arnold""","""American, born 1941""","""American""","""Male""",1941,0,,
4,"""Charles Arnoldi""","""American, born 1946""","""American""","""Male""",1946,0,"""Q1063584""",500027998.0
5,"""Per Arnoldi""","""Danish, born 1941""","""Danish""","""Male""",1941,0,,


## More on `None` and `NaN`

`polars` has two types of missing data.

* `None`/`null` is a missing value.
* `NaN` represents the result of an undefined operation
* `NaN` is **not** missing

In [24]:
df = pl.DataFrame({'a': [-1, 0 , 1, None],
                   'b': [1, 2, None, 4],
                   'c': [1.0, 2.0, float('nan'), 4.0]})
df

a,b,c
i64,i64,f64
-1.0,1.0,1.0
0.0,2.0,2.0
1.0,,
,4.0,4.0


### `Nan` are a result of undefined operations

Note that computing the square root of a negative number returns `Nan`, not `None`/`null`

In [25]:
df_w_sqrt = (df
             .select([pl.col('a'),
                      pl.col('a').sqrt().alias('sqrt_a'),
                     ])
)
df_w_sqrt

a,sqrt_a
i64,f64
-1.0,
0.0,0.0
1.0,1.0
,


### `Nan`  are not `None` 

In [26]:
(df_w_sqrt
 .select([
          pl.col('sqrt_a'),
          pl.col('sqrt_a').is_null().alias('Is null'),
          pl.col('sqrt_a').is_nan().alias('Is nan'),
             ])
)

sqrt_a,Is null,Is nan
f64,bool,bool
,False,True
0.0,False,False
1.0,False,False
,True,


### `NaN` and `None` affect aggregation differently.

We will discuss the effects of these values on aggregation in a future lecture.

## Getting to know your data

To get to know your data, use the following data frame methods.

* `df.head()`        first five rows
* `df.tail()`        last five rows
* `df.sample(5)`     random sample of rows
* `df.shape`         number of rows/columns in a tuple
* `df.describe()`    calculates measures of central tendency

#### Getting the number of rows and columns using `shape`

In [27]:
df.shape

(4, 3)

#### Getting summary statistics for each column with `describe`

In [28]:
df.describe()

statistic,a,b,c
str,f64,f64,f64
"""count""",3.0,3.0,4.0
"""null_count""",1.0,1.0,0.0
"""mean""",0.0,2.333333,
"""std""",1.0,1.527525,
"""min""",-1.0,1.0,1.0
"""25%""",0.0,2.0,2.0
"""50%""",0.0,2.0,4.0
"""75%""",1.0,4.0,4.0
"""max""",1.0,4.0,4.0


## <font color="red"> Exercise 6.1.3</font>

**Tasks.**

* Use various method to inspect the `./sample_data/Artwork.csv` data from MoMA 
* Write a short summary of what your learn.

In [None]:
# Your code here (open new code cells for each method)

> Your thoughts here (open new markdown cells for each method)