# Introduction to Pandas

## Pandas provides Python data frames

* Popular and established
* Inspired by R dataframes
* Built on `numpy` for fast computation

In [13]:
import pandas as pd

## Our first dataframe

In [2]:
df = pd.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "Love_of_R": [2, 5, 11],
                   "years_at_wsu": [3, 16, 4]})
df.head()

Unnamed: 0,Names,Python_mastery,Love_of_R,years_at_wsu
0,Iverson,10.0,2,3
1,Malone,5.0,5,16
2,Bergen,1.0,11,4


## Reading from a csv

* Most data sets will be read in from a csv or JSON data file
* `Pandas` provides `read_csv` and `read_json`

In [3]:
artists = pd.read_csv("./data/Artists.csv")
artists.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


# JSON data file

* Another (more modern) storage
* Here the data is stored in row `dict`

In [5]:
!head -n 23 ./data/Artists.json

[
{
  "ConstituentID": 1,
  "DisplayName": "Robert Arneson",
  "ArtistBio": "American, 1930–1992",
  "Nationality": "American",
  "Gender": "Male",
  "BeginDate": 1930,
  "EndDate": 1992,
  "Wiki QID": null,
  "ULAN": null
},
{
  "ConstituentID": 2,
  "DisplayName": "Doroteo Arnaiz",
  "ArtistBio": "Spanish, born 1936",
  "Nationality": "Spanish",
  "Gender": "Male",
  "BeginDate": 1936,
  "EndDate": 0,
  "Wiki QID": null,
  "ULAN": null
},


## Reading a JSON data file

* Another (more modern) storage
* Here the data is stored in row `dict`

In [59]:
artists = pd.read_json("/Users/tiverson/Desktop/dsci430data/MoMA/Artists.json")
artists.head()

Unnamed: 0,ArtistBio,BeginDate,ConstituentID,DisplayName,EndDate,Gender,Nationality,ULAN,Wiki QID
0,"American, 1930–1992",1930,1,Robert Arneson,1992,Male,American,,
1,"Spanish, born 1936",1936,2,Doroteo Arnaiz,0,Male,Spanish,,
2,"American, born 1941",1941,3,Bill Arnold,0,Male,American,,
3,"American, born 1946",1946,4,Charles Arnoldi,0,Male,American,500027998.0,Q1063584
4,"Danish, born 1941",1941,5,Per Arnoldi,0,Male,Danish,,


## <font color="red"> Exercise 2 </font>
    
Use tab-completion and `help` to discover and explore two more methods of reading a file into a `Pandas` dataframe.


In [None]:
pd.read_ #<-- Tab here

## So what is a `DateFrame`

* Like R, Pandas focuses on columns
* Think `dict` of `(str, Series)` pairs 
* A series is a typed list-like structure

In [7]:
# This is how I imagine a dataframe
df = pd.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "years_at_wsu": [3, 14, 4]})

## Columns are `Series` and hold one type of data

In [4]:
type(artists.BeginDate), type(artists.DisplayName)

(pandas.core.series.Series, pandas.core.series.Series)

In [5]:
artists.BeginDate.dtype, artists.DisplayName.dtype

(dtype('int64'), dtype('O'))

## Two ways to access a column

* **Method 1:** like a dictionary
    * `df["column_name"]`
* **Method 2:** like an object attribute
    * `df.column_name`
    * Only for proper names!

In [10]:
artists.BeginDate.head(2)

0    1930
1    1936
Name: BeginDate, dtype: int64

In [11]:
artists['BeginDate'].head(2)

0    1930
1    1936
Name: BeginDate, dtype: int64

## More on data types

* See all data types with `df.dtypes`
* You can set the `dtypes` when you read a dataframe
* Read more about types: [Pandas docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html)

In [12]:
artists.dtypes

ArtistBio         object
BeginDate          int64
ConstituentID      int64
DisplayName       object
EndDate            int64
Gender            object
Nationality       object
ULAN             float64
Wiki QID          object
dtype: object

## Setting `dtypes` with `read_csv`

We can pass a `dict` of types to `dtype` keyword

In [27]:
import numpy as np
artist_types = {'ConstituentID': np.int64,
                'DisplayName': str,
                'ArtistBio': str,
                'Nationality': str,
                'Gender':str,
                'BeginDate': np.int64,
                'EndDate': np.int64,
                'Wiki QID': str,
                'ULAN':np.float64}
artists2 = pd.read_csv('./data/artists.csv', dtype = artist_types)
artists2.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


## Is `ULAN` REALLY a `float`?

* Currently, `pandas` $\rightarrow$ no missing `int`s
    * This is also the reason for the `0`s in dates
* This is incredibly annoying!
* Hope is on the horizon

In [30]:
pd.__version__

'0.23.4'

## Installing the development version

In [31]:
!pip install --upgrade --pre pandas

Requirement already up-to-date: pandas in /Users/tiverson/.pyenv/versions/anaconda3-5.0.0/lib/python3.6/site-packages (0.24.0rc1)


In [2]:
pd.__version__

'0.24.0rc1'

## Using the new `Int64` extension type

Make sure you restart the kernel

In [9]:
import pandas as pd
import numpy as np
artist_types = {'ConstituentID': np.int64,
                'DisplayName': str,
                'ArtistBio': str,
                'Nationality': str,
                'Gender':str,
                'BeginDate': pd.Int64Dtype(),
                'EndDate': pd.Int64Dtype(),
                'Wiki QID': str,
                'ULAN':pd.Int64Dtype()}
artists2 = pd.read_csv('./data/artists.csv', dtype = artist_types)
artists2.head()

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


## An `Int` by any other name ...

* `int64` $\rightarrow$ no missing values
* `Int64` $\rightarrow$ allows `NaN`

In [7]:
artists2.dtypes

ConstituentID     int64
DisplayName      object
ArtistBio        object
Nationality      object
Gender           object
BeginDate         Int64
EndDate           Int64
Wiki QID         object
ULAN              Int64
dtype: object

## Preview of coming attractions

* Now we can switch `BeginDate` and `EndDate` from `0` to `np.NaN`
* We will do this in the next section

# Getting to know your data

## Basic inspection tools

* `df.head()`        first five rows
* `df.tail()`        last five rows
* `df.sample(5)`     random sample of rows
* `df.shape`         number of rows/columns in a tuple
* `df.describe()`    calculates measures of central tendency
* `df.info()`

## <font color="red"> Exercise 1: Load and inspect the artwork from MoMA </font>

Make sure you can load both the csv and json files

[Data source](https://github.com/MuseumofModernArt/collection)

#### Read the csv and inspect the `head`

In [12]:
artwork = pd.read_csv("./data/Artworks.csv")
artwork.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria, Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


**Task:** Write a few sentences describing an problems

*Your thoughts here*

#### Inspect the column names with the `columns` attribute

In [17]:
artwork.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

**Question:** See any problems?

*Your thoughts here*

#### Inspect the tail

In [11]:
artwork.tail()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
136526,Duplicate of plate facing page 6 from Mazas,Maximilien Luce,3621,"(French, 1858–1941)",(French),(1858),(1941),(Male),1894,Lithograph from the supplementary suite of an ...,...,,,,,32.3,,,23.5,,
136527,Duplicate of plate facing page 7 from Mazas,Maximilien Luce,3621,"(French, 1858–1941)",(French),(1858),(1941),(Male),1894,Lithograph from the supplementary suite of an ...,...,,,,,30.4,,,24.0,,
136528,Duplicate of plate facing page 8 from Mazas,Maximilien Luce,3621,"(French, 1858–1941)",(French),(1858),(1941),(Male),1894,Lithograph from the supplementary suite of an ...,...,,,,,33.0,,,24.2,,
136529,Duplicate of plate facing page 9 from Mazas,Maximilien Luce,3621,"(French, 1858–1941)",(French),(1858),(1941),(Male),1894,Lithograph from the supplementary suite of an ...,...,,,,,32.0,,,24.0,,
136530,Duplicate of plate facing page 10 from Mazas,Maximilien Luce,3621,"(French, 1858–1941)",(French),(1858),(1941),(Male),1894,Lithograph from the supplementary suite of an ...,...,,,,,22.0,,,30.6,,


#### Check out the `shape`

In [15]:
artwork.shape

(136531, 29)

**Question:** What do these number mean?

*Your thoughts here*

#### Use `describe` to compute statistics

In [13]:
artwork.describe()

Unnamed: 0,ObjectID,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
count,136531.0,10.0,13157.0,1429.0,117187.0,738.0,289.0,116280.0,0.0,3304.0
mean,90576.707253,44.86802,16.789801,23.184666,37.642548,89.892356,1287.944097,38.072156,,7766.9
std,68230.367311,28.631604,55.337322,45.070383,47.700061,330.290367,12038.129595,66.48243,,114507.0
min,2.0,9.9,0.0,0.635,0.0,0.0,0.09,0.0,,0.0
25%,36117.5,23.5,0.0,7.9,18.1,17.1,5.67,17.780036,,238.0
50%,72981.0,36.0,0.5,13.8,27.940056,26.7,19.9583,25.400051,,780.0
75%,137842.5,71.125,10.2,24.9,44.3,79.7,80.2867,44.6,,4320.0
max,294767.0,83.8,1808.483617,914.4,9140.0,8321.0566,185067.585957,9144.0,,6283065.0


#### Use `info` to look at types and totals

In [17]:
artwork.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136531 entries, 0 to 136530
Data columns (total 29 columns):
Title                 136492 non-null object
Artist                135083 non-null object
ConstituentID         135083 non-null object
ArtistBio             131085 non-null object
Nationality           135083 non-null object
BeginDate             135083 non-null object
EndDate               135083 non-null object
Gender                135083 non-null object
Date                  134128 non-null object
Medium                125364 non-null object
Dimensions            125536 non-null object
CreditLine            133653 non-null object
AccessionNumber       136531 non-null object
Classification        136531 non-null object
Department            136531 non-null object
DateAcquired          129775 non-null object
Cataloged             136531 non-null object
ObjectID              136531 non-null int64
URL                   78575 non-null object
ThumbnailURL          67825 non-null

# Working with large files

## Using `chunksize` to read table chunks

* `chunksize` is the number of rows

In [21]:
c_size = 500
df_iter = pd.read_csv("./data/Artworks.csv", chunksize=c_size)
df_iter

<pandas.io.parsers.TextFileReader at 0x110a50eb8>

## What is `df_iter`?

* Using `chunksize` $\rightarrow$ dataframe iterator
* Lazily returns chunks of data
* Won't fill up memory

## Big data best practice

* Use `toolz.first` to grab the first chunk
* Prototype on a small chunk of data
* Run your code on all chunks later


In [28]:
from toolz import first 

first_chuck = first(df_iter)
first_chuck.shape

(500, 29)

In [29]:
first_chuck.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
1500,Candlestick,Louis Comfort Tiffany,5876,"(American, 1848–1933)",(American),(1848),(1933),(Male),c. 1910,Bronze,...,http://www.moma.org/media/W1siZiIsIjIxMjUiXSxb...,,,11.43,38.1,,,,,
1501,Tumblers,Kaj Franck,1968,"(Finnish, 1911–1989)",(Finnish),(1911),(1989),(Male),1954,Turn mold-blown glass,...,http://www.moma.org/media/W1siZiIsIjIyODAxMyJd...,,,8.0,8.255,,,,,
1502,Jelly Fish Fabric,Reiko Sudo,7045,"(Japanese, born 1953)",(Japanese),(1953),(0),(Female),c. 1994,Polyester,...,http://www.moma.org/media/W1siZiIsIjIxMDMzMiJd...,,,,,637.5,,86.4,,
1503,Multi-use Chair,Frederick Kiesler,3091,"(American, born Austria-Hungary. 1890–1965)",(American),(1890),(1965),(Male),1942,Oak and linoleum,...,http://www.moma.org/media/W1siZiIsIjIxMDMyNyJd...,,88.9002,,84.7727,,,39.6876,,
1504,Bowl,James Prestini,4729,"(American, born Italy. 1908–1993)",(American),(1908),(1993),(Male),c. 1945,Birch wood,...,http://www.moma.org/media/W1siZiIsIjIxMDMzMyJd...,,,17.8,4.4,,,,,


## Prototyping your code

* Now build/test your code on a small chunk
* Good general coding technique

In [43]:
fix_begin_date = lambda df: (df >> mutate(BeginDate = X.BeginDate.str.replace('[()]', '')))
fix_begin_date(first_chuck).head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
1500,Candlestick,Louis Comfort Tiffany,5876,"(American, 1848–1933)",(American),1848,(1933),(Male),c. 1910,Bronze,...,http://www.moma.org/media/W1siZiIsIjIxMjUiXSxb...,,,11.43,38.1,,,,,
1501,Tumblers,Kaj Franck,1968,"(Finnish, 1911–1989)",(Finnish),1911,(1989),(Male),1954,Turn mold-blown glass,...,http://www.moma.org/media/W1siZiIsIjIyODAxMyJd...,,,8.0,8.255,,,,,
1502,Jelly Fish Fabric,Reiko Sudo,7045,"(Japanese, born 1953)",(Japanese),1953,(0),(Female),c. 1994,Polyester,...,http://www.moma.org/media/W1siZiIsIjIxMDMzMiJd...,,,,,637.5,,86.4,,
1503,Multi-use Chair,Frederick Kiesler,3091,"(American, born Austria-Hungary. 1890–1965)",(American),1890,(1965),(Male),1942,Oak and linoleum,...,http://www.moma.org/media/W1siZiIsIjIxMDMyNyJd...,,88.9002,,84.7727,,,39.6876,,
1504,Bowl,James Prestini,4729,"(American, born Italy. 1908–1993)",(American),1908,(1993),(Male),c. 1945,Birch wood,...,http://www.moma.org/media/W1siZiIsIjIxMDMzMyJd...,,,17.8,4.4,,,,,


## Safe operations on big data

* Aggregate to a managable size
* Filter to a managable size
* Read chuck $\rightarrow$ Process chuck $\rightarrow$ Write chunk
* Use generator expressions
    * like list comprehensions, but with `()`

In [47]:
df_iter = pd.read_csv("./data/Artworks.csv", chunksize=c_size)
df_iter2 = (fix_begin_date(df) for df in df_iter)
df_iter2

<generator object <genexpr> at 0x110a38678>

In [48]:
?first_chuck.to_csv

## Writing large files

After processing each chunk, we need to write the rows to a file. Note that

1. We need to append each chunk to the same file
1. We want to print the header on the first chunk.
2. We *don't* want to print the header on all subsequent chunks.

#### Step 1 - Read the file in chunks

In [83]:
df_iter = pd.read_csv('./data/Artworks2.csv', 
                      chunksize=500, # Pick a reasonable chunk size.  I had memory errors with a smaller size
                      sep=',', # To help the parser not run out of memory
                      dtype={'BeginDate':str}, # We are using string method, make sure they will work
                      engine='python') # The way I fixed parsing errors

#### Step 2 - Process and write the first chunk

In [84]:
first_chunk = first(df_iter)
fix_begin_date(first_chuck).to_csv('./data/Artwork_new.csv', mode='w')

#### Step 3 - Process and append the rest of the chunks

In [85]:
for i, chunk in enumerate(df_iter):
    fixed_chunk = fix_begin_date(chunk)
    fixed_chunk.to_csv('./data/Artwork_new.csv', mode='a') # mode 'a' is append

## Inspecting the result

In [81]:
artwork_new = pd.read_csv('./data/Artwork_new.csv')
artwork_new.shape

(136805, 30)

In [79]:
artwork_new.head()

Unnamed: 0.1,Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,1500.0,Candlestick,Louis Comfort Tiffany,5876,"(American, 1848–1933)",(American),1848,(1933),(Male),c. 1910,...,http://www.moma.org/media/W1siZiIsIjIxMjUiXSxb...,,,11.43,38.1,,,,,
1,1501.0,Tumblers,Kaj Franck,1968,"(Finnish, 1911–1989)",(Finnish),1911,(1989),(Male),1954,...,http://www.moma.org/media/W1siZiIsIjIyODAxMyJd...,,,8.0,8.255,,,,,
2,1502.0,Jelly Fish Fabric,Reiko Sudo,7045,"(Japanese, born 1953)",(Japanese),1953,(0),(Female),c. 1994,...,http://www.moma.org/media/W1siZiIsIjIxMDMzMiJd...,,,,,637.5,,86.4,,
3,1503.0,Multi-use Chair,Frederick Kiesler,3091,"(American, born Austria-Hungary. 1890–1965)",(American),1890,(1965),(Male),1942,...,http://www.moma.org/media/W1siZiIsIjIxMDMyNyJd...,,88.9002,,84.7727,,,39.6876,,
4,1504.0,Bowl,James Prestini,4729,"(American, born Italy. 1908–1993)",(American),1908,(1993),(Male),c. 1945,...,http://www.moma.org/media/W1siZiIsIjIxMDMzMyJd...,,,17.8,4.4,,,,,
