# Day One: Introduction to DataFrames

First, import pandas using the conventional alias `pd` and check the version. Any version should > 2.0 should do.

In [5]:
import pandas as pd
print(pd.__version__)

2.2.3


After downloading from the `.csv` file from GitHub, we should be able to read it in.

Let's use the `os` module to make sure that we're looking for the file in the right place: the folder that has our Jupyter notebook in it. 

The first function we'll use with os is `os.getcwd()` which should return the *current working directory*. This is where Python will look for files when it tries to open them if you don't specify an alternative path.

In [6]:
import os
os.getcwd()

'/Users/emwin/Documents/pandas-workshop'

Your folder name might be different, depending on how you set up your project and your operating system, but this is the name and path of the folder where I'm storing my workshop materials for this class.

The next function we'll use is `os.listdir()`, which lists the files and folders in the current directory. Do you see the `.csv` file you just downloaded?

In [7]:
os.listdir()

['day-one.ipynb',
 'resources.md',
 'day-01-w25.ipynb',
 'Archive',
 'environment.yml',
 'Untitled.ipynb',
 'README.md',
 '.gitignore',
 '.ipynb_checkpoints',
 '.git',
 'IAAPI_raw_2023-10-05.csv',
 'Notebooks']

Our first attempt to read in this file will fail.

In [9]:
moma = pd.read_csv("IAAPI_raw_2023-10-05.csv")
moma

ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3


Let's look more closely at this file. This file might have the `.csv` suffix, but it's not actually comma-separated. It has the `|` separator.

Looking at the documentation for [**read_csv()**](https://pandas.pydata.org/docs/dev/reference/api/pandas.read_csv.html), we can pass in a `sep=` argument that tells Pandas what character (aka letter/number/symbol) is used to separate the columns in each line of this file.

In [10]:
moma = pd.read_csv("IAAPI_raw_2023-10-05.csv", sep="|")
moma

Unnamed: 0,Match point,ULAN ID,Wikidata ID,VIAF ID,LC ID,Name,Label,Birth date,Death date,Ancestry/Heritage,Indexes,Description,Watsonline,Watsonline URL,Met Collection,Met Collection URL,Wikipedia,Wikipedia URL,Art Asia America,AAA URL
0,AAPI-0001,500487777.0,Q466654,79468708.0,n86857749,"Abad, Pacita",Pacita Abad,1946.0,2004.0,Filipino/a/x,,painter,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Pacita_Abad,,
1,AAPI-0970,,Q47157454,102816611.0,no2007063757,"Abbas, Hamra",Hamra Abbas,1976.0,,Pakistani,,sculptor;painter;installation artist,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,,,,
2,AAPI-0002,500116914.0,Q7426381,96547973.0,no99018101,"Abe, Satoru",Satoru Abe,1926.0,,Japanese;Hawaiian (Kamaʻāina),,painter;sculptor,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Satoru_Abe,,
3,AAPI-1031,,Q23881684,307448466.0,no2013110010,"Abichandani, Jaishri",Jaishri Abichandani,1969.0,,Indian,,interdisciplinary artist,Watsonline,https://library.metmuseum.org/search/?searchty...,,,,,,
4,AAPI-0003,,Q19867429,,,"Acebo Davis, Terry",Terry Acebo Davis,1953.0,,Filipino/a/x,,installation artist;mixed-media artist;printmaker,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Terry_Acebo_Davis,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,AAPI-0890,500524609.0,Q21607972,307424125.0,n2014002953,"Zheng, Chongbin",Chongbin Zheng,1961.0,,Chinese,,installation artist;painter;time-based media a...,Watsonline,https://library.metmuseum.org/search/?searchty...,Met Collection,https://www.metmuseum.org/art/collection/searc...,,,,
926,AAPI-0735,,Q8070726,,,"Zheng, Lianjie",Lianjie Zheng,1962.0,,Chinese,,installation artist;painter;performance artist...,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Zheng_Lianjie,artasiamerica,http://artasiamerica.org/artist/detail/57
927,AAPI-0736,,Q120867919,,,"Zheng, Shengtian",Shengtian Zheng,1938.0,,Chinese,,painter;ink artist/calligrapher,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,,,,
928,AAPI-0889,,Q120868879,,,"Zhong, Yueying",Yueying Zhong,1960.0,,Chinese,,painter;ink artist/calligrapher,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,,,,


Let's make a quick adjustment so that Pandas displays all the columns in this DataFrame.

In [11]:
pd.set_option('display.max_columns', None)

Our next Pandas function is **DataFrame.head(n)** which displays the first *n* rows a given DataFrame.
Now we can inspect the contents of our DataFrame.

In [12]:
moma.head(5) # Gives the first five rows of a DataFrame

Unnamed: 0,Match point,ULAN ID,Wikidata ID,VIAF ID,LC ID,Name,Label,Birth date,Death date,Ancestry/Heritage,Indexes,Description,Watsonline,Watsonline URL,Met Collection,Met Collection URL,Wikipedia,Wikipedia URL,Art Asia America,AAA URL
0,AAPI-0001,500487777.0,Q466654,79468708.0,n86857749,"Abad, Pacita",Pacita Abad,1946.0,2004.0,Filipino/a/x,,painter,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Pacita_Abad,,
1,AAPI-0970,,Q47157454,102816611.0,no2007063757,"Abbas, Hamra",Hamra Abbas,1976.0,,Pakistani,,sculptor;painter;installation artist,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,,,,
2,AAPI-0002,500116914.0,Q7426381,96547973.0,no99018101,"Abe, Satoru",Satoru Abe,1926.0,,Japanese;Hawaiian (Kamaʻāina),,painter;sculptor,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Satoru_Abe,,
3,AAPI-1031,,Q23881684,307448466.0,no2013110010,"Abichandani, Jaishri",Jaishri Abichandani,1969.0,,Indian,,interdisciplinary artist,Watsonline,https://library.metmuseum.org/search/?searchty...,,,,,,
4,AAPI-0003,,Q19867429,,,"Acebo Davis, Terry",Terry Acebo Davis,1953.0,,Filipino/a/x,,installation artist;mixed-media artist;printmaker,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Terry_Acebo_Davis,,


`moma` is a DataFrame object.

In [13]:
type(moma) # Print data type

pandas.core.frame.DataFrame

If we look at the **.shape** attribute (a tuple in `(row, column)` form) we can see it has 930 rows and 20 columns.

In [14]:
print(moma.shape)

(930, 20)


We can look at the row names in the **.index** attribute and the column names in the **.columns** attribute--we'll learn how to change these later!
If we do not specify an index column using the *index_col=* parameter, the row names will default to a **RangeIndex** starting at 0 and stopping at `num_rows - 1`.

In [15]:
moma.index

RangeIndex(start=0, stop=930, step=1)

In [16]:
moma.columns

Index(['Match point', 'ULAN ID', 'Wikidata ID', 'VIAF ID', 'LC ID', 'Name',
       'Label', 'Birth date', 'Death date', 'Ancestry/Heritage', 'Indexes',
       'Description', 'Watsonline', 'Watsonline URL', 'Met Collection',
       'Met Collection URL', 'Wikipedia', 'Wikipedia URL', 'Art Asia America',
       'AAA URL'],
      dtype='object')

Every **column** in a **DataFrame** is a `pd.Series`, and every **row** in a **DataFrame** is a **Series**. However, columns have an assigned **Dtype** where rows do not. This makes sense because rows typically represent *observations* while columns represent *variables*.

You can see the Dtypes for each column in your DataFrame using **.dtypes**. (**object** is the default Dtype for categorical variables)

In [17]:
moma.dtypes

Match point            object
ULAN ID               float64
Wikidata ID            object
VIAF ID               float64
LC ID                  object
Name                   object
Label                  object
Birth date            float64
Death date            float64
Ancestry/Heritage      object
Indexes                object
Description            object
Watsonline             object
Watsonline URL         object
Met Collection         object
Met Collection URL     object
Wikipedia              object
Wikipedia URL          object
Art Asia America       object
AAA URL                object
dtype: object

We can select a single column in a DataFrame by using `frame["column_name"]` notation.

In [18]:
moma["Name"]

0              Abad, Pacita
1              Abbas, Hamra
2               Abe, Satoru
3      Abichandani, Jaishri
4        Acebo Davis, Terry
               ...         
925         Zheng, Chongbin
926          Zheng, Lianjie
927        Zheng, Shengtian
928          Zhong, Yueying
929          Tamotzu, Chuzo
Name: Name, Length: 930, dtype: object

A column is a **pd.Series** of the same length as the number of rows in the parent **DataFrame**.

In [19]:
type(moma["Name"])

pandas.core.series.Series

## Describing Distributions in DataFrames and Series
We can get descriptive statistics for **numeric** Series or for several numeric columns in a DataFrame at once using [**DataFrame.describe()**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).
* count
* mean
* std
* min
* 25% quartile
* 50% quartile (median)
* 75% quartile
* max

In [43]:
moma.describe() # On all numeric columns in a DataFrame

Unnamed: 0,ULAN ID,VIAF ID,Birth date,Death date
count,308.0,418.0,820.0,248.0
mean,500238700.0,3.184308e+20,1946.328049,1989.181452
std,172538.3,1.326401e+21,28.555683,24.795098
min,500001300.0,6613.0,1840.0,1893.0
25%,500107100.0,44883970.0,1928.0,1977.5
50%,500152600.0,96348450.0,1952.0,1994.0
75%,500336400.0,268501800.0,1969.0,2008.0
max,500778600.0,9.81616e+21,1994.0,2022.0


We can compute any of the aggregate measurements from those descriptive statistics on Series or DataFrames by calling **Series.count()**, **Series.min()**, etc.

In [44]:
moma["Death date"].max()

2022.0

What do you think happens if I call the **.min()** method on an alphabetical column like `Name`?

In [45]:
moma["Name"].min()

'Abad, Pacita'

Aggregatation functions like **.min()** and **.count()** ignore NaN or missing values. For example, only 248 of the artists have known death dates.

In [46]:
moma["Death date"].count()

248

To see a longer list of Series aggregation/computation functions, [see the official documentation](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats).

### Categorical Variable Distributions
But what about those columns of dtype `object` in our DataFrame? Pandas has ways for us to assess those categorical variables at a glance, even though you can't the average or standard deviation of say, a name.

For an individual column or Series, we can use [**Series.value_counts()**](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html#pandas.Series.value_counts) to compute the value frequency of categorical variables in descending order of frequency.

For example, let's select the "Ancestry/Heritage" column and get its **.value_counts()**.

In [47]:
moma["Ancestry/Heritage"].value_counts()

Ancestry/Heritage
Japanese                                                 256
Chinese                                                  176
Korean                                                   139
Filipino/a/x                                              68
Japanese;Hawaiian (Kamaʻāina)                             38
Indian                                                    34
Vietnamese                                                27
Taiwanese                                                 27
Hongkonger                                                18
Pakistani                                                 14
Native Hawaiian (Kānaka ʻŌiwi, Kānaka Maoli)               9
Cambodian                                                  8
Chinese;Hongkonger                                         8
Chinese;Hawaiian (Kamaʻāina)                               7
Hongkonger;Chinese                                         6
Filipino/a/x;Hawaiian (Kamaʻāina)                          5
Thai  

Looking at this distribution, we can see that artists with ancestry from multiple Asian nations will have `;` in the `Ancestry/Heritage` column. You can also see that artists of Japanese, Chinese, and Korean descent make up the largest share.

To assess several categorical variables in a DataFrame, we can pass in a special flag to **DataFrame.describe()** called `include=`. Because our categorical variables are of type `object`, we'll pass in that as an argument.

In [48]:
moma.describe(include="object")

Unnamed: 0,Match point,Wikidata ID,LC ID,Name,Label,Ancestry/Heritage,Indexes,Description,Watsonline,Watsonline URL,Met Collection,Met Collection URL,Wikipedia,Wikipedia URL,Art Asia America,AAA URL
count,930,930,341,930,918,903,31,896,930,930,83,83,420,419,135,135
unique,930,929,341,930,918,56,9,296,1,929,1,83,1,417,1,135
top,AAPI-0001,Q28663102,n86857749,"Abad, Pacita",Pacita Abad,Japanese,Index of Indigenous and Native American Artists\n,painter,Watsonline,https://library.metmuseum.org/search~S1/?searc...,Met Collection,https://www.metmuseum.org/art/collection/searc...,Wikipedia,https://en.wikipedia.org/wiki/Eiko_%26_Koma,artasiamerica,http://artasiamerica.org/artist/detail/73
freq,1,2,1,1,1,256,13,155,930,2,83,1,420,2,135,1


As you can see, only **count**, **unique**, **top**, and **freq** (of the top value) are computed for string variables.

You can also use `include="all"` to see the numeric and categorical distributions simultaneously, but this can be a bit awkward to read.

In [49]:
moma.describe(include="all")

Unnamed: 0,Match point,ULAN ID,Wikidata ID,VIAF ID,LC ID,Name,Label,Birth date,Death date,Ancestry/Heritage,Indexes,Description,Watsonline,Watsonline URL,Met Collection,Met Collection URL,Wikipedia,Wikipedia URL,Art Asia America,AAA URL
count,930,308.0,930,418.0,341,930,918,820.0,248.0,903,31,896,930,930,83,83,420,419,135,135
unique,930,,929,,341,930,918,,,56,9,296,1,929,1,83,1,417,1,135
top,AAPI-0001,,Q28663102,,n86857749,"Abad, Pacita",Pacita Abad,,,Japanese,Index of Indigenous and Native American Artists\n,painter,Watsonline,https://library.metmuseum.org/search~S1/?searc...,Met Collection,https://www.metmuseum.org/art/collection/searc...,Wikipedia,https://en.wikipedia.org/wiki/Eiko_%26_Koma,artasiamerica,http://artasiamerica.org/artist/detail/73
freq,1,,2,,1,1,1,,,256,13,155,930,2,83,1,420,2,135,1
mean,,500238700.0,,3.184308e+20,,,,1946.328049,1989.181452,,,,,,,,,,,
std,,172538.3,,1.326401e+21,,,,28.555683,24.795098,,,,,,,,,,,
min,,500001300.0,,6613.0,,,,1840.0,1893.0,,,,,,,,,,,
25%,,500107100.0,,44883970.0,,,,1928.0,1977.5,,,,,,,,,,,
50%,,500152600.0,,96348450.0,,,,1952.0,1994.0,,,,,,,,,,,
75%,,500336400.0,,268501800.0,,,,1969.0,2008.0,,,,,,,,,,,



Q: Let's say you're interested in studying notable artists who haven't been added to Wikipedia. How would you determine how many of the artists in this dataset do *not* have Wikipedia pages?

**HINT:** Remember `.shape`?

A: Use the **.count()** function on the `Wikipedia` column and the first item in the `.shape` tuple to get the number of rows, then subtract.

In [50]:
have_wiki = moma["Wikipedia URL"].count() # Artists who do have a Wikpedia 
row_count = moma.shape[0] # Number of rows in the DataFrame
print("There are", row_count - have_wiki, "artists without Wikipedia pages featured at the MoMA in this dataset.")

There are 511 artists without Wikipedia pages featured at the MoMA in this dataset.


## Indexing

We can access values in a DataFrame using [one of two selection methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html):
* .loc[*row*, *column*] to get values by named indices and columns (inclusive on both ends)
* .iloc[*row*, *column*] to get values by numbered indices (exclusive on the right)

### Selecting from a DataFrame with the default (numeric) index
*Q: How would you select the third column in the second row using loc? How would you select it with iloc?*
Use this peek of the first three rows as a guide.

In [28]:
moma.head(3)

Unnamed: 0,Match point,ULAN ID,Wikidata ID,VIAF ID,LC ID,Name,Label,Birth date,Death date,Ancestry/Heritage,Indexes,Description,Watsonline,Watsonline URL,Met Collection,Met Collection URL,Wikipedia,Wikipedia URL,Art Asia America,AAA URL
0,AAPI-0001,500487777.0,Q466654,79468708.0,n86857749,"Abad, Pacita",Pacita Abad,1946.0,2004.0,Filipino/a/x,,painter,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Pacita_Abad,,
1,AAPI-0970,,Q47157454,102816611.0,no2007063757,"Abbas, Hamra",Hamra Abbas,1976.0,,Pakistani,,sculptor;painter;installation artist,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,,,,
2,AAPI-0002,500116914.0,Q7426381,96547973.0,no99018101,"Abe, Satoru",Satoru Abe,1926.0,,Japanese;Hawaiian (Kamaʻāina),,painter;sculptor,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Satoru_Abe,,


In [29]:
print(moma.loc[1, "Wikidata ID"]) # Our index in this case is a number
print(moma.iloc[1, 2]) # The second row, the second column

Q47157454
Q47157454


We can also use .loc and .iloc to select ranges of cells in the form of rows and columns. 

In [30]:
moma.iloc[0:2, :] # Because iloc is not inclusive at the right end, first two rows

Unnamed: 0,Match point,ULAN ID,Wikidata ID,VIAF ID,LC ID,Name,Label,Birth date,Death date,Ancestry/Heritage,Indexes,Description,Watsonline,Watsonline URL,Met Collection,Met Collection URL,Wikipedia,Wikipedia URL,Art Asia America,AAA URL
0,AAPI-0001,500487777.0,Q466654,79468708.0,n86857749,"Abad, Pacita",Pacita Abad,1946.0,2004.0,Filipino/a/x,,painter,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,Wikipedia,https://en.wikipedia.org/wiki/Pacita_Abad,,
1,AAPI-0970,,Q47157454,102816611.0,no2007063757,"Abbas, Hamra",Hamra Abbas,1976.0,,Pakistani,,sculptor;painter;installation artist,Watsonline,https://library.metmuseum.org/search~S1/?searc...,,,,,,


In [31]:
moma.loc[:, "Name":"Death date"] # loc is inclusive

Unnamed: 0,Name,Label,Birth date,Death date
0,"Abad, Pacita",Pacita Abad,1946.0,2004.0
1,"Abbas, Hamra",Hamra Abbas,1976.0,
2,"Abe, Satoru",Satoru Abe,1926.0,
3,"Abichandani, Jaishri",Jaishri Abichandani,1969.0,
4,"Acebo Davis, Terry",Terry Acebo Davis,1953.0,
...,...,...,...,...
925,"Zheng, Chongbin",Chongbin Zheng,1961.0,
926,"Zheng, Lianjie",Lianjie Zheng,1962.0,
927,"Zheng, Shengtian",Shengtian Zheng,1938.0,
928,"Zhong, Yueying",Yueying Zhong,1960.0,


In [32]:
moma.loc[:, "Name"] # Returns all the rows with the Name column

0              Abad, Pacita
1              Abbas, Hamra
2               Abe, Satoru
3      Abichandani, Jaishri
4        Acebo Davis, Terry
               ...         
925         Zheng, Chongbin
926          Zheng, Lianjie
927        Zheng, Shengtian
928          Zhong, Yueying
929          Tamotzu, Chuzo
Name: Name, Length: 930, dtype: object

Many Pandas functions use **lists** or **dictionaries** as arguments.

### Lists
[**Lists**](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) are one-dimensional, ordered sequences. Lists can include items of heterogeneous data types. While lists have more functionality than described here, we will use them in this workshop as constants declared in the form `list_name = [item_0, item1...]`.

In [40]:
my_list = ["a", "b", 3, 5]
print(my_list[0]) # Gets the first item
print(my_list[:2]) # Gets the first two items
my_list.append(["dog", "cat"])
print(my_list)

a
['a', 'b']
['a', 'b', 3, 5, ['dog', 'cat']]


We can use lists to select several columns at once and in any order. You can pass it lists of column names to .loc[]. Remember `:` means select all columns or all rows.

In [41]:
moma.loc[:, ["Name", "Label", "Birth date"]]

Unnamed: 0,Name,Label,Birth date
0,"Abad, Pacita",Pacita Abad,1946.0
1,"Abbas, Hamra",Hamra Abbas,1976.0
2,"Abe, Satoru",Satoru Abe,1926.0
3,"Abichandani, Jaishri",Jaishri Abichandani,1969.0
4,"Acebo Davis, Terry",Terry Acebo Davis,1953.0
...,...,...,...
925,"Zheng, Chongbin",Chongbin Zheng,1961.0
926,"Zheng, Lianjie",Lianjie Zheng,1962.0
927,"Zheng, Shengtian",Shengtian Zheng,1938.0
928,"Zhong, Yueying",Yueying Zhong,1960.0


You can make the list a separate variable for readability purposes.

In [42]:
longer_list = ["Name", "Label", "Birth date", "Death date", "Wikipedia"]
moma.loc[:, longer_list]

Unnamed: 0,Name,Label,Birth date,Death date,Wikipedia
0,"Abad, Pacita",Pacita Abad,1946.0,2004.0,Wikipedia
1,"Abbas, Hamra",Hamra Abbas,1976.0,,
2,"Abe, Satoru",Satoru Abe,1926.0,,Wikipedia
3,"Abichandani, Jaishri",Jaishri Abichandani,1969.0,,
4,"Acebo Davis, Terry",Terry Acebo Davis,1953.0,,Wikipedia
...,...,...,...,...,...
925,"Zheng, Chongbin",Chongbin Zheng,1961.0,,
926,"Zheng, Lianjie",Lianjie Zheng,1962.0,,Wikipedia
927,"Zheng, Shengtian",Shengtian Zheng,1938.0,,
928,"Zhong, Yueying",Yueying Zhong,1960.0,,
