# Indexing & Filtering

- how many distinct artists are there in the datset?
- How many artworks by Francis Bacon are there?
- what is the artwork with the biggest dimensions?

----
Do's & Don'ts
- Always use iloc and loc! won't give you unexpected results
- Use references such as df['artist']for expressions

----

In [1]:
import pandas as pd

In [3]:
df = pd.read_pickle("data_frame.pickle")
df.head()

Unnamed: 0_level_0,artist,title,medium,year,acquisitionYear,width,height,units
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1035,"Blake, Robert",A Figure Bowing before a Seated Old Man with h...,"Watercolour, ink, chalk and graphite on paper....",,1922.0,394,419,mm
1036,"Blake, Robert","Two Drawings of Frightened Figures, Probably f...",Graphite on paper,,1922.0,311,213,mm
1037,"Blake, Robert",The Preaching of Warning. Verso: An Old Man En...,Graphite on paper. Verso: graphite on paper,1785.0,1922.0,343,467,mm
1038,"Blake, Robert",Six Drawings of Figures with Outstretched Arms,Graphite on paper,,1922.0,318,394,mm
1039,"Blake, William",The Circle of the Lustful: Francesca da Rimini...,Line engraving on paper,1826.0,1919.0,243,335,mm


In [4]:
# select 1 column
df["artist"].head()

id
1035     Blake, Robert
1036     Blake, Robert
1037     Blake, Robert
1038     Blake, Robert
1039    Blake, William
Name: artist, dtype: object

In [6]:
# select 2 columns
df[["artist", "title"]].head()

Unnamed: 0_level_0,artist,title
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1035,"Blake, Robert",A Figure Bowing before a Seated Old Man with h...
1036,"Blake, Robert","Two Drawings of Frightened Figures, Probably f..."
1037,"Blake, Robert",The Preaching of Warning. Verso: An Old Man En...
1038,"Blake, Robert",Six Drawings of Figures with Outstretched Arms
1039,"Blake, William",The Circle of the Lustful: Francesca da Rimini...


In [8]:
# avoid this, as it is unpredicatable
df.artist.head()

id
1035     Blake, Robert
1036     Blake, Robert
1037     Blake, Robert
1038     Blake, Robert
1039    Blake, William
Name: artist, dtype: object

----
how many distinct artists are there in the datset?

In [9]:
artists = df['artist']

In [10]:
len(pd.unique(artists))

3336

----
How many artworks by Francis Bacon are there?

In [11]:
s = artists == 'Bacon, Francis'
s.value_counts()

False    69151
True        50
Name: artist, dtype: int64

In [12]:
# Or
artist_counts = artists.value_counts()
artist_counts['Bacon, Francis']

50

----
## Indexing done the right way

- use loc and iloc - these are the indexers
- these give consistent results

df.loc[ row indexer, column indexer ]
```
df.loc[ 1503, 0 ]
df.loc[ df['artist']=='Bacon, Francis', : ]  # all the columns for Francis Bacon
```

df.iloc[ row indexer, column indexer ]
```
df.iloc[ 1:100, : ]
df.iloc[ :, : ]  # all rows
```

In [22]:
df.loc[ df['artist']=='Bacon, Francis', : ].head()

Unnamed: 0_level_0,artist,title,medium,year,acquisitionYear,width,height,units
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
672,"Bacon, Francis",Figure in a Landscape,Oil paint on canvas,1945,1950.0,1448.0,1283.0,mm
673,"Bacon, Francis",Study of a Dog,Oil paint on canvas,1952,1952.0,1981.0,1372.0,mm
674,"Bacon, Francis",Three Studies for Figures at the Base of a Cru...,Oil paint on 3 boards,1944,1953.0,,,
677,"Bacon, Francis",Study for a Portrait of Van Gogh IV,Oil paint on canvas,1957,1958.0,1524.0,1168.0,mm
678,"Bacon, Francis",Reclining Woman,Oil paint on canvas,1961,1961.0,1988.0,1416.0,mm


In [26]:
df.loc[1035, 'artist']

'Blake, Robert'

In [24]:
df.iloc[ 90:100, : ]

Unnamed: 0_level_0,artist,title,medium,year,acquisitionYear,width,height,units
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1728,"Burne-Jones, Sir Edward Coley, Bt",Two Studies of a Seated Male Nude for ‘The Lib...,Graphite and chalk on paper,1863.0,1927.0,179,335,mm
1729,"Burne-Jones, Sir Edward Coley, Bt",Study for ‘Blind Love’,Graphite on paper,,1927.0,302,140,mm
1730,"Burne-Jones, Sir Edward Coley, Bt",Study for ‘Blind Love’,Graphite on paper,,1927.0,311,133,mm
1731,"Burne-Jones, Sir Edward Coley, Bt",Study of a Male Figure for ‘Clerk Saunders’,Graphite on paper,1861.0,1927.0,297,148,mm
1732,"Burne-Jones, Sir Edward Coley, Bt",Study of a Seated Woman for ‘The Hours’,Graphite on paper,1866.0,1927.0,478,264,mm
1733,"Burne-Jones, Sir Edward Coley, Bt",Study of a Reclining Figure,Graphite on paper,,1927.0,105,152,mm
1734,"Burne-Jones, Sir Edward Coley, Bt",Head of a Girl,Graphite on paper,1861.0,1927.0,186,140,mm
1735,"Burne-Jones, Sir Edward Coley, Bt",Two Studies for the Head of the King in ‘King ...,Graphite on paper,1880.0,1927.0,152,162,mm
1736,"Burne-Jones, Sir Edward Coley, Bt",Head of a Woman: ?Georgiana Burne-Jones,Graphite on paper,1861.0,1927.0,140,166,mm
20230,"Burne-Jones, Sir Edward Coley, Bt",Study of Heads for a Pietà,Graphite and chalk on paper,,1927.0,136,154,mm


In [27]:
df.iloc[0,0]

'Blake, Robert'

In [28]:
df.iloc[0,:]

artist                                                 Blake, Robert
title              A Figure Bowing before a Seated Old Man with h...
medium             Watercolour, ink, chalk and graphite on paper....
year                                                             NaN
acquisitionYear                                                 1922
width                                                            394
height                                                           419
units                                                             mm
Name: 1035, dtype: object

In [29]:
df.iloc[0:2,0:2]

Unnamed: 0_level_0,artist,title
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1035,"Blake, Robert",A Figure Bowing before a Seated Old Man with h...
1036,"Blake, Robert","Two Drawings of Frightened Figures, Probably f..."


----
what is the artwork with the biggest dimensions?

In [36]:
# This produces an error as our data is not clean.
#df['height'].as_type() * df['width']

In [37]:
# notice this is type object
df['width'].sort_values().head()

id
20822            (1):
105337    (diameter):
98671         (each):
76420         (each):
91391        (image):
Name: width, dtype: object

In [38]:
# some null values are described as NaN
df['width'].sort_values().tail()

id
121283    NaN
117863    NaN
120549    NaN
122900    NaN
112306    NaN
Name: width, dtype: object

In [41]:
# try to convert (won't work)
# pd.to_numeric(df['width'])

# Force NaNs (this looks good)
#pd.to_numeric(df['width'], errors="coerce")

# Modify df['width']
#df.loc[:, 'width'] = pd.to_numeric(df['width'], errors="coerce")

In [43]:
# Modify df['height']
# df.loc[:, 'height'] = pd.to_numeric(df['height'], errors="coerce")

In [47]:
# Get an area DF
area = df['width'] * df['height']
area.head()

id
1035    165086.0
1036     66243.0
1037    160181.0
1038    125292.0
1039     81405.0
dtype: float64

In [50]:
# add new column
# df = df.assign(area = area)

In [52]:
# thankfully only mm, but good to check
df['units'].value_counts()

mm    65860
Name: units, dtype: int64

In [56]:
# Wow, this is a very large size for a picture....
df['area'].max()

132462000.0

In [54]:
df['area'].idxmax()

98367

In [55]:
df.loc[df['area'].idxmax(), :]

artist                               Therrien, Robert
title                No Title (Table and Four Chairs)
medium             Aluminium, steel, wood and plastic
year                                             2003
acquisitionYear                                  2008
width                                            8920
height                                          14850
units                                              mm
area                                      1.32462e+08
Name: 98367, dtype: object

In [58]:
# ^^^ This looks like some scuplture which, would explain the area being so large