In [1]:
import pandas as pd 

In [2]:
cars = pd.read_csv('data/cars.csv')

In [3]:
people = pd.read_csv('data/people.csv')

In [4]:
cars.head()

Unnamed: 0,Name,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,US
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,US
2,Plymouth Satellite,18.0,8,318.0,150,3436,11.0,US
3,AMC Rebel SST,16.0,8,304.0,150,3433,12.0,US
4,Ford Torino,17.0,8,302.0,140,3449,10.5,US


In [5]:
people.head()

Unnamed: 0,Name,Age,Weight,Height,Gender
0,Rita,27,67,1.65,F
1,Dexter,35,81,1.84,M
2,Anna,29,55,1.6,F
3,Bob,41,73,1.79,M


We can access a specific cell of a pandas dataframe using the DataFrame.iloc property by providing the row and column index. The name **iloc stands for integer location**

When we really only want to access a single location, as in the first example, it is recommended to use the **DataFrame.iat** property. It has the same syntax but doesn't allow ranges.

In [6]:
print(people.iat[0, 0])

Rita


In [7]:
cars_odd = cars.iloc[1::2,:3]
fifth_odd_car_name = cars_odd.iat[4,0]
last_four = cars_odd.tail(4)
print(last_four)

                  Name   MPG  Cylinders
399  Dodge Charger 2.2  36.0          4
401    Ford Mustang GL  27.0          4
403      Dodge Rampage  32.0          4
405         Chevy S-10  31.0          4


To access the data with row and column labels instead of indexes, we can use the **DataFrame.loc** property.

In [8]:
print(people.loc[1, 'Name'])

Dexter


By default, when we read a CSV, pandas will use the row indexes as row labels. If we want something else, we need to say it explicitly. We could have specified through the index_col keyword argument in the pandas.read_csv() function. The way it works is that we pass the index of the columns that we want to use as labels for the rows.

Here's how we could use the Name column (index 0) as row labels when we read the CSV:

In [9]:
people = pd.read_csv('data/people.csv', index_col=0)

In [10]:
people.head()

Unnamed: 0_level_0,Age,Weight,Height,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rita,27,67,1.65,F
Dexter,35,81,1.84,M
Anna,29,55,1.6,F
Bob,41,73,1.79,M


In [11]:
people1 = pd.read_csv('data/people.csv')
people1.head()

Unnamed: 0,Name,Age,Weight,Height,Gender
0,Rita,27,67,1.65,F
1,Dexter,35,81,1.84,M
2,Anna,29,55,1.6,F
3,Bob,41,73,1.79,M


We can also change the index after loading the dataframe using the **DataFrame.set_index()** method. By default, this method will return a copy of the dataframe with the new index. If you don't want a copy but rather modify the index, you need to use the **inplace** keyword argument set to **True**.

In [12]:
people_indexed_on_name = people1.set_index('Name') # Get new dataframe
people1.set_index('Name', inplace=True)            # Change the people dataframe directly

In [13]:
people1.head()

Unnamed: 0_level_0,Age,Weight,Height,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rita,27,67,1.65,F
Dexter,35,81,1.84,M
Anna,29,55,1.6,F
Bob,41,73,1.79,M


In [14]:
cars.set_index('Name', inplace=True)
weight_torino = cars.loc['Ford Torino', 'Weight']

In [15]:
weight_torino

3449

When we **convert a column into an index, that column is no longer a column in our dataframe**. For example, on the previous screen, we set the Name column as the index of the people dataframe. This means that now this dataframe has four columns Age, Weight, Height, and Gender.

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/pandasindex.jpg?raw=true">

In [16]:
print(people1.shape)

(4, 4)


In [17]:
print(cars.loc['Ford Pinto', 'Weight'])

Name
Ford Pinto    2046
Ford Pinto    2310
Ford Pinto    2451
Ford Pinto    2639
Ford Pinto    2984
Ford Pinto    2565
Name: Weight, dtype: int64


We can delete an index by using the **DataFrame.reset_index()** method. This will restore the row that we used to create the index. By default, it will return a new dataframe without indexes. If we want to change the current one instead, we need to pass **inplace=True** to it.

In [18]:
honda_civic_hp = cars.loc['Honda Civic','Horsepower']
print(honda_civic_hp)

Name
Honda Civic    97
Honda Civic    53
Honda Civic    67
Name: Horsepower, dtype: int64


In [19]:
honda_civic_hp.head()

Name
Honda Civic    97
Honda Civic    53
Honda Civic    67
Name: Horsepower, dtype: int64

In [20]:
cars.head()

Unnamed: 0_level_0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,US
Buick Skylark 320,15.0,8,350.0,165,3693,11.5,US
Plymouth Satellite,18.0,8,318.0,150,3436,11.0,US
AMC Rebel SST,16.0,8,304.0,150,3433,12.0,US
Ford Torino,17.0,8,302.0,140,3449,10.5,US


In [21]:
cars.reset_index(inplace=True)
cars.head()

Unnamed: 0,Name,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin
0,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,US
1,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,US
2,Plymouth Satellite,18.0,8,318.0,150,3436,11.0,US
3,AMC Rebel SST,16.0,8,304.0,150,3433,12.0,US
4,Ford Torino,17.0,8,302.0,140,3449,10.5,US


The true power of the loc property lies in the fact that we can also use it to select ranges of data. The syntax is the same as iloc and 2-dimensional ndarrays except for two differences:

1. We use row and column labels instead of indexes to specify the ranges.
2. The range endpoints now are inclusive.

In [22]:
weights = cars['Weight']
name_origin_0_and_3 = cars.loc[[0, 3], ['Name', 'Origin']]

Pandas keeps two types of indexes:

* Integer indexes from 0 to the number of rows (or columns) minus one. These can be accessed using the **DataFrame.iloc** property.

* Label indexes. By default, the columns use the CSV header, and the rows use indexes starting a 0 to the number of rows minus one. These can be accessed using the **DataFrame.loc** property.

One way to remember this is to think about the first letter of the property:

**Difference between loc and iloc**

* loc: label based selection
* iloc: integer based selection

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/loc_iloc.JPG?raw=true">
    

Labels don't need to be strings. As a matter of fact, the default labels on the rows are integers. Let's replace the index on the **people** dataframe by integers but starting at 1 rather than 0. To create an index from a specific list we need to give that list to the **pandas.Index()** constructor.
    
<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/loc_iloc2.jpg?raw=true">

In [23]:
num_rows = cars.shape[0]
one_index = pd.Index([i for i in range(1,num_rows+1)])
cars.set_index(one_index,inplace=True)

In [24]:
car_100 = cars.loc[100]
print(car_100)

Name            Ford LTD
MPG                   13
Cylinders              8
Displacement         351
Horsepower           158
Weight              4363
Acceleration          13
Origin                US
Name: 100, dtype: object


In [25]:
cars2_10 = cars.loc[2:10]
print(cars2_10)

                  Name   MPG  Cylinders  Displacement  Horsepower  Weight  \
2    Buick Skylark 320  15.0          8         350.0         165    3693   
3   Plymouth Satellite  18.0          8         318.0         150    3436   
4        AMC Rebel SST  16.0          8         304.0         150    3433   
5          Ford Torino  17.0          8         302.0         140    3449   
6     Ford Galaxie 500  15.0          8         429.0         198    4341   
7     Chevrolet Impala  14.0          8         454.0         220    4354   
8    Plymouth Fury iii  14.0          8         440.0         215    4312   
9     Pontiac Catalina  14.0          8         455.0         225    4425   
10  AMC Ambassador DPL  15.0          8         390.0         190    3850   

    Acceleration Origin  
2           11.5     US  
3           11.0     US  
4           12.0     US  
5           10.5     US  
6           10.0     US  
7            9.0     US  
8            8.5     US  
9           10.0     

## Series

Pandas uses a **pandas.Series** object as a data structure to represent 1-dimensional data. In the previous mission, we learned that dataframes are the pandas equivalent of 2-dimension ndarray.

Series objects are the pandas equivalent to 1-dimensional ndarrays. It comes with the same vectorized operations as ndarrays but is more flexible because in inherits the labels from the dataframe. This makes them very convenient to use because we can use names rather than indexes.

In [26]:
ages = people['Age']
print(type(ages))

<class 'pandas.core.series.Series'>


A series works as an enhanced dictionary. They inherit **both the indexes and labels** from the original dataframe. We can access the values of series using indexes from 0 to the length minus 1, like an ndarray. 

But we can also access them using the labels.

In [27]:
dexter = people.loc['Dexter']
print(dexter)
print(type(dexter))

Age         35
Weight      81
Height    1.84
Gender       M
Name: Dexter, dtype: object
<class 'pandas.core.series.Series'>


In [28]:
print(ages['Rita'])
print(ages[0])
print(dexter['Height'])
print(dexter[2])

27
27
1.84
1.84


Here we used the names as labels. What if instead we had kept the Name columns and used numeric labels for the rows (blue values)? Say we define the dataframe index to start at 1 instead of 0, like in the cars dataframe.

Then we will have an ambiguous situation for the ages series:

In this case, the labels (blue values) take priority, and we cannot use the indexes (pink values).
For this reason, we recommended that you use the labels (blue values) rather than the indexes (pink values) when accessing values in a pandas series. After all, one of the reasons why we wanted to move from NumPy to pandas is to be able to use labels to access rows and columns.

series object is closely related to a one-dimensional ndarray. Actually, we can convert a series into an ndarray using the **Series.values** attribute.

In [29]:
ages = people['Age'].values
print(ages)
print(type(ages))

[27 35 29 41]
<class 'numpy.ndarray'>


We learned about the ndarray.max(), ndarray.min() and ndarray.sum() functions. These compute the maximum value, minimum value and sum of the values, respectively. These same methods exist for series objects as well:

print(people['Age'].max())<br>
print(people['Age'].min())<br>
print(people['Age'].sum())<br>
print(people['Age'].max())<br>

In [30]:
print(people['Age'].max())
print(people['Age'].min())
print(people['Age'].sum())

41
27
132


In [31]:
print(people['Weight'] + people['Height'])

Name
Rita      68.65
Dexter    82.84
Anna      56.60
Bob       74.79
dtype: float64


In [32]:
max_weight = cars['Weight'].max()
min_weight = cars['Weight'].min()
weight_ratio = max_weight/min_weight


In [33]:
print(weight_ratio)

3.186608803471792


In [34]:
print(people['Gender'].value_counts())

F    2
M    2
Name: Gender, dtype: int64


**Series to dictionary**<br>
Series.to_dict()

In [35]:
gender_count = people['Gender'].value_counts()
gender_count_dict = gender_count.to_dict()
print(gender_count_dict)

{'F': 2, 'M': 2}


In [36]:
origin_counts = cars['Origin'].value_counts()
origin_counts_dict = origin_counts.to_dict()
print(origin_counts_dict)

{'US': 254, 'Japan': 79, 'Europe': 73}


When we use a comparison operator between a series and a value, we'll get a boolean series object.<br>


In [37]:
print(people['Gender']=='M')

Name
Rita      False
Dexter     True
Anna      False
Bob        True
Name: Gender, dtype: bool


In [38]:
european_cars = cars[cars['Origin'] == 'Europe']
print(european_cars.shape)

(73, 8)


We can use the **negation operator ~** to select all rows that don't satisfy a given condition. The condition should be put between parenthesis. For example, we can select all people who are not more than 30 years old as follows:

In [39]:
people[(people['Weight'] <= 75) & (people['Height'] <= 1.7)]

Unnamed: 0_level_0,Age,Weight,Height,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rita,27,67,1.65,F
Anna,29,55,1.6,F


In [40]:
people[~(people['Age'] > 30)]

Unnamed: 0_level_0,Age,Weight,Height,Gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Rita,27,67,1.65,F
Anna,29,55,1.6,F


In [41]:
non_us_cars = cars[~(cars['Origin'] == 'US')]
low_mpg_horsepower = cars[(cars['MPG'] >0)&(cars['MPG']<10)&(cars['Horsepower']>149)]
light_or_fast = cars[(cars['Weight']<2001)|(cars['Acceleration']>29)]

We can select a list of columns by providing the names of the columns in a list.

In [42]:
non_us_cars

Unnamed: 0,Name,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin
11,Citroen DS-21 Pallas,0.0,4,133.0,115,3090,17.5,Europe
21,Toyota Corolla Mark ii,24.0,4,113.0,95,2372,15.0,Japan
25,Datsun PL510,27.0,4,97.0,88,2130,14.5,Japan
26,Volkswagen 1131 Deluxe Sedan,26.0,4,97.0,46,1835,20.5,Europe
27,Peugeot 504,25.0,4,110.0,87,2672,17.5,Europe
...,...,...,...,...,...,...,...,...
392,Honda Civic,38.0,4,91.0,67,1965,15.0,Japan
393,Honda Civic (auto),32.0,4,91.0,67,1965,15.7,Japan
394,Datsun 310 GX,38.0,4,91.0,67,1995,16.2,Japan
399,Toyota Celica GT,32.0,4,144.0,96,2665,13.9,Japan


In [43]:
non_us_cars.loc[non_us_cars['Acceleration']< 20,['Name','Weight','Acceleration']]

Unnamed: 0,Name,Weight,Acceleration
11,Citroen DS-21 Pallas,3090,17.5
21,Toyota Corolla Mark ii,2372,15.0
25,Datsun PL510,2130,14.5
27,Peugeot 504,2672,17.5
28,Audi 100 LS,2430,14.5
...,...,...,...
391,Toyota Corolla,2245,16.9
392,Honda Civic,1965,15.0
393,Honda Civic (auto),1965,15.7
394,Datsun 310 GX,1995,16.2


In [44]:
non_us_cars.loc[non_us_cars['Acceleration']< 20,'Name':'Acceleration']

Unnamed: 0,Name,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration
11,Citroen DS-21 Pallas,0.0,4,133.0,115,3090,17.5
21,Toyota Corolla Mark ii,24.0,4,113.0,95,2372,15.0
25,Datsun PL510,27.0,4,97.0,88,2130,14.5
27,Peugeot 504,25.0,4,110.0,87,2672,17.5
28,Audi 100 LS,24.0,4,107.0,90,2430,14.5
...,...,...,...,...,...,...,...
391,Toyota Corolla,34.0,4,108.0,70,2245,16.9
392,Honda Civic,38.0,4,91.0,67,1965,15.0
393,Honda Civic (auto),32.0,4,91.0,67,1965,15.7
394,Datsun 310 GX,38.0,4,91.0,67,1995,16.2


In [45]:
name_and_origin = cars.loc[(cars['MPG'] > 0)  & 
                           (cars['MPG'] < 12) & 
                           (cars['Horsepower'] >= 200), 
                           ['Name', 'Origin']]

name_and_origin

Unnamed: 0,Name,Origin
32,Ford F250,US
33,Chevy C20,US
34,Dodge D200,US
75,Mercury Marquis,US


Adding a new column in Pandas

In [46]:
people['BMI'] = people['Height'] / (people['Weight'] ** 2)
people

Unnamed: 0_level_0,Age,Weight,Height,Gender,BMI
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Rita,27,67,1.65,F,0.000368
Dexter,35,81,1.84,M,0.00028
Anna,29,55,1.6,F,0.000529
Bob,41,73,1.79,M,0.000336


In [47]:
cars['PW_ratio'] = cars['Horsepower'] / cars['Weight']
max_pw_ratio = cars['PW_ratio'].max()
print(max_pw_ratio)

0.0729099157485418


In [48]:
mpg_l100_constant = 235.214583
mpg_non_zero = cars.loc[cars['MPG'] > 0, 'MPG']
L100 = mpg_l100_constant / mpg_non_zero
cars['L/100'] = L100

In [49]:
cars

Unnamed: 0,Name,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Origin,PW_ratio,L/100
1,Chevrolet Chevelle Malibu,18.0,8,307.0,130,3504,12.0,US,0.037100,13.067477
2,Buick Skylark 320,15.0,8,350.0,165,3693,11.5,US,0.044679,15.680972
3,Plymouth Satellite,18.0,8,318.0,150,3436,11.0,US,0.043655,13.067477
4,AMC Rebel SST,16.0,8,304.0,150,3433,12.0,US,0.043694,14.700911
5,Ford Torino,17.0,8,302.0,140,3449,10.5,US,0.040591,13.836152
...,...,...,...,...,...,...,...,...,...,...
402,Ford Mustang GL,27.0,4,140.0,86,2790,15.6,US,0.030824,8.711651
403,Volkswagen Pickup,44.0,4,97.0,52,2130,24.6,Europe,0.024413,5.345786
404,Dodge Rampage,32.0,4,135.0,84,2295,11.6,US,0.036601,7.350456
405,Ford Ranger,28.0,4,120.0,79,2625,18.6,US,0.030095,8.400521


## Optimizing Dataframe Memory Footprint

The DataFrame.info() method returns an estimate for the amount of memory a dataframe consumes.

In [50]:
moma = pd.read_csv('data/moma.csv')
moma.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ExhibitionID            34129 non-null  float64
 1   ExhibitionNumber        34558 non-null  object 
 2   ExhibitionTitle         34558 non-null  object 
 3   ExhibitionCitationDate  34557 non-null  object 
 4   ExhibitionBeginDate     34558 non-null  object 
 5   ExhibitionEndDate       33354 non-null  object 
 6   ExhibitionSortOrder     34558 non-null  float64
 7   ExhibitionURL           34125 non-null  object 
 8   ExhibitionRole          34424 non-null  object 
 9   ConstituentID           34044 non-null  float64
 10  ConstituentType         34424 non-null  object 
 11  DisplayName             34424 non-null  object 
 12  AlphaSort               34424 non-null  object 
 13  FirstName               31499 non-null  object 
 14  MiddleName              3804 non-null 

moma dataframe has an estimated memory footprint of 7.1+ megabytes.

To grasp how pandas calculates this estimate, we first need to understand how pandas represents different types of values. Based on the dataframe summary from the last step, we can tell that the moma dataframe only contains float64 and object columns. Let's examine how pandas represents these values.

Under the hood, pandas groups the columns into blocks of values of the same type. Here's a preview of how pandas stores the first seven columns of the moma dataframe:

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/df_memory_rep.jpg?raw=true">
    
You'll notice that the blocks don't maintain references to the column names. This is because blocks are optimized for storing the actual values in the dataframe.

The **BlockManager class** is responsible for maintaining the mapping between the row and column indexes and the actual blocks. It acts as an API that provides access to the underlying data. Whenever we select, edit, or delete values, the dataframe class interfaces with the BlockManager class to translate our requests to function and method calls.

Pandas uses the ObjectBlock class to represent the block containing string columns, and the FloatBlock class to represent the block containing float columns. For blocks representing numeric values like integers and floats, pandas combines the columns and stores them as a NumPy ndarray. Due to this storage scheme, accessing a slice of values is incredibly fast.

To observe how the BlockManager organizes the data, we can retrieve the internal BlockManager object from within a dataframe using the **DataFrame._data** private attribute. This will return the column and row axes, as well as the individual Block instance for each unique type in the dataframe.

In [51]:
print(moma._data)

BlockManager
Items: Index(['ExhibitionID', 'ExhibitionNumber', 'ExhibitionTitle',
       'ExhibitionCitationDate', 'ExhibitionBeginDate', 'ExhibitionEndDate',
       'ExhibitionSortOrder', 'ExhibitionURL', 'ExhibitionRole',
       'ConstituentID', 'ConstituentType', 'DisplayName', 'AlphaSort',
       'FirstName', 'MiddleName', 'LastName', 'Suffix', 'Institution',
       'Nationality', 'ConstituentBeginDate', 'ConstituentEndDate',
       'ArtistBio', 'Gender', 'VIAFID', 'WikidataID', 'ULANID',
       'ConstituentURL'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=34558, step=1)
FloatBlock: [0, 6, 9, 19, 20, 23, 25], 7 x 34558, dtype: float64
ObjectBlock: [1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 21, 22, 24, 26], 20 x 34558, dtype: object


**Float Columns**

The float64 type represents floating-point values using 64 bits, or 8 bytes. There are 34,558 rows in the dataframe, which means that each float64 column in our dataframe uses 276,464 bytes of memory (34558 rows times 8 bytes).

Under the hood, pandas represents numeric values as NumPy ndarrays, and stores them in a contiguous block of memory. This storage model consumes less space and allows us to access the values themselves quickly. Because pandas represents each value of the same type using the same number of bytes, and a NumPy ndarray stores the number of values, pandas can return the number of bytes a numeric column consumes quickly and accurately.

We can retrieve the amount of memory the values in a column consume using the **Series.nbytes** attribute.

**Object Columns**

The object type represents most non-numeric data, like string values. It represents each value using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn't have fine grained-control over how values in memory are stored.

This limitation causes Python to store a list of strings in a fragmented way that consumes more memory and is slower to access. Each element in a Python list is really a pointer that contains the "address" for the actual value's location in memory. Here's a diagram that visualizes the difference between how NumPy and Python store an array of values:

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/Array_list_rep.jpg?raw=true">
    
**How pandas Estimates the Dataframe Memory Footprint**

Because the NumPy array stores its own dimensions underneath and all of the values in a NumPy array have the same type, pandas can accurately calculate the memory footprint of numeric columns without having to look up each value.

For object type columns, however, pandas only knows that each value consumes at least 8 bytes (for just the pointer) without manually inspecting the linked value. This means that pandas represents each value in a float64 column and an object column using 8 bytes of memory.

If you recall, a kilobyte is equivalent to 1,024 bytes (2^10), and a megabyte is equivalent to 1,048,576 bytes (2^20). With this in mind, we can calculate the estimated shallow memory footprint that the DataFrame.info() method returned.

In [52]:
print(moma.size)

933066


In [53]:
num_entries = moma.size
total_bytes = num_entries*8
total_megabytes = total_bytes/1048576

print(total_megabytes)

7.1187286376953125


The original memory footprint pandas returned was 7.1+ MB, which matches our result of 7.118 megabytes from the last step. To force pandas to inspect the memory for each linked string value and return the true memory footprint, we need to set the memory_usage parameter to "deep" when calling DataFrame.info().

In [54]:
print(moma.info(memory_usage="deep"))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34558 entries, 0 to 34557
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   ExhibitionID            34129 non-null  float64
 1   ExhibitionNumber        34558 non-null  object 
 2   ExhibitionTitle         34558 non-null  object 
 3   ExhibitionCitationDate  34557 non-null  object 
 4   ExhibitionBeginDate     34558 non-null  object 
 5   ExhibitionEndDate       33354 non-null  object 
 6   ExhibitionSortOrder     34558 non-null  float64
 7   ExhibitionURL           34125 non-null  object 
 8   ExhibitionRole          34424 non-null  object 
 9   ConstituentID           34044 non-null  float64
 10  ConstituentType         34424 non-null  object 
 11  DisplayName             34424 non-null  object 
 12  AlphaSort               34424 non-null  object 
 13  FirstName               31499 non-null  object 
 14  MiddleName              3804 non-null 

The true memory footprint of our dataframe is 45.6 megabytes. This means that we'll require about 38.5 megabytes to store the actual Python strings for the object columns (45.6 - 7.1)

Let's calculate the amount of memory just the object columns are consuming (both the pointers as well as the actual linked string values).

We can use the **DataFrame.memory_usage()** method to return the amount of memory each column consumes. We need to set the **deep** parameter to **True** to display the deep memory footprint of each column:

In [55]:
print(moma.memory_usage(deep=True))

Index                         128
ExhibitionID               276464
ExhibitionNumber          2085250
ExhibitionTitle           3333695
ExhibitionCitationDate    3577728
ExhibitionBeginDate       2281851
ExhibitionEndDate         2234872
ExhibitionSortOrder        276464
ExhibitionURL             3494606
ExhibitionRole            2179383
ConstituentID              276464
ConstituentType           2313112
DisplayName               2548428
AlphaSort                 2534329
FirstName                 2104909
MiddleName                1218917
LastName                  2162937
Suffix                    1110333
Institution               1221368
Nationality               1949664
ConstituentBeginDate       276464
ConstituentEndDate         276464
ArtistBio                 3183300
Gender                    1858994
VIAFID                     276464
WikidataID                1821293
ULANID                     276464
ConstituentURL            2677922
dtype: int64


If we want to get a dataframe containing only the columns with the object datatype (in this case, the Gender column), we can use the **DataFrame.select_dtypes() method**

In [56]:
obj_cols = moma.select_dtypes(include=['object'])
print(obj_cols)

      ExhibitionNumber                              ExhibitionTitle  \
0                    1           Cézanne, Gauguin, Seurat, Van Gogh   
1                    1           Cézanne, Gauguin, Seurat, Van Gogh   
2                    1           Cézanne, Gauguin, Seurat, Van Gogh   
3                    1           Cézanne, Gauguin, Seurat, Van Gogh   
4                    1           Cézanne, Gauguin, Seurat, Van Gogh   
...                ...                                          ...   
34553             1536  Recent Japanese Posters from the Collection   
34554             1536  Recent Japanese Posters from the Collection   
34555             1536  Recent Japanese Posters from the Collection   
34556             1536  Recent Japanese Posters from the Collection   
34557             1537                Directed by Vincente Minnelli   

                                  ExhibitionCitationDate ExhibitionBeginDate  \
0            [MoMA Exh. #1, November 7-December 7, 1929]           

In [57]:
obj_cols = moma.select_dtypes(include=['object'])
obj_cols_mem = obj_cols.memory_usage(deep=True)
obj_cols_sum = obj_cols_mem.sum() / 1048576
print(obj_cols_sum)

43.76699352264404


Pandas uses 43.8 megabytes of the total 45.6 megabytes to represent the object columns. This means that we can achieve the greatest memory savings by converting object columns to numeric ones. However, we'll start by learning how to optimize the numeric datatypes.

Now that we understand how pandas represents two common data types in memory, let's learn more about the other types in pandas, their subtypes, and other ways we can reduce a dataframe's memory footprint.

Similarly to NumPy, many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, float64, and float128 subtypes. The number portion of a type's name indicates the number of bits that type uses to represent values. For example, the subtypes we just listed use 2, 4, 8 and 16 bytes, respectively. The following table shows the subtypes for the most common pandas types:

