# Performance considerations for Pandas - Part II
Sometimes there is more than one way to do something in Pandas. Some investigation and care in implementation will benefit in the long term.

The demonstration below uses the weather data from noaa.gov.
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/

The country name to two letter country code is given in the ghcnd-countries.txt. The weather station details are in ghcnd-stations.txt. The station name starts with the country name prefix. The goal of this exercise is to enhance the ghcnd-stations.txt table to have the country name where the station is located. There are about 114k station, so the table is not very big. It takes about 7 MB in memory.

First read and examine data.

In [12]:
import pandas as pd
countries_df = pd.read_fwf("ghcnd-countries.txt",header=None, names=['code','country']) 
stations_df = pd.read_fwf('ghcnd-stations.txt', header=None, 
                names=['station', 'latitude','longitude','elevation','state','name','gsn_flag',
                      'hcn_crn_flag','wmo_id'])
print("Number of countries : ", len(countries_df))
print("Number of stations : ", len(stations_df))
print("Size of stations dataframe in memory : ", round(stations_df.memory_usage().sum()*1e-6,2), " MB")

Number of countries :  219
Number of stations :  113951
Size of stations dataframe in memory :  8.2  MB


In [2]:
stations_df.head()

Unnamed: 0,station,latitude,longitude,elevation,state,name,gsn_flag,hcn_crn_flag,wmo_id
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,,,,
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,,,,
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,,GSN,41196.0,
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,,,41194.0,
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,,,41217.0,


First create a dictionary of prefix to country name mapping. This is a list of about 200 countries and is very quick. This dictionary is used in subsequent attempts.

In [3]:
%time
countries_map = pd.Series(countries_df['country'].values, index=countries_df['code'].values).to_dict()

Wall time: 0 ns


### Method 1: Iterate through every row
Iterate through every row, and determine the country name based on the prefix. Note that when using `iterrows()`, the `row` variable here is a copy of the dataframe, so need to use `.loc` to specifically set a value in the dataframe. This is probably the first thought for pandas beginners

In [4]:
%%time
stations_df['country_1'] = ""
for i, row in stations_df.iterrows():
    prefix = row['station'][0:2]
    stations_df.loc[i,'country_1'] = countries_map[prefix]

Wall time: 8min 48s


### Method 2: Iterate through every row, create a temporary Series
This is similar to Method 1, but creates a temporary Series to hold the column. 

In [5]:
%%time
country_column = pd.Series()
for i, row in stations_df.iterrows():
    prefix = row['station'][0:2]
    country_column.at[i]= countries_map[prefix]
stations_df['country_2'] = country_column

Wall time: 8min 26s


### Method 3: Iterate through every row, create a temporary Series of known length
This is similar to Method 2, but the Series is initalized to be the size of stations_df. The preformance increases dramatically from Method 2.

In [6]:
%%time
country_column = pd.Series(index=stations_df.index,dtype=str)
for i, row in stations_df.iterrows():
    prefix = row['station'][0:2]
    country_column.at[i]= countries_map[prefix]
stations_df['country_3'] = country_column

Wall time: 14.5 s


### Method 4: Iterate through every row, create a temporary list
This is similar to Method 2, but creates a temporary list to hold the column. Note that the performance is similar to Method 3, but did not need the size of the list to be pre-determined.

In [7]:
%%time
country_list = []
for i, row in stations_df.iterrows():
    prefix = row['station'][0:2]
    country_list.append(countries_map[prefix])
stations_df['country_4'] = country_list

Wall time: 13.6 s


### Method 5: Use dictionary and map
Get a list of stations from each country, build a dictionary of station to country mapping. Use this dictionary to create a new column of country names

In [8]:
%%time
stations_map1 = {}
for prefix, country in countries_map.items():
    sub_stations_list = stations_df[stations_df['station'].str.startswith(prefix)]['station'].values.tolist()
    stations_map1.update({st : country for st in sub_stations_list })
stations_df['country_5'] = stations_df['station'].map(stations_map1)

Wall time: 8.53 s


### Method 6: Apply with custom function
Use the pandas `apply` method to iterate over every row to get the country name. Note that the logic is very similar to Method 1, using iterrows, but is much faster than that one. 

In [9]:
%%time
def get_country_name(row):    
    prefix = row['station'][0:2]
    return countries_map[prefix]

stations_df['country_6'] = stations_df.apply(get_country_name, axis='columns')


Wall time: 1.61 s


### Method 7: Apply and Map
1. First create a Series with just the prefixes
stations_df[0].apply(lambda x : x[0:2])
2. Then use map to get the country names

In [10]:
%%time
stations_df['country_7'] = stations_df['station'].apply(lambda x : x[0:2]).map(countries_map)

Wall time: 45.5 ms


In [11]:
stations_df.head()

Unnamed: 0,station,latitude,longitude,elevation,state,name,gsn_flag,hcn_crn_flag,wmo_id,country_1,country_2,country_3,country_4,country_5,country_6,country_7
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,,,,,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,,,,,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda,Antigua and Barbuda
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,,GSN,41196.0,,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,,,41194.0,,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,,,41217.0,,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates,United Arab Emirates


## Conclusion
The methods are roughly arraged from worst performance to best in terms of runtime. Note that memory is not a consideration here. As far as possible, use the DataFrame columns directly to get best performance. Avoid using `.iterrows()`. Try different methods to see which one is best. In this case, we went from about 7 minues to 46 ms - a 9130x improvement!!! Method 7 is **more than Nine Thousand times faster** !!!