# Performance considerations when building a Pandas DataFrame
Sometimes we need to get each value and store it in a dataframe. This might happen if we have a series of API calls to make, and the result of each API call (if valid) is stored as a row in the DataFrame. One thought is to create an empty DataFrame and add rows as we get the data ("I don't like to create unnecessary variables", someone told me.). However, as I will demonstrate here, this is not the best way.

To simulate the building of DataFrame step-by-step, I am going to use a dummy array (list of lists), and iterate through it. The first dataset is all floats and the second data set is randomly generated strings.

In [10]:
import numpy as np
import pandas as pd
num_rows = 10000
num_cols = 5
my_data = np.random.rand(num_rows,num_cols).tolist()

### Method 1: Construct an empty table and fill each cell individually

In [2]:
%%time
my_data_df1 = pd.DataFrame(columns=range(num_cols),index=range(num_rows))
for i in range(num_rows):
    for j in range(num_cols):
        my_data_df1.iloc[i,j] = my_data[i][j]  

Wall time: 1min 58s


### Method 2: Add one row at a time

In [3]:
%%time
my_data_df2 = pd.DataFrame(columns=range(num_cols))
for i in range(num_rows):
    row = []
    for j in range(num_cols):
        row.append(my_data[i][j])
    my_data_df2.loc[i] = row

Wall time: 20.2 s


### Method 3: Construct python list of lists and create DataFrame

In [4]:
%%time
tmp_data = []
for row in my_data:
    line = []
    for j in row:
        line.append(j)
    tmp_data.append(line)
my_data_df3 = pd.DataFrame(tmp_data)

Wall time: 20.9 ms


Now we repeat the same process with randomly generated strings.

In [5]:
import random
import string
def randomString(stringLength=10):
    """Generate a random string of fixed length """
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(stringLength))

In [6]:
my_string_data = [[randomString() for _ in range(num_cols)] for _ in range(num_rows)]


### Method 1: Construct an empty table and fill each cell individually

In [7]:
%%time
my_string_df1 = pd.DataFrame(columns=range(num_cols),index=range(num_rows))
for i in range(num_rows):
    for j in range(num_cols):
        my_string_df1.iloc[i,j] = my_string_data[i][j] 

Wall time: 1min 56s


### Method 2: Add one row at a time

In [8]:
%%time
my_string_df2 = pd.DataFrame(columns=range(num_cols))
for i in range(num_rows):
    row = []
    for j in range(num_cols):
        row.append(my_data[i][j])
    my_string_df2.loc[i] = row

Wall time: 20.8 s


### Method 3: Construct python list of lists and create DataFrame

In [9]:
%%time
tmp_str_data = []
for row in my_data:
    line = []
    for j in row:
        line.append(j)
    tmp_str_data.append(line)
my_string_df3 = pd.DataFrame(tmp_str_data)

Wall time: 19.9 ms


### Conclusion
Whenever possible, save the complete data into python lists of lists (or, in general python list of dictionaries, or dictionaries of lists) before creating a dataframe. In this example, we saw a 5661x performance improvement from 1min 47 s to 18.9ms. Thats a **performance improvement of five thousand six hundred times!!!**