# Efficient Techniques for Working with Pandas DataFrames


Handling Pandas DataFrames can be a cumbersome and time-consuming process, particularly when dealing with looping operations. As a Python developer, you may have spent considerable effort searching for the most efficient way to loop through DataFrames. What if there were simpler and more effective alternatives? This article explores superior methods of working with Pandas DataFrames that eliminate the need for looping. We'll delve into the advantages of these approaches and provide practical examples to get you started. If you're eager to enhance your DataFrame workflow and bid farewell to tedious loops, keep reading!

## Table of Contents:
1. Why Do We Need Efficient Coding?
2. Looping effectively Using the .iterrows()
3. Looping Effectively Using .apply()
4. Looping Effectively Using vectorization
5. Summary of Best Practices

Throughout this article, we will use the Poker card game dataset. First let's load and explore the data:

In [1]:
import pandas as pd
import time
poker_data = pd.read_csv('poker_hand.csv')
poker_data.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


In each poker round, each player has five cards in hand, each one characterized by its symbol, which can be either hearts, diamonds, clubs, or spades, and its rank, which ranges from 1 to 13. The dataset consists of every possible combination of five cards one person can possess.
* Sn: symbol of the n-th card where: 1 (Hearts), 2 (Diamonds), 3 (Clubs), 4 (Spades)
* Rn: rank of the n-th card where: 1 (Ace), 2–10, 11 (Jack), 12 (Queen), 13 (King)

## 1. Why do We need Efficient Coding?


Efficient code is code that executes faster and with lower computational meomery. In this article, we will use the time() function to measure the computational time. 

This function measures the current time so we will assign it to a variable before the code execution and after and then calculate the difference to know the computational time of the code. A simple example is shown in the code below:



In [None]:
# record time before execution
start_time = time.time()
# execute operation
result = 5 + 2
# record time after execution
end_time = time.time()
print("Result calculated in {} sec".format(end_time - start_time))

Result calculated in 0.0001010894775390625 sec


Let's see some examples of how applying efficient code methods will improve the code runtime and decrease the computational time complexity: We will calculate the square of each number from zero, up to a million. At first, we will use a list comprehension to execute this operation, and then repeat the same procedure using a for-loop.

First using list comprehension:




In [None]:
#using List comprehension 
list_comp_start_time = time.time()
result = [i*i for i in range(0,1000000)]
list_comp_end_time = time.time()
print("Time using the list_comprehension: {} sec".format(list_comp_end_time -
list_comp_start_time))

Time using the list_comprehension: 0.12260246276855469 sec


Now we will use for loop to execute the same operation:



In [None]:
# Using For loop
for_loop_start_time= time.time()
result=[]
for i in range(0,1000000):
  result.append(i*i)
for_loop_end_time= time.time()
print("Time using the for loop: {} sec".format(for_loop_end_time - for_loop_start_time))

Time using the for loop: 0.37175941467285156 sec


We can see that there is a big difference between them, we can calculate the difference between them in percentage:



In [None]:
list_comp_time = list_comp_end_time - list_comp_start_time
for_loop_time = for_loop_end_time - for_loop_start_time
print("Difference in time: {} %".format((for_loop_time - list_comp_time)/
list_comp_time*100))

Difference in time: 203.22344778232394 %


Here is another example to show the effect of writing efficient code. We would like to calculate the sum of all consecutive numbers from 1 to 1 million. There are two ways the first is to use brute force in which we will add one by one to a million.



In [None]:
def sum_formula(N):
  return N*(N+1)/2
  
# Using the formula
formula_start_time = time.time()
formula_result = sum_formula(1000000)
formula_end_time = time.time()

print("Time using the formula: {} sec".format(formula_end_time - formula_start_time))

Time using the formula: 5.8650970458984375e-05 sec


Another more efficient method is to use a formula to calculate it. When we want to calculate the sum of all the integer numbers from 1 up to a number, let’s say N, we can multiply N by N+1, and then divide by 2, and this will give us the result we want. This problem was actually given to some students back in Germany in the 19th century, and a bright student called Carl-Friedrich Gauss devised this formula to solve the problem in seconds.

In [None]:
def sum_brute_force(N):
  res = 0
  for i in range(1,N+1):
    res+=i
  return res

# Using brute force
bf_start_time = time.time()
bf_result = sum_brute_force(1000000)
bf_end_time = time.time()

print("Time using brute force: {} sec".format(bf_end_time - bf_start_time))

Time using brute force: 0.06304192543029785 sec


After running both methods, we achieve a massive improvement with a magnitude of over 160,000%, which clearly demonstrates why we need efficient and optimized code, even for simple tasks.



One of the most inefficient methods to write python code is to have many loops in your code, especially if you have large data. Since as a data scientist, you will need to iterate through your dataframe extensively, especially in the data preparation and exploration phase, it is important to be able to do this efficiently, as it will save you much time and give space for more important work. We will walk through three methods to make your loops much faster and more efficient:

* Looping using the .iterrows() function
* Looping using the .apply() function
* Vectorization


## 2. Looping effectively using .iterrows()
Before we talk about how to use the .iterrows() function to improve the looping process, let’s refresh the notion of a generator function.

Generators are a simple tool to create iterators. Inside the body of a generator, instead of return statements, you will find only yield() statements. There can be just one, or several yield() statements. Here, we can see a generator, city_name_generator(), that produces four city names. We assign the generator to the variable city_names for simplicity.



In [None]:
def city_name_generator():
  yield('New York')
  yield('London')
  yield('Tokyo')
  yield('Sao Paolo')

city_names = city_name_generator()


To access the elements that the generator yields we can use Python’s next() function. Each time the next() command is used, the generator will produce the next value to yield, until there are no more values to yield. We here have 4 cities. Let’s run the next command four times and see what it returns:



In [None]:
next(city_names)

'New York'

In [None]:
next(city_names)

'London'

In [None]:
next(city_names)

'Tokyo'

In [None]:
next(city_names)

'Sao Paolo'

As we can see that every time we run the next() function it will print a new city name.



Let's go back to the .iterrows() function. The .iterrows() function is a property of every pandas DataFrame. When called, it produces a list with two elements. We will use this generator to iterate through each line of our poker DataFrame. The first element is the index of the row, while the second element contains a pandas Series of each feature of the row: the Symbol and the Rank for each of the five cards. It is very similar to the notion of the enumerate() function, which when applied to a list, returns each element along with its index.

The most intuitive way to iterate through a Pandas DataFrame is to use the range() function, which is often called crude looping. This is shown with the code below:



In [None]:
start_time = time.time()
for index in range(poker_data.shape[0]):
  next
print("Time using range(): {} sec".format(time.time() - start_time))

Time using range(): 0.0036385059356689453 sec


One smarter way to iterate through a pandas DataFrame is to use the **.iterrows()** function, which is optimized for this task. We simply define the **‘for’** loop with two iterators, one for the number of each row and the other for all the values.

Inside the loop, the **next()** command indicates that the loop moves to the next value of the iterator, without actually doing something.

In [None]:
data_generator = poker_data.iterrows()
start_time = time.time()
for index, values in data_generator:
  next
print("Time using .iterrows(): {} sec".format(time.time() - start_time))

Time using .iterrows(): 1.2583379745483398 sec


Comparing the two computational times we can also notice that the use of .iterrows() does not improve the speed of iterating through pandas DataFrame. It is very useful though when we need a cleaner way to use the values of each row while iterating through the dataset.


## 3.Looping Effectively Using .apply()

Now we will use the **.apply()** function to be able to perform a specific task while iterating through a pandas DataFrame. The **.apply()** function does exactly what it says; it applies another function to the whole DataFrame.

The syntax of the **.apply()** function is simple: we create a mapping, using a lambda function in this case, and then declare the function we want to apply to every cell. Here, we’re applying the square root function to every cell of the DataFrame. In terms of speed, it matches the speed of just using the NumPy sqrt() function over the whole DataFrame.


In [None]:
data_sqrt = poker_data.apply(lambda x: np.sqrt(x), axis =0 )
data_sqrt.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1.0,3.162278,1.0,3.316625,1.0,3.605551,1.0,3.464102,1.0,1.0,3.0
1,1.414214,3.316625,1.414214,3.605551,1.414214,3.162278,1.414214,3.464102,1.414214,1.0,3.0
2,1.732051,3.464102,1.732051,3.316625,1.732051,3.605551,1.732051,3.162278,1.732051,1.0,3.0
3,2.0,3.162278,2.0,3.316625,2.0,1.0,2.0,3.605551,2.0,3.464102,3.0
4,2.0,1.0,2.0,3.605551,2.0,3.464102,2.0,3.316625,2.0,3.162278,3.0


This is a simple example since we would like to apply this function to the dataframe.

But what happens when the function of interest is taking more than one cell as an input? For example, what if we want to calculate the sum of the rank of all the cards in each hand? In this case, we will use the .apply() function the same way as we did before, but we need to add ‘axis=1’ at the end of the line to specify we’re applying the function to each row.



In [None]:
apply_start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=1)
apply_end_time = time.time()
apply_time = apply_end_time - apply_start_time
print("Time using .apply(): {} sec".format(apply_time))

Time using .apply(): 0.2000577449798584 sec


Then, we will use the .iterrows() function we saw previously, and compare their efficiency.



In [None]:
for_loop_start_time = time.time()
for ind, value in poker_data.iterrows():
  sum([value[1], value[3], value[5], value[7], value[9]])
for_loop_end_time = time.time()

for_loop_time = for_loop_end_time - for_loop_start_time
print("Time using .iterrows(): {} sec".format(for_loop_time))

Time using .iterrows(): 1.1545953750610352 sec


Using the .apply() function is significantly faster than the .iterrows() function, with a magnitude of around 400 percent, which is a massive improvement!



In [None]:
print('The differnce: {} %'.format((for_loop_time - apply_time) / apply_time * 100))

The differnce: 477.1310554246618 %


As we did with rows, we can do exactly the same thing for the columns; apply one function to each column. By replacing the axis=1 with axis=0, we can apply the sum function on every column.



In [None]:
apply_start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x), axis=0)
apply_end_time = time.time()
apply_time = apply_end_time - apply_start_time
print("Time using .apply(): {} sec".format(apply_time))

Time using .apply(): 0.021090030670166016 sec


By comparing the **.apply()** function with the native panda's function for summing over rows, we can see that pandas’ native .sum() functions perform the same operation faster.



In [None]:
pandas_start_time = time.time()
poker_data[['R1', 'R1', 'R3', 'R4', 'R5']].sum(axis=0)
pandas_end_time = time.time()
pandas_time = pandas_end_time - pandas_start_time
print("Time using pandas: {} sec".format(pandas_time))

Time using pandas: 0.0039751529693603516 sec


In [None]:
print('The differnce: {} %'.format((apply_time - pandas_time) / pandas_time * 100))

The differnce: 430.54639237089907 %


In conclusion, we observe that the .apply() function performs faster when we want to iterate through all the rows of a pandas DataFrame, but is slower when we perform the same operation through a column.



## 4.Looping effectively using vectorization

To understand how we can reduce the amount of iteration performed by the function, recall that the fundamental units of Pandas, DataFrames, and Series, are both based on arrays. Pandas perform more efficiently when an operation is performed to a whole array than to each value separately or sequentially. This can be achieved through vectorization. Vectorization is the process of executing operations on entire arrays.

In the code below we want to calculate the sum of the ranks of all the cards in each hand. In order to do that, we slice the poker dataset keeping only the columns that contain the ranks of each card. Then, we call the built-in .sum() property of the DataFrame, using the parameter axis = 1 to denote that we want the sum for each row. In the end, we print the sum of the first five rows of the data.



In [None]:
start_time_vectorization = time.time()

poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)
end_time_vectorization = time.time()

vectorization_time = end_time_vectorization  - start_time_vectorization
print("Time using pandas vectorization: {} sec".format(vectorization_time))

Time using pandas vectorization: 0.009327411651611328 sec


We saw previously various methods that perform functions applied to a DataFrame faster than simply iterating through all the rows of the DataFrame. Our goal is to find the most efficient method to perform this task.



Using .iterrows() to loop through the DataFrame:


In [None]:
data_generator = poker_data.iterrows()

start_time_iterrows = time.time()

for index, value in data_generator:
  sum([value[1], value[3], value[5], value[7]])

end_time_iterrows = time.time()
iterrows_time = end_time_iterrows - start_time_iterrows
print("Time using .iterrows() {} seconds " .format(iterrows_time))

Time using .iterrows() 1.1502439975738525 seconds 


Using the .apply() mehtod


In [None]:
start_time_apply = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].apply(lambda x: sum(x),axis=1)
end_time_apply = time.time()

apply_time = end_time_apply - start_time_apply

print("Time using apply() {} seconds"  .format(apply_time))

Time using apply() 0.3497791290283203 seconds


Comparing the time it takes to sum the ranks of all the cards in each hand using vectorization, the .iterrows() function, and the .apply() function, we can see that the vectorization method performs much better.

We can also use another vectorization method to effectively iterate through the DataFrame which is using Numpy arrays to vectorize the DataFrame.

The NumPy library, which defines itself as a “fundamental package for scientific computing in Python”, performs operations under the hood in optimized, pre-compiled C code. Similar to pandas working with arrays, NumPy operates on arrays called ndarrays. A major difference between Series and ndarrays is that ndarrays leave out many operations such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on pandas Series. NumPy arrays can be used in place of the pandas Series when the additional functionality offered by the pandas Series isn’t critical.

For the problems we explore in this article, we could use NumPy ndarrays instead of the pandas series. The question at stake is whether this would be more efficient or not.

Again, we will calculate the sum of the ranks of all the cards in each hand. We convert our rank arrays from pandas Series to NumPy arrays simply by using the .values method of pandas Series, which returns a pandas Series as a NumPy ndarray. As with vectorization on the series, passing the NumPy array directly into the function will lead pandas to apply the function to the entire vector.



In [None]:
start_time = time.time()

poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].values.sum(axis=1)

print("Time using NumPy vectorization: {} sec" .format(time.time() - start_time))

Time using NumPy vectorization: 0.001745462417602539 sec


In [None]:
start_time = time.time()
poker_data[['R1', 'R2', 'R3', 'R4', 'R5']].sum(axis=1)
print("Time using the pandas vectorization %s seconds" % (time.time() - start_time))

Time using the  0.003729104995727539 seconds


## 5. Summary of best practices for looping through DataFrame
* Using **.iterrows()** does not improve the speed of iterating through the DataFrame but it is more efficient.
* The **.apply()** function performs faster when we want to iterate through all the rows of a pandas DataFrame, but is slower when we perform the same operation through a column.
* Vectorizing over the pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of the highest priority, we can call in reinforcements in the form of the NumPy Python library.