# How to Iterate/Loop over a Pandas DataFrame
Let's import `Pandas` and `Numpy` to get started.

In [1]:
import pandas as pd
import numpy as np

Now we can create an example dataframe:

In [2]:
df = pd.DataFrame(np.random.rand(10000, 5), columns=['A','B','C','D','E'])
df.head()

Unnamed: 0,A,B,C,D,E
0,0.554527,0.389633,0.024277,0.660475,0.399419
1,0.066705,0.66827,0.473297,0.007648,0.300665
2,0.997717,0.734364,0.718288,0.41089,0.271431
3,0.175517,0.799431,0.820741,0.300792,0.302119
4,0.301502,0.178007,0.053361,0.76178,0.462002


We can get an overview of the DataFrame with the `describe` method.

In [3]:
df.describe()

Unnamed: 0,A,B,C,D,E
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,0.501036,0.497504,0.500794,0.498799,0.493497
std,0.287209,0.288277,0.287789,0.289711,0.287943
min,2.5e-05,3.2e-05,1e-06,6.8e-05,6.3e-05
25%,0.255397,0.246733,0.250577,0.24818,0.247233
50%,0.496472,0.496364,0.500797,0.501773,0.489159
75%,0.746159,0.742955,0.749463,0.746268,0.743193
max,0.999975,0.999796,0.999992,0.999991,0.999989


### **First Method available to iterate over a Pandas DataFrame**

In [14]:
for i, row in df.iterrows():
    while i < 3:
        print(f"i = {i}", row, 
              f"Column B: {row['B']}", '<-------------------------->', sep='\n')
        break

i = 0
A    0.554527
B    0.389633
C    0.024277
D    0.660475
E    0.399419
Name: 0, dtype: float64
Column B: 0.38963262110181274
<-------------------------->
i = 1
A    0.066705
B    0.668270
C    0.473297
D    0.007648
E    0.300665
Name: 1, dtype: float64
Column B: 0.6682699176000896
<-------------------------->
i = 2
A    0.997717
B    0.734364
C    0.718288
D    0.410890
E    0.271431
Name: 2, dtype: float64
Column B: 0.7343644889037613
<-------------------------->


The output matches this:

In [26]:
df.head(3)

Unnamed: 0,A,B,C,D,E
0,1.0,0.389633,0.024277,0.660475,0.399419
1,0.0,0.66827,0.473297,0.007648,0.300665
2,1.0,0.734364,0.718288,0.41089,0.271431


We can check how much time does it take to run this function over all the
DataFrame and then we can see if we can improve it.

### **Creating an iterating function with `iterrows` method**

In [20]:
def iterrow_example(df, col):
    for i, row in df.iterrows():
        val = row[col]
        if val < 0.5:
            df.at[i, col] = 0
        else:
            df.at[i, col] = 1

How much time does it take to run the `iterrow_example` function over our df?

In [21]:
%timeit iterrow_example(df, 'A')

641 ms ± 33.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


I'm sure we can do better.

### **Creating an iterating function with `iloc`**

In [24]:
def iloc_iterrow_example(df, col):
    for i in df.index:
        val = df[col].iloc[i]
        if val < 0.5:
            df.at[i, col] = 0
        else:
            df.at[i, col] = 1

In [25]:
%timeit iloc_iterrow_example(df, 'B')

171 ms ± 31.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


With just those few little changes we had a huge speed increase.

We can try with the apply method.

### **Using `apply` method**

In [27]:
%timeit df['C'] = df['C'].apply(lambda x: 0 if x < 0.5 else 1)

5.05 ms ± 633 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


What an improvement!

We can use `numpy.where()` method too.

The reader could try with the `map` function.

### **Using `numpy.where()` method**

In [28]:
%timeit np.where(df['D'] < 0.5, 0, 1)

106 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


It is even faster than apply method

### **Using `numpy.where.values()` method**

Pandas DataFrame.values it's an attribute that returns a numpy array
representation of a DataFrame

In [31]:
%timeit np.where(df['D'].values < 0.5, 0, 1)

74.9 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


I hope this notebook can be helpful and that it gives to you some ideas on how
you can improve the iterations over pandas dataframes.

How many other methods do you know?

## References
- [YouTube Tutorial](https://www.youtube.com/watch?v=CG3EV7UBELA)