<a href="https://colab.research.google.com/github/stevenkhwun/Python_Notes/blob/main/Vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vectorization in Python

Based on the Medium article *Say Goodbye to Loops in Python, and Welcome Vectorization!* by Anmol Tomar.

https://medium.com/codex/say-goodbye-to-loops-in-python-and-welcome-vectorization-e4df66615a52


## What is Vectorization?

Vectorization is the technique of implementing (NumPy) array operations on a dataset. In the background, it applies the operations to all the elements of an array or series in one go (unlike a ‘for’ loop that manipulates one row at a time).

## Example 1: Finding the Sum of numbers

In [1]:
# Using Loops
import time
start = time.time()

# iterative sum
total = 0
# iterating through 1.5 million numbers
for item in range(0, 1500000):
  total = total + item

print('Sum is:' + str(total))
end = time.time()

print(end - start)

Sum is:1124999250000
0.4100453853607178


In [3]:
# Using Vectorization
import numpy as np
start = time.time()

# vectorized sum - using numpy for vectorization
# np.arange create the sequence of numbers from 0 to 1499999
print(np.sum(np.arange(1500000)))
end = time.time()

print(end - start)

1124999250000
0.012412548065185547


## Mathematical Operations on DataFrame

In Data Science, while working with Pandas DataFrame, the developers use loops to create new derived columns using mathematical operations.

In the following example, we can see how easily the loops can be replaced with Vectorization.

In [13]:
# Creating a DataFrame having 1 million rows and 4 columns
# filled with random values between 1 and 50
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,50,size=(1000000,4)), columns=('a','b','c','d'))
print(df.shape)
df.head()

(1000000, 4)


Unnamed: 0,a,b,c,d
0,4,43,27,26
1,37,41,40,14
2,13,44,3,5
3,16,7,1,47
4,31,41,41,25


We will create a new column 'ratio' to find the ratio of the column 'd' and 'c'.

In [15]:
# Using Loops
import time
start = time.time()

# Iterating through DataFrame using iterrows
for idx, row in df.iterrows():
  # creating a new column
  df.at[idx,'ratio'] = 100 * (row["d"] / row["c"])
end = time.time()

print(end - start)
df.head()

56.64831018447876


Unnamed: 0,a,b,c,d,ratio
0,4,43,27,26,96.296296
1,37,41,40,14,35.0
2,13,44,3,5,166.666667
3,16,7,1,47,4700.0
4,31,41,41,25,60.97561


In [16]:
# Using Vectorization
start = time.time()
df["ratio"] = 100 * (df["d"] / df["c"])
end = time.time()

print(end - start)
df.head()

0.016091108322143555


Unnamed: 0,a,b,c,d,ratio
0,4,43,27,26,96.296296
1,37,41,40,14,35.0
2,13,44,3,5,166.666667
3,16,7,1,47,4700.0
4,31,41,41,25,60.97561
