<a href="https://colab.research.google.com/github/wjdolan/DS_Portfolio/blob/main/Vectorization_vs_Loops.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Updated Comparison of Vectorization vs For Loops**

Vectorization is the technique of implementing (NumPy) array operations on a dataset. It applies the operations of all the elements of an array at the same time vs. for loops that applies it one row at a time

In [1]:
# Using for loops

import time
start = time.time()

total = 0

# iterate through 1.5 million numbers

for item in range(0,1500000):
  total = total + item

print('sum is:' + str(total))
end = time.time()

print(end-start)



sum is:1124999250000
0.5068621635437012


In [2]:
# Using Vectorization

import numpy as np
start = time.time()

# use numpy for vectorization

print(np.sum(np.arange(1500000)))

end = time.time()

print(end-start)


1124999250000
0.018259286880493164


Calculation time comparison:

for loop: 0.5068 seconds
vectorization: 0.0183 seconds

Vectorization was 28x faster

**Mathematical Operations on DataFrame**

Compare computation time on dataframe with 4 columns and 5MM rows

In [3]:
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,50,size=(5000000,4)), columns=('a','b','c','d'))


In [4]:
df.shape

(5000000, 4)

In [5]:
df.head()

Unnamed: 0,a,b,c,d
0,12,45,15,6
1,2,36,42,29
2,17,44,6,37
3,29,2,19,33
4,2,34,38,36


use for loops:

In [6]:
start = time.time()

for idx, row in df.iterrows():
  df.at[idx, 'ratio'] = 100*(row['d']/row['c'])

end = time.time()
print(end-start)

  df.at[idx, 'ratio'] = 100*(row['d']/row['c'])
  df.at[idx, 'ratio'] = 100*(row['d']/row['c'])


357.4451746940613


using Vectorization:

In [7]:
start = time.time()

df['ratio'] = 100*(df['d']/df['c'])

end = time.time()
print(end-start)



0.10319709777832031


Calculation time comparison:

for loop: 357 seconds 
vectorization: 0.103 seconds

Vectorization was 3,500x faster

**If-else Statements**

Create a new column in dataframe from above based on data in an existing column

Using for loops: 

In [8]:
start = time.time()

# iterate through DataFrame using iterrows
for idx, row in df.iterrows():
  if row.a ==0:
    df.at[idx,'e'] = row.d
  elif (row.a <= 25) & (row.a > 0):
    df.at[idx, 'e'] = (row.b) - (row.c)
  else:
    df.at[idx, 'e'] = row.b + row.c

end = time.time()

print('time using for-loops:', (end-start))

time using for-loops: 553.3422744274139


Using Vectorization:

In [9]:
start = time.time()

df['e'] = df['b'] + df['c']
df.loc[df['a'] <= 25, 'e'] = df['b'] - df['c']
df.loc[df['a'] == 0, 'e'] = df['d']

end = time.time()

print('time using vectorization:', (end-start))

time using vectorization: 0.3814542293548584


Calculation time comparison:

for loop: 553 seconds 
vectorization: 0.381 seconds

Vectorization was 1,400x faster

**Solving ML/DL Networks**

Deep learning solves multiple complex equations for millions of rows


In [10]:
# Create the data

z = np.random.rand(1,5)

x = np.random.rand(5000000,5)

In [11]:
z , x

(array([[0.65662776, 0.61650631, 0.30600364, 0.77061291, 0.7527969 ]]),
 array([[0.5259611 , 0.84462899, 0.64114104, 0.87970317, 0.33274583],
        [0.64151787, 0.26050838, 0.20988106, 0.86704442, 0.92814382],
        [0.79469601, 0.04986064, 0.84661822, 0.77369978, 0.92500103],
        ...,
        [0.06464728, 0.54142503, 0.74455576, 0.8421134 , 0.19020428],
        [0.02633111, 0.32439616, 0.51270438, 0.82808423, 0.04288669],
        [0.23119859, 0.30296862, 0.67940736, 0.24556172, 0.76543868]]))

Using for-loops:

In [15]:
total = 0
tic = time.process_time()

for i in range(0,5000000):
  total = 0
  for j in range(0,5):
    total = total + x[i][j]*z[0][j]

  total[i] = total

toc = time.process_time()

print('Computation time= ' + str((toc-tic)) + 'seconds')

TypeError: ignored

Using Vectorization:

In [14]:
tic = time.process_time()

# dot product
np.dot(x, z.T)

toc = time.process_time()
print('Computation time = ' + str((toc - tic)) + 'seconds')



Computation time = 0.13225094500000978seconds
