### In this code, I demonstrate the impact of sub-optimal choices when manipulating your data before training a model. 

###Perhaps, this is applicable to the entire program, from data import, to cleaning, to transformaiton, all the way to training and evaluating the model.

### You need to be concious of your choice otherwise, it will become a performance bottleneck.




# Goal: predict if a Kickstarter project is going to be sucessful
Data source:
https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201801.csv

In [8]:
#Load the required libraries

import pandas as pd
import numpy as np


## Download data 
### Below code needs to be run in Colab. If you want to run it locally, you need to modify below code

In [9]:
#get file from google drive
%%time
!pip install googledrivedownloader #black magic
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id="1bOhcqr5bxx5WcyUzCL3Ncfzf-pc0lc5h",
                                    dest_path="./ks-projects-201801.csv",
                                    unzip=False)

CPU times: user 14.5 ms, sys: 9.97 ms, total: 24.5 ms
Wall time: 2.79 s


# Load data

In [10]:
df = pd.read_csv("ks-projects-201801.csv", header=0, encoding="utf-8-sig")
df.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ID                378661 non-null  int64  
 1   name              378657 non-null  object 
 2   category          378661 non-null  object 
 3   main_category     378661 non-null  object 
 4   currency          378661 non-null  object 
 5   deadline          378661 non-null  object 
 6   goal              378661 non-null  float64
 7   launched          378661 non-null  object 
 8   pledged           378661 non-null  float64
 9   state             378661 non-null  object 
 10  backers           378661 non-null  int64  
 11  country           378661 non-null  object 
 12  usd pledged       374864 non-null  float64
 13  usd_pledged_real  378661 non-null  float64
 14  usd_goal_real     378661 non-null  float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.3+ MB


### From above output, it is clear that deadline and launched are not inferred as data and hence we need to manually convert them to text. Perhaps, there might be other techniques to correct the schema issue wt the time of importing data, but the focus in this notebook is to demonstrate the impact of your coding, not solving schema issue.

In [14]:
#How many rows?
df.shape

(378661, 15)

In [12]:
import timeit
def convert_using_vectorization():
  pd.to_datetime(df['deadline'])

from dateutil.parser import parse

def convert_using_applyloop():
  df['deadline'] = df['deadline'].apply(lambda x: parse(x).strftime('%Y-%m-%d %H:%M:%S'))

print("Time taken to complete is: ", timeit.timeit(test, number=1))
print("Time taken to complete is: ", timeit.timeit(test_apply, number=1))


Time taken to complete is:  0.06592728500004341
Time taken to complete is:  18.292234538000002


In [15]:
# See below; it takes 277 times more time to loop over the df than use vectorization to loop over
18.292234538000002/0.06592728500004341

277.46075904669755

### Code optimization has always been a hot programming topic, and it is more important in compute-intensive tasks like machine learning.

There are tons of articles about optimizing Python codes if you search. One of them is here you can refer to
https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6