In [15]:
import pandas as pd
import os
import re
import time

# Speed test example

I have a large number of rows in my data (about 750k!), and a lot of the data isn't usable in it's current format.

When I started to try and clean it, I ran into a lot of problems with processing speed. Through some trial and error, I found a way to make my function run almost 9000 times faster!

## The original function

In [3]:
sixteenData = pd.read_csv('2016Short.csv') #Read the file into pandas

In [6]:
sixteenData.category.head() #Show the head of the column I want to change

0    {"urls":{"web":{"discover":"http://www.kicksta...
1    {"urls":{"web":{"discover":"http://www.kicksta...
2    {"urls":{"web":{"discover":"http://www.kicksta...
3    {"urls":{"web":{"discover":"http://www.kicksta...
4    {"urls":{"web":{"discover":"http://www.kicksta...
Name: category, dtype: object

The data in this column was originally in JSON format. When it was imported to a csv and then read into Pandas, it lost it's JSON formatting and such, I can't use it as a JSON object anymore. So I'm going to try and extract the text I need using regular expressions

In [7]:
def catStrip2016(target):
    '''
     This function strips out the parent category and subcategory of the project
    The regex.split method with r'\W+' splits the input string at all non-alphanumeric characters
    The list locations where the parent/subcategory are change between scrape dates
    So the correct positions have to be chosen for each
    '''
    x = re.split(r'\W+', target) #Split the input at all non-alphanumerics
    results = [x[10], x[11]] #Extract the two parts I'm interested in
    return(results) #Return the results of the function

In [10]:
timeCounter4 = []
for entry in range(5):
    start = time.time()
    temp = sixteenData.category[entry]
    categories = catStrip2016(temp)
    sixteenData.iloc[entry]['Parent'] = temp[0]
    sixteenData.iloc[entry]['Subcategory'] = temp[1]
    end = time.time()
    timeCounter4.append(end - start)
    print(entry, 'Last operation took ',end - start,' seconds')    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


0 Last operation took  0.14040517807006836  seconds
1 Last operation took  0.15074706077575684  seconds
2 Last operation took  0.14543581008911133  seconds
3 Last operation took  0.1445779800415039  seconds
4 Last operation took  0.1438159942626953  seconds


Aside from some warnings, the function executed okay. The problem here is each iteration took 0.15 seconds. Stretched out over the entire length of the dataframe (abut 189,000 rows), and this operation would take around 8 hours to complete! 

I have four .csv files that I need to clean, and I have four columns in each .csv file that need similar operations done. If I used this method, the whole thing would take 128 hours to run! 

Clearly I need to find another way to work. My first guess is that either the .iloc function adds to the complexity since it has to search for the right location on every iteration.

## Second try - use `.append()` to make a separate data frame and then merge it

I want to try and break up the operations, so for this try I'm going to pull the entry out of the 
sixteenData dataframe, run the function on it, and then append it to the end of a 
new dataframe. 

In [12]:
'''
This time, keep the same function but append the results to the end of a
new dataframe rather than locating it in the original

This block of code extracts the desired entry in the original dataframe,
runs the function on it, then appends it to the end of a new dataframe
'''

tempFrame = pd.DataFrame()
for entry in range(5):
    start = time.time()
    temp = catStrip2016(sixteenData.category[entry])
    tempFrame.append({'Parent':temp[0],
                      'Subcategory':temp[1]
                      } , ignore_index = True)
    end = time.time()
    print(entry, 'Last operation took ',end - start,' seconds')

0 Last operation took  0.0035181045532226562  seconds
1 Last operation took  0.0032079219818115234  seconds
2 Last operation took  0.0033278465270996094  seconds
3 Last operation took  0.004068851470947266  seconds
4 Last operation took  0.003242969512939453  seconds


Great start! We've drastically reduced the time it takes for each iteration. I know from personal experience that inserting the new dataframe into the original won't take long, so this is a major improvement. 

It's still pretty slow though. Running this method on all the rows will take around 11 minutes (0.0035 * 189000). For my entire dataset, I'm looking at about 3 hours of computation time, and I can't do that. 

At this point, I figured the size of the dataframe I was working with might be slowing it down for some reason (the original file is about 500 MB).

So I'm going to try and isolate the data of interest a little more

## Third try - Use `pd.pop()` to isolate the column I want to work with

For this try, I used `pd.pop()` to isolate the column of interest. Then I ran the code from try 2 to see if there's any improvement

In [13]:
'''
This block of codes pops a specific column from a pandas dataframe,
runs the cleaning operation on just that column, and then appends the
result to a new dataframe
'''
tempFrame2 = pd.DataFrame() #Create a blank dataframe for later use
dummy = sixteenData.copy() #Make a copy of the original dataframe to keep it intact 
loneColumn = dummy.pop('category') #pop the category column into a separate dataframe
for entry in range(5):
    start = time.time()
    temp = catStrip2016(loneColumn[entry])
    tempFrame.append({'Categories':temp}, ignore_index = True )
    end = time.time()
    print(entry, 'Last operation took ',end - start,' seconds')

0 Last operation took  0.016701936721801758  seconds
1 Last operation took  0.0019998550415039062  seconds
2 Last operation took  0.0019178390502929688  seconds
3 Last operation took  0.0019621849060058594  seconds
4 Last operation took  0.0023949146270751953  seconds


Well that was ...underwhelming. There was a 2x increase in speed, but I'm still looking at a few hours in total to run this on my whole dataset.

It's time to get serious, I'm asking the internet

## Desperation - Turning to Google

At this point, I've exhausted my knowledge of Python. From my limited understanding of computer science, I know that `for` loops can get computationally expensive for a large sample size. 

For my last attempt, I just ran a google search for methods that apply a function to an entire column in Pandas.

To my surprise, Pandas has a method called .... `pd.apply()` 

(I feel like the Python developers are taking a cheap shot at newbies with this one)

In [14]:
print('Function Start')
start = time.time()
sixteenData['category_list'] = sixteenData['category'].apply(catStrip2016)
end = time.time()
total = end - start
print('Operation took ', total, ' seconds')

Function Start
Operation took  3.229017734527588  seconds


## Success!

Using the `.apply()` method, running this function on a single column over all of the rows took only 3.23 seconds!

To compare properly, that means this operation takes 3.23/189000 seconds per entry, or 0.0000171 seconds.

That's an improvement of 8,500x the original (badly coded) operation. Now I have a time frame that I can work with, and make a proper cleaning script for each of my .csv files 

From what I understand, programmers use "big 0 notation" (ie O(1), O(N), O(n<sup>2</sup>)) to keep track of algorithm complexity and how fast things get calculated. 

Here's a quick primer on big O: [link](https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/)

If I was a better student, I'd estimate the O() complexity for each of the above cases. Maybe next time!