# Purpose
The purpose of this notebook is to improve the word count function because it became a bottleneck in my algorithm. Using python, counting words for bag of words took too much time. I decided to use cython to improve its speed. On my PC the time for word counting was reduced by ~50%

** Some functions were relatively quicker on my computer than here**

## Summary
* v1 - python intuitive solution - 30.4 micro seconds
* v1.1 - using word counting cython function - 18.8 micro seconds
* v1.1.1 - more declerations inside cython helper function - 19 micro seconds
* v1.2 - helper function using *for* rather tham *foreach* - 19.4 micro seconds
* v1.3 - converting the full intuitive function to Cython - 19.1  micro seconds
* v1.4 -  using *for* rather tham *foreach* in the unified function - 19.4 micro seconds
* v2 - using word comparison to remove the need for text.split() - 408 micro seconds
* v2.1 - changing to Cython - 77.5 micro seconds
* v3 - using lazy evaluation - 947 micro seconds
* v3.1 - changing to cython -  13.1 micro seconds
* v3.2 - removing some declarations -  12.4 micro seconds
* v3.3 - Caching word inside text-  82 micro seconds
* v3.4 - Switching to while - 14.5 micro seconds
* v3.5 - Caching len(word) - 14.5 micro seconds
* v3.6 - Caching more variables - 12.2 micro seconds
* v3.7 - Counting only complete words - 10.3 micro seconds
* v3.71 refactoring version 3.7 monstrosity - 11 micro seconds

Conclusions:

 About 63% less calculation time or 200% faster :)
 
 A 40% less calculation time could be achieve easily by compling the intuitive function in Cython. 
 
 foreach is a bit faster than for

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
train_X = pd.read_csv(
    '../input/training_text', sep="\|\|", engine='python', header=None, skiprows=1, 
    names=["ID","Text"], index_col=0)
train_y = pd.DataFrame.from_csv("../input/training_variants")
train_X = pd.concat([train_X, train_y], axis=1)
train_y = train_X["Class"] - 1
del train_X["Class"]


In [None]:
test_X = pd.read_csv(
    '../input/test_text', sep="\|\|", engine='python', header=None, skiprows=1, 
    names=["ID","Text"], index_col=0)
test_y = pd.DataFrame.from_csv("../input/test_variants")
test_X = pd.concat([test_X, test_y], axis=1)
del test_y


In [None]:
data = pd.concat([train_X, test_X], axis=0)
print(data.head())
print(data.tail())

In [None]:
# remove punctuation
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

def remove_punctuation(text):
    stemmed_list = tokenizer.tokenize(text)
    stemmed_text = ' '.join(stemmed_list)
    return stemmed_text

# stem words
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def stem_text(text):
    text_list = text.split()
    stemmed_list = [stemmer.stem(word) for word in text_list]
    stemmed_text = ' '.join(stemmed_list)
    return stemmed_text

In [None]:
def preprocess_text(text):
    text = remove_punctuation(text)
    text = stem_text(text)
    return text

In [None]:
demo_text = data["Text"].iat[0][:2000]
print(demo_text, '\n')
demo_text = preprocess_text(demo_text)
print(demo_text)



In [None]:
def word_count_v1(text, word):
    count = 0
    text = text.split()
    for word_i in text:
        if word_i == word:
            count += 1
    return count

print('Normal word counting function in python')
print(word_count_v1(demo_text, "cdk10"))
%timeit word_count_v1(demo_text, "cdk10")

In [None]:
%load_ext Cython

In [None]:
%%cython
def c_word_counts_v1(word_list, str word):
    cdef int count = 0
    for word_i in word_list:
        if word_i == word:
            count += 1
    return count

In [None]:
def word_count_v1_1(text, word):
    text = text.split()
    return c_word_counts_v1(text, word)

print('Changing word counting from Python to Cython')
print(word_count_v1_1(demo_text, "cdk10"))
%timeit word_count_v1_1(demo_text, "cdk10")

In [None]:
%%cython
def c_word_counts_v1_1(word_list, str word):
    cdef int count = 0
    cdef str word_i
    for word_i in word_list:
        if word_i == word:
            count += 1
    return count

In [None]:
def word_count_v1_1_1(text, word):
    text = text.split()
    return c_word_counts_v1_1(text, word)

print('Declaring in Cython word_count that word_i would also be str')
print(word_count_v1_1(demo_text, "cdk10"))
%timeit word_count_v1_1(demo_text, "cdk10")

In [None]:
%%cython
def c_word_counts_v2(word_list, str word):
    cdef int count = 0
    cdef int i
    for i in range(len(word_list)):
        if word_list[i] == word:
            count += 1
    return count

In [None]:
def word_count_v1_2(text, word):
    text = text.split()
    return c_word_counts_v2(text, word)

print('Using for instead of \"foreach\" in cython word_count')
print(word_count_v1_1(demo_text, "cdk10"))
%timeit word_count_v1_1(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v1_3(str text, str word):
    cdef int count = 0
    cdef str word_i
    text_list = text.split()
    for word_i in text_list:
        if word_i == word:
            count += 1
    return count

In [None]:
print('Inserting the text.split() to the Cython function, it might be optimized as well')
print(word_count_v1_3(demo_text, "cdk10"))
%timeit word_count_v1_3(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v1_4(str text, str word):
    cdef int count = 0
    cdef int i
    text_list = text.split()
    for i in range(len(text_list)):
        if text_list[i] == word:
            count += 1
    return count

In [None]:
print('Using for instead of \"foreach\" in the unified cython word_count')
print(word_count_v1_4(demo_text, "cdk10"))
%timeit word_count_v1_4(demo_text, "cdk10")

In [None]:
def word_count_v2(text, word):
    count = 0
    n_chars = len(word)
    for i in range(len(text) - n_chars):
        if text[i:i+n_chars] == word:
            count += 1
    return count
print('Using rolling word comparison, removes the need for str.split() with python')
print(word_count_v2(demo_text, "cdk10"))
%timeit word_count_v2(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v2_1(str text, str word):
    cdef int count = 0
    cdef int n_chars = len(word)
    cdef int i
    for i in range(len(text) - n_chars):
        if text[i:i+n_chars] == word:
            count += 1
    return count

In [None]:
print('Changing to Cython')
print(word_count_v2_1(demo_text, "cdk10"))
%timeit word_count_v2_1(demo_text, "cdk10")

In [None]:
def word_count_v3(text, word):
    count = 0
    n_chars = len(word)
    for i in range(len(text) - n_chars):
        for j in range(n_chars):
            if text[i+j] != word[j]:
                break
            if j == n_chars-1:
                count += 1
    return count
print('Instead of comparing words, using lazy evaluation. In python')
print(word_count_v3(demo_text, "cdk10"))
%timeit word_count_v3(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_1(str text, str word):
    cdef int count = 0
    cdef int word_chars = len(word)
    cdef int text_chars = len(text)
    cdef int i
    cdef int j
    for i in range(text_chars - word_chars + 1):
        for j in range(word_chars):
            if text[i+j] != word[j]:
                break
            if j == word_chars-1:
                count += 1
    return count

In [None]:
print('Switching to Cython')
print(word_count_v3_1(demo_text, "cdk10"))
%timeit word_count_v3_1(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_2(str text, str word):
    cdef int count = 0
    cdef int i
    cdef int j
    cdef int word_chars = len(word)
    for i in range(len(text) - word_chars + 1):
        for j in range(word_chars):
            if text[i+j] != word[j]:
                break
            if j == word_chars-1:
                count += 1
    return count

In [None]:
print('removing declaration for text_chars')
print(word_count_v3_2(demo_text, "cdk10"))
%timeit word_count_v3_2(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_3(str text, str word):
    cdef int count = 0
    cdef str text_cache
    cdef int i
    cdef int j
    cdef int word_chars = len(word)
    for i in range(len(text)):
        text_cache = text[i:i+word_chars]
        for j in range(word_chars):
            if text_cache[j] != word[j]:
                break
            if j == word_chars-1:
                count += 1
    return count

In [None]:
print('Trying to cache the word for each iteration')
print(word_count_v3_3(demo_text, "cdk10"))
%timeit word_count_v3_3(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_4(str text, str word):
    cdef int count = 0
    cdef int i = 0
    cdef int j
    cdef int word_chars = len(word)
    while i < (len(text) - word_chars + 1):
        for j in range(word_chars):
            if text[i+j] != word[j]:
                break
            if j == word_chars-1:
                count += 1
        i += 1
    return count

In [None]:
print('Switching to while')
print(word_count_v3_4(demo_text, "cdk10"))
%timeit word_count_v3_4(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_5(str text, str word):
    cdef int count = 0
    cdef int i = 0
    cdef int j
    cdef int word_chars = len(word)
    while i < (len(text) - word_chars + 1):
        for j in range(word_chars):
            if text[i+j] != word[j]:
                break
            if j == word_chars-1:
                count += 1
                i += word_chars  # after the word ends there's a space so it can move word_chars+1 chars
        i += 1
    return count

In [None]:
print('Caching len(word)')
print(word_count_v3_5(demo_text, "cdk10"))
%timeit word_count_v3_5(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_6(str text, str word):
    cdef int count = 0
    cdef int i = 0
    cdef int j
    cdef int word_chars = len(word)
    cdef int text_scan_end = (len(text) - word_chars + 1)
    cdef int word_scan_end = word_chars - 1
    while i < text_scan_end:
        for j in range(word_chars):
            if text[i+j] != word[j]:
                break
            if j == word_scan_end:
                count += 1
                i += word_chars  # after the word ends there's a space so it can move word_chars+1 chars
        i += 1
    return count

In [None]:
print('Caching (len(text) - word_chars + 1), word_chars - 1')
print(word_count_v3_6(demo_text, "cdk10"))
%timeit word_count_v3_6(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_7(str text, str word):
    cdef int count = 0
    cdef int i = 0
    cdef int j
    cdef int word_chars = len(word)
    cdef int text_scan_end = (len(text) - word_chars + 1)
    cdef int word_scan_end = word_chars - 1
    while i < text_scan_end:
        if i:
            if text[i-1] == " ":                
                for j in range(word_chars):
                    if text[i+j] != word[j]:
                        break
                    if j == word_scan_end:
                        if i == text_scan_end + 1:
                            count += 1
                        else:
                            if text[i+j+1] == " ":
                                count += 1
                        i += word_chars  # after the word ends there's a space so it can move word_chars+1 chars
        else:
            for j in range(word_chars):
                if text[i+j] != word[j]:
                    break
                if j == word_scan_end:
                    if i == text_scan_end + 1:
                        count += 1
                    else:
                        if text[i+j+1] == " ":
                            count += 1
                    i += word_chars  # after the word ends there's a space so it can move word_chars+1 chars
        i += 1
    return count

In [None]:
print('Starting to check only if it is a start of a word, and count only when it is not a part of a longer word')
print(word_count_v3_7(demo_text, "cdk10"))
%timeit word_count_v3_7(demo_text, "cdk10")

In [None]:
%%cython
def word_count_v3_7_1(str text, str word):
    cdef int count = 0
    cdef int i
    cdef int j
    cdef int word_chars = len(word)
    cdef int text_scan_end = (len(text) - word_chars + 1)
    cdef int word_scan_end = word_chars - 1
    cdef bint start_word_flag
    
    while i < text_scan_end:
        start_word_flag = False
        if i:
            if text[i-1] == " ":
                start_word_flag = True
        else:
            start_word_flag = True
            
        if start_word_flag:
            for j in range(word_chars):
                if text[i+j] != word[j]:
                    break
                if j == word_scan_end:
                    if i == text_scan_end + 1:
                        count += 1
                    else:
                        if text[i+j+1] == " ":
                            count += 1
                    i += word_chars  # after the word ends there's a space so it can move word_chars+1 chars
        i += 1
    return count

In [None]:
print('Refactioring the monstrosity')
print(word_count_v3_7_1(demo_text, "cdk10"))
%timeit word_count_v3_7_1(demo_text, "cdk10")