#### with

In [None]:
with open(training_set,'r') as f:
    for line in f:
        line=line.strip()

Using "with", it's no longer necessary to close the file. "with" will do it for you.  In essence, it has the following scheme: automatically call the __enter__ method, and after "some code" automatically call the __exit__ method. In the above case, __enter__ == open and __exit__ == close.

In [None]:
 class controlled_execution:
        def __enter__(self):
            set things up
            return thing
        def __exit__(self, type, value, traceback):
            tear things down

    with controlled_execution() as thing:
         some code

#### class consumes more time and resource to be created, but class itself is not a bottleneck [link](https://stackoverflow.com/questions/10072428/why-is-creating-a-class-in-python-so-much-slower-than-instantiating-a-class)

In [None]:
>>> class Haha(object): pass
...
>>> sys.getsizeof(Haha)
904
>>> sys.getsizeof(Haha())
64

#### dictionary

Dictionary itself is quite efficient (as a Hash table). But dictionary of dictionary could be a bottlenect, like "ConditionalFreqDict" in nltk:

In [None]:
class FreqDist(dict):
    def __init__(self, samples=None):

        dict.__init__(self)
        self._N = 0
        self._reset_caches()
        if samples:
            self.update(samples)

    def inc(self, sample, count=1):

        if count == 0: return
        self[sample] = self.get(sample,0) + count
......

class ConditionalFreqDist(defaultdict):
    def __init__(self, cond_samples=None):
        defaultdict.__init__(self, FreqDist)
        if cond_samples:
            for (cond, sample) in cond_samples:
                self[cond].inc(sample)
......

The above configuration could be a potential bottle check. It wants to build a table to store conditional prob, prob[c][d]. prob itself is a dict, prob[c] returns a dict, and prob[c][d] is the value

A naive but feasible way to replace dict of dict is use a larger single dict, with the key being a combination 'c_d'

In [None]:
tag_bigram_dict={}

def transition_count(tag_bigram_dict,tag_tag):
    if not tag_tag in tag_bigram_dict:
        tag_bigram_dict[tag_tag]=1
    else:
        tag_bigram_dict[tag_tag]+=1

Another better way is using the 'defaultdict' class, which belongs to "collection", and has HIGH Performance.


    >>> s = 'mississippi'
    >>> d = defaultdict(int)
    >>> for k in s:
    ...     d[k] += 1
    ...
    >>> d.items()
    [('i', 4), ('p', 2), ('s', 4), ('m', 1)]
    
    
    >>> s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
    >>> d = defaultdict(list)
    >>> for k, v in s:
    ...     d[k].append(v)
    ...
    >>> d.items()
    [('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]



Usually, a Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary. The defaultdict in contrast will simply create any items that you try to access (provided of course they do not exist yet). 

To create such a "default" item, it calls the function object(default_factory; it is initialized from the first argument to the constructor, if present, or to None, if absent) that you pass in the constructor . For the first example, default items are created using int(), which will return the integer object 0. For the second example, deafult items are created using list(), which returns a new empty list object.

In [None]:
self.w_ct = defaultdict(int)
        self.tag_dict = defaultdict(set)
        for l in open(train_file,'r'):
            l = l.split()
            l.append('END')
            l.append('END')
            for i in range(0,len(l)):
                self.w_ct[l[i]] += 1
        for l in open(train_file,'r'):
            l = l.split()
            l.append('END')
            l.append('END')
            self.sum_t_ct += len(l)
            for i in range(0,len(l),2):
                w = l[i]
                if(self.w_ct[w]<5):
                    w='UNKA'
                self.t_ct[l[i+1]] += 1
                self.wt_ct[(w,l[i+1])] += 1
                self.tag_dict[w].add(l[i+1])
                if(i>=1):
                    self.bi_t_ct[(l[i-1],l[i+1])] += 1

####  IPython parallel

In [2]:
from multiprocessing import cpu_count
print cpu_count()

2


To start an IPython cluster, use the following in the command line:

ipcluster start -n 2

It can also be started through the notebook GUI

#### A nice way to delete empty string in a list

In [1]:
str_list = ['a','b','']
str_list = filter(None, str_list) # fastest
str_list = filter(bool, str_list) # fastest
str_list = filter(len, str_list)  # a bit of slower
str_list = filter(lambda item: item, str_list) # slower than list comprehension
print str_list

['a', 'b']
