#Case study: data structure selection

* How should you choose? The first step is to think about the operations you will need to implement for each data structure. 
* Say If you need to add count of words. 
* Your first choice might be a list, since it is easy to add  elements, 
* With tuples, you can’t append or remove, but you can use the addition operator to form a new tuple:
* You can use a dictionary if you need to store words as key and count as value


### Other factors to consider in choosing data structures.
* One is run time. Sometimes there is a theoretical reason to expect one data structure to be faster than other; for example, I mentioned that the in operator is faster for dictionaries than for lists, at least when the number of elements is large.

* But often you don’t know ahead of time which implementation will be faster. One option is to implement both of them and see which is better. This approach is called benchmarking. 
* A practical alternative is to choose the data structure that is easiest to implement, and then see if it is fast enough for the intended application. 

* If so, there is no need to go on. If not, there are tools, like the profile module, that can identify the places in a program that take the most time.

* The other factor to consider is storage space.  In some cases, saving space can also make your program run faster, and in the extreme, your program might not run at all if you run out of memory. But for many applications, space is a secondary consideration after run time.

#13.1  Word frequency analysis

In [63]:
f = open("mydata.dat","w")
f.write("apple and orange\n")
f.write("mango\n")
f.write("orange\n")
f.write("grapes\n")
f.close()

In [64]:
f = open("mydata.dat","r")
text=f.read()
print(text)

apple and orange
mango
orange
grapes



In [76]:
def freq(str):

  str = str.split()  
  print(str)        

  str2 = [] 

  for i in str:  
    #print(i)            

    if i not in str2: 
      print(i)
      print(str2)

      str2.append(i)  



  for i in range(0, len(str2)): 

    print('Frequency of', str2[i], 'is :', str.count(str2[i]))  
  print(str2)   


str ='apple mango apple orange '

freq(str)        

['apple', 'mango', 'apple', 'orange']
apple
[]
mango
['apple']
orange
['apple', 'mango']
Frequency of apple is : 2
Frequency of mango is : 1
Frequency of orange is : 1
['apple', 'mango', 'orange']


In [78]:
f = open("mydata.dat","w")
f.write("apple and orange\n")
f.write("mango\n")
f.write("orange\n")
f.write("grapes\n")
f.close()

f = open("mydata.dat","r")
text=f.read()
print(text)


apple and orange
mango
orange
grapes



In [82]:
def hist(myfile):
  hist={} #Empty
  f1 = open(myfile, "r")
  while True:
    text = f1.readline() 
    if text == "":
      break
    words = text[:-1].split(" ") 
    print(words)
    for word in words:
     if word in hist:
       hist[word]=hist[word]+1 #incrementing
     else:
       hist[word]=1 # hist['apple']=1 {'apple':1,'and':1,'orange':2,'mango':1,'grapes':1}

  f1.close()
  #print(hist)
  return hist

hist("mydata.dat")

['apple', 'and', 'orange']
['mango']
['orange']
['grapes']


{'and': 1, 'apple': 1, 'grapes': 1, 'mango': 1, 'orange': 2}

# 13.2  Random numbers

In [84]:
import random

for i in range(10):
    #print(i)
    x = random.random()
    print (x)

0.7815285329571251
0.6705926499622238
0.056179322417355015
0.2238596618253167
0.7478502708139596
0.538598550035814
0.48886522778680086
0.8560787211603603
0.38217386300796996
0.19645380416417635


The function randint takes parameters low and high and returns an integer between low and high (including both).

In [88]:
random.randint(5, 100)

45

In [89]:
random.randint(5, 100)

24

To choose an element from a sequence at random, you can use choice:

In [130]:
t = [1, 2, 3,5,6]
t

[1, 2, 3, 5, 6]

In [131]:
 random.choice(t)

3

In [95]:
random.choice(t)

3

#13.3  Word histogram

In [11]:
f = open("mydata.dat","w")
f.write("apple and orange\n")
f.write("mango\n")
f.write("orange\n")
f.write("grapes\n")
f.close()

In [98]:
def hist(myfile):
  hist={}
  f1 = open(myfile, "r")
  while True:
    text = f1.readline() 
    if text == "":
      break
    words = text[:-1].split(" ") 
    #print(words)
    for word in words:
     if word in hist:
       hist[word]=hist[word]+1
     else:
       hist[word]=1

  f1.close()
  #print(hist)
  return hist

myhist =hist("mydata.dat")

In [99]:
print(myhist)

{'apple': 1, 'and': 1, 'orange': 2, 'mango': 1, 'grapes': 1}


#13.4  Most common words

most_common takes a histogram and returns a list of word-frequency tuples, sorted in reverse order by frequency

In [103]:
def most_common(hist):
    t = []
    for key, value in hist.items():
        t.append((value, key))

    t.sort(reverse=True)
    return t

In [104]:
t = most_common(myhist)
print ('The most common words are:')
for freq, word in t[0:2]:
    print (word, '\t', freq)

The most common words are:
orange 	 2
mango 	 1


#13.5  Optional parameters

We have seen built-in functions and methods that take a variable number of arguments. It is possible to write user-defined functions with optional arguments, too. For example, here is a function that prints the most common words in a histogram

In [112]:
def print_most_common(hist, num=3):
    t = most_common(hist)
    print ('The most common words are:')
    for freq, word in t[:num]:
        print (word, '\t', freq)

In [113]:
print_most_common(myhist)

The most common words are:
orange 	 2
mango 	 1
grapes 	 1


In [114]:
print_most_common(myhist,5)

The most common words are:
orange 	 2
mango 	 1
grapes 	 1
apple 	 1
and 	 1


Write a function to calculate sum of n numbers


In [111]:
def sumofn(l=1,h=100):
  pass

In [110]:
sumofn()

<function sum>

#13.6  Dictionary subtraction

subtract takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. Since we don’t really care about the values, we set them all to None.

In [116]:
f = open("mydata.dat","w")
f.write("apple and orange\n")
f.write("mango\n")
f.write("orange\n")
f.write("grapes\n")
f.write("banana\n")
f.close()
f = open("mydata.dat","r")
print(f.read())

apple and orange
mango
orange
grapes
banana



In [118]:
words = hist('mydata.dat')
words

{'and': 1, 'apple': 1, 'banana': 1, 'grapes': 1, 'mango': 1, 'orange': 2}

In [117]:
myhist

{'and': 1, 'apple': 1, 'grapes': 1, 'mango': 1, 'orange': 2}

In [120]:
def subtract(d1, d2):
    res = dict()
    for key in d1:
        #print(key)
        if key not in d2:
            res[key] = None
    return res

In [121]:
words = hist('mydata.dat')
diff = subtract(words, myhist)

In [122]:
words

{'and': 1, 'apple': 1, 'banana': 1, 'grapes': 1, 'mango': 1, 'orange': 2}

In [123]:
myhist

{'and': 1, 'apple': 1, 'grapes': 1, 'mango': 1, 'orange': 2}

In [124]:
for word in diff.keys():
    print (word)

banana


#13.7  Random words

To choose a random word from the histogram, the simplest algorithm is to build a list with multiple copies of each word, according to the observed frequency, and then choose from the list:

In [125]:
myhist

{'and': 1, 'apple': 1, 'grapes': 1, 'mango': 1, 'orange': 2}

In [126]:
myhist.items()

dict_items([('apple', 1), ('and', 1), ('orange', 2), ('mango', 1), ('grapes', 1)])

In [146]:
def random_word(h):
    t = []
    for word, freq in h.items():
        t.extend([word] * freq*2) 
        #print(t)

    return random.choice(t)

In [147]:
random_word(myhist)

'and'