# Problem 7

**Letter frequencies.** This problem has three (3) exercises worth a total of ten (10) points.

Letter frequency in text has been studied in cryptoanalysis, in particular frequency analysis. Linguists use letter frequency analysis as a rudimentary technique for language identification, where it's particularly effective as an indicator of whether an unknown writing system is alphabetic, syllablic, or ideographic.

Primarily, three different ways exist for letter frequency analysis. Each way generally results in very different charts for common letters. Based on the provided text, the first method is to count letter frequency in root words of a dictionary. The second way is to include all word variants when counting, such as gone, going and goes and not just the root word go. Such a system results in letters like "s" appearing much more frequently. The last variant is to count letters based on their frequency in the actual text that is being studied. 

For more details, refer to the link: 
https://en.wikipedia.org/wiki/Letter_frequency

In this problem, we will focus on the 3rd methodology.

**Exercise 0** (2 points). First, given a string input, define a function  `preprocess` that returns a string with non-alphabetic characters removed and all the alphabets converted into a lower case. 

For example, 'We are coding letter Frequency! Yay!" would be transformed into "wearecodingletterfrequencyyay"

In [1]:
def preprocess(S):
    ###
    return ''.join([i.lower() for i in S if i.isalpha()])
    ###


In [3]:
preprocess("We are coding letter Frequency! Yay!")

'wearecodingletterfrequencyyay'

In [2]:
# Test cell: valid_string
import random, string

N_str = 100 #Length of random string

def generate_str(n):
    random_str = ''.join(random.choice(string.ascii_lowercase + string.ascii_uppercase + string.digits + string.punctuation) for _ in range(n))
    return random_str

def check_preprocess_str(n):
    random_str = generate_str(n)
    print("Input String: ",random_str)
    assert preprocess(random_str).islower() == True
    assert preprocess(random_str).isalpha() == True
    print("|----Your function seems to work correct for the string----|"+"\n")

check_preprocess_str(N_str)
check_preprocess_str(N_str)
check_preprocess_str(N_str)

print("\n(Passed)!")

Input String:  xg]$M~L0oD|7n7"&>!@R_l)&BL5jo2K4P(+4~}|E:(a+ZH;ve:0'NU}&J8mrk'QJ}%L&'F2H1{("wa=ftmY?+&cUeXRLvv%m:67,
|----Your function seems to work correct for the string----|

Input String:  7y!A]b9&!9Y?+Q`EdqP1ilXF]`iv}..BNTjif".j\@Bqpf"/JdztigO9F/3`u)MLY2SF>*cCi%CPZ"5@ObQ*e%,qIOz_`pE;LRJM
|----Your function seems to work correct for the string----|

Input String:  8ht+\[:0N1&y3b;%18U-[5X9%]d_;%q.g#WL<6`nsY[@{O/UVJEd\([khZ]R."})p[q!)V]M6`}{(1oMrMvQ~$8.(aKII$k.c.F\
|----Your function seems to work correct for the string----|


(Passed)!


**Exercise 1** (4 points). With the necessary pre-processing complete, the next step is to write a function `count_letters(S)` to count the number of occurrences of each letter in the alphabet.  

You can assume that only letters will be present in the input string. It should output a dictionary and if any alphabet (a-z) is missing in the input string, it should still be a part of the output dictionary and its corresponding value should be equal to zero.


In [4]:
from collections import defaultdict 
def count_letters(S):
    ###
    d = dict.fromkeys(string.ascii_lowercase,0)
    S_clean = preprocess(S)
    for s in S_clean:
        d[s] += 1
    return d
    ###


In [5]:
# Test cell: count_letters
import collections

N_processed_str = 100

def generate_processed_str(n):
    random_processed_str = ''.join(random.choice(string.ascii_lowercase) for _ in range(n))
    return random_processed_str

def check_count_letters(S):
    print("Input String: ",S)
    random_char = chr(random.randint(97,122))
    print("Character frequency evaluated for: ", random_char)
    if(random_char in S):
        assert count_letters(S)[random_char] == collections.Counter(S)[random_char]
        print("|----Your function seems to return correct freq for the char----|"+"\n")
    else:
        assert count_letters(S)[random_char] == 0
        print("|----Your function seems to return correct freq for the char----|"+"\n")
        
check_count_letters(generate_processed_str(N_processed_str))
check_count_letters(generate_processed_str(N_processed_str))
check_count_letters(generate_processed_str(N_processed_str))
print("\n(Passed)!")

Input String:  iebcitpudqmwhziklusxtrzrsmpcdvwibwtucjrpjpzbwiwxorweqsegkfrrcwkfnlgdjyiwzcwgnmidchrcscftqiabrxcauqok
Character frequency evaluated for:  n
|----Your function seems to return correct freq for the char----|

Input String:  xggrbypxijrouhupagsqusjgygcsadewficttnchvmevuhvumrnthtrcsqrhrpaxdpfukzcarlaywvcsiuadaxvafcqoezprojyb
Character frequency evaluated for:  m
|----Your function seems to return correct freq for the char----|

Input String:  dconbxganhqmqvebubsrvvernvfgnuopslcpxqkrhfvqaczauhhjdehfitgkivzubtxwrczpobmfgjwqqnsutfovnaauotmhzjzb
Character frequency evaluated for:  c
|----Your function seems to return correct freq for the char----|


(Passed)!


**Exercise 2** (4 points). The next step is to sort the distribution of a dictionary containing all the letters in the alphabet as keys and number of occurrences in text as associated value. 

Sorting should be first done in decreasing order by occurrence count and for two elements with same count, the order should be alphabetic. The function  `find_top_letter(d)` should return the 1st character in the order.

In [6]:
def find_top_letter(d):
    ###
    sorted_dict = sorted(d.items(), key = lambda item: item[1], reverse = True)
    return sorted_dict[0][0]
    ###


In [7]:
# Test cell: highest_freq_letter

def create_random_dict():
    max_char_value = random.randint(5, 20)
    random_dict = {c:random.randint(0,max_char_value-1) for c in string.ascii_lowercase}
    random_letter1, random_letter2 = random.sample(string.ascii_lowercase, 2)
    random_dict[random_letter1], random_dict[random_letter2] = max_char_value, max_char_value
    if(random_letter1 < random_letter2):
        return random_letter1, random_dict
    else:
        return random_letter2, random_dict

def check_top_letter():
    top_letter, random_dict = create_random_dict()
    user_letter = find_top_letter(random_dict)
    assert user_letter == top_letter
    print("Input Dictionary: ", random_dict)
    print("Your function correctly returned most frequent letter: {} \n".format(user_letter))
    
check_top_letter()
check_top_letter()
check_top_letter()
print("\n(Passed)!")

Input Dictionary:  {'a': 4, 'b': 4, 'c': 3, 'd': 0, 'e': 1, 'f': 0, 'g': 4, 'h': 2, 'i': 3, 'j': 2, 'k': 4, 'l': 3, 'm': 2, 'n': 4, 'o': 4, 'p': 1, 'q': 0, 'r': 5, 's': 4, 't': 6, 'u': 3, 'v': 2, 'w': 5, 'x': 6, 'y': 0, 'z': 3}
Your function correctly returned most frequent letter: t 

Input Dictionary:  {'a': 5, 'b': 4, 'c': 10, 'd': 8, 'e': 6, 'f': 6, 'g': 1, 'h': 0, 'i': 0, 'j': 7, 'k': 1, 'l': 1, 'm': 5, 'n': 7, 'o': 9, 'p': 7, 'q': 6, 'r': 8, 's': 1, 't': 2, 'u': 9, 'v': 8, 'w': 11, 'x': 7, 'y': 5, 'z': 11}
Your function correctly returned most frequent letter: w 

Input Dictionary:  {'a': 3, 'b': 0, 'c': 5, 'd': 3, 'e': 5, 'f': 0, 'g': 6, 'h': 0, 'i': 7, 'j': 9, 'k': 7, 'l': 2, 'm': 11, 'n': 7, 'o': 11, 'p': 0, 'q': 10, 'r': 6, 's': 2, 't': 7, 'u': 7, 'v': 7, 'w': 5, 'x': 5, 'y': 6, 'z': 2}
Your function correctly returned most frequent letter: m 


(Passed)!


**Fin!** You've reached the end of this problem. Don't forget to restart the kernel and run the entire notebook from top-to-bottom to make sure you did everything correctly. If that is working, try submitting this problem. (Recall that you *must* submit and pass the autograder to get credit for your work!)