# Week 6 Validation and Functional tests

## Team 19: Ish Jain, Mudit Jain, Nick Munoz, Sagar Jogadhenu

# Question1 - Chunkify
You have a file that needs to be divided into n chunks. While it would be straightforward to split the file into equal-bytes sizes and then write those chunks to file, you cannot write any incomplete lines to the files. This means that all of the n files that you create must have no truncated lines. If a split of a certain byte-size would result in a truncated line, then you can back off and only write the previous complete line. You can save the rest of it for the next chunk.

You can download Metamorphosis, by Franz Kafka as the sample text. The file is of size 139055 bytes. Splitting into three pieces gives the following files and their respective sizes:

size	filename
46310	pg5200.txt_000.txt
46334	pg5200.txt_001.txt
46411	pg5200.txt_002.txt
The last line of the pg5200.txt_000.txt is the following:

her, she hurried out again and even turned the key in the lock so

The last line of the pg5200.txt_001.txt is the following:

there.  He, fortunately, would usually see no more than the object

As a final hint, splitting the same file into eight parts gives the following:

size	filename
17321	pg5200.txt_000.txt
17376	pg5200.txt_001.txt
17409	pg5200.txt_002.txt
17354	pg5200.txt_003.txt
17445	pg5200.txt_004.txt
17332	pg5200.txt_005.txt
17381	pg5200.txt_006.txt
17437	pg5200.txt_007.txt
You should think about making your file sizes as uniform as possible (this not graded, however). Otherwise, for a very long file, the last file may be inordinately large, as compared to the others. Your algorithm should pass through the file exactly once. You should assume that you cannot read the entire file into memory at once. If possible, you also want to minimize how much you move the file pointer around in the file. You should ensure that your code produces the file sizes that are indicated for each of the cases shown above.

Here is the function signature:

def split_by_n(fname,n=3):
    '''
    Split files into sub files of near same size
    fname : Input file name
    n is the number of segments
    '''
Hint: Use wt as the file write mode.
The individual filenames should include the original filename (fname) and a number indicating the current file sequence number in the split. For example, if pg5200.txt is the original file then the 8th division should be named pg5200.txt_007.txt. Your code should strive to produce file sizes as close to the file sizes shown in the example above.

## Validation tests

In [None]:
# Test 1: check doc string and defaults
import inspect
assert split_by_n.__doc__ != None, "doc string must exist"
assert len(split_by_n.__doc__) > 0, "doc string must not be empty"
assert inspect.signature(split_by_n).parameters['n'].default == 3, "parameter n must have a default value of 3"
    
# Test2: Supply different invalid values for n 
n_value_list = ['5', -2, 0, 1.5, None]
fname = "pg5200.txt"
assert_count = 0
for n in n_value_list:
    try:
        split_by_n(fname, n)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(n_value_list), "for each of the supplied n, there should be an assert"
        
# Test3: Supply different invalid values for file name
fname_list = [2, '', ' ', "pg5200_not_present.txt", None]
n=2
assert_count = 0
for fname in fname_list:
    try:
        split_by_n(fname, n)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(fname_list), "for each of the supplied fname, there should be an assert"
print('Passed all validation tests for split_by_n')

## Functional tests

In [2]:
import os
import random
main_file_name = "test.txt"
infile = open(main_file_name,'r')
main_file = infile.readlines()   # List of strings that each element is one line from file
infile.close()


def test_fn(int_var,main_txt,file_test_name, size_check=False):
    '''
    This function is called to run a test condition for this function

    int_var: number of segments for the split
    main_txt: list of strings from the original text file(each element is one line from text file)
    file_test_name: name of the original text file
    size_check: if True, check chunk sizes are within +/- 5% 
    '''
    chunk_size = len(main_txt)//int_var   # Chunk size

    # Check case where file is split into 3 smaller files
    split_by_n(file_test_name,int_var)

    file_chunks = []
    # Check if the three files were created
    for ind in range(0,int_var):
        file_name = file_test_name+_"%03d.txt" % ind
        if not os.path.exists(file_name):
            return False
        # check if each file is of similar size to show equal distribution of text
        infile = open(file_name,'r')
        txt_file = infile.readlines()
        infile.close()
        file_chunks.append(txt_file)

    if size_check:
        for ele in file_chunks:
            # Check if each file is within 5% of the chunk size
            if (len(ele) / chunk_size) < 0.95 and (len(ele) / chunk_size) > 1.05:
                return False
            # Check if each line is in original file
            for line not in ele:
                return False
    
    # Clean up the created files
    for ind in range(0,int_var):
        file_name = get_filename(file_test_name,ind)
        os.remove(file_name)
    return True


# Test case 1:
n = 3
assert test_fn(n,main_file,main_file_name,True)

# Test case 2:
n = 15
assert test_fn(n,main_file,main_file_name)

# Test case 3:
n = 105
assert test_fn(n,main_file,main_file_name)

# Stress test:
for i in range(0,25):
    n = random.randrange(1,200)
    assert test_fn(n,main_file,main_file_name)

print("Passed all functional tests for split_by_n")

## Question 2 - Encrypted Sentence
We will implement a very simple encryption scheme that closely resembles the one-time-pad. You have probably seen this method used in movies like Unknown. The idea is that you and your counterparty share a book whose words you will use as the raw material for a codebook. In this case, you need Metamorphosis, by Franz Kafka.
Your job is to create a codebook of 2-tuples that map to specific words in the given text based on the line and position the words appears in the text. The text is very long so there will be duplicated words. Strip out all of the punctuation and make everything lowercase.
For example, the word let appears on line 1683 in the text as the fifth word (reading from left-to-right). Similarly, the word us appears in the text on line 1761 as the fifth word.
Thus, if the message you want to send is the following:

`let us not say we met late at the night about the secret`

Then, one possible valid sequence for that message is the following:

 [(1394, 2), (1773, 11), (894, 10), (840, 1), (541, 2), (1192, 5), (1984, 7), (2112, 6), (1557, 2), (959, 8), (53, 10), (2232, 8), (552, 5)] 
 
Your counterparty receives the above sequence of tuples, and, because she has the same text, she is able to look up the line and word numbers of each of the tuples to retrieve the encoded message. Notice that the word the appears twice in the above message but is encoded differently each time. This is because re-using codewords (i.e., 2-tuples) destroys the encryption strength. In case of repeated words, you should have a randomized scheme to ensure that no message contains the same 2-tuple, even if the same word appears multiple times in the message. If there is only one occurrence of a word in the text and the message uses that word repeatedly so that each occurrence of the word cannot have a unique 2-tuple, then the message should be rejected (i.e., assert against this).
Your assignment is to create an encryption function and the corresponding decryption function to implement this scheme. Note that your downloaded text should have 2362 lines and 25186 words in it.

Function definitions:

#### def encrypt_message(message,fname):

    '''
    Given `message`, which is a lowercase string without any punctuation, and `fname` which is the
    name of a text file source for the codebook, generate a sequence of 2-tuples that
    represents the `(line number, word number)` of each word in the message. The output is a list
    of 2-tuples for the entire message. Repeated words in the message should not have the same 2-tuple.

    :param message: message to encrypt
    :type message: str
    :param fname: filename for source text
    :type fname: str
    :returns: list of 2-tuples
    '''
    
#### def decrypt_message(inlist,fname):

    '''
    Given `inlist`, which is a list of 2-tuples`fname` which is the
    name of a text file source for the codebook, return the encrypted message.
    :param inlist: inlist to decrypt
    :type inlist: list of 2-tuples
    :param fname: filename for source text
    :type fname: str
    :returns: decrypted message string
    '''


## Validation Tests

### Validation tests for encrypt_message

In [5]:
# Test1: function validation - check doc string and any defaults
assert encrypt_message.__doc__ != None, "doc string must exist"
assert len(encrypt_message.__doc__) > 0, "doc string must not be empty"

# Test2: input parameter message - input types and values other than specified must assert
fname = 'pg5200.txt'
msg_list = [1,'', ' ',None]
assert_count = 0
for message in msg_list:
    try:
        encrypt_message(message, fname)
    except AssertionError:
        assert_count = assert_count + 1
        
assert assert_count == len(msg_list)

# Test3: input fname validation 
fname_list = [2, '', ' ', "pg5200_not_present.txt", None]
message='secret'
assert_count = 0
for fname in fname_list:
    try:
        encrypt_message(message, fname)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(fname_list), "for each of the supplied fname, there should be an assert"

# Test4: Output format validation
message = 'it is a secret'
fname = 'pg5200.txt'
out_tuple_list = encrypt_message(message, fname)
assert isinstance(out_tuple_list, list), "output must be a list"
assert len(out_tuple_list) == len(message.split()), "output must contain one element for each word in input"
assert all(isinstance(item, tuple) for item in out_tuple_list), "elements of output list must be tuples"
assert all(len(item) == 2 for item in out_tuple_list), "each tuple must be a 2-tuple"
assert all(isinstance(item[0], int) and isinstance(item[1],int) for item in out_tuple_list), "elements of each tuple must be an int"
assert all(item[0] > 0 and item[1] > 0 for item in out_tuple_list), "values in each tuple must be greater than 0"

#Test5: check line number and tuple number is within bounds of input file
# This is approximate check since this doesn't consider stripping punctuations 
line_word_list =[]
with open(fname,'r') as f:
    line_word_list.append(f.readline().lower().split())
assert all(item[0] <= len(line_word_list) for item in out_tuple_list), "line number must be within total number of lines of input file"
assert all(item[1] <=len(line_word_list[item[0]]) for item in out_tuple_list), "word position must  be within the number of words for a given line"
print("Passed all validation tests for encrypt_message")

### Validation tests for decrypt_message

In [None]:
#Test1 function checks - check doc string, defaults
assert decrypt_message.__doc__ != None, "doc string must exist for the method"
assert len(decrypt_message.__doc__) > 0, "doc string must not be empty"

fname = 'pg5200.txt'
#Test2 - input parameter, inlist validation
inlist_list = [(1,2), 1.0, [1,2], [('a',1),(1,'b')],[]], "various incorrect inputs for inlist parameter"
assert_count = 0
for inlist in inlist_list:
    try:
        decrypt_message(inlist, fname)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(inlist_list)

# Test3: Duplicate tuples and incorrect tuples in inlist should fail with assert
inlist_list = [[(9,4), (9,4)],  # "duplicate tuples not allowed"
               [(9,4),(1,2,3)], #"generated output tuples must be 2-tuples"
               [(-1,10)]]       # output must not contain -negative values in tuples
assert_count = 0
for inlist in inlist_list:
    try:
        decrypt_message(inlist, fname)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(inlist_list)

#Test 4: Supply invalid line number and word number in inlist 
inlist_list = [[(25000,1)], # invalid line number
               [(2,10)]]    # invalid word position


# Test5: input fname validation 
fname_list = [2, '', ' ', "pg5200_not_present.txt", None]
inlist = [(9,4)]
assert_count = 0
for fname in fname_list:
    try:
        decrypt_message(inlist, fname)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(fname_list), "for each of the supplied fname, there should be an assert"


# Test 6 - check line number is within the number of lines of input file
# This is approximate test - since code for stripping punctuation is not shown
line_word_list =[]
with open(fname,'r') as f:
    line_word_list.append(f.readline().lower().split())
assert all(item[0] <= len(line_word_list) for item in inlist), "line number must be within total number of lines of input file"
assert all(item[1] <=len(line_word_list[item[0]]) for item in inlist), "word position must within the number of words for a given line"

#Test 7 - output format validation
fname = "pg5200.txt"
inlist = [(9,4)]
assert isinstance(decrypt_message(inlist,fname), str), "output must be a string"
assert len(inlist) == len(decrypt_message(inlist,fname).split()), "number of words in output must match number of tuples in input"

print("Passed validation tests for decrypt_message")

## Functional tests

### Functional tests for encrypt_message

In [None]:
# Test1 - check output generates correct tuple
fname = "pg5200.txt"
message = "copyrighted", 
assert encrypt_message(message,fname) == [(9, 4)], "word copyrighted exists only at line 9 as 4th word"
message = "secret"
assert True == encrypt_message(message,fname)[0] is in [(552, 5), (850, 3), (902, 1)], "word secret exists at three places hence output should be one of them"

#Test 2 - check corner cases
message = "this is copyrighted that is not copyrighted" 
assert_count = 0
try:
    encrypt_message(message, fname), "word copyrighted only exists once hence the message must result in assert"
except AssertionError:
    assert_count = 1
assert assert_count == 1

message = "xylophone"
assert_count = 0
try:
    encrypt_message(message, fname), "word does not exist in input file hence assert"
except AssertionError:
    assert_count = 1
assert assert_count == 1


#Test3 - check each repeated word is encoded into a unique tuple
message = "this is a secret that is not a secret"
out_tuple_list = encrypt_message(message, fname)
assert len(out_tuple_list) == len(message.split()), "number of tuples in output must match number of words in input"
assert len(out_tuple_list) == len(set(out_tuple_list)), "all tuples of output must be unique"

# Test4 - we can successfully decrypt the encrypted message
message = "let us not say we met late at the night about the secret"
assert decrypt_message(encrypt_message(message, fname)) == message, "encrypted message must be correctly decrypted"

print("Passed functional tests for encrypt_message")

### Functional tests for decrypt_message

In [None]:
# Test1 output generates correct string
fname = 'pg5200.txt'
inlist = [(9,4)]
assert decrypt_message(inlist,fname) == "copyrighted", "actual output must match expected output"

# Test2
inlist = [(559, 11), (1761, 6), (1119, 2), (367, 9), (541, 2), (2328, 3), (1253, 10), (1500, 4), (2072, 4), (747, 5),(1545, 4), (2318, 8), (850, 3)]
expected_msg = "let us not say we met late at the night about the secret"
assert decrypt_message(inlist,fname) == expected_msg, "decoded message matches expected msg"


## Question 3 - Multinomial

Write a function to return samples from the Multinomial distribution using pure Python (i.e., no third-party modules like Numpy, Scipy). Here is some sample output.

>>> multinomial_sample(10,[1/3,1/3,1/3],k=10)
 [[3, 3, 4], 
  [4, 4, 2], 
  [3, 4, 3], 
  [5, 2, 3], 
  [3, 3, 4], 
  [3, 4, 3], 
  [6, 2, 2], 
  [2, 6, 2], 
  [5, 4, 1], 
  [4, 4, 2]]

## Validation tests

In [9]:
# Test 1: check doc string and defaults - 
assert multinomial_sample.__doc__ != None, "doc string must exist"
assert len(multinomial_sample.__doc__) > 0, "doc string must not be empty"
assert inspect.signature(multinomial_sample).parameters['k'].default == 1, "parameter k must have a default value of 1"
    
# Test2: Supply different invalid values for n 
n_value_list = ['5', -1, 0, 1.5, None]
p = [1/3,1/3,1/3]
assert_count = 0
for n in n_value_list:
    try:
        multinomial_sample(n,p)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(n_value_list), "for each of the supplied n, there should be an assert"

#Test3: Different invalid values for p
p_value_list = [1,  # not a list
                ['a',1/3,1/3], # list contains strings instead of int or float
                (1,2), # not a list
                [-1,1/3,1/3], # p contains negative values
               ]
n = 2
assert_count = 0
for p in p_value_list:
    try:
        multinomial_sample(n,p)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(p_value_list), "for each of the supplied p, there should be an assert"

# Test 4 : sum of probabilities equal to 1
# function should have assert(round(sum(p),5)==1)
n = 10
k = 10
p = [1/2,1/3,1/3]
assert_count = 0
try:
    multinomial_sample(n,p,k)
except AssertionError:
    assert_count = 1
assert assert_count == 1

# Test5: Supply different invalid values for k
k_value_list = ['5', -1, 0, 1.5, None]
p = [1/3,1/3,1/3]
n = 10
assert_count = 0
for k in n_value_list:
    try:
        multinomial_sample(n,p,k)
    except AssertionError:
        assert_count = assert_count + 1
assert assert_count == len(k_value_list), "for each of the supplied k, there should be an assert"

# Test 6 - output format validation
p = [1/3,1/3,1/3]
n = 10
k = 10
output = ultinomial_sample(n,p,k)
assert isinstance(output, list), " output must be a list of lists"
for item in output:
    assert isinstance(item, list), "each element of output must be a list"
    assert len(item) == len(p), "the length of each inner list must match the length of probabilities list"
    

## Functional Tests

In [1]:
# Test 1 - number of samples must match requested
p = [1/3,1/3,1/3]
n = 10
k = 10
output = ultinomial_sample(n,p,k)
assert k == len(output), "number of samples must match k"
for item in output:
    assert sum(item) == n, "sum of all elements in innerlist must match number of trials"
    
assert set(output) >= k//2, "at least 50% of samples are unique i.e., results are random"

# Test 2 - sample approximately follows the specified probability
p = [0.25, 0.5, 0.25]
n = 2000
k = 10
output = ultinomial_sample(n,p,k)
for sample in output:
    assert sample[1] > sample[0] and sample[1] > sample[2],  "over large number of trials, the most probable element must win max times"


