# Section 2 - Coding

In this section we will load and manipulate "unconventional" data files - for which you will be required to create a simple loading functionality and then be able to process and query a data file.

There is a "section2_data.txt" file attached to this IPython notebook. The data file contains few rows from a computer vision dataset. Each row corresponds to a frame in a video and contains some metadata and annotations over it.

The following notebook will pose some questions about reading and processing this data.

Feel free to use any existing python library to answer the questions.

In [None]:
# Importing necessary Libraries
import re
import os

In [1]:
# Head(top few rows) view of the text file
!head section2_data.txt

{"_i": 0, "frame": "frame_000.png", "video": "video000", "value": 39, "labels": ["bird"]}
{"_i": 1, "frame": "frame_001.png", "video": "video000", "value": 33, "labels": ["frog", "dog"]}
{"_i": 2, "frame": "frame_002.png", "video": "video000", "value": 25, "labels": ["panda", "panda"]}
{"_i": 3, "frame": "frame_003.png", "video": "video000", "value": 28, "labels": ["dog", "dog"]}
{"_i": 4, "frame": "frame_004.png", "video": "video000", "value": 16, "labels": ["cat"]}
{"_i": 5, "frame": "frame_005.png", "video": "video000", "value": 32, "labels": ["bird", "frog", "bird"]}
{"_i": 6, "frame": "frame_006.png", "video": "video000", "value": 35, "labels": ["bird", "dog"]}
{"_i": 7, "frame": "frame_000.png", "video": "video001", "value": 25, "labels": ["dog", "bird"]}
{"_i": 8, "frame": "frame_001.png", "video": "video001", "value": 16, "labels": ["dog", "panda", "bird"]}
{"_i": 9, "frame": "frame_002.png", "video": "video001", "value": 23, "labels": ["panda"]}


## Section 1 - Design a data loader

Design a data structure, that give a file path `"section2_data.txt"`, it will read and parse the contents of the file above.

#### Q1 - Design the data structure with the following properties:
- The data structure is either callable or indexable. It will accepts a single parameter, as integer, and return the parsed contents of the row corresponding to the given index.
- The data structure needs to return the number of rows in the file (and in memory) when called with the python command `len(my_data_struct)`


In [2]:
## YOUR SOLUTION

# We are creating the requried datastructure to ouput the parsed content 
# of the index provided by the user

# It returns the the parsed row and the file that is present in the ram
# currently so as to invoke the 'len' function to compute the length of
# file stored in the ram

def datastruct(idx):
    '''
    It is used to return the parsed row 
    of the index passed as an arguement
    by the user.
    
    It also return the file currentlty 
    stored in the ram so as to give the
    user to use the "len" functionality
    on the return text file.
    
    '''

    if os.path.exists("./section2_data.txt") is not None:   #checking if the path contains the required files
        with open("section2_data.txt") as file:             # opening the file using context manager
            my_data_struct = file.readlines()               # reading the file content line by line.
            if idx < 0:
                print ("Please enter positive index")
            elif idx > len(my_data_struct)-1:
                print("Index out of range! Please enter the index in range {} - {}".format(0,len(my_data_struct))) 
            else:
                return my_data_struct[idx],my_data_struct
        file.closed
        
    else:
        print("Error:File not found !!")

        

content,my_data_struct = datastruct(4)
print(content)
print(len(my_data_struct))

{"_i": 4, "frame": "frame_004.png", "video": "video000", "value": 16, "labels": ["cat"]}

51


#### Q2 - Prove that you can initialize the reader and then calculate its length `len(reader)` and print the 26th and 43rd elements of the dataset.

In [3]:
## YOUR SOLUTION
if os.path.exists("./section2_data.txt") is not None:                #checking if the path contains the required files
    with open("./section2_data.txt") as file:                        # opening the file using context manager
        text = file.read()                                           # reading the file content.
        print("The length of the reader is : {}".format(len(text)))z
        curr_pos = file.seek(25)                                     # moving the file pointer to 26 position
        print("'{}' is the 26th element".format(file.read(1)))
        file.seek(42-curr_pos)                                       # moving the file pointer to 43 positon from the current positon(25th)
        print("'{}' is the 43th element".format(file.read(1)))
    file.closed
else:
    print("Error: File not found")

The length of the reader is : 5014
'_' is the 26th element
':' is the 43th element


## Section 2 - Process the data

#### Q1 - Write an algorithm that will generate a dictionary with key/value pairs, where the keys are the name of each unique video in the dataset and the value is the number of frames of that video.

In [4]:
### YOUT SOLUTION
import re
def video_frame_count(my_data_struct):
    hashmap = dict()
    for i in range(len(my_data_struct)):
        line = my_data_struct[i]             # For each line in the file we are storing in variable called line
        key = ''
        value = ''
        key = re.findall('video[+0-9][+0-9][+0-9]',line)[0]  # we are applying regular expression i.e 'video followed by 1 or more digit, 1 or more digit,1 or more digit'.
        value = line.split('"value": ')[1][0:2]              # we are spliting the text into two token using "value" keyword and selecting 2nd token and then performing the string slicing.
        value = re.sub(',','',value)              # replacing the value ',' with ''.
        value = int(value)                        # converting the string value obtained to integer.
        if hashmap.get(key) is not None:          # checking if the value is already present in the hashmap if not then we are adding the value.
            hashmap[key] = hashmap[key]+value
        else:
            hashmap[key] = value
            
        
    return hashmap


print(video_frame_count(my_data_struct))
    

{'video000': 208, 'video001': 175, 'video002': 149, 'video003': 402, 'video004': 251}


#### Q2 - Write an algorithm that will generate a dictionary with key/value pairs, where the keys are the name of each unique video in the dataset and the value is the sum of the `value` field of all the frames containing a `dog`.

In [5]:
### YOUR SOLUTION
def video_value_sum_with_dog(my_data_struct):
    hashmap = dict()
    for i in range(len(my_data_struct)):        
        line = my_data_struct[i]                # For each line in the file we are storing in variable called line
        label = line.split('"labels"')[1]       # splitting the text using "labels" token and using the 2nd token.
        if 'dog' in label:                      # checking if the "dog" is present in the label
            key = ''
            value = ''
            key = re.findall('video[+0-9][+0-9][+0-9]',line)[0]         # we are applying regular expression i.e 'video followed by 1 or more digit, 1 or more digit,1 or more digit'.
            value = line.split('"value": ')[1][0:2]                     # we are spliting the text into two token using "value" keyword and selecting 2nd token and then performing the string slicing.
            value = re.sub(',','',value)                 # replacing the value ',' with ''.
            value = int(value)                           # converting the string value obtained to integer.
            if hashmap.get(key) is not None:             # checking if the value is already present in the hashmap if not then we are adding the value.
                hashmap[key] = hashmap[key]+value
            else:
                hashmap[key] = value
            
        
    return hashmap

print(video_value_sum_with_dog(my_data_struct))

{'video000': 96, 'video001': 69, 'video002': 91, 'video003': 129, 'video004': 49}


#### Q3 - Last, create an algorithm that returns a dictionary with the count of each of the animal types in the dataset, excluding occurrences in `video003` and rows where the `value` is odd.

In [6]:
### YOUR SOLUTION
def animal_count(my_data_struct):
    hashmap = dict()
    for i in range(len(my_data_struct)):
        line = my_data_struct[i]              # For each line in the file we are storing in variable called line
        animal = ''
        video = ''
        value = ''
        animals = line.split('"labels": ')[1]                                          # we are splitting the text using "labels" and selecting the 2nd token
        animals = animals.replace('[','').replace('"','').replace(']','').replace('}','').replace('\n','').replace(' ','')        # necessary substitution for text processing
        animals = list(set(animals.split(',')))                                        # we are splitting the animals text obatined after parsing the text and then converting it into set and finbally into list for further processing
        video = re.findall('video[+0-9][+0-9][+0-9]',line)[0]                          # extracting the videoname from the text.
        value = line.split('"value": ')[1][0:2]
        value = re.sub(',','',value)
        value = int(value)
        if video != 'video003' and value%2 == 0:
            for animal in animals:
                if hashmap.get(animal) is not None:
                    hashmap[animal] += 1
                else:
                    hashmap[animal] = 1
        
    return(hashmap)
        
print(animal_count(my_data_struct))
    
   

{'dog': 9, 'cat': 6, 'frog': 8, 'bird': 5, 'panda': 4}
