# <span style="color:yellow">Step 5: Data Wrangling</span>  

<span style="color:red">- What kind of cleaning steps did you perform?</span>  
The annotation data loaded was not in a convenient row/column format for proper use in a dataframe, I had to do some selection of specific columns/rows to get a proper format.
The annotation data was split into 4 different files containing different information. I had to combine the data need from the multiple files into a single dataframe.
In addition, the columns headers of the annotations needed to be renamed for clarity.
&nbsp;


<span style="color:red">- How did you deal with missing values, if any?</span>  
There was no missing data, however I verified that there was no duplication of annotation data.  
&nbsp;


<span style="color:red">- Were there outliers? If so, how did you handle them?</span>  
There were no outliers.  
&nbsp;

<span style="color:red">- If your dataset is too large to work with, does it make sense to build your prototype on a smaller subset of the data?</span>  
The dataset was in fact too large since it dealt with video images. I added capability to select the starting point and end point of videos so they can be further analyzed. 
In addition, to reduce the loading time of annotation data, I created an automated script that ran thought all available video annotations and created df_bbox.csv files containing the necessary information. I found that loading a csv was significantly faster then reading from the csv vs. yml. The pre processed dateframes were placed in the appropriate "./processed/" folder.
&nbsp;


I was required to understanding the annotation data files/codes that were spread across multiple files:

Event File Columns 
- 1: event ID        (unique identifier per event within a clip, same eid can exist on different clips)
- 2: event type      (event type)
- 3: duration        (event duration in frames)
- 4: start frame     (start frame of the event)
- 5: end frame       (end frame of the event)
- 6: current frame   (current frame number)
- 7: bbox lefttop x  (horizontal x coordinate of left top of bbox, origin is lefttop of the frame)
- 8: bbox lefttop y  (vertical y coordinate of left top of bbox, origin is lefttop of the frame)
- 9: bbox width      (horizontal width of the bbox)
- 10: bbox height    (vertical height of the bbox)

Event Type ID (for column 2 above)
- 1: Person loading an Object to a Vehicle
- 2: Person Unloading an Object from a Car/Vehicle
- 3: Person Opening a Vehicle/Car Trunk
- 4: Person Closing a Vehicle/Car Trunk
- 5: Person getting into a Vehicle
- 6: Person getting out of a Vehicle
- 7: Person gesturing
- 8: Person digging
- 9: Person carrying an object
- 10: Person running
- 11: Person entering a facility
- 12: Person exiting a facility

Object File Columns
- 1: Object id        (a unique identifier of an object track. Unique within a file.)
- 2: Object duration  (duration of the object track)
- 3: Currnet frame    (corresponding frame number)
- 4: bbox lefttop x   (horizontal x coordinate of the left top of bbox, origin is lefttop of the frame)
- 5: bbox lefttop y   (vertical y coordinate of the left top of bbox, origin is lefttop of the frame)
- 6: bbox width       (horizontal width of the bbox)
- 7: bbox height      (vertical height of the bbox)
- 8: Objct Type       (object type)

Object Type ID (for column 8 above for object files)
- 1: person
- 2: car              (usually passenger vehicles such as sedan, truck)
- 3: vehicles         (vehicles other than usual passenger cars. Examples include construction vehicles)
- 4: object           (neither car or person, usually carried objects)
- 5: bike, bicylces   (may include engine-powered auto-bikes)

Mapping File Columns
- 1: event ID         (unique event ID, points to column 1 of event file)
- 2: event type       (event type, points to column 2 of event file)
- 3: event duration   (event duration, points to column 3 of event file)
- 4: start frame      (start frame of event)
- 5: end frame        (end frame of event)
- 6: number of obj    (total number of associated objects)
- 7-end:              (variable number of columns which captures the associations maps for variable number of objects in the clip. 
                     If '1', the event is associated with the object. Otherwise, if '0', there's none.
                     The corresponding oid in object file can be found by 'column number - 7')


Loading libraries:

In [1]:
# !pip install -r requirements.txt 
import cv2
import numpy as np
from numpy.core.numeric import True_
from numpy.lib.arraysetops import unique
import pandas as pd
# import glob
import os
from pandas import json_normalize
from os import getcwd, path
from yaml import SafeLoader, load
import datetime
import matplotlib.pyplot as plt
import yaml
import shutil
import json
from sys import path

%matplotlib inline
cv2.__version__

'4.5.4'

Setting variables and paths to dataset info:

In [2]:

os.chdir(os.path.dirname(path[0]))

video_name = 'VIRAT_S_050000_07_001014_001126'
dataset_dir_path = './sample_datasets/VIRAT/'  # top directory where the dataset is located 
video_src_path = dataset_dir_path + video_name +'/' # location where videos are stored
annotations_path = dataset_dir_path + video_name +'/'

# setup
image_ext = '.jpg'
video_max_frames = 2000

#video
video_ext = '.mp4'

video_name_orig = video_name + video_ext
video_dest_path = './processed/' +  video_name + '/'  # location where to place processed videos/data

# annotations
saved_csv = video_dest_path + 'df_bbox.csv'

video_name_new = 'ann_yml_'

ann_activities_file = annotations_path + video_name + '.activities.yml'
ann_geom_file = annotations_path + video_name + '.geom.yml'
ann_regions_file = annotations_path + video_name + '.regions.yml'
ann_types_file = annotations_path + video_name + '.types.yml'

video_name_new = video_name_new + video_name + video_ext

Parsing the annotated object categories and incorporating them into a dataframe

In [3]:
# Create directory to store new video
if not os.path.exists(video_dest_path):
    os.makedirs(video_dest_path)


if not os.path.exists(saved_csv):
    with open(ann_types_file) as yaml_file:
        yaml_contents = load(yaml_file, Loader=SafeLoader)
    yaml_df = json_normalize(yaml_contents)
    yaml_df
    for col in yaml_df.columns:
        type_name = col.split('.')[-1]
        if not (type_name == 'id1'):
            yaml_df.loc[yaml_df[col] == 1, col] = type_name
    
    yaml_df = yaml_df[yaml_df['types.id1'].notna()].reset_index().dropna(axis=1, how='all')  
    type_df = yaml_df.ffill(axis=1).iloc[:,-1].to_frame(name='category')
    type_df.insert(0, "id", yaml_df['types.id1'])
    
    type_df.head(10)

Parsing the annotated object bounding boxes and incorporating them into a single dataframe

In [4]:
# using annotations:
print("Loading annotations...")
def add_category_type(row):
  id = row['object_id']
  val = type_df.loc[type_df['id'] == id, 'category'].iloc[0]
  return val


if os.path.exists(saved_csv):
  df_bbox = pd.read_csv(saved_csv)
else:
  with open(ann_geom_file) as yaml_file:
      yaml_contents = load(yaml_file, Loader=SafeLoader)
  yaml_df = json_normalize(yaml_contents)

  df_bbox = yaml_df[['geom.id1','geom.ts0','geom.ts1','geom.g0']].dropna().reset_index()
  df_bbox.rename(columns={'geom.id1': 'object_id', 'geom.ts0': 'frame_id','geom.ts1': 'time_sec', 'geom.g0': 'bbox'}, inplace=True)
  df_bbox['bbox'] = df_bbox['bbox'].str.split()
  df_tmp = pd.DataFrame(df_bbox['bbox'].to_list(), columns = ['bb_left', 'bb_top', 'bb_right', 'bb_bottom'])
  df_bbox = pd.concat([df_bbox, df_tmp], axis=1).drop(columns=['bbox'])

  df_bbox['category'] = df_bbox.apply(lambda row: add_category_type(row), axis=1) 
  df_bbox.drop(columns=['index'], axis=1, inplace=True)
  # df_bbox.set_index['index'] 
  df_bbox.to_csv(saved_csv, index = False)
    

df_bbox.head()

Loading annotations...


Unnamed: 0,object_id,frame_id,time_sec,bb_left,bb_top,bb_right,bb_bottom,category
0,0.0,0.0,0.0,485,743,653,914,Vehicle
1,0.0,1.0,0.033333,489,748,657,919,Vehicle
2,0.0,2.0,0.066667,488,747,656,918,Vehicle
3,0.0,3.0,0.1,488,747,656,918,Vehicle
4,0.0,4.0,0.133333,488,747,656,918,Vehicle


Annotation of videos:

In [5]:
video_dest_path

'./processed/VIRAT_S_050000_07_001014_001126/'

In [6]:
from utils.video_utils import VideoUtils
from utils.folder_utils import FolderUtils
# vUtils = VideoUtils(categoriesDict) 
types_lst = df_bbox['category'].unique()
types_dict = {}
for i in range(len(types_lst)):
  types_dict[types_lst[i]] = i

vUtils = VideoUtils() 
df_bbox.head()
start_time = 0
gt_video_out, bbox_gt = vUtils.AnnotateVideo(video_dest_path, video_src_path + video_name_orig, video_dest_path + video_name_new, df_bbox, start_time_sec=start_time, duration_sec=1, save_images=False)

df = pd.DataFrame(bbox_gt)
df.head(10)

Total frames in video: 3351 @ 30 frames/sec
3351 0 30
Created frame id  0, 0.00 sec in video; completed:  0.0 %
Created frame id 25, 0.83 sec in video; completed:  83.3 %
Done: Created video: ./processed/VIRAT_S_050000_07_001014_001126/ann_yml_VIRAT_S_050000_07_001014_001126.mp4


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,"[485, 743, 653, 914]","[489, 748, 657, 919]","[488, 747, 656, 918]","[488, 747, 656, 918]","[488, 747, 656, 918]","[488, 746, 656, 917]","[488, 746, 656, 917]","[488, 746, 656, 917]","[488, 745, 656, 916]","[488, 745, 656, 916]",...,"[486, 742, 654, 913]","[486, 742, 654, 913]","[486, 742, 654, 913]","[486, 742, 654, 913]","[486, 742, 654, 913]","[485, 742, 653, 913]","[485, 742, 653, 913]","[485, 742, 653, 913]","[485, 742, 653, 913]","[485, 742, 653, 913]"
1,"[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]",...,"[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[193, 168, 410, 432]","[194, 168, 411, 432]","[194, 168, 411, 432]","[194, 168, 411, 432]","[194, 168, 411, 432]","[194, 168, 411, 432]","[194, 168, 411, 432]"
2,"[1403, 1020, 1661, 1080]","[1397, 1018, 1655, 1079]","[1392, 1016, 1650, 1079]","[1387, 1015, 1645, 1079]","[1382, 1013, 1639, 1079]","[1378, 1011, 1633, 1079]","[1373, 1009, 1628, 1079]","[1369, 1007, 1622, 1079]","[1364, 1005, 1617, 1079]","[1360, 1003, 1611, 1079]",...,"[1316, 975, 1548, 1078]","[1313, 973, 1543, 1078]","[1310, 970, 1537, 1077]","[1306, 968, 1532, 1077]","[1303, 965, 1527, 1077]","[1300, 963, 1522, 1077]","[1296, 960, 1517, 1077]","[1293, 958, 1511, 1077]","[1290, 955, 1506, 1077]","[1286, 953, 1501, 1077]"
3,"[672, 3, 813, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]",...,"[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]","[671, 3, 812, 60]"
4,"[737, 42, 873, 141]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]","[737, 41, 873, 140]",...,"[737, 39, 873, 138]","[737, 39, 873, 138]","[737, 39, 873, 138]","[737, 39, 873, 138]","[737, 39, 873, 138]","[738, 39, 874, 138]","[738, 39, 874, 138]","[738, 39, 874, 138]","[738, 39, 874, 138]","[738, 39, 874, 138]"
5,"[1298, 769, 1421, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]",...,"[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]","[1298, 769, 1420, 848]"
6,"[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]",...,"[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]","[837, 168, 949, 239]"
7,"[1085, 285, 1134, 371]","[1085, 284, 1134, 370]","[1085, 283, 1134, 369]","[1085, 282, 1134, 368]","[1085, 282, 1134, 368]","[1085, 281, 1134, 367]","[1085, 280, 1134, 366]","[1085, 280, 1134, 366]","[1085, 279, 1134, 365]","[1085, 278, 1134, 364]",...,"[1082, 270, 1131, 358]","[1081, 269, 1130, 357]","[1081, 268, 1130, 356]","[1080, 267, 1129, 356]","[1080, 267, 1129, 355]","[1080, 266, 1129, 355]","[1079, 265, 1128, 354]","[1079, 264, 1128, 354]","[1079, 264, 1128, 353]","[1078, 263, 1127, 353]"
8,"[607, 505, 698, 628]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]",...,"[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]","[607, 505, 697, 627]"
9,"[1798, 824, 1842, 895]","[1793, 824, 1841, 896]","[1789, 825, 1840, 897]","[1788, 824, 1839, 897]","[1787, 824, 1838, 898]","[1786, 824, 1837, 899]","[1785, 824, 1836, 900]","[1784, 824, 1835, 901]","[1783, 824, 1834, 902]","[1782, 824, 1833, 903]",...,"[1772, 824, 1823, 913]","[1771, 825, 1822, 914]","[1770, 825, 1821, 915]","[1769, 826, 1820, 916]","[1768, 826, 1819, 917]","[1768, 827, 1819, 918]","[1767, 827, 1818, 918]","[1766, 828, 1817, 919]","[1765, 828, 1816, 920]","[1764, 829, 1815, 921]"
