# Section 1
You would have received a Access_data.zip file with 22 datasets in .csv format and 1 data dictionary in .docx format. The datasets are sample of the mock-data on individuals accessing two office sites of Company ABC, consisting of:

1. When: Time of entry by the individual

2. Profile: Type of access card
    -	0 - Staff Pass
    -	1 - Temp Pass
    -	2 - Visitor Pass

3. Dept: Department of the individual

4. CardNum: Card unique identifier. The length of the card number cannot be less than 8 characters. Currently, if CardNum starts with a/multiple ‘0’, the data captured in system will exclude/remove the “0”.

You can assume that the total number of staff in the company is 2000 and the data is extracted from the company’s building access system. An individual can tap in and out several times within the same day. When the individual first clock in, that would be the earliest time slot and the only record you will base off the analysis. (You can also state your other assumptions if need be.)


In [1]:
## Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels as sm
import os

## Question 1.1
Write code preferably in R or Python to process and organise the raw data (Access_data.zip) to make it suitable for analysis. Identify and resolve the data quality issues in the raw data, if any. (The created code should allow user to efficiently and easily run it to ingest additional datasets of different period, beyond the given sample.)

We assume the files are represented in the format of 'SiteAYYYYMMDD-YYYYMMDDa.csv' or 'SiteBYYYYMMDD-YYYYMMDDb.csv' for both sites respectively.

In [2]:
# Utility function for listing particular required site files in a specified data directory path
def list_files_for_a_site(site_name):
    data_dir_path=os.path.join(os.getcwd(), "data", "Access_Data")
    return [os.path.join(data_dir_path, file) for file in os.listdir(data_dir_path) if file.startswith(site_name)]

In [3]:
# Construct list of site A and site B files.
site_A_file_list =  list_files_for_a_site(site_name="SiteA")
site_B_file_list =  list_files_for_a_site(site_name="SiteB")

print(site_A_file_list)
print(site_B_file_list)

['c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200420-20200426a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200427-20200503a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200504-20200510a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200511-20200517a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200518-20200524a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200525-20200531a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200601a.csv', 'c:\\Us

In [4]:
def column_checker(df, filename):
    expected_col = set(["When", "Profile", "Dept", "CardNum"])
    symmetric_diff_set = set(df.columns).symmetric_difference(expected_col)
    if symmetric_diff_set :
        print(f"Identified a non-expected column for {filename}")
        print(f"Symmetric difference: {symmetric_diff_set}")
    return None

### Process site A and site B

Do a quick check on column name and found that one of site A data file has column named "Depts" instead of "Dept" while for site B, there is a column named "CardNum " instead of "CardNum" for same period of 20200622-20200628. To resolve this issue, we will strip all leading/trailing space and extract the first 4 alphanumeric representation for convenience

In [5]:
for file in site_A_file_list:
    temp_df = pd.read_csv(file, sep=",")
    column_checker(temp_df, file)

print()
for file in site_B_file_list:
    temp_df = pd.read_csv(file, sep=",")
    column_checker(temp_df, file)

Identified a non-expected column for c:\Users\quekz\OneDrive\Desktop\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\data\Access_Data\SiteA20200622-20200628a.csv
Symmetric difference: {'Depts', 'Dept'}

Identified a non-expected column for c:\Users\quekz\OneDrive\Desktop\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\data\Access_Data\SiteB20200622-20200628b.csv
Symmetric difference: {'CardNum ', 'CardNum'}


Append the loaded dataframe into a list for vertical stacking. We assume the time period for both sites will be Apr 20 2020 to Jun 28 2020 based on file name date representation.

Notice that there are quite a significant number of nulls for Department feature in Site A (9266) and Site B (10100); and 5 null card information for site B.

In [6]:
site_A_df_list = []
for file in site_A_file_list:
    temp_df = pd.read_csv(file, sep=",")
    temp_df.columns = [col.strip()[:4] for col in temp_df.columns]
    site_A_df_list.append(temp_df)

site_B_df_list = []
for file in site_B_file_list:
    temp_df = pd.read_csv(file, sep=",")
    temp_df.columns = [col.strip()[:4] for col in temp_df.columns]
    site_B_df_list.append(temp_df)


site_A_df = pd.concat(site_A_df_list, ignore_index=True)
site_B_df = pd.concat(site_B_df_list, ignore_index=True)

# Split When into Date and TIme
site_A_df[['Time', 'Date']] = site_A_df['When'].str.split(' ', expand=True)
site_B_df[['Time', 'Date']] = site_B_df['When'].str.split(' ', expand=True)

print(site_A_df.info())
print(site_B_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12192 entries, 0 to 12191
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   When    12192 non-null  object
 1   Prof    12192 non-null  object
 2   Dept    2925 non-null   object
 3   Card    12192 non-null  object
 4   Time    12192 non-null  object
 5   Date    12192 non-null  object
dtypes: object(6)
memory usage: 571.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24499 entries, 0 to 24498
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   When    24499 non-null  object
 1   Prof    24499 non-null  object
 2   Dept    14399 non-null  object
 3   Card    24494 non-null  object
 4   Time    24499 non-null  object
 5   Date    24499 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB
None


In [7]:
# Convert card type to string in case of int and string mix representation
site_A_df["Card"] = site_A_df["Card"].astype(str)

Check Department and Profile uniqueness for both sides as they are categorical. Notice that for Site A, we see that there is some form of discrepancy for Dept 1, as well as the Profile value represented in numeric or string format.

To simplify department representation, we will remove all spaces and concatenate the alphanumeric representation, while for profile representation, we will do a string cast and standardise to meaningful name representation

In [8]:
print(site_A_df["Dept"].unique())
print(site_A_df["Prof"].unique())

[nan 'Dept 5' 'Dept 11' 'Dept 18' 'Dept 4' 'Dept 9' 'Dept 15' 'Dept 14'
 'Dept 2' 'Dept 8' 'Dept 12' 'Dept  1' 'Dept 1' 'Dept 19' 'Dept 7'
 'Dept 17' 'Dept 10' 'Dept 6' 'Dept 3' 'Dept 16' 'Dept 13']
[2 1 0 '1' '0' '2' 'Visitor Pass']


In [9]:
profile_pass_mapping = {
    "0": "Staff Pass",
    "1": "Temp Pass",
    "2": "Visitor Pass"
}

site_A_df["Dept"] = site_A_df["Dept"].fillna("unknown")
site_A_df["Dept"] = site_A_df["Dept"].map(lambda x: x.replace(" ","") if x else x)

site_A_df["Prof"] = site_A_df["Prof"].map(lambda x: str(x))
site_A_df["Prof"] = site_A_df["Prof"].map(lambda x: profile_pass_mapping[x] if x in profile_pass_mapping else x)

In [10]:
print(site_A_df["Dept"].unique())
print(site_A_df["Prof"].unique())

['unknown' 'Dept5' 'Dept11' 'Dept18' 'Dept4' 'Dept9' 'Dept15' 'Dept14'
 'Dept2' 'Dept8' 'Dept12' 'Dept1' 'Dept19' 'Dept7' 'Dept17' 'Dept10'
 'Dept6' 'Dept3' 'Dept16' 'Dept13']
['Visitor Pass' 'Temp Pass' 'Staff Pass']


Handle missing data for Site A involving department feature. Identify the pass type (profile) which the department information is unknown. We need to check from the perspective of each pass type(profile) and see the applicability of department before deciding how to impute.

Checking from this 2 perspective, we conclude that the department information is not applicable for both temp/visitor pass. We can fill as not applicable representation.

In [11]:
# Identify the pass type (profile) which the department information is unknown, as we need to impute
site_A_df[site_A_df["Dept"]=="unknown"]["Prof"].unique()

array(['Visitor Pass', 'Temp Pass'], dtype=object)

In [12]:
# For each pass type (profile), see its applicability to department information
for profile in site_A_df["Prof"].unique():
    unique_dept = site_A_df[site_A_df["Prof"]==profile]["Dept"].unique()
    print(f"{profile}:{unique_dept}")

Visitor Pass:['unknown']
Temp Pass:['unknown']
Staff Pass:['Dept5' 'Dept11' 'Dept18' 'Dept4' 'Dept9' 'Dept15' 'Dept14' 'Dept2'
 'Dept8' 'Dept12' 'Dept1' 'Dept19' 'Dept7' 'Dept17' 'Dept10' 'Dept6'
 'Dept3' 'Dept16' 'Dept13']


Similarity check for site B

In [14]:
# Convert card type to string in case of int and string mix representation
site_B_df["Card"] = site_B_df["Card"].astype(str)

In [15]:
print(site_B_df["Dept"].unique())
print(site_B_df["Prof"].unique())

['Dept 4' nan 'Dept 2' 'Dept 16' 'Dept 10' 'Dept 5' 'Dept 14' 'Dept 3'
 'Dept 15' 'Dept 8' 'Dept 13' 'Dept 19' 'Dept 11' 'Dept 9' 'Dept 18'
 'Dept 7' 'Dept 17' 'Dept 1' 'Dept 12' 'Dept 6']
[0 1 2 '0' '1' '2' 'Temp Pass' 'Staff Pass']


In [16]:
site_B_df["Dept"] = site_B_df["Dept"].fillna("unknown")
site_B_df["Dept"] = site_B_df["Dept"].map(lambda x: x.replace(" ","") if x else x)

site_B_df["Prof"] = site_B_df["Prof"].map(lambda x: str(x))
site_B_df["Prof"] = site_B_df["Prof"].map(lambda x: profile_pass_mapping[x] if x in profile_pass_mapping else x)

In [17]:
print(site_B_df["Dept"].unique())
print(site_B_df["Prof"].unique())

['Dept4' 'unknown' 'Dept2' 'Dept16' 'Dept10' 'Dept5' 'Dept14' 'Dept3'
 'Dept15' 'Dept8' 'Dept13' 'Dept19' 'Dept11' 'Dept9' 'Dept18' 'Dept7'
 'Dept17' 'Dept1' 'Dept12' 'Dept6']
['Staff Pass' 'Temp Pass' 'Visitor Pass']


Handle missing data for Site B involving Department feature.

For Department, identify the profile which department info is unknown. We also need to check from the perspective of each pass type(profile) and see the uniqueness of department to decide how to impute.

Checking from this 2 perspective, we conclude that the department information is not applicable for both temp/visitor pass. We can fill as not applicable representation.

In [18]:
site_B_df[site_B_df["Dept"]=="unknown"]["Prof"].unique()

array(['Temp Pass', 'Visitor Pass'], dtype=object)

In [20]:
for profile in site_B_df["Prof"].unique():
    unique_dept = site_B_df[site_B_df["Prof"]==profile]["Dept"].unique()
    print(f"{profile}:{unique_dept}")

Staff Pass:['Dept4' 'Dept2' 'Dept16' 'Dept10' 'Dept5' 'Dept14' 'Dept3' 'Dept15'
 'Dept8' 'Dept13' 'Dept19' 'Dept11' 'Dept9' 'Dept18' 'Dept7' 'Dept17'
 'Dept1' 'Dept12' 'Dept6']
Temp Pass:['unknown']
Visitor Pass:['unknown']


In both sites, we are sure that temp and visitor pass do not have corresponding department info. As such we will replace it with a vale as not_applicable.

In [21]:
site_A_df["Dept"] = site_A_df["Dept"].map(lambda x:x.replace("unknown","not_applicable" if x=="unknown" else x))
site_B_df["Dept"] = site_B_df["Dept"].map(lambda x:x.replace("unknown","not_applicable" if x=="unknown" else x))

We still have null cases where card information is unknown. A quick check shows the card affected are either temp pass/visitor pass. We may want to combine the dataframe from 2 sites to resolve the card information issue as no additional information are provided.

In [None]:
site_B_df[site_B_df["Card"].isna()]

Unnamed: 0,When,Prof,Dept,Card,Time,Date
2429,26/4/2020 7:31,Temp Pass,unknown,,26/4/2020,7:31
2430,24/4/2020 7:34,Temp Pass,unknown,,24/4/2020,7:34
2431,22/4/2020 7:27,Temp Pass,unknown,,22/4/2020,7:27
4715,3/5/2020 8:32,Visitor Pass,unknown,,3/5/2020,8:32
6985,8/5/2020 9:42,Visitor Pass,unknown,,8/5/2020,9:42


In [None]:
# Create a new identifier row for both dataframe
site_A_df["Site"] = "A"
site_B_df["Site"] = "B"
