# Section 1
The data folder contains an 'Access_data' subfolder which includes 22 datasets in .csv format and 1 data dictionary in .docx format. The datasets are sample of the mock-data on individuals accessing two office sites of Company ABC, consisting of:

1. When: Time of entry by the individual

2. Profile: Type of access card
    -	0 - Staff Pass
    -	1 - Temp Pass
    -	2 - Visitor Pass

3. Dept: Department of the individual

4. CardNum: Card unique identifier. The length of the card number cannot be less than 8 characters. Currently, if CardNum starts with a/multiple ‘0’, the data captured in system will exclude/remove the “0”.

You can assume that the **total number of staff in the company is 2000** and the data is extracted from the company’s building access system. An individual can tap in and out several times within the same day. When the individual first clock in, that would be the earliest time slot and the only record you will base off the analysis. (You can also state your other assumptions if need be.)


In [1]:
## Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels as sm
import os

## Question 1.1
Write code preferably in R or Python to **process and organise the raw data** to make it **suitable for analysis**. Identify and resolve the data quality issues in the raw data, if any. (The created code should allow user to efficiently and easily run it to ingest additional datasets of different period, beyond the given sample.)

Assumption: Data files are represented in the format of 'SiteAYYYYMMDD-YYYYMMDDa.csv' or 'SiteBYYYYMMDD-YYYYMMDDb.csv' which represents sites A and B respectively.

In [2]:
# Utility function for listing particular required site files in a specified data directory path
def list_files_for_a_site(site_name):
    data_dir_path=os.path.join(os.getcwd(), "data", "Access_Data")
    return [os.path.join(data_dir_path, file) for file in os.listdir(data_dir_path) if file.startswith(site_name)]

In [3]:
# Construct list of site A and site B files.
site_A_file_list =  list_files_for_a_site(site_name="SiteA")
site_B_file_list =  list_files_for_a_site(site_name="SiteB")

print(site_A_file_list)
print(site_B_file_list)

['c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200420-20200426a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200427-20200503a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200504-20200510a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200511-20200517a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200518-20200524a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200525-20200531a.csv', 'c:\\Users\\quekz\\OneDrive\\Desktop\\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\\data\\Access_Data\\SiteA20200601a.csv', 'c:\\Us

In [4]:
def column_checker(df, filename):
    expected_col = set(["When", "Profile", "Dept", "CardNum"])
    symmetric_diff_set = set(df.columns).symmetric_difference(expected_col)
    if symmetric_diff_set :
        print(f"Identified a non-expected column for {filename}")
        print(f"Symmetric difference: {symmetric_diff_set}")
    return None

### Column naming check

Do a quick check across data columns and it was found that the data for site A has a column named "Depts" instead of expected "Dept" while for site B, there is a column named "CardNum " instead of expected "CardNum" for same data period, 20200622-20200628. To resolve the formating issue, we will strip all leading/trailing space and extract the first 4 alphanumeric representation for convenience and standardisation

In [5]:
for file in site_A_file_list:
    temp_df = pd.read_csv(file, sep=",")
    column_checker(temp_df, file)

print()
for file in site_B_file_list:
    temp_df = pd.read_csv(file, sep=",")
    column_checker(temp_df, file)

Identified a non-expected column for c:\Users\quekz\OneDrive\Desktop\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\data\Access_Data\SiteA20200622-20200628a.csv
Symmetric difference: {'Dept', 'Depts'}

Identified a non-expected column for c:\Users\quekz\OneDrive\Desktop\MOM_Senior-Analyst-Analyst_SensingAnalytics_Assessment\data\Access_Data\SiteB20200622-20200628b.csv
Symmetric difference: {'CardNum', 'CardNum '}


### Data consolidation by sites

Append the loaded dataframe into a list for vertical stacking. Based on file name date representation, we have data representing the period from Apr 20, 2020 to Jun 28, 2020.

Notice that there are quite a significant number of nulls for 'Department' feature in Site A (9266) and Site B (10100); and 5 null 'CardNum' information for site B. Note that this excludes values which maybe malformed as the columns/features are categorical nature, except for 'When'

In [6]:
site_A_df_list = []
for file in site_A_file_list:
    temp_df = pd.read_csv(file, sep=",")
    temp_df.columns = [col.strip()[:4] for col in temp_df.columns]
    site_A_df_list.append(temp_df)

site_B_df_list = []
for file in site_B_file_list:
    temp_df = pd.read_csv(file, sep=",")
    temp_df.columns = [col.strip()[:4] for col in temp_df.columns]
    site_B_df_list.append(temp_df)


site_A_df = pd.concat(site_A_df_list, ignore_index=True)
site_B_df = pd.concat(site_B_df_list, ignore_index=True)

# Split When into Date and TIme
site_A_df[['Time', 'Date']] = site_A_df['When'].str.split(' ', expand=True)
site_B_df[['Time', 'Date']] = site_B_df['When'].str.split(' ', expand=True)

print(site_A_df.info())
print(site_B_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12192 entries, 0 to 12191
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   When    12192 non-null  object
 1   Prof    12192 non-null  object
 2   Dept    2925 non-null   object
 3   Card    12192 non-null  object
 4   Time    12192 non-null  object
 5   Date    12192 non-null  object
dtypes: object(6)
memory usage: 571.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24499 entries, 0 to 24498
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   When    24499 non-null  object
 1   Prof    24499 non-null  object
 2   Dept    14399 non-null  object
 3   Card    24494 non-null  object
 4   Time    24499 non-null  object
 5   Date    24499 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB
None


In [7]:
# Convert card type to string in case of int and string mix representation
site_A_df["Card"] = site_A_df["Card"].astype(str)
site_B_df["Card"] = site_B_df["Card"].astype(str)

### Checking 'Department' and 'Profile' features uniqueness 

These features are expected categorical based on name interpretation. Notice that for Site A, we see that there is some form of discrepancy for Dept 1, as well as the Profile value represented in numeric or string format.

To simplify 'Department' feature representation, we will remove all inbetween spaces and concatenate the alphanumeric representation to form a value of format 'DeptXX'. For 'Profile' representation, we will do a string cast and standardise to meaningful name representation using either 'Temp Pass', 'Staff Pass' or 'Visitor Pass'.

Site A

In [8]:
print(site_A_df["Dept"].unique())
print(site_A_df["Prof"].unique())

[nan 'Dept 5' 'Dept 11' 'Dept 18' 'Dept 4' 'Dept 9' 'Dept 15' 'Dept 14'
 'Dept 2' 'Dept 8' 'Dept 12' 'Dept  1' 'Dept 1' 'Dept 19' 'Dept 7'
 'Dept 17' 'Dept 10' 'Dept 6' 'Dept 3' 'Dept 16' 'Dept 13']
[2 1 0 '1' '0' '2' 'Visitor Pass']


In [9]:
profile_pass_mapping = {
    "0": "Staff Pass",
    "1": "Temp Pass",
    "2": "Visitor Pass"
}

site_A_df["Dept"] = site_A_df["Dept"].fillna("unknown")
site_A_df["Dept"] = site_A_df["Dept"].map(lambda x: x.replace(" ","") if x else x)

site_A_df["Prof"] = site_A_df["Prof"].map(lambda x: str(x))
site_A_df["Prof"] = site_A_df["Prof"].map(lambda x: profile_pass_mapping[x] if x in profile_pass_mapping else x)

In [10]:
# Check conversion
print(site_A_df["Dept"].unique())
print(site_A_df["Prof"].unique())

['unknown' 'Dept5' 'Dept11' 'Dept18' 'Dept4' 'Dept9' 'Dept15' 'Dept14'
 'Dept2' 'Dept8' 'Dept12' 'Dept1' 'Dept19' 'Dept7' 'Dept17' 'Dept10'
 'Dept6' 'Dept3' 'Dept16' 'Dept13']
['Visitor Pass' 'Temp Pass' 'Staff Pass']


Site B

In [11]:
print(site_B_df["Dept"].unique())
print(site_B_df["Prof"].unique())

['Dept 4' nan 'Dept 2' 'Dept 16' 'Dept 10' 'Dept 5' 'Dept 14' 'Dept 3'
 'Dept 15' 'Dept 8' 'Dept 13' 'Dept 19' 'Dept 11' 'Dept 9' 'Dept 18'
 'Dept 7' 'Dept 17' 'Dept 1' 'Dept 12' 'Dept 6']
[0 1 2 '0' '1' '2' 'Temp Pass' 'Staff Pass']


In [12]:
site_B_df["Dept"] = site_B_df["Dept"].fillna("unknown")
site_B_df["Dept"] = site_B_df["Dept"].map(lambda x: x.replace(" ","") if x else x)

site_B_df["Prof"] = site_B_df["Prof"].map(lambda x: str(x))
site_B_df["Prof"] = site_B_df["Prof"].map(lambda x: profile_pass_mapping[x] if x in profile_pass_mapping else x)

In [13]:
print(site_B_df["Dept"].unique())
print(site_B_df["Prof"].unique())

['Dept4' 'unknown' 'Dept2' 'Dept16' 'Dept10' 'Dept5' 'Dept14' 'Dept3'
 'Dept15' 'Dept8' 'Dept13' 'Dept19' 'Dept11' 'Dept9' 'Dept18' 'Dept7'
 'Dept17' 'Dept1' 'Dept12' 'Dept6']
['Staff Pass' 'Temp Pass' 'Visitor Pass']


### Handle missing(unknown) data for Site A's 'department' feature and identify the affected pass type (profile). 

This is done by checking for each card profile (Visitor, Temp or Staff Passes) and verify for data points which indicates 'unknown' as Dept value. From the results, we conclude that the department information is not applicable for both temp/visitor pass categories. We can fill as 'not applicable' representation.

In [14]:
# Site A: For each pass type (profile), see its applicability to department information
for profile in site_A_df["Prof"].unique():
    unique_dept = site_A_df[site_A_df["Prof"]==profile]["Dept"].unique()
    print(f"{profile}:{unique_dept}")

Visitor Pass:['unknown']
Temp Pass:['unknown']
Staff Pass:['Dept5' 'Dept11' 'Dept18' 'Dept4' 'Dept9' 'Dept15' 'Dept14' 'Dept2'
 'Dept8' 'Dept12' 'Dept1' 'Dept19' 'Dept7' 'Dept17' 'Dept10' 'Dept6'
 'Dept3' 'Dept16' 'Dept13']


Similarly for site B

In [15]:
site_B_df[site_B_df["Dept"]=="unknown"]["Prof"].unique()

array(['Temp Pass', 'Visitor Pass'], dtype=object)

In [16]:
for profile in site_B_df["Prof"].unique():
    unique_dept = site_B_df[site_B_df["Prof"]==profile]["Dept"].unique()
    print(f"{profile}:{unique_dept}")

Staff Pass:['Dept4' 'Dept2' 'Dept16' 'Dept10' 'Dept5' 'Dept14' 'Dept3' 'Dept15'
 'Dept8' 'Dept13' 'Dept19' 'Dept11' 'Dept9' 'Dept18' 'Dept7' 'Dept17'
 'Dept1' 'Dept12' 'Dept6']
Temp Pass:['unknown']
Visitor Pass:['unknown']


**Conclude: In both sites, we are sure that both *temp* and *visitor* passes do not have corresponding department info. As such we will replace it with a value as 'not_applicable' on the assumption that non-staff pass should just have no specific profile information.**

In [17]:
site_A_df["Dept"] = site_A_df["Dept"].map(lambda x:x.replace("unknown","not_applicable" if x=="unknown" else x))
site_B_df["Dept"] = site_B_df["Dept"].map(lambda x:x.replace("unknown","not_applicable" if x=="unknown" else x))

#### Check card representation length.
It is known that card ID should not be less than 8 characters. Check if there is a record of card ID containing more than 8 characters and it is of numeric. If not, assume that card ID would be 8 characters long. From the printout below it suggests that max card length is 8 based on records which are valid.

In [18]:
site_A_df["Card_ID_Length"]= site_A_df["Card"].map(lambda x: len(x) if x else "Invalid ID")
site_B_df["Card_ID_Length"]= site_B_df["Card"].map(lambda x: len(x) if x else "Invalid ID")

print(site_A_df["Card_ID_Length"].unique())
print(site_B_df["Card_ID_Length"].unique())

[4 5 6 7 8 3]
[4 5 6 7 8 3]


In [19]:
# Check if card ID are numeric.
site_A_df["is_Card_ID_Numeric"]= site_A_df["Card"].map(lambda x: x.isnumeric())
site_B_df["is_Card_ID_Numeric"]= site_B_df["Card"].map(lambda x: x.isnumeric())

print(site_A_df[site_A_df["is_Card_ID_Numeric"]==False].shape)
print(site_B_df[site_B_df["is_Card_ID_Numeric"]==False].shape)

(110, 8)
(761, 8)


In [20]:
site_A_df[site_A_df["is_Card_ID_Numeric"]==False][["Prof","Dept","Card"]].value_counts()

Prof        Dept            Card   
Temp Pass   not_applicable  #REF!      47
Staff Pass  Dept18          #REF!      13
            Dept5           #REF!      10
            Dept15          #REF!       8
            Dept11          #REF!       5
            Dept14          #REF!       5
            Dept2           #REF!       5
            Dept8           #REF!       4
            Dept17          #REF!       3
            Dept15          #VALUE!     2
            Dept3           #REF!       2
            Dept7           #REF!       2
            Dept10          #REF!       1
            Dept19          9 42530     1
            Dept4           #REF!       1
Temp Pass   not_applicable  #VALUE!     1
dtype: int64

In [21]:
site_B_df[site_B_df["is_Card_ID_Numeric"]==False][["Prof","Dept","Card"]].value_counts()

Prof          Dept            Card   
Temp Pass     not_applicable  #REF!      740
Staff Pass    Dept8           #VALUE!     10
              Dept16          #VALUE!      3
Temp Pass     not_applicable  nan          3
                              #VALUE!      2
Visitor Pass  not_applicable  nan          2
Staff Pass    Dept16          40202!       1
dtype: int64

The Card ID of '9 42530' for site A and '40202!' for site B could be a malformed string. Check if the department information is provided and use it to cross check with other card information.
A quick eyeball check suggests this assumption is correct. As such, we will update the value to correct version.

In [22]:
site_A_df[site_A_df["Dept"]== "Dept19"]["Card"].unique()

array(['885201', '942530', '9 42530', '34802', '175703', '222002',
       '257102', '811717', '960606', '6801234', '7872732', '193902',
       '954488', '490402', '7870682', '3902', '5201', '7102', '8914575',
       '852098'], dtype=object)

In [23]:
site_B_df[site_B_df["Dept"]== "Dept16"]["Card"].unique()

array(['19002', '38502', '40202', '60902', '67802', '96702', '103003',
       '105687', '115002', '126002', '128203', '159502', '165467',
       '181738', '205202', '207903', '216501', '217504', '219001',
       '236802', '312903', '329203', '353602', '394402', '477203',
       '706803', '712253', '714228', '732480', '740209', '761795',
       '781362', '793632', '800972', '811771', '820702', '820831',
       '827519', '851469', '862960', '863435', '872492', '873052',
       '880956', '882135', '883767', '900102', '908603', '911961',
       '6863216', '7860070', '8253259', '8833245', '8861027', '9761019',
       '23002', '39502', '48402', '105502', '110302', '169702', '211102',
       '236302', '399902', '447302', '704079', '713274', '732653',
       '747351', '760427', '800928', '801699', '810356', '822249',
       '823779', '832576', '840492', '863433', '880904', '954429',
       '8850177', '10302', '78204', '215902', '409402', '412803',
       '842415', '863158', '127268', '134502',

In [24]:
# Correction
site_B_df.loc[site_B_df["Card"]=="40202!",["Card","is_Card_ID_Numeric"]] = "40202",True
site_A_df.loc[site_A_df["Card"]=="9 42530",["Card","is_Card_ID_Numeric"]] = "942530", True

In [25]:
# Prefill with 0 for card information
site_A_df["Card"] = site_A_df["Card"].map(lambda x: x.zfill(8) if (x!="#VALUE!" and x!="#REF!" and x!="nan") else x)
site_B_df["Card"] = site_B_df["Card"].map(lambda x: x.zfill(8) if (x!="#VALUE!" and x!="#REF!" and x!="nan") else x)

#### Check if 'when' feature format follows a date and time format.

This can be quickly checked by doing a datetime string conversion. Since there is no error generated, we can assume that the format is correct

In [26]:
# Extract date info
site_A_df["Date"] = pd.to_datetime(site_A_df["When"], format="%d/%m/%Y %H:%M", infer_datetime_format=False, errors='raise'
).dt.date
site_A_df["Date"] = pd.to_datetime(site_A_df["Date"])
site_B_df["Date"] = pd.to_datetime(site_B_df["When"], format="%d/%m/%Y %H:%M", infer_datetime_format=False, errors='raise').dt.date
site_B_df["Date"] = pd.to_datetime(site_B_df["Date"])

site_A_df.head()

Unnamed: 0,When,Prof,Dept,Card,Time,Date,Card_ID_Length,is_Card_ID_Numeric
0,20/4/2020 7:17,Visitor Pass,not_applicable,1001,20/4/2020,2020-04-20,4,True
1,21/4/2020 7:10,Visitor Pass,not_applicable,1001,21/4/2020,2020-04-21,4,True
2,22/4/2020 7:09,Visitor Pass,not_applicable,1001,22/4/2020,2020-04-22,4,True
3,23/4/2020 7:16,Visitor Pass,not_applicable,1001,23/4/2020,2020-04-23,4,True
4,24/4/2020 7:25,Visitor Pass,not_applicable,1001,24/4/2020,2020-04-24,4,True


In [27]:
# Extract time info
site_A_df["Time"] = pd.to_datetime(site_A_df["When"], format="%d/%m/%Y %H:%M", infer_datetime_format=False, errors='raise').dt.strftime("%H:%M")
#site_A_df["Time"] = pd.to_datetime(site_A_df["Time"], format="%H:%M")
site_B_df["Time"] = pd.to_datetime(site_B_df["When"], format="%d/%m/%Y %H:%M", infer_datetime_format=False, errors='raise').dt.strftime("%H:%M")
#site_B_df["Time"] = pd.to_datetime(site_B_df["Time"], format="%H:%M")

In [28]:
site_A_df.head()

Unnamed: 0,When,Prof,Dept,Card,Time,Date,Card_ID_Length,is_Card_ID_Numeric
0,20/4/2020 7:17,Visitor Pass,not_applicable,1001,07:17,2020-04-20,4,True
1,21/4/2020 7:10,Visitor Pass,not_applicable,1001,07:10,2020-04-21,4,True
2,22/4/2020 7:09,Visitor Pass,not_applicable,1001,07:09,2020-04-22,4,True
3,23/4/2020 7:16,Visitor Pass,not_applicable,1001,07:16,2020-04-23,4,True
4,24/4/2020 7:25,Visitor Pass,not_applicable,1001,07:25,2020-04-24,4,True


In [29]:
site_B_df[site_B_df["is_Card_ID_Numeric"]==False][["Prof","Dept","Card"]].value_counts()

Prof          Dept            Card   
Temp Pass     not_applicable  #REF!      740
Staff Pass    Dept8           #VALUE!     10
              Dept16          #VALUE!      3
Temp Pass     not_applicable  nan          3
                              #VALUE!      2
Visitor Pass  not_applicable  nan          2
dtype: int64

Drop duplicates based on existing data

In [30]:
# Quick drop duplicates, assuming the same
site_A_df.drop_duplicates(inplace=True)
site_B_df.drop_duplicates(inplace=True)

Check affected dates which record are invalid. For site A the impacted dates are from May 27 onwards, with mostly #REF! string impacting 1 Jun data. From EXcel understanding, this means a form of record referencing. For Site B, the impacted dates are Apr 20 to May 7 and May 26 onwards. 

FOr ease of analysis, we can assume #REF indicates reference of information from same Prof for same date, while #VALUE! can indicate a malformed information which is not a reference info but some number.

Assume temp pass are only tied to respective sites and cannot be shared.

In [31]:
site_A_df[site_A_df["is_Card_ID_Numeric"]==False][["Date", "Prof","Dept","Card"]].value_counts()

Date        Prof        Dept            Card   
2020-06-01  Temp Pass   not_applicable  #REF!      41
            Staff Pass  Dept18          #REF!      13
                        Dept5           #REF!      10
                        Dept15          #REF!       8
                        Dept14          #REF!       5
                        Dept2           #REF!       5
                        Dept11          #REF!       4
                        Dept8           #REF!       4
                        Dept17          #REF!       3
                        Dept3           #REF!       2
                        Dept7           #REF!       2
2020-05-27  Staff Pass  Dept15          #VALUE!     1
2020-05-28  Staff Pass  Dept15          #VALUE!     1
            Temp Pass   not_applicable  #VALUE!     1
2020-06-01  Staff Pass  Dept10          #REF!       1
                        Dept4           #REF!       1
dtype: int64

In [32]:
site_A_df = site_A_df[~site_A_df["Card"].str.contains("#REF!")]
site_A_df[site_A_df["is_Card_ID_Numeric"]==False][["Date", "Prof","Dept","Card"]].value_counts()

Date        Prof        Dept            Card   
2020-05-27  Staff Pass  Dept15          #VALUE!    1
2020-05-28  Staff Pass  Dept15          #VALUE!    1
            Temp Pass   not_applicable  #VALUE!    1
dtype: int64

In [33]:
# See if can find some patterns for replacing #VALUE! info
dept15_A_df = site_A_df[site_A_df["Dept"]=="Dept15"]
dept15_A_df[(dept15_A_df["Date"]< "2020-05-30") & (dept15_A_df["Date"]> "2020-05-25")].sort_values("Date")

Unnamed: 0,When,Prof,Dept,Card,Time,Date,Card_ID_Length,is_Card_ID_Numeric
5048,26/5/2020 8:34,Staff Pass,Dept15,00001897,08:34,2020-05-26,4,True
5206,26/5/2020 8:15,Staff Pass,Dept15,00005602,08:15,2020-05-26,4,True
5818,26/5/2020 8:50,Staff Pass,Dept15,00044552,08:50,2020-05-26,5,True
5813,26/5/2020 8:40,Staff Pass,Dept15,00043285,08:40,2020-05-26,5,True
5824,27/5/2020 9:15,Staff Pass,Dept15,00061445,09:15,2020-05-27,5,True
5821,27/5/2020 9:14,Staff Pass,Dept15,00044552,09:14,2020-05-27,5,True
5811,27/5/2020 10:15,Staff Pass,Dept15,00043285,10:15,2020-05-27,5,True
5854,27/5/2020 9:00,Staff Pass,Dept15,#VALUE!,09:00,2020-05-27,7,False
5309,27/5/2020 8:48,Staff Pass,Dept15,00007794,08:48,2020-05-27,4,True
5267,27/5/2020 8:16,Staff Pass,Dept15,00006603,08:16,2020-05-27,4,True


In [34]:
site_A_df.loc[site_A_df["is_Card_ID_Numeric"]==False,["Card", "is_Card_ID_Numeric"]]= "NewCard", True

In [35]:
dept15_A_df = site_A_df[site_A_df["Prof"]=="Temp Pass"]


Apply same process for site B

In [36]:
site_B_df[site_B_df["is_Card_ID_Numeric"]==False][["Date", "Prof","Dept","Card"]].sort_values("Date").value_counts(sort=False)

Date        Prof          Dept            Card   
2020-04-20  Staff Pass    Dept8           #VALUE!     1
            Temp Pass     not_applicable  #REF!      73
2020-04-21  Staff Pass    Dept8           #VALUE!     1
            Temp Pass     not_applicable  #REF!      76
2020-04-22  Temp Pass     not_applicable  #REF!      82
                                          nan         1
2020-04-23  Staff Pass    Dept8           #VALUE!     1
            Temp Pass     not_applicable  #REF!      89
2020-04-24  Temp Pass     not_applicable  #REF!      87
                                          nan         1
2020-04-25  Staff Pass    Dept8           #VALUE!     1
            Temp Pass     not_applicable  #REF!      58
2020-04-26  Temp Pass     not_applicable  #REF!      61
                                          nan         1
2020-04-28  Staff Pass    Dept8           #VALUE!     1
2020-04-30  Staff Pass    Dept8           #VALUE!     1
2020-05-03  Visitor Pass  not_applicable  nan         

In [37]:
site_B_df = site_B_df[~site_B_df["Card"].str.contains("#REF!")]
site_B_df.loc[site_B_df["is_Card_ID_Numeric"]==False,["Card", "is_Card_ID_Numeric"]]= "NewCard", True

### Stack both sites dataframe for overall analysis

In [38]:
# Create a new identifier column row for both dataframe
site_A_df["Site"] = "A"
site_B_df["Site"] = "B"

# Create is weekend column

# Stack
combine_df = pd.concat([site_A_df, site_B_df], ignore_index=True)
combine_df.head()
# Create is weekend column
combine_df["is_weekend"] = combine_df["Date"].dt.dayofweek > 4


Possibility of a staff accessing multiple office? A quick check with an identified 001199102 indicates a same staff pass can access both sites on the same day and may be identified as different department under different site (eg. 08743285 ID). Thus, we assume that these records are based of readers located at the respective departments of the building instead of respective offices entrance in view of unavailable other access info. Notice that for card 00005301, there are also records containing not_applicable_prefix. The existence of "Not applicable" in Department for Staff could mean their access to some rooms which information is not known or missing.

The total staff based on current record ID is 1518, which is less than assumed 2000, but yet satisfies the assumption to be made.

In [39]:
staff_pass_card_id_list = combine_df[combine_df["Prof"]=="Staff Pass"]["Card"].unique()
print("Total staff based on ID:", len(staff_pass_card_id_list))
number_of_multi_dept_staff_pass_id_dict= {}
for id in staff_pass_card_id_list:
    dept_list = combine_df[(combine_df["Card"]==id) & (combine_df["is_Card_ID_Numeric"]==True)]["Dept"].unique()
    if len(dept_list) > 1:
        number_of_multi_dept_staff_pass_id_dict[id]= dept_list

number_of_multi_dept_staff_pass_id_dict

Total staff based on ID: 1517


{'00119102': array(['Dept11', 'Dept3', 'Dept1'], dtype=object),
 '08743285': array(['Dept15', 'Dept16'], dtype=object),
 '00005301': array(['not_applicable', 'Dept18', 'Dept4'], dtype=object),
 '00068502': array(['Dept7', 'Dept1'], dtype=object),
 '00216002': array(['Dept18', 'Dept4'], dtype=object),
 '00366902': array(['Dept2', 'Dept18'], dtype=object),
 '00282902': array(['Dept18', 'Dept2'], dtype=object),
 '06740173': array(['Dept18', 'Dept17'], dtype=object),
 '00095902': array(['Dept5', 'Dept4'], dtype=object),
 '00333803': array(['Dept5', 'Dept4'], dtype=object),
 '00338003': array(['Dept3', 'Dept1'], dtype=object),
 '00438704': array(['Dept5', 'Dept4'], dtype=object),
 '00821802': array(['Dept5', 'Dept4'], dtype=object),
 '00295302': array(['Dept5', 'Dept4'], dtype=object),
 '00335103': array(['Dept5', 'Dept4'], dtype=object),
 '00408202': array(['Dept5', 'Dept4'], dtype=object),
 '00782409': array(['Dept5', 'Dept4'], dtype=object),
 '00000902': array(['Dept5', 'Dept4', 'Dept3']

Check breakdown of card uniqueness by profile. It was noted earlier that the only passes categorised under staff pass would have accessed department information. For other pass types (visitor and temp pass), no department information is captured.Notice that there are 967 different temp pass.visitor cards, and 951 staff with department records indicating not_applicable. A simple guess based on this information may suggest that there could be 1-2 staff assigned to each temp pass or visitor pass holder for some special events or meetings.

In [40]:
for prof in combine_df["Prof"].unique():
    print(prof)
    print(combine_df[combine_df["Prof"]==prof]["Card"].nunique())

Visitor Pass
21
Temp Pass
945
Staff Pass
1517


In [43]:
# Check number of staff who have not_applicable status in dept info
combine_df[(combine_df["Dept"]=="not_applicable")]["Card"].nunique()

951