# **Attendance Data Consolidation and Transformation with Pandas**

**Note:** This notebook contains a modified version of codes I developed to improve an existing process during my previous employment. The techniques and data formats applied here are similar to those used in the real-world scenario, but the actual data, including names, numbers, and values, have been randomly generated for the purpose of illustration and confidentiality. The modifications aim to showcase the approach and techniques applied in the original context.

### **Background**

Code in this notebook was used to streamline the attendance reporting process that I used to perform on a daily basis. The daily task involved transforming schedule and phone data of over 500 service desk employees into a format that is uploaded to SQL for use in an attendance dashboard.

#### **Goal**

The goal is to generate two outputs from raw data in a quick and efficient manner.
1. Consolidated schedule data - Contains:
  * Date
  * Start time
  * End time  
  * Employee number
  * Code: 1 (Off), 2 (Shift), 3 (Training), 4 (Vacation), 5 (PTO)

2. Employee login logout data - Contains:
  * Employee name
  * Date
  * Login time
  * Logout time
  * Employee number

#### **Raw Data**

##### **Schedule data**

Schedule data is provided separately by each department. They all use the same Excel file template.

![picture](https://drive.google.com/uc?export=view&id=1OsR9ezvDlW5Pw3ORYCtt4g9LNmYaK_TD)

##### **Phone data**

Employee login/logout data is obtained from the software phone system in Excel format. A day may have multiple login and logout times.

![picture](https://drive.google.com/uc?export=view&id=15MZvNlQopQJdMIvQ3qmwzsnTW-qINl8u)

##### **Employee List**

The employee list is also saved as an Excel file. It contains employee IDs, Z IDs and phone IDs.

![picture](https://drive.google.com/uc?export=view&id=1_as6A6QaJtbvSfLDHSRXP7j57ovyaJ-E)



### **Part 1: Schedule Extraction**

In [1]:
import pandas as pd
import datetime as dt
import os

year = 2023
month = 12
date = 6

selected_date = dt.datetime(year, month, date) # Assign date to variable
print(selected_date)

2023-12-06 00:00:00


In [2]:
schedule_path = '/content/drive/MyDrive/attendance/schedule' # Location of schedule files

schedule_files = [file for file in os.listdir(schedule_path) if file.endswith('.xlsx')] # Read folder for list of xlsx files

for file in schedule_files:
    print(file) # Show list of xlsx files

Account3.xlsx
Account1.xlsx
Account2.xlsx


In [3]:
# List of accounts for schedule extraction
accounts = ['Account1', 'Account2', 'Account3']

In [4]:
combined_df = []

for account in accounts:
    file_df = pd.read_excel(os.path.join(schedule_path, f'{account}.xlsx'), header=0)
    combined_df.append(file_df)

df = pd.concat(combined_df, ignore_index=True) # Combine data from every file into 1 df

df.head()

Unnamed: 0,Account,Site,Employee First Name,Employee Last Name,Employee Full Name,Z ID,ROLE,Time In,Time Out,WFH or Onsite,2023-12-04 00:00:00,2023-12-05 00:00:00,2023-12-06 00:00:00,2023-12-07 00:00:00,2023-12-08 00:00:00,2023-12-09 00:00:00,2023-12-10 00:00:00
0,Account1,Site B,Leo,Anderson,"Anderson, Leo",Z920211,L1,21:30:00,06:30:00,Onsite,21:30:00,21:30:00,VL,VL,VL,OFF,OFF
1,Account1,Site B,Sophia,Archer,"Archer, Sophia",Z431118,L1,21:30:00,06:30:00,Onsite,21:30:00,21:30:00,21:30:00,21:30:00,21:30:00,OFF,OFF
2,Account1,Site B,Emma,Barrett,"Barrett, Emma",Z895504,L1,22:30:00,07:30:00,Onsite,22:30:00,22:30:00,22:30:00,22:30:00,VL,OFF,OFF
3,Account1,Site A,Nathan,Barrett,"Barrett, Nathan",Z450462,L1,18:00:00,03:00:00,WFH,18:00:00,18:00:00,VL,VL,VL,OFF,OFF
4,Account1,Site B,Samuel,Bennett,"Bennett, Samuel",Z881393,L1,16:00:00,01:00:00,Onsite,09:00:00,09:30:00,16:00:00,16:00:00,16:00:00,OFF,OFF


In [5]:
df['Z ID'] = df['Z ID'].str.strip() # Remove whitespaces from all Z IDs

# Take only rows that are L1 or L2 under 'ROLE' and not NA or blank under the selected date. Select only relevant columns
df = df[df['ROLE'].isin(['L1', 'L1.5', 'L2']) & df[selected_date].notna()][['Account', 'Employee Full Name', 'Z ID', 'ROLE', 'Time In', selected_date]]


df.head()

Unnamed: 0,Account,Employee Full Name,Z ID,ROLE,Time In,2023-12-06 00:00:00
0,Account1,"Anderson, Leo",Z920211,L1,21:30:00,VL
1,Account1,"Archer, Sophia",Z431118,L1,21:30:00,21:30:00
2,Account1,"Barrett, Emma",Z895504,L1,22:30:00,22:30:00
3,Account1,"Barrett, Nathan",Z450462,L1,18:00:00,VL
4,Account1,"Bennett, Samuel",Z881393,L1,16:00:00,16:00:00


In [6]:
df['Z ID'].value_counts() # Check for duplicate Z IDs

Z768155    2
Z284698    1
Z399840    1
Z735944    1
Z739094    1
          ..
Z855300    1
Z453011    1
Z668825    1
Z217526    1
Z735420    1
Name: Z ID, Length: 84, dtype: int64

In [7]:
df[df['Z ID'] == 'Z768155']

Unnamed: 0,Account,Employee Full Name,Z ID,ROLE,Time In,2023-12-06 00:00:00
42,Account1,"Sullivan, Zoey",Z768155,L2,21:30:00,21:30:00
76,Account2,"Sullivan, Zoey",Z768155,L1,,OFF


In [8]:
df = df.drop(76)
df[df['Z ID'] == 'Z768155']

Unnamed: 0,Account,Employee Full Name,Z ID,ROLE,Time In,2023-12-06 00:00:00
42,Account1,"Sullivan, Zoey",Z768155,L2,21:30:00,21:30:00


In [9]:
# Get employee numbers from emplist file

emp_path = '/content/drive/MyDrive/attendance/EmpList.xlsx' # Location of emplist file

emp_df = pd.read_excel(emp_path, header=0)

emp_df.head()

Unnamed: 0,Employee ID,Z ID,Name,Account,Phone ID
0,1089037,Z920211,"Anderson, Leo",Account1,39105058
1,1172007,Z431118,"Archer, Sophia",Account1,25393153
2,1323496,Z895504,"Barrett, Emma",Account1,34591484
3,1407756,Z450462,"Barrett, Nathan",Account1,20152612
4,1544048,Z881393,"Bennett, Samuel",Account1,39176237


In [10]:
df = df.merge(emp_df, how = 'left', on = 'Z ID')

df.head()

Unnamed: 0,Account_x,Employee Full Name,Z ID,ROLE,Time In,2023-12-06 00:00:00,Employee ID,Name,Account_y,Phone ID
0,Account1,"Anderson, Leo",Z920211,L1,21:30:00,VL,1089037,"Anderson, Leo",Account1,39105058
1,Account1,"Archer, Sophia",Z431118,L1,21:30:00,21:30:00,1172007,"Archer, Sophia",Account1,25393153
2,Account1,"Barrett, Emma",Z895504,L1,22:30:00,22:30:00,1323496,"Barrett, Emma",Account1,34591484
3,Account1,"Barrett, Nathan",Z450462,L1,18:00:00,VL,1407756,"Barrett, Nathan",Account1,20152612
4,Account1,"Bennett, Samuel",Z881393,L1,16:00:00,16:00:00,1544048,"Bennett, Samuel",Account1,39176237


In [11]:
df[selected_date].value_counts() # Check values from selected date

21:30:00    13
OFF          8
20:00:00     8
22:00:00     7
21:00:00     6
19:00:00     5
14:00:00     4
VL           4
23:00:00     4
06:00:00     3
15:00:00     3
05:00:00     3
16:00:00     2
17:00:00     2
07:00:00     2
22:30:00     2
18:00:00     2
00:00:00     1
09:00:00     1
13:00:00     1
VK           1
11:00:00     1
03:00:00     1
Name: 2023-12-06 00:00:00, dtype: int64