# Validating timezone regularity
Given that the distributions were not uniform and some individuals appeared to have their peak meal intake distribution visibly shifted to a period overnight, it makes sense to check the timezones of the individuals. There is possible validation between the timezone of datetime values for records in the files and the region held in the profile file, which can translate to a timezone. The first check is to see whether people have multiple timezones for datetime values across the records in the files. This should focus on the single datetime column in each file that is used for the time series in the processed data, rather than a generic approach to all timestamp columns. This is relevant to the device status files where we know that columns do not align in their timestamps. Some are localised to UTC while others keep the timezone offset, in the same dataset. The existence of two timezones in the time series column may be justified, but it might also be introduced through error. Either way, it needs checking and possibly eliminating the individuals by default. The second check is to see whether the timezone of the datetime values matches the timezone of the region in the profile file. The second check is to see whether the timezone of the datetime values matches the timezone of the region in the profile file. These should align. The check is done by translating the profile region to a timezone and for both this and the record timestamps to have their UTC offset compared, which makes comparison similar.

In [2]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
import pandas as pd

from src.data_processing.read import read_all_profile, convert_timezone_to_utc_offset
from src.configurations import Configuration
from src.config import INTERIM_DATA_DIR


config = Configuration()

profile_read_recs = read_all_profile(config)
profile_offsets = {}
for rr in profile_read_recs:
    profile_offsets[rr.zip_id] = rr.utc_offsets

[32m2025-05-23 22:45:00.776[0m | [1mINFO    [0m | [36msrc.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: C:\Users\ross\PycharmProjects\masters_project[0m


00221634
00309157
00897741
01177138
01352464
01739655
01884126
01919652
01949240
02033176
02050717
02199852
02611986
03403352
03572116
04762925
05274556
05582191
07613176
07886752
10540336
12689381
13029224
13484299
13708515
13783771
14092221
14470046
15558575
15634563
16553776
16975609
17161370
18001564
18991425
19626656
20216809
20396154
20649783
20656313
20777653
21946407
22961398
23340371
23428091
23711486
23769130
23863411
24110807
24448124
24587372
25401109
25692073
26691577
26856617
27526291
27553507
27700103
27819368
28176124
28608066
28756888
28761103
28768536
28823146
32407882
32635618
32997134
33324736
33470634
33831564
33962890
33999544
34148224
35187603
35533061
35719805
37764532
37875431
37948668
37998755
38110191
39038570
39079816
39182506
39819048
39901815
39986716
40237051
40634871
40997757
41131654
41663654
42052178
42360672
43589707
45025419
45120081
46253612
47323535
47631371
47750728
47971065
48509634
48540630
49141524
49182092
49551394
['timezone']
Index(['carbs_h

In [3]:
df = pd.DataFrame(list(profile_offsets.items()), columns=['id', 'tz'])
df.head()

Unnamed: 0,zip_id,regions
0,221634,[Europe/Berlin]
1,309157,[Australia/Brisbane]
2,897741,"[US/Pacific, UTC]"
3,1177138,
4,1352464,"[Europe/Stockholm, UTC, US/Pacific-New]"


In [11]:
df_profile_exp = df.explode('tz').reset_index(drop=True)
df_profile_exp['offset'] = df_profile_exp['tz'].apply(convert_timezone_to_utc_offset)
df_count_tz = df_profile_exp.groupby('id').count()
df_count_tz.reset_index().groupby('tz').agg({'id': 'count'}).reset_index().sort_values('id', ascending=False)

Unnamed: 0,regions,zip_id
1,1,106
0,0,61
2,2,44
3,3,9
5,5,3
7,7,3
4,4,2
6,6,1


In [12]:
df_profile_exp = df_profile_exp.set_index(['id', 'offset']) # This will be used to check the timezone of the datetime values in the device status files
df_profile_exp.to_csv(INTERIM_DATA_DIR / 'profile_region_utc_offset.csv')

In [6]:
df_profile_exp.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,regions
zip_id,region_utc_offset,Unnamed: 2_level_1
221634,2.0,Europe/Berlin
309157,10.0,Australia/Brisbane
897741,-7.0,US/Pacific
897741,0.0,UTC
1177138,,


In [7]:
df_profile_exp.loc['41131654']

Unnamed: 0_level_0,regions
region_utc_offset,Unnamed: 1_level_1
12.0,Pacific/Auckland


None of the openaps/enacted/timestamp columns appeared to have issues in the read, but given that we don't need those datetime columns in the device status files that had different timezone values, we should just address the timezones in the openaps/enacted/timestamp column.