# Re-identification Risks in De-identified Transport Data: The Myki Case
| Name       | Student Number | Email Address            | Student Type         |
|------------|----------------|--------------------------|----------------------|
| YUPENG WEN | s224212855     | s224212855@deakin.edu.au | Postgraduate(Sit731) |

## 1. Introduction
The release of government-held datasets for research and innovation purposes has become increasingly common, driven by the promise of data-driven policy making, transparency, and economic value creation. However, high-profile incidents have demonstrated that datasets believed to be “de-identified” may still expose individuals to privacy harms through re-identification. The Myki public transport dataset released by Public Transport Victoria (PTV) represents a prominent Australian example where traditional de-identification techniques failed to adequately mitigate privacy risks.

The purpose of this article is to critically examine the Myki data release in light of contemporary legal and technical understandings of re-identification. The article first introduces the background of the Myki dataset and the circumstances that enabled re-identification. It then compares how re-identification is treated under the General Data Protection Regulation (GDPR) and Australia’s Privacy Act 1988, before analysing how the Victorian Privacy and Data Protection Act 2014 (PDP Act) defines personal information through the standard of reasonable identifiability. A comparative case study of the Netflix Prize Movie Ratings Dataset is presented to demonstrate that the Myki incident is not isolated. Finally, the article proposes practical recommendations and technical techniques that data custodians can adopt to reduce re-identification risk while preserving analytical value in future Myki data releases.

## 2. Background of the Myki Dataset
### 2.1 De-identification measures applied
Public Transport Victoria released the Myki dataset after applying a series of conventional de-identification techniques intended to remove personal identifiers while retaining research utility. Direct identifiers, such as names, Myki card numbers, and customer account details, were removed or replaced with randomly generated identifiers. Each Myki card was assigned a unique but persistent pseudonymous identifier, enabling trips associated with the same card to be linked over time without directly revealing the cardholder’s real-world identity.

In addition, limited attribute suppression and generalisation were applied to certain sensitive fields. However, key data elements—including precise timestamps, locations, and sequential trip information—were largely retained. While these measures aligned with common de-identification practices at the time, they did not sufficiently reduce the uniqueness of individual travel patterns. As a result, individuals could still be distinguished based on habitual movements across time and space.

### 2.2 External data enabling re-identification

Re-identification became feasible through data linkage with external information that was publicly available or reasonably accessible. Individuals frequently disclose commuting habits through social media posts, blogs, or professional profiles. Journalistic and academic investigations further demonstrated that the known travel routines of politicians and public officials could be matched against distinctive Myki travel records.

The combination of time, location, and frequency acted as a set of quasi-identifiers. Repeated travel between specific origins and destinations at consistent times generated highly distinctive patterns. When linked with external knowledge, these patterns enabled individuals to be singled out with a high degree of confidence, even in the absence of explicit identifiers.

### 2.3 Legal and governance issues identified by OVIC

The Office of the Victorian Information Commissioner (OVIC) identified several legal and governance failures in its investigation of the Myki data release. First, OVIC found that PTV misinterpreted the concept of de-identified data by assuming that the removal of direct identifiers was sufficient to render the dataset non-personal. Under the Privacy and Data Protection Act 2014 (Vic), information remains personal if individuals are reasonably identifiable, including through linkage with other datasets [1].

Second, OVIC concluded that PTV failed to adequately assess the foreseeable risk of re-identification arising from linkage with external data. Modern privacy risk assessments, as emphasised by OVIC, must account for reasonably foreseeable linkage attacks rather than focusing solely on internal datasets. Finally, OVIC determined that the release breached several Information Privacy Principles, including those relating to data security, use and disclosure limitations, and accountability for personal information.

## 3. Re-identification under the GDPR and the Privacy Act 1988
### 3.1 Treatment of re-identification and dataset linkage

The GDPR does not explicitly define re-identification as a standalone concept, nor does it directly regulate dataset linkage as a specific activity. Nevertheless, re-identification risk is embedded throughout the regulation. The GDPR distinguishes between anonymous and pseudonymous data, explicitly categorising pseudonymised data as personal data because the potential for linkage remains. Recital 26 [2] provides that data is only anonymous if re-identification is not reasonably likely by any means, including the use of external datasets. As a result, controllers remain responsible for assessing identification risks and must treat any data that could reasonably be linked back to an individual as personal data.

Similarly, the Privacy Act 1988 does not explicitly define re-identification. However, it addresses the concept more directly through its definition of de-identified information and through subsequent amendments and regulatory guidance. De-identified information is data that is no longer about an identifiable individual and therefore falls outside the scope of the Act. Re-identification via data linkage is recognised as a critical risk that must be managed to preserve de-identified status. Moreover, complementary legislation, such as the Data Availability and Transparency Act 2022, introduces explicit prohibitions and penalties for attempts to re-identify individuals in shared government datasets.

### 3.2 Determining when data is sufficiently anonymised

Under the GDPR, anonymisation is treated as a dynamic, risk-based concept. Data is considered anonymous only if re-identification would be too costly, time-consuming, or technically difficult using means reasonably likely to be employed now or in the foreseeable future. Because technological capabilities and auxiliary datasets evolve, data that was once anonymous may later revert to being personal data. Until a controller can demonstrate that re-identification is not reasonably likely, the full obligations of the GDPR continue to apply.

In contrast, the Privacy Act 1988 has traditionally adopted a more static approach. Data is treated as de-identified once individuals are no longer identifiable or reasonably identifiable, often assessed at the time of release. While guidance from the Office of the Australian Information Commissioner (OAIC) encourages agencies to consider linkage risks, this consideration is less deeply embedded in the statutory framework than under the GDPR’s forward-looking model.

### 3.3 Key differences, overlaps, and gaps

The GDPR imposes a deeper, forward-looking risk assessment requirement and clearly distinguishes anonymisation from pseudonymisation, whereas the Privacy Act historically focuses on threshold-based identifiability at the time of disclosure. Accountability mechanisms under the GDPR are also stronger, with enforceable documentation requirements and significant penalties.

Despite these differences, both frameworks rely on the concept of reasonable identifiability and recognise that removing direct identifiers alone is insufficient. Neither framework guarantees absolute anonymity, acknowledging that contextual information can transform ostensibly non-identifying data into personal data.

However, notable gaps remain. Neither regime explicitly regulates large-scale dataset linkage activities, and both rely on flexible standards that may be interpreted inconsistently. Furthermore, detailed technical standards for anonymisation are largely left to regulatory guidance rather than codified in legislation.

## 4. Re-identification under the Victorian Privacy and Data Protection Act 2014

The PDP Act defines personal information as information about an individual whose identity can be “reasonably ascertained”. This definition adopts a contextual and risk-based standard rather than requiring direct identification. Although the Act does not explicitly refer to re-identification, the notion of reasonable ascertainability implicitly encompasses re-identification through data linkage.

Where an individual’s identity can be inferred by combining a released dataset with other reasonably available information—such as public records, social media disclosures, or known behavioural patterns—the information remains personal under the Act. Data linkage amplifies identifiability by combining quasi-identifiers such as time, location, and travel behaviour, often rendering individuals unique despite the absence of names or identifiers.

The Myki case illustrates that such linkage was reasonably foreseeable. Consequently, compliance with the PDP Act requires data custodians to evaluate how easily a member of the public could link released data with external information to re-identify individuals, rather than relying solely on technical de-identification measures.

## 5. Comparative Case Study: The Netflix Prize Dataset

A widely cited example of failed anonymisation is the Netflix Prize Movie Ratings Dataset. In 2006, Netflix released approximately 100 million movie ratings from around 500,000 users, removing direct identifiers and replacing them with random user IDs [3]. The dataset was intended to advance research in recommendation systems.

Researchers Narayanan and Shmatikov demonstrated that the dataset could be re-identified by linking it with publicly available IMDb ratings. By matching patterns of movie ratings and approximate timestamps, they were able to associate anonymised Netflix records with specific IMDb user profiles. Even a small number of overlapping ratings was sufficient to enable re-identification [4].

The consequences were significant. The findings highlighted the inadequacy of traditional anonymisation methods for behavioural data and triggered public criticism and legal action against Netflix. Ultimately, Netflix cancelled subsequent competitions, illustrating the reputation and legal risks associated with underestimating re-identification threats[5].

## 6. Recommendations for Future Myki Data Releases

To reduce re-identification risk, data custodians should adopt a proactive, risk-based approach to data release. First, re-identification testing should be conducted by deliberately attempting to link anonymised datasets with realistic external data sources, such as publicly shared travel information. This provides an empirical measure of real-world risk.

Second, privacy risk scoring frameworks can be applied to assess re-identification likelihood based on factors such as the number of quasi-identifiers, the uniqueness of individual patterns, and the accessibility of auxiliary datasets. Third, simulation of linkage scenarios using synthetic or hypothetical external data can help identify vulnerable data combinations before release.

Finally, principles of data minimisation and aggregation should guide release decisions. Only data necessary for defined research objectives should be shared, with aggregation by station, time blocks, or user cohorts preferred over individual-level trip records.

## 7. Technical Techniques to Reduce Re-identification Risk

A layered technical approach can reduce re-identification risk while preserving analytical value. Aggregation hides individual-level sequences while retaining population-level trends, such as total trips per station and hour. Rounding and generalisation, including coarsening timestamps or locations, further reduce uniqueness. Suppression of rare or highly unique records protects individuals whose patterns are most vulnerable to linkage attacks.

### Aggregation
- Combine data into groups (e.g., by station clusters, time blocks, or user cohorts) to reduce uniqueness.
- Preserves trends and patterns while hiding individual-level information.

Pseudocode example: group trips by station and hour

In [None]:
# Aggregate trips by origin station and hour
for trip in trips:
    trip.hour_block = round(trip.timestamp.hour / 1)
aggregated = trips.groupby(['origin_station', 'hour_block']).count()

Effect: Individual travel sequences are hidden, but total traffic per station and hour remains analyzable.
### Generalisation
- Replace precise values with coarser categories

Pseudocode example: rounding timestamps

In [None]:
def round_time(timestamp, interval_minutes=15):
    return timestamp - (timestamp.minute % interval_minutes)*time_delta(minutes=1)

Effect: Reduces uniqueness of trips without losing overall temporal trends.
### Suppression of Rare or Unique Records
- Remove trips or users with rare patterns that make re-identification easy (e.g., only one user travelling a particular route at a rare hour).

Pseudocode example: filter rare trips

In [None]:
# Remove origin-destination pairs with fewer than 5 trips
trip_counts = trips.groupby(['origin_station','destination_station']).count()
trips = trips[trip_counts['trip_id'] >= 5]

Effect: Protects individuals with highly unique travel patterns.
### Differential Privacy (DP)
- Introduce carefully calibrated noise to data aggregates to ensure individuals cannot be re-identified, while statistical properties are preserved.
- Example: adding Laplace noise to station counts.

Pseudocode example: Laplace noise for DP

In [None]:
import numpy as np

epsilon = 0.1 # custom privacy budget
for cell in aggregated:
    cell.value += np.random.laplace(0, 1/epsilon)

Effect: Retains approximate patterns and distributions while mathematically limiting re-identification risk.

More advanced techniques, such as differential privacy, introduce calibrated noise to statistical outputs, providing formal privacy guarantees while maintaining overall distributions. When combined, these techniques make linkage attacks substantially more difficult without eliminating the dataset’s research utility.

## 8. Conclusion

The Myki data incident demonstrates that de-identification is not a purely technical exercise but a contextual, legal, and risk-based assessment. Legal frameworks in Australia and the European Union converge on the principle of reasonable identifiability, implicitly encompassing re-identification through data linkage. Lessons from both the Myki and Netflix cases underline the need for forward-looking risk assessments, rigorous testing, and privacy-preserving design. By adopting these measures, future Myki data releases can better balance public value with the protection of individual privacy.

## Reference
[1]	Disclosure of myki travel information Investigation under section 8C(2)(e) of the Privacy and Data Protection Act 2014 (Vic) 2019.
[2]	General Data Protection Regulation (GDPR). (n.d.). Recital 26 - Not Applicable to Anonymous Data. [online] Available at: https://gdpr-info.eu/recitals/no-26/ [Accessed 7 Jan. 2026].
[3]	Schneier, B. (2007). Why ‘Anonymous’ Data Sometimes Isn’t. [online] WIRED. Available at: https://www.wired.com/2007/12/why-anonymous-data-sometimes-isnt [Accessed 9 Jan. 2026].
[4]	Vic.gov.au. (2020). An Introduction to De-Identification – Office of the Victorian Information Commissioner. [online] Available at: https://ovic.vic.gov.au/privacy/resources-for-organisations/an-introduction-to-de-identification [Accessed 9 Jan. 2026].
[5]	Applying Theoretical Advances in Privacy to Computational Social Science Practice. (n.d.). Available at: https://privacytools.seas.harvard.edu/sites/g/files/omnuum6656/files/privacytools/files/sloan_project_proposal.pdf [Accessed 9 Jan. 2026].
