# iTunes Library Data Cleaning

This is a demo of cleaning iTunes library data. I utilize a custom-made reader module `iTunes` to keep here tidy.

In [1]:
# Custom-made reader
from iTunes import Library, Utils

## Read / Load Data

The module receives two file formats: the XML file exported from iTunes, and the message pack file generated by the module.

In [2]:
# lib = Library.from_xml(r'..\..\xml\Music\2025-11.xml') # iTunes XML format
lib = Library.from_msgpack(r'.\data\lib.msgpack') # Message pack format

print(lib)
display(lib.data.head())

iTunes Library <2836 tracks>


Unnamed: 0,Track ID,Name,Artist,Composer,Album,Genre,Year,Date Modified,Date Added,Play Count,Size,Total Time,Disc Number,Track Number,Tags
0,3816,十年,陳奕迅,,THE 1ST ELEVEN YEARS 然後呢?,Mando Pop,2008,2020-12-20 01:08:28,2020-12-02 14:24:31,40,8144121,0 days 00:03:21.926000,1.0,,"{List: Mandarin, Library, Music}"
1,3818,也可以,閻奕格,,我有我自己,Mando Pop,2017,2021-02-15 14:59:40,2021-02-16 05:48:44,22,4418942,0 days 00:04:30.680000,1.0,10.0,"{List: Mandarin, Library, Music}"
2,3820,大哥,衛蘭,,My Love,Canto Pop,2005,2021-03-29 11:03:21,2020-12-02 14:24:31,26,4335914,0 days 00:03:50.739000,,,"{Library, Music, List: Cantonese}"
3,3822,小手拉大手,梁靜茹,,親親,Mando Pop,2006,2021-03-29 10:56:28,2020-12-02 13:50:11,17,4658111,0 days 00:04:04.349000,,,"{List: Mandarin, Library, Music}"
4,3824,小幸運,田馥甄,,小幸運,Mando Pop,2015,2021-03-29 10:59:10,2020-12-02 14:24:31,23,10709820,0 days 00:04:25.586000,,,"{List: Mandarin, Library, Music}"


## Filter Tags

First of all, some tracks which are not part of the library were added. To remove them, a pre-defined whitelist is used to decide whether a track is kept.

In [3]:
tag_map: dict[str, str] = Utils.read_yaml(r'.\data\tags.yaml')
lib = lib.map('Tags', tag_map).filter('Tags', tag_map.values())

print(lib)
display(lib.data.head().iloc[:, [0, 1, 2, 14]])

iTunes Library <2525 tracks>


Unnamed: 0,Track ID,Name,Artist,Tags
0,3816,十年,陳奕迅,{Mandarin}
1,3818,也可以,閻奕格,{Mandarin}
2,3820,大哥,衛蘭,{Cantonese}
3,3822,小手拉大手,梁靜茹,{Mandarin}
4,3824,小幸運,田馥甄,{Mandarin}


## Gather Artists

To acquire the artists of each track, the module determines them through the Artist and the Name field. Then they are stored in the nested list. \
Another thing should be noted is that an artist may have multile names, so I provided yet another pre-defined conversion table to merge the releases.

In [4]:
artist_map: dict[str, str | list[str]] = Utils.read_yaml(r'.\data\artists.yaml')
lib = lib.nested_artists(artist_map, artists_with_comma = ['接個吻,開一槍'])

display(lib.data.tail().iloc[:, [0, 1, 2, 3]])

Unnamed: 0,Track ID,Name,Artist,Composer
2520,9466,Hello I Miss U,"[Mazare, SadBois, Carter Rubin]","Maarten Vorwerk, Spencer Jordan, Elias Nichola..."
2521,9468,Upside Down,"[Cloudy Parallels, Xentry]","Caleb Hunter, Nhan Đức & Azend"
2522,9470,Prism Gate,"[Tatsunoshin, Aira Arere, GALSTYLEZ, Massive N...",Tatsunoshin & Massive New Krew
2523,9472,Novel,"[technoplanet, Tamako Kinoshita, Shumpei Tsuyama]",technoplanet
2524,9474,Glide,[Myuk],knoak


## Export Cleaned Data

The cleaned data is exported for further analysis.

In [5]:
lib.to_msgpack(r'.\data\lib-cln.msgpack')