# Ontology #2: Music Domain    

## Data generation approach  
   

### Main Classes    
    
1. **Genre**      
2. **RecordLabel**      
3. **Award**      
4. **Artist**      
5. **Album**      
6. **Song**      
   - References: Artist(s), Album, RecordLabel (via Artist), multiple Genres, Award(s), etc.    
    
### Generation Order & Rationale    
    
1. **Genre**      
   - Foundational list (e.g., “Rock,” “Pop,” “Jazz”).      
   - Other classes (Songs, Albums) will reference these genres.    
    
2. **RecordLabel**      
   - Independent entities. Artists can be “signedTo” a label.      
   - Must exist before we generate Artists who might link to them.    
    
3. **Award**      
   - Also an independent list (e.g., “Grammy Award”).      
   - Songs can reference these if they “haveWonAward.”      
    
4. **Artist**      
   - Some artists will reference a `RecordLabelID` if they’re signed to a label.      
   - So we create the labels first, then the artists can pick from them.    
    
5. **Album**      
   - References the `Artist` (via “performedBy,” “featuredOn,” or for “album artist,” etc.)      
   - Also references one or multiple `Genre` IDs.      
   - So the `Artist` and `Genre` must exist first.    
    
6. **Song**      
   - The most dependent entity: references `Artist`(s), possibly an `AlbumID`, `GenreID`(s), `AwardID`(s).      
   - Because it can have multiple relationships (artistIDs, albumID, awards, etc.), all those must exist prior.   

## Implementation

In [1]:
import random
import pandas as pd
from faker import Faker

In [2]:
data_path = "./data/"

In [3]:
fake = Faker()

In [4]:
# Predefined genre info
GENRE_OPTIONS = [
    {"name": "Rock", "description": "Characterized by a strong beat and typically guitar-based"},
    {"name": "Pop", "description": "Popular music, often catchy and mainstream"},
    {"name": "Hip Hop", "description": "Rhythmic music featuring rap and urban culture"},
    {"name": "Jazz", "description": "Originated from African-American communities, improvisation-based"},
    {"name": "Classical", "description": "Rooted in the tradition of Western culture (symphonies, operas)"},
    {"name": "R&B", "description": "Rhythm and Blues, soulful vocals and strong backbeat"},
    {"name": "Electronic", "description": "Synthesizer-driven, EDM, house, techno, etc."},
    {"name": "Country", "description": "Southern US origins, storytelling, guitar and banjo"},
    {"name": "Reggae", "description": "Jamaican origin, syncopated rhythms, offbeat accents"},
    {"name": "Latin", "description": "Spanish/Portuguese language, salsa, bachata, reggaeton"},
    {"name": "Metal", "description": "Heavier rock with distorted guitars and aggressive vocals"},
    {"name": "Blues", "description": "African-American origin, soulful, 'blue' notes and chord progressions"},
    {"name": "Funk", "description": "Strong rhythmic groove, syncopated bass lines"},
    {"name": "Soul", "description": "Combines elements of gospel and R&B"},
    {"name": "Folk", "description": "Traditional, storytelling focus, acoustic instruments"}
]

In [5]:
# Predefined awarding bodies
AWARDING_BODIES = [
    "Recording Academy",
    "Billboard",
    "American Music Association",
    "MTV",
    "British Phonographic Industry",
    "Canadian Academy of Recording Arts and Sciences"
]

In [6]:
# Predefined award sets
AWARD_NAMES = [
    "Grammy Award for Song of the Year",
    "Grammy Award for Record of the Year",
    "Billboard Music Award",
    "American Music Award",
    "MTV Video Music Award",
    "Brit Award",
    "Juno Award",
    "Country Music Association Award",
    "Soul Train Music Award"
]

In [7]:
# 1. Configuration
NUM_SONGS = 600
NUM_ARTISTS = 100
NUM_ALBUMS = 100
NUM_LABELS = 25
NUM_GENRES = 15 # defined above
NUM_AWARDS = 30 # defined above

# Optional: seed for reproducibility
# random.seed(42)
# Faker.seed(42)

In [8]:
# 2. Generate Entities

In [9]:
# a) Genres
genres = []
for i in range(NUM_GENRES):
    # If we have fewer than 15, repeat from GENRE_OPTIONS randomly
    genre_info = random.choice(GENRE_OPTIONS)
    genres.append({
        "id": f"genre_{i}",
        "name": genre_info["name"],
        "description": genre_info["description"]
    })

In [10]:
# b) Record Labels
labels = []
for i in range(NUM_LABELS):
    labels.append({
        "id": f"label_{i}",
        "labelName": fake.company(),
        "location": fake.city()
    })

In [11]:
# c) Awards
awards = []
for i in range(NUM_AWARDS):
    awards.append({
        "id": f"award_{i}",
        "awardName": random.choice(AWARD_NAMES),
        "year": random.randint(1990, 2023),
        "awardingBody": random.choice(AWARDING_BODIES)
    })

In [12]:
# d) Artists
artists = []
for i in range(NUM_ARTISTS):
    # ~50% chance the artist is signed to a label
    maybe_label = random.choice(labels)["id"] if random.random() < 0.5 else None
    artists.append({
        "id": f"artist_{i}",
        "name": fake.name(),
        "birthDate": fake.date_of_birth(minimum_age=18, maximum_age=70).isoformat(),
        "nationality": fake.country(),
        "labelID": maybe_label  # can be None
    })

In [13]:
# e) Albums
albums = []
for i in range(NUM_ALBUMS):
    # multiple genres possible: choose 1-2 random
    album_genres = random.sample(genres, k=random.randint(1, 2))
    albums.append({
        "id": f"album_{i}",
        "title": f"{fake.catch_phrase()} Album {i}",
        "releaseYear": random.randint(1980, 2023),
        # store genre IDs
        "genreIDs": [g["id"] for g in album_genres]
    })

In [14]:
# 3. Generate Songs
songs = []
for i in range(NUM_SONGS):
    # each song has 1-2 performing artists
    assigned_artists = random.sample(artists, k=random.randint(1,2))
    # maybe the song is on an album, or maybe it's a single
    if random.random() < 0.8:
        # 80% chance the song is on some album
        album_obj = random.choice(albums)
        albumID = album_obj["id"]
        albumGenres = album_obj["genreIDs"]
    else:
        albumID = None
        albumGenres = []

    # choose 1 or 2 genres, possibly from the album's genres, or random
    possible_genres = albumGenres or [g["id"] for g in genres]  # if album has genres, use those first
    chosen_genres = random.sample(possible_genres, k=random.randint(1, min(2, len(possible_genres))))

    # ~10% chance the song has an award
    assigned_awards = []
    if random.random() < 0.1:
        assigned_awards = random.sample(awards, k=1)

    # random duration 120-420s (2-7 minutes)
    duration_sec = random.randint(120, 420)

    # random release date from 1980 to current
    release_date = fake.date_between(start_date='-43y', end_date='today')

    songs.append({
        "id": f"song_{i}",
        "title": f"{fake.catch_phrase()} Song {i}",
        "duration": duration_sec,
        "releaseDate": release_date.isoformat(),
        "artistIDs": [a["id"] for a in assigned_artists],
        "albumID": albumID,
        "genreIDs": chosen_genres,
        "awardIDs": [aw["id"] for aw in assigned_awards]
    })

In [15]:
# -------------------------
# 4. Summary and Sample Prints
# -------------------------
print("Entities Generated:")
print("  Genres:", len(genres))
print("  Record Labels:", len(labels))
print("  Awards:", len(awards))
print("  Artists:", len(artists))
print("  Albums:", len(albums))
print("  Songs:", len(songs))

Entities Generated:
  Genres: 15
  Record Labels: 25
  Awards: 30
  Artists: 100
  Albums: 100
  Songs: 600


In [16]:
print("\nSample Genre:", genres[0])
print("Sample Record Label:", labels[0])
print("Sample Award:", awards[0])
print("Sample Artist:", artists[0])
print("Sample Album:", albums[0])
print("Sample Song:", songs[0])


Sample Genre: {'id': 'genre_0', 'name': 'Funk', 'description': 'Strong rhythmic groove, syncopated bass lines'}
Sample Record Label: {'id': 'label_0', 'labelName': 'Kelly, Taylor and Fletcher', 'location': 'Ortizchester'}
Sample Award: {'id': 'award_0', 'awardName': 'Grammy Award for Record of the Year', 'year': 2002, 'awardingBody': 'British Phonographic Industry'}
Sample Artist: {'id': 'artist_0', 'name': 'George Rivera', 'birthDate': '1997-05-15', 'nationality': 'Saint Vincent and the Grenadines', 'labelID': None}
Sample Album: {'id': 'album_0', 'title': 'Reverse-engineered human-resource paradigm Album 0', 'releaseYear': 2004, 'genreIDs': ['genre_8', 'genre_10']}
Sample Song: {'id': 'song_0', 'title': 'Re-engineered analyzing benchmark Song 0', 'duration': 402, 'releaseDate': '1983-10-03', 'artistIDs': ['artist_97', 'artist_91'], 'albumID': 'album_94', 'genreIDs': ['genre_3', 'genre_4'], 'awardIDs': []}


In [17]:
# persist the data
pd.DataFrame(genres).to_csv(data_path+"genres.csv", encoding = "utf-8", escapechar = "\\", index=False)
pd.DataFrame(labels).to_csv(data_path+"labels.csv", encoding = "utf-8", escapechar = "\\", index=False)
pd.DataFrame(awards).to_csv(data_path+"awards.csv", encoding = "utf-8", escapechar = "\\", index=False)
pd.DataFrame(artists).to_csv(data_path+"artists.csv", encoding = "utf-8", escapechar = "\\", index=False)
pd.DataFrame(albums).to_csv(data_path+"albums.csv", encoding = "utf-8", escapechar = "\\", index=False)
pd.DataFrame(songs).to_csv(data_path+"songs.csv", encoding = "utf-8", escapechar = "\\", index=False)