# 📘 Notebook Summary – 01_Data_Collection.ipynb
This notebook is the data collection pipeline for a broader music analytics project. It automates the process of gathering Billboard Hot 100 songs from 1959 to 2024, along with their lyrics, by combining web scraping and asynchronous API requests. The workflow is designed for performance, maintainability, and scalability.

## Key Features:
* **Chart Scraping:** Scrapes the Billboard Year-End Hot 100 list for each year using Wikipedia.

* **Async Lyrics Fetching:** Uses aiohttp to fetch lyrics from the lyrics.ovh API with intelligent title/artist variations and retry logic to maximize success rate.

* **Caching System:** Implements a JSON-based caching mechanism to avoid redundant API calls and improve performance across sessions.

* **Error Handling & Logging:** Gracefully handles missing data and logs failed entries for analysis.

* **Modular Functions:** Functions are clearly organized by role (scraping, cleaning, API calls, caching, orchestration), making the code reusable and extensible.

This notebook sets the foundation for subsequent analysis — including sentiment, word frequency, topic modeling, and visual storytelling based on lyrics trends.

## 📦 Imports & Setup

This section imports all the essential libraries used throughout the notebook:

* **Web Scraping & Requests**: requests, BeautifulSoup – to scrape Billboard Hot 100 chart data from Wikipedia.

* **Async I/O & Networking**: aiohttp, asyncio, nest_asyncio – to enable asynchronous fetching of lyrics from an API for improved performance in a Jupyter environment.

* **Data Handling**: pandas, numpy, json, os – for efficient data manipulation, caching, and storage.

* **Text Processing & NLP**: re, nltk, TextBlob, WordNetLemmatizer, stopwords – for cleaning, tokenizing, and analyzing song lyrics.

* **Visualization**: matplotlib, seaborn, WordCloud – to visualize insights such as most common words and trends.

* **Utility**: tqdm.asyncio – for progress bars in asynchronous loops, and logging for monitoring execution.

* **nest_asyncio.apply**() is called to allow nested event loops for Google Colab, enabling smooth use of asyncio.run() inside the notebook.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import aiohttp
import asyncio
import re
import urllib.parse
import time
import logging
from tqdm.asyncio import tqdm_asyncio
import nest_asyncio
import os
import json
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud
from nltk import ngrams
import zipfile

# Applying nest_asyncio for notebook compatibility
nest_asyncio.apply()
logging.basicConfig(level=logging.INFO)

## Creating function to scrape data from Billboard hot-100 year end pages on wikipedia

The function parses the HTML table on the wikipedia page using ***BeautifulSoup***, handles cases where artist names are omitted due to rowspans (if consecutive songs have the same artist), and returns a clean pandas DataFrame with the song's rank, title, and artist.

In [None]:
# --------------------------
# WEB SCRAPING FUNCTIONS
# --------------------------

def scrape_hot_100(year):
    """Scrape Billboard Hot 100 for given year"""
    url = f"https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_{year}"
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
    except Exception as e:
        logging.error(f"Error fetching {year}: {str(e)}")
        return None

    table = soup.find('table', {'class': 'wikitable sortable'})
    if table is None:
        logging.warning(f"Hot 100 table not found for {year}")
        return None

    data = []
    current_artist = None

    for row in table.find_all('tr')[1:]:  # Skip header
        cols = row.find_all('td')

        if len(cols) == 3:  # Full row with artist
            try:
                rank = int(cols[0].text.strip())
                title = cols[1].text.strip().strip('"')
                current_artist = cols[2].text.strip()
                data.append([rank, title, current_artist])
            except ValueError:
                continue

        elif len(cols) == 2 and current_artist:  # Rowspan artist case
            try:
                rank = int(cols[0].text.strip())
                title = cols[1].text.strip().strip('"')
                data.append([rank, title, current_artist])
            except ValueError:
                continue

    if not data:
        return None

    df = pd.DataFrame(data, columns=["Rank", "Title", "Artist"])
    return df.set_index("Rank")


## 🎵 Lyric Fetching Functions

This section defines asynchronous functions to fetch song lyrics from the lyrics.ovh API ( [website](https://lyricsovh.docs.apiary.io/#) / [github](https://github.com/NTag/lyrics.ovh)). It includes:



*   **clean_artist_name()**: Cleans artist names by removing extra punctuation, featured artists, and brackets to improve match accuracy.

*   **fetch_lyrics()**: Sends asynchronous API requests to retrieve lyrics for a given artist and song title.

*   **fetch_with_retries()**: Tries multiple variations of artist and song title combinations to maximize the chance of a successful match, with optional retries and delays for robustness.

These functions work together to collect lyrics data while handling naming inconsistencies common in music metadata.

In [None]:
# --------------------------
# LYRIC FETCHING FUNCTIONS
# --------------------------

def clean_artist_name(artist):
    """Optimized artist cleaning function"""
    if not isinstance(artist, str):
        return ""

    # Remove content in parentheses and brackets
    artist = re.sub(r'\([^)]*\)|\[[^\]]*\]', '', artist)

    # Handle common patterns with a single regex
    match = re.search(r'^([^,;&/]+?)(?:\s*(?:,|&|and|featuring?|ft\.?|with|/|x)\s|$)', artist, flags=re.IGNORECASE)
    return match.group(1).strip() if match else artist.strip()

async def fetch_lyrics(session, artist, title, timeout=8):
    """Fetch lyrics with optimized parameters"""
    artist_encoded = urllib.parse.quote(artist)
    title_encoded = urllib.parse.quote(title)
    url = f"https://api.lyrics.ovh/v1/{artist_encoded}/{title_encoded}"

    try:
        async with session.get(url, timeout=timeout) as response:
            if response.status == 200:
                data = await response.json()
                return data.get('lyrics', '')
            elif response.status == 404:
                return ''  # Known missing
    except (aiohttp.ClientError, asyncio.TimeoutError):
        return ''
    return ''

async def fetch_with_retries(session, artist, title, max_retries=3):
    """Optimized with precomputed variations"""
    # Clean artist name once
    primary_artist = clean_artist_name(artist)

    # Precompute title variations
    base_title = re.sub(r'\([^)]*\)', '', title).strip()
    base_title_no_punct = re.sub(r'[\!\?\.\,\']', '', base_title)
    base_title_no_feat = title.split(' (')[0].strip()

    # Prepare all variations upfront
    variations = [
        (primary_artist, title),
        (primary_artist, base_title),
        (primary_artist, base_title_no_feat),
        (artist.split(',')[0].strip(), title),
        (artist, title),
        (primary_artist, f"{base_title} (feat. ...)"),
        (primary_artist, base_title_no_punct),
    ]

    # Add extra variations for problematic titles
    if any(char in title for char in ['!', '?', '.', ',', "'"]):
        variations.append((artist, base_title_no_punct))
        variations.append((primary_artist, base_title_no_punct))

    # Try variations without delay first
    for try_artist, try_title in variations[:max_retries]:
        lyrics = await fetch_lyrics(session, try_artist, try_title)
        if lyrics:
            return lyrics

    # Add delays only for remaining retries
    for i, (try_artist, try_title) in enumerate(variations[max_retries:max_retries*2]):
        await asyncio.sleep(0.2 * (i + 1))
        lyrics = await fetch_lyrics(session, try_artist, try_title)
        if lyrics:
            return lyrics

    return ''

##💾 Lyrics Caching System
This class implements a JSON-based cache to store and reuse fetched lyrics, reducing redundant API calls and improving efficiency. Key features include:

*   Persistent storage in lyrics_cache.json.

*   Efficient lookup using normalized artist and song title keys.

*   Periodic auto-saving every 50 entries to avoid frequent file I/O.

This helps speed up lyric retrieval in repeated or large-scale scraping runs.

In [None]:
# --------------------------
# CACHE SYSTEM FOR REUSABLE RESULTS
# --------------------------

class LyricsCache:
    def __init__(self, cache_file='lyrics_cache.json'):
        self.cache_file = cache_file
        self.cache = self.load_cache()

    def load_cache(self):
        if os.path.exists(self.cache_file):
            try:
                with open(self.cache_file, 'r') as f:
                    return json.load(f)
            except:
                return {}
        return {}

    def save_cache(self):
        with open(self.cache_file, 'w') as f:
            json.dump(self.cache, f)

    def get_key(self, artist, title):
        return f"{clean_artist_name(artist)}|||{title.lower().strip()}"

    def get(self, artist, title):
        return self.cache.get(self.get_key(artist, title), None)

    def set(self, artist, title, lyrics):
        key = self.get_key(artist, title)
        self.cache[key] = lyrics
        # Save periodically rather than on every set
        if len(self.cache) % 50 == 0:
            self.save_cache()

    def __len__(self):
        return len(self.cache)

## ⚙️ Optimized Main Execution Pipeline
This section orchestrates the full scraping and lyric-fetching workflow across multiple years of Billboard Hot 100 data. Key features include:

### process_year():

*   Scrapes chart data for a given year.

*   Retrieves cached lyrics where available.

*   Uses asynchronous batch requests to fetch missing lyrics concurrently.

*   Writes results to a CSV file and logs any failed fetches.

### main():

*   Coordinates the end-to-end process for multiple years.

*   Manages connection pooling via aiohttp for performance.

*   Persists the lyrics cache and logs missing entries for later review.


The workflow is designed for speed and efficiency, using async I/O, intelligent caching, and batch processing to handle large-scale data collection and enrichment.

In [None]:
# --------------------------
# OPTIMIZED MAIN EXECUTION
# --------------------------

async def process_year(year, cache, session, concurrency=50):
    """Process a single year with caching and optimized fetching"""
    print(f"\n{'='*40}\nProcessing {year}\n{'='*40}")
    start_time = time.time()

    # Scrape Billboard data
    billboard_df = scrape_hot_100(year)
    if billboard_df is None:
        print(f"⚠️  No data found for {year}")
        return None, []

    print(f"Found {len(billboard_df)} songs for {year}")

    # Initialize lyrics column
    billboard_df['Lyrics'] = ''

    # Check cache first
    cached_count = 0
    for idx, row in billboard_df.iterrows():
        cache_key = cache.get_key(row['Artist'], row['Title'])
        if cache_key in cache.cache:
            billboard_df.at[idx, 'Lyrics'] = cache.cache[cache_key]
            cached_count += 1

    print(f"🚀 {cached_count}/{len(billboard_df)} lyrics from cache")

    # Prepare tasks for missing lyrics
    tasks = []
    for idx, row in billboard_df.iterrows():
        if not billboard_df.at[idx, 'Lyrics']:
            tasks.append(
                fetch_with_retries(session, row['Artist'], row['Title'])
            )

    # Process in batches for better memory management
    results = []
    batch_size = concurrency * 5  # Process in larger batches
    for i in range(0, len(tasks), batch_size):
        batch = tasks[i:i+batch_size]
        batch_results = await tqdm_asyncio.gather(
            *batch,
            desc=f"Fetching {year} lyrics",
            unit="song"
        )
        results.extend(batch_results)

    # Update DataFrame and cache
    result_idx = 0
    failed_entries = []
    for idx, row in billboard_df.iterrows():
        if not billboard_df.at[idx, 'Lyrics']:
            lyrics = results[result_idx]
            result_idx += 1

            if lyrics:
                billboard_df.at[idx, 'Lyrics'] = lyrics
                cache.set(row['Artist'], row['Title'], lyrics)
            else:
                failed_entries.append({
                    'Year': year,
                    'Artist': row['Artist'],
                    'Title': row['Title'],
                    'Rank': idx
                })

    # Save results
    billboard_df.to_csv(f"hot100_{year}.csv")
    elapsed = time.time() - start_time
    print(f"✅ Saved {year} data in {elapsed:.1f}s - {len(failed_entries)} missing")

    return billboard_df, failed_entries

async def main(years, concurrency=30):
    """Run the full scraping process with caching and concurrency"""
    cache = LyricsCache()
    all_data = {}
    all_failed_entries = []

    # Reusable HTTP session with connection pooling
    connector = aiohttp.TCPConnector(limit=concurrency)
    async with aiohttp.ClientSession(connector=connector) as session:
        for year in years:
            try:
                df, failed = await process_year(year, cache, session, concurrency)
                if df is not None:
                    all_data[year] = df
                    all_failed_entries.extend(failed)
            except Exception as e:
                print(f"🚨 Critical error processing {year}: {str(e)}")
                logging.exception(e)

    # Final cache save
    cache.save_cache()

    # Save failure log
    if all_failed_entries:
        fail_df = pd.DataFrame(all_failed_entries)
        fail_df.to_csv('missing_lyrics_log.csv', index=False)
        print(f"\nSaved {len(fail_df)} missing entries to log")

    return all_data

# Run on Colab
if __name__ == "__main__":
    # Years to process
    years_to_scrape = list(range(1959, 2025))

    # Run with high concurrency
    hot_100_data = asyncio.run(main(years_to_scrape, concurrency=40))

    # Show summary
    for year, df in hot_100_data.items():
        missing = df[df['Lyrics'] == '']
        print(f"{year}: {len(missing)} missing lyrics")


Processing 1959
Found 100 songs for 1959
🚀 0/100 lyrics from cache


Fetching 1959 lyrics: 100%|██████████| 100/100 [00:11<00:00,  8.58song/s]


✅ Saved 1959 data in 11.9s - 42 missing

Processing 1960
Found 100 songs for 1960
🚀 1/100 lyrics from cache


Fetching 1960 lyrics: 100%|██████████| 99/99 [00:08<00:00, 11.67song/s]


✅ Saved 1960 data in 8.7s - 36 missing

Processing 1961
Found 100 songs for 1961
🚀 0/100 lyrics from cache


Fetching 1961 lyrics: 100%|██████████| 100/100 [00:07<00:00, 12.70song/s]


✅ Saved 1961 data in 8.0s - 29 missing

Processing 1962
Found 100 songs for 1962
🚀 1/100 lyrics from cache


Fetching 1962 lyrics: 100%|██████████| 99/99 [00:08<00:00, 12.22song/s]


✅ Saved 1962 data in 8.6s - 29 missing

Processing 1963
Found 100 songs for 1963
🚀 0/100 lyrics from cache


Fetching 1963 lyrics: 100%|██████████| 100/100 [00:11<00:00,  9.02song/s]


✅ Saved 1963 data in 11.3s - 42 missing

Processing 1964
Found 100 songs for 1964
🚀 0/100 lyrics from cache


Fetching 1964 lyrics: 100%|██████████| 100/100 [00:10<00:00,  9.39song/s]


✅ Saved 1964 data in 10.9s - 40 missing

Processing 1965
Found 100 songs for 1965
🚀 0/100 lyrics from cache


Fetching 1965 lyrics: 100%|██████████| 100/100 [00:08<00:00, 11.21song/s]


✅ Saved 1965 data in 9.1s - 29 missing

Processing 1966
Found 100 songs for 1966
🚀 0/100 lyrics from cache


Fetching 1966 lyrics: 100%|██████████| 100/100 [00:09<00:00, 10.12song/s]


✅ Saved 1966 data in 10.1s - 42 missing

Processing 1967
Found 100 songs for 1967
🚀 0/100 lyrics from cache


Fetching 1967 lyrics: 100%|██████████| 100/100 [00:09<00:00, 10.83song/s]


✅ Saved 1967 data in 9.6s - 33 missing

Processing 1968
Found 100 songs for 1968
🚀 0/100 lyrics from cache


Fetching 1968 lyrics: 100%|██████████| 100/100 [00:08<00:00, 11.92song/s]


✅ Saved 1968 data in 8.6s - 31 missing

Processing 1969


ERROR:root:The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Traceback (most recent call last):
  File "<ipython-input-10-57c07e56214c>", line 88, in main
    df, failed = await process_year(year, cache, session, concurrency)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<ipython-input-10-57c07e56214c>", line 34, in process_year
    if not billboard_df.at[idx, 'Lyrics']:
  File "/usr/local/lib/python3.11/dist-packages/pandas/core/generic.py", line 1577, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
  logging.exception(e)


Found 101 songs for 1969
🚀 0/101 lyrics from cache
🚨 Critical error processing 1969: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Processing 1970
Found 100 songs for 1970
🚀 0/100 lyrics from cache


Fetching 1970 lyrics: 100%|██████████| 100/100 [00:09<00:00, 10.60song/s]


✅ Saved 1970 data in 9.6s - 32 missing

Processing 1971
Found 100 songs for 1971
🚀 0/100 lyrics from cache


Fetching 1971 lyrics: 100%|██████████| 100/100 [00:12<00:00,  7.75song/s]


✅ Saved 1971 data in 13.2s - 42 missing

Processing 1972
Found 100 songs for 1972
🚀 0/100 lyrics from cache


Fetching 1972 lyrics: 100%|██████████| 100/100 [00:09<00:00, 10.04song/s]


✅ Saved 1972 data in 10.2s - 31 missing

Processing 1973
Found 100 songs for 1973
🚀 0/100 lyrics from cache


Fetching 1973 lyrics: 100%|██████████| 100/100 [00:10<00:00,  9.53song/s]


✅ Saved 1973 data in 10.7s - 28 missing

Processing 1974
Found 100 songs for 1974
🚀 0/100 lyrics from cache


Fetching 1974 lyrics: 100%|██████████| 100/100 [00:09<00:00, 10.88song/s]


✅ Saved 1974 data in 9.4s - 22 missing

Processing 1975
Found 100 songs for 1975
🚀 0/100 lyrics from cache


Fetching 1975 lyrics: 100%|██████████| 100/100 [00:08<00:00, 11.90song/s]


✅ Saved 1975 data in 8.6s - 28 missing

Processing 1976
Found 100 songs for 1976
🚀 1/100 lyrics from cache


Fetching 1976 lyrics: 100%|██████████| 99/99 [00:07<00:00, 13.57song/s]


✅ Saved 1976 data in 7.7s - 20 missing

Processing 1977
Found 100 songs for 1977
🚀 0/100 lyrics from cache


Fetching 1977 lyrics: 100%|██████████| 100/100 [00:07<00:00, 12.51song/s]


✅ Saved 1977 data in 8.2s - 22 missing

Processing 1978
Found 100 songs for 1978
🚀 0/100 lyrics from cache


Fetching 1978 lyrics: 100%|██████████| 100/100 [00:06<00:00, 15.52song/s]


✅ Saved 1978 data in 6.6s - 10 missing

Processing 1979
Found 100 songs for 1979
🚀 1/100 lyrics from cache


Fetching 1979 lyrics: 100%|██████████| 99/99 [00:06<00:00, 15.07song/s]


✅ Saved 1979 data in 6.8s - 23 missing

Processing 1980
Found 100 songs for 1980
🚀 0/100 lyrics from cache


Fetching 1980 lyrics: 100%|██████████| 100/100 [00:06<00:00, 14.64song/s]


✅ Saved 1980 data in 7.0s - 22 missing

Processing 1981
Found 100 songs for 1981
🚀 0/100 lyrics from cache


Fetching 1981 lyrics: 100%|██████████| 100/100 [00:07<00:00, 13.80song/s]


✅ Saved 1981 data in 7.4s - 21 missing

Processing 1982
Found 100 songs for 1982
🚀 1/100 lyrics from cache


Fetching 1982 lyrics: 100%|██████████| 99/99 [00:06<00:00, 14.77song/s]


✅ Saved 1982 data in 6.9s - 24 missing

Processing 1983
Found 100 songs for 1983
🚀 1/100 lyrics from cache


Fetching 1983 lyrics: 100%|██████████| 99/99 [00:06<00:00, 15.80song/s]


✅ Saved 1983 data in 6.6s - 10 missing

Processing 1984
Found 100 songs for 1984
🚀 0/100 lyrics from cache


Fetching 1984 lyrics: 100%|██████████| 100/100 [00:06<00:00, 14.72song/s]


✅ Saved 1984 data in 7.0s - 15 missing

Processing 1985
Found 100 songs for 1985
🚀 0/100 lyrics from cache


Fetching 1985 lyrics: 100%|██████████| 100/100 [00:06<00:00, 15.76song/s]


✅ Saved 1985 data in 6.6s - 15 missing

Processing 1986
Found 100 songs for 1986
🚀 0/100 lyrics from cache


Fetching 1986 lyrics: 100%|██████████| 100/100 [00:06<00:00, 15.63song/s]


✅ Saved 1986 data in 6.6s - 9 missing

Processing 1987
Found 100 songs for 1987
🚀 1/100 lyrics from cache


Fetching 1987 lyrics: 100%|██████████| 99/99 [00:06<00:00, 15.93song/s]


✅ Saved 1987 data in 6.4s - 9 missing

Processing 1988
Found 100 songs for 1988
🚀 0/100 lyrics from cache


Fetching 1988 lyrics: 100%|██████████| 100/100 [00:06<00:00, 16.27song/s]


✅ Saved 1988 data in 6.3s - 9 missing

Processing 1989
Found 100 songs for 1989
🚀 0/100 lyrics from cache


Fetching 1989 lyrics: 100%|██████████| 100/100 [00:06<00:00, 16.63song/s]


✅ Saved 1989 data in 6.3s - 12 missing

Processing 1990
Found 100 songs for 1990
🚀 2/100 lyrics from cache


Fetching 1990 lyrics: 100%|██████████| 98/98 [00:06<00:00, 15.84song/s]


✅ Saved 1990 data in 6.7s - 8 missing

Processing 1991
Found 100 songs for 1991
🚀 1/100 lyrics from cache


Fetching 1991 lyrics: 100%|██████████| 99/99 [00:06<00:00, 14.39song/s]


✅ Saved 1991 data in 7.1s - 16 missing

Processing 1992
Found 100 songs for 1992
🚀 1/100 lyrics from cache


Fetching 1992 lyrics: 100%|██████████| 99/99 [00:05<00:00, 17.62song/s]


✅ Saved 1992 data in 5.8s - 9 missing

Processing 1993
Found 100 songs for 1993
🚀 3/100 lyrics from cache


Fetching 1993 lyrics: 100%|██████████| 97/97 [00:06<00:00, 15.09song/s]


✅ Saved 1993 data in 6.6s - 18 missing

Processing 1994
Found 100 songs for 1994
🚀 5/100 lyrics from cache


Fetching 1994 lyrics: 100%|██████████| 95/95 [00:06<00:00, 15.53song/s]


✅ Saved 1994 data in 6.3s - 10 missing

Processing 1995
Found 100 songs for 1995
🚀 10/100 lyrics from cache


Fetching 1995 lyrics: 100%|██████████| 90/90 [00:05<00:00, 15.21song/s]


✅ Saved 1995 data in 6.1s - 17 missing

Processing 1996
Found 100 songs for 1996
🚀 6/100 lyrics from cache


Fetching 1996 lyrics: 100%|██████████| 94/94 [00:06<00:00, 15.51song/s]


✅ Saved 1996 data in 6.5s - 17 missing

Processing 1997
Found 100 songs for 1997
🚀 13/100 lyrics from cache


Fetching 1997 lyrics: 100%|██████████| 87/87 [00:06<00:00, 14.23song/s]


✅ Saved 1997 data in 6.3s - 17 missing

Processing 1998
Found 100 songs for 1998
🚀 10/100 lyrics from cache


Fetching 1998 lyrics: 100%|██████████| 90/90 [00:05<00:00, 15.00song/s]


✅ Saved 1998 data in 6.2s - 14 missing

Processing 1999
Found 100 songs for 1999
🚀 3/100 lyrics from cache


Fetching 1999 lyrics: 100%|██████████| 97/97 [00:06<00:00, 15.34song/s]


✅ Saved 1999 data in 6.5s - 8 missing

Processing 2000
Found 100 songs for 2000
🚀 5/100 lyrics from cache


Fetching 2000 lyrics: 100%|██████████| 95/95 [00:07<00:00, 13.29song/s]


✅ Saved 2000 data in 7.4s - 7 missing

Processing 2001
Found 100 songs for 2001
🚀 7/100 lyrics from cache


Fetching 2001 lyrics: 100%|██████████| 93/93 [00:06<00:00, 14.93song/s]


✅ Saved 2001 data in 6.5s - 7 missing

Processing 2002
Found 100 songs for 2002
🚀 6/100 lyrics from cache


Fetching 2002 lyrics: 100%|██████████| 94/94 [00:05<00:00, 15.93song/s]


✅ Saved 2002 data in 6.2s - 6 missing

Processing 2003
Found 100 songs for 2003
🚀 3/100 lyrics from cache


Fetching 2003 lyrics: 100%|██████████| 97/97 [00:05<00:00, 16.34song/s]


✅ Saved 2003 data in 6.5s - 7 missing

Processing 2004
Found 100 songs for 2004
🚀 8/100 lyrics from cache


Fetching 2004 lyrics: 100%|██████████| 92/92 [00:05<00:00, 15.88song/s]


✅ Saved 2004 data in 6.1s - 5 missing

Processing 2005
Found 100 songs for 2005
🚀 6/100 lyrics from cache


Fetching 2005 lyrics: 100%|██████████| 94/94 [00:05<00:00, 15.73song/s]


✅ Saved 2005 data in 6.2s - 8 missing

Processing 2006
Found 100 songs for 2006
🚀 10/100 lyrics from cache


Fetching 2006 lyrics: 100%|██████████| 90/90 [00:06<00:00, 14.33song/s]


✅ Saved 2006 data in 6.5s - 9 missing

Processing 2007
Found 100 songs for 2007
🚀 8/100 lyrics from cache


Fetching 2007 lyrics: 100%|██████████| 92/92 [00:05<00:00, 15.60song/s]


✅ Saved 2007 data in 6.1s - 9 missing

Processing 2008
Found 100 songs for 2008
🚀 9/100 lyrics from cache


Fetching 2008 lyrics: 100%|██████████| 91/91 [00:05<00:00, 17.52song/s]


✅ Saved 2008 data in 5.5s - 4 missing

Processing 2009
Found 100 songs for 2009
🚀 10/100 lyrics from cache


Fetching 2009 lyrics: 100%|██████████| 90/90 [00:05<00:00, 15.73song/s]


✅ Saved 2009 data in 6.3s - 6 missing

Processing 2010
Found 100 songs for 2010
🚀 12/100 lyrics from cache


Fetching 2010 lyrics: 100%|██████████| 88/88 [00:06<00:00, 14.51song/s]


✅ Saved 2010 data in 6.3s - 9 missing

Processing 2011
Found 100 songs for 2011
🚀 9/100 lyrics from cache


Fetching 2011 lyrics: 100%|██████████| 91/91 [00:05<00:00, 17.06song/s]


✅ Saved 2011 data in 5.6s - 9 missing

Processing 2012
Found 100 songs for 2012
🚀 8/100 lyrics from cache


Fetching 2012 lyrics: 100%|██████████| 92/92 [00:05<00:00, 16.50song/s]


✅ Saved 2012 data in 5.8s - 4 missing

Processing 2013
Found 100 songs for 2013
🚀 11/100 lyrics from cache


Fetching 2013 lyrics: 100%|██████████| 89/89 [00:05<00:00, 16.59song/s]


✅ Saved 2013 data in 5.6s - 8 missing

Processing 2014
Found 100 songs for 2014
🚀 11/100 lyrics from cache


Fetching 2014 lyrics: 100%|██████████| 89/89 [00:05<00:00, 16.90song/s]


✅ Saved 2014 data in 5.5s - 4 missing

Processing 2015
Found 100 songs for 2015
🚀 8/100 lyrics from cache


Fetching 2015 lyrics: 100%|██████████| 92/92 [00:06<00:00, 15.13song/s]


✅ Saved 2015 data in 6.5s - 13 missing

Processing 2016
Found 100 songs for 2016
🚀 10/100 lyrics from cache


Fetching 2016 lyrics: 100%|██████████| 90/90 [00:05<00:00, 15.63song/s]


✅ Saved 2016 data in 6.0s - 13 missing

Processing 2017
Found 100 songs for 2017
🚀 7/100 lyrics from cache


Fetching 2017 lyrics: 100%|██████████| 93/93 [00:05<00:00, 16.19song/s]


✅ Saved 2017 data in 6.1s - 12 missing

Processing 2018
Found 100 songs for 2018
🚀 13/100 lyrics from cache


Fetching 2018 lyrics: 100%|██████████| 87/87 [00:05<00:00, 15.42song/s]


✅ Saved 2018 data in 5.9s - 11 missing

Processing 2019
Found 100 songs for 2019
🚀 10/100 lyrics from cache


Fetching 2019 lyrics: 100%|██████████| 90/90 [00:06<00:00, 14.59song/s]


✅ Saved 2019 data in 6.5s - 9 missing

Processing 2020
Found 100 songs for 2020
🚀 8/100 lyrics from cache


Fetching 2020 lyrics: 100%|██████████| 92/92 [00:05<00:00, 15.74song/s]


✅ Saved 2020 data in 6.1s - 11 missing

Processing 2021
Found 100 songs for 2021
🚀 7/100 lyrics from cache


Fetching 2021 lyrics: 100%|██████████| 93/93 [00:05<00:00, 16.28song/s]


✅ Saved 2021 data in 6.0s - 8 missing

Processing 2022
Found 100 songs for 2022
🚀 12/100 lyrics from cache


Fetching 2022 lyrics: 100%|██████████| 88/88 [00:05<00:00, 15.27song/s]


✅ Saved 2022 data in 6.2s - 5 missing

Processing 2023
Found 100 songs for 2023
🚀 13/100 lyrics from cache


Fetching 2023 lyrics: 100%|██████████| 87/87 [00:06<00:00, 14.23song/s]


✅ Saved 2023 data in 6.4s - 8 missing

Processing 2024
Found 100 songs for 2024
🚀 18/100 lyrics from cache


Fetching 2024 lyrics: 100%|██████████| 82/82 [00:05<00:00, 14.26song/s]


✅ Saved 2024 data in 6.0s - 5 missing

Saved 1088 missing entries to log
1959: 42 missing lyrics
1960: 36 missing lyrics
1961: 29 missing lyrics
1962: 29 missing lyrics
1963: 42 missing lyrics
1964: 40 missing lyrics
1965: 29 missing lyrics
1966: 42 missing lyrics
1967: 33 missing lyrics
1968: 31 missing lyrics
1970: 32 missing lyrics
1971: 42 missing lyrics
1972: 31 missing lyrics
1973: 28 missing lyrics
1974: 22 missing lyrics
1975: 28 missing lyrics
1976: 20 missing lyrics
1977: 22 missing lyrics
1978: 10 missing lyrics
1979: 23 missing lyrics
1980: 22 missing lyrics
1981: 21 missing lyrics
1982: 24 missing lyrics
1983: 10 missing lyrics
1984: 15 missing lyrics
1985: 15 missing lyrics
1986: 9 missing lyrics
1987: 9 missing lyrics
1988: 9 missing lyrics
1989: 12 missing lyrics
1990: 8 missing lyrics
1991: 16 missing lyrics
1992: 9 missing lyrics
1993: 18 missing lyrics
1994: 10 missing lyrics
1995: 17 missing lyrics
1996: 17 missing lyrics
1997: 17 missing lyrics
1998: 14 missing lyr

## 🗜️ Zipping Billboard Hot 100 CSV Files
The function **zip_csv_files()** automates the process of compressing individual Billboard Hot 100 CSV files into a single ZIP archive named hot100_data.zip. It checks for the existence of each yearly CSV file (e.g., hot100_2000.csv) and adds it to the archive if found.

The zipped file is useful for efficiently downloading and transferring the complete dataset between different stages of the project (e.g., from the Data Collection stage to pre-processing and analysis).



In [None]:

def zip_csv_files(start_year=1959, end_year=2024, output_zip='hot100_data_all_years.zip'):
    with zipfile.ZipFile(output_zip, 'w') as zipf:
        for year in range(start_year, end_year + 1):
            filename = f"hot100_{year}.csv"
            if os.path.exists(filename):
                zipf.write(filename)
    print(f"✅ All CSV files zipped to {output_zip}")

# Create zip file
zip_csv_files()

✅ All CSV files zipped to hot100_data_all_years.zip
