# Data Cleaning Notebook

Download music datasets from GCS, create subsets, exploring, and cleaning them using pandas.

In [1]:
import os
from dotenv import load_dotenv
import pandas as pd
from google.cloud import storage
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

## 1. Setup: Load Environment Variables and Initialize GCS Client

In [2]:
import os
from dotenv import load_dotenv
from google.cloud import storage

In [3]:
# Load .env file
load_dotenv('/Users/mani/Desktop/msds-694-cohort-14-group-5/.  env')
gcs_key_path = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
bucket_name = os.getenv("GCS_BUCKET_NAME")

In [4]:
# Set GCP credentials and initialize client
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = gcs_key_path
client = storage.Client()
bucket = client.bucket(bucket_name)

In [5]:
# Helper function to download CSV from GCS
def download_csv_from_gcs(blob_name, local_path):
    blob = bucket.blob(blob_name)
    blob.download_to_filename(local_path)
    print(f"Downloaded {blob_name} to {local_path}")

In [6]:
# Local folders
local_folder = '/Users/mani/Desktop/msds-694-cohort-14-group-5/data'
os.makedirs(local_folder, exist_ok=True)

files = ['album_info.csv', 'critic_ratings.csv', 'user_ratings.csv']

In [7]:
# Download files
for f in files:
    download_csv_from_gcs(f'data/{f}', os.path.join(local_folder, f))

Downloaded data/album_info.csv to /Users/mani/Desktop/msds-694-cohort-14-group-5/data/album_info.csv
Downloaded data/critic_ratings.csv to /Users/mani/Desktop/msds-694-cohort-14-group-5/data/critic_ratings.csv
Downloaded data/user_ratings.csv to /Users/mani/Desktop/msds-694-cohort-14-group-5/data/user_ratings.csv


## 2. Create Smaller Subsets of the Data

In [8]:
subset_folder = os.path.join(local_folder, 'subsets/raw')
os.makedirs(subset_folder, exist_ok=True)

for f in files:
    df = pd.read_csv(os.path.join(local_folder, f))
    subset = df.head(5000)
    subset.to_csv(os.path.join(subset_folder, f.replace(
        '.csv', '_subset.csv')), index=False)
    print(f"Created subset: {f.replace('.csv', '_subset.csv')}")

Created subset: album_info_subset.csv


  df = pd.read_csv(os.path.join(local_folder, f))


Created subset: critic_ratings_subset.csv
Created subset: user_ratings_subset.csv


## 3. Load Subsets into Pandas and Explore


In [9]:
subset_files = ['album_info_subset.csv',
                'critic_ratings_subset.csv', 'user_ratings_subset.csv']
dfs = {}

for f in subset_files:
    path = os.path.join(subset_folder, f)
    df = pd.read_csv(path)
    dfs[f] = df
    print(f"\nLoaded {f} with shape {df.shape}")


Loaded album_info_subset.csv with shape (5000, 8)

Loaded critic_ratings_subset.csv with shape (5000, 6)

Loaded user_ratings_subset.csv with shape (5000, 4)


In [10]:
# Explore datasets: column types, missing values, duplicates
for name, df in dfs.items():
    print(f"\n--- Summary for {name} ---")
    print("Columns and types:")
    print(df.dtypes)
    print("\nFirst 5 rows:")
    print(df.head())
    print("\nMissing values per column:")
    print(df.isna().sum())
    print("\nDuplicate rows:", df.duplicated().sum())
    print("-" * 50)


--- Summary for album_info_subset.csv ---
Columns and types:
slug            object
artist          object
album           object
critic_score     int64
user_score       int64
release_date    object
release_year     int64
genres          object
dtype: object

First 5 rows:
                                 slug          artist  \
0      6647-john-coltrane-giant-steps   John Coltrane   
1  6041-miles-davis-sketches-of-spain     Miles Davis   
2     6654-charles-mingus-blues-roots  Charlie Mingus   
3             7022-etta-james-at-last      Etta James   
4           21774-max-roach-we-insist       Max Roach   

                                      album  critic_score  user_score  \
0                               Giant Steps            95          86   
1                         Sketches of Spain            96          83   
2                             Blues & Roots            90          84   
3                                  At Last!           100          81   
4  We Insist! Max

#### `album_info_subset.csv`

- **Missing values:**  
  - `artist`: 270 missing entries  
  - `release_date`: 1,834 missing entries  
  - `genres`: 1,766 missing entries  
- **Duplicates:** 2 duplicate rows detected  
- **Data types:** Most columns have appropriate types. `release_date` is an object, which may need conversion to a datetime type for analysis.

#### `critic_ratings_subset.csv`

**Observations:**

- **Missing values:**  
  - `author`: 2,224 missing entries  
  - `snippet`: 2,412 missing entries  
- **Duplicates:** None  
- **Data types:** All columns are correctly typed; `date` may require conversion to datetime for temporal analysis.

#### `user_ratings_subset.csv`

**Observations:**

- **Missing values:** None  
- **Duplicates:** None  
- **Data types:** All columns are appropriately typed; `date` can be converted to datetime.

## 4. Cleaning the Data
- Drop duplicates
- Fill missing categorical/text columns with "Unknown" or "No snippet"
- Keep numeric columns as is

In [11]:
# Cleaning album_info
album_df = dfs['album_info_subset.csv'].drop_duplicates()
album_df['artist'] = album_df['artist'].fillna("Unknown")
album_df['genres'] = album_df['genres'].fillna("Unknown")

# Cleaning critic_ratings
critic_df = dfs['critic_ratings_subset.csv'].drop_duplicates()
critic_df['author'] = critic_df['author'].fillna("Unknown")
critic_df['snippet'] = critic_df['snippet'].fillna("No snippet")

# Cleaning user_ratings (no missing values)
user_df = dfs['user_ratings_subset.csv'].drop_duplicates()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  album_df['artist'] = album_df['artist'].fillna("Unknown")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  album_df['genres'] = album_df['genres'].fillna("Unknown")


In [12]:
# Save cleaned CSVs
cleaned_folder = os.path.join(local_folder, 'subsets', 'cleaned')
os.makedirs(cleaned_folder, exist_ok=True)

album_df.to_csv(os.path.join(
    cleaned_folder, 'album_info_cleaned.csv'), index=False)
critic_df.to_csv(os.path.join(
    cleaned_folder, 'critic_ratings_cleaned.csv'), index=False)
user_df.to_csv(os.path.join(
    cleaned_folder, 'user_ratings_cleaned.csv'), index=False)

## 5. Load Cleaned CSVs and Verify

In [13]:
subset_cleaned_files = ['album_info_cleaned.csv',
                        'critic_ratings_cleaned.csv', 'user_ratings_cleaned.csv']
cleaned_dfs = {}

for f in subset_cleaned_files:
    path = os.path.join(cleaned_folder, f)
    df = pd.read_csv(path)
    cleaned_dfs[f] = df
    print(f"\nLoaded {f} with shape {df.shape}")


Loaded album_info_cleaned.csv with shape (4998, 8)

Loaded critic_ratings_cleaned.csv with shape (5000, 6)

Loaded user_ratings_cleaned.csv with shape (5000, 4)


In [14]:
# Explore cleaned datasets
for name, df in cleaned_dfs.items():
    print(f"\n--- Summary for {name} ---")
    print("Columns and types:")
    print(df.dtypes)
    print("\nFirst 5 rows:")
    print(df.head())
    print("\nMissing values per column:")
    print(df.isna().sum())
    print("\nDuplicate rows:", df.duplicated().sum())
    print("-" * 50)


--- Summary for album_info_cleaned.csv ---
Columns and types:
slug            object
artist          object
album           object
critic_score     int64
user_score       int64
release_date    object
release_year     int64
genres          object
dtype: object

First 5 rows:
                                 slug          artist  \
0      6647-john-coltrane-giant-steps   John Coltrane   
1  6041-miles-davis-sketches-of-spain     Miles Davis   
2     6654-charles-mingus-blues-roots  Charlie Mingus   
3             7022-etta-james-at-last      Etta James   
4           21774-max-roach-we-insist       Max Roach   

                                      album  critic_score  user_score  \
0                               Giant Steps            95          86   
1                         Sketches of Spain            96          83   
2                             Blues & Roots            90          84   
3                                  At Last!           100          81   
4  We Insist! Ma

#### `album_info_cleaned.csv`

**Observations:**

- **Missing values:**  
  - `release_date`: 1,834 missing entries remain  
  - All other columns are now complete  
- **Duplicates:** None  
- **Data types:** Appropriate; `release_date` may still require conversion to datetime for temporal analysis.

#### `critic_ratings_cleaned.csv`

**Observations:**

- **Missing values:** None remain — all previously missing `author` and `snippet` entries have been addressed.  
- **Duplicates:** None  
- **Data types:** All appropriate; `date` can be converted to datetime for temporal analysis.

#### `user_ratings_cleaned.csv`

**Observations:**

- **Missing values:** None  
- **Duplicates:** None  
- **Data types:** Appropriate; `date` may be converted to datetime if needed.

#### Summary of Cleaning Outcomes

- **Duplicates:** All datasets now have duplicates removed.  
- **Missing values:**  
  - `album_info_cleaned.csv` still has missing `release_date` values (1,834).  
  - All other missing values have been resolved.  
- **Data type considerations:** Convert `release_date` and `date` columns to datetime for time-based analysis.  