In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('/content/dataset.csv')

## Export Processed DataFrame to CSV

### Subtask:
Export the final processed `df_encoded` DataFrame to a CSV file.

**Reasoning**:
To save the DataFrame `df_encoded` to a CSV file, I will use the `df_encoded.to_csv()` method. I will set `index=False` to avoid writing the DataFrame index as a column in the CSV file.

In [10]:
df_encoded.to_csv('processed_music_data.csv', index=False)
print("DataFrame exported to 'processed_music_data.csv' successfully.")

DataFrame exported to 'processed_music_data.csv' successfully.


# Task
Perform data cleaning and feature engineering on the `df` DataFrame. This involves inspecting data types and missing values, handling missing values (especially in `track_genre`), converting `duration_ms` to `duration_s`, creating a `popularity_bucket` feature by binning `popularity`, and one-hot encoding categorical variables such as `track_genre`, `key`, `mode`, and `time_signature`. Finally, display the updated DataFrame's head and info.

## Inspect Data Types and Missing Values (Current df)


**Reasoning**:
To inspect the data types and non-null values for each column in the `df` DataFrame, and then calculate the number of missing values for each column, as per the instructions.



In [None]:
print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   index             114000 non-null  int64  
 1   track_id          114000 non-null  object 
 2   artists           113999 non-null  object 
 3   album_name        113999 non-null  object 
 4   track_name        113999 non-null  object 
 5   popularity        114000 non-null  int64  
 6   duration_ms       114000 non-null  int64  
 7   explicit          114000 non-null  bool   
 8   danceability      114000 non-null  float64
 9   energy            114000 non-null  float64
 10  key               114000 non-null  int64  
 11  loudness          114000 non-null  float64
 12  mode              114000 non-null  int64  
 13  speechiness       114000 non-null  float64
 14  acousticness      114000 non-null  float64
 15  instrumentalness  114000 non-null  float64
 16  liveness          11

## Handle Missing Values (Current df)

**Reasoning**:
Based on the previous inspection, 'artists', 'album_name', and 'track_name' columns have one missing value each. The subtask requires filling these missing values with 'Unknown' and then verifying the absence of any missing values.



In [3]:
df['artists'] = df['artists'].fillna('Unknown')
df['album_name'] = df['album_name'].fillna('Unknown')
df['track_name'] = df['track_name'].fillna('Unknown')

print(df.isnull().sum())

index               0
track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64


## Convert duration_ms to duration_s (Feature Engineering)


**Reasoning**:
The subtask requires converting 'duration_ms' to 'duration_s' by dividing by 1000 and then displaying the head of the new column for verification.



In [4]:
df['duration_s'] = df['duration_ms'] / 1000
print(df[['duration_ms', 'duration_s']].head())

   duration_ms  duration_s
0       230666     230.666
1       149610     149.610
2       210826     210.826
3       201933     201.933
4       198853     198.853


## IGNORE Create popularity_bucket (Feature Engineering)

**Reasoning**:
To create the new categorical feature `popularity_bucket` by binning the `popularity` column, I will define appropriate bins and labels, then use `pd.cut()` to apply these bins and display the head and value counts for verification.



In [None]:
import pandas as pd

bins = [0, 33, 66, 100]
labels = ['Low', 'Medium', 'High']

df['popularity_bucket'] = pd.cut(df['popularity'], bins=bins, labels=labels, right=True, include_lowest=True)

print(df[['popularity', 'popularity_bucket']].head())
print(df['popularity_bucket'].value_counts())

   popularity popularity_bucket
0          73              High
1          55            Medium
2          57            Medium
3          71              High
4          82              High
popularity_bucket
Low       55582
Medium    50786
High       7632
Name: count, dtype: int64


## Export Processed DataFrame with New Features to CSV

### Subtask:
Export the final processed `df_encoded` DataFrame, which now includes all engineered features, to a new CSV file.

**Reasoning**:
To save the DataFrame `df_encoded` with all new features to a CSV file, I will use the `df_encoded.to_csv()` method. I will set `index=False` to avoid writing the DataFrame index as a column in the CSV file, and use a new filename to distinguish it from previous exports.

In [None]:
df_encoded.to_csv('processed_music_data_with_features.csv', index=False)
print("DataFrame exported to 'processed_music_data_with_features.csv' successfully.")

DataFrame exported to 'processed_music_data_with_features.csv' successfully.


In [5]:
df_encoded = df

## Create More Interaction Features

### Subtask:
Create additional interaction features: `energy_valence_gap` and `dance_energy_ratio`.

**Reasoning**:
To further enrich the feature set, I will create two more interaction features: `energy_valence_gap` (absolute difference between energy and valence) and `dance_energy_ratio` (ratio of danceability to energy, with a small epsilon added to energy to prevent division by zero). Finally, I will display the head and info of the DataFrame to verify the additions.

In [6]:
import numpy as np

# Create energy_valence_gap
df_encoded["energy_valence_gap"] = np.abs(
    df_encoded["energy"] - df_encoded["valence"]
)

# Create dance_energy_ratio, adding a small epsilon to energy to avoid division by zero
df_encoded["dance_energy_ratio"] = (
    df_encoded["danceability"] / (df_encoded["energy"] + 1e-6)
)

print("DataFrame head with new interaction features:")
print(df_encoded[['energy', 'valence', 'energy_valence_gap', 'danceability', 'energy', 'dance_energy_ratio']].head())

print("\nDataFrame Info after adding more interaction features:")
df_encoded.info()

DataFrame head with new interaction features:
   energy  valence  energy_valence_gap  danceability  energy  \
0  0.4610    0.715              0.2540         0.676  0.4610   
1  0.1660    0.267              0.1010         0.420  0.1660   
2  0.3590    0.120              0.2390         0.438  0.3590   
3  0.0596    0.143              0.0834         0.266  0.0596   
4  0.4430    0.167              0.2760         0.618  0.4430   

   dance_energy_ratio  
0            1.466374  
1            2.530105  
2            1.220052  
3            4.463012  
4            1.395031  

DataFrame Info after adding more interaction features:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   index               114000 non-null  int64  
 1   track_id            114000 non-null  object 
 2   artists             114000 non-null  object 
 3   alb

## Create Interaction Features

### Subtask:
Create new interaction features by multiplying existing columns: `energy_x_loudness`, `dance_x_valence`, and `acoustic_x_instrumental`.

**Reasoning**:
To enhance the feature set, I will create new features by multiplying pairs of existing relevant columns. First, I'll compute `loudness_abs` (absolute value of `loudness`) as `loudness` can be negative, and then use it to create `energy_x_loudness`. I will also create `dance_x_valence` and `acoustic_x_instrumental` as requested. Finally, I will display the head of the DataFrame with these new columns and its `info()` to verify the changes.

In [7]:
import numpy as np

# Create loudness_abs
df_encoded["loudness_abs"] = np.abs(df_encoded["loudness"])

# Create new interaction features
df_encoded["energy_x_loudness"] = df_encoded["energy"] * df_encoded["loudness_abs"]
df_encoded["dance_x_valence"] = df_encoded["danceability"] * df_encoded["valence"]
df_encoded["acoustic_x_instrumental"] = (
    df_encoded["acousticness"] * df_encoded["instrumentalness"]
)

print("DataFrame head with new interaction features:")
print(df_encoded[['energy', 'loudness', 'loudness_abs', 'energy_x_loudness', 'danceability', 'valence', 'dance_x_valence', 'acousticness', 'instrumentalness', 'acoustic_x_instrumental']].head())

print("\nDataFrame Info after adding interaction features:")
df_encoded.info()

DataFrame head with new interaction features:
   energy  loudness  loudness_abs  energy_x_loudness  danceability  valence  \
0  0.4610    -6.746         6.746           3.109906         0.676    0.715   
1  0.1660   -17.235        17.235           2.861010         0.420    0.267   
2  0.3590    -9.734         9.734           3.494506         0.438    0.120   
3  0.0596   -18.515        18.515           1.103494         0.266    0.143   
4  0.4430    -9.681         9.681           4.288683         0.618    0.167   

   dance_x_valence  acousticness  instrumentalness  acoustic_x_instrumental  
0         0.483340        0.0322          0.000001             3.252200e-08  
1         0.112140        0.9240          0.000006             5.137440e-06  
2         0.052560        0.2100          0.000000             0.000000e+00  
3         0.038038        0.9050          0.000071             6.398350e-05  
4         0.103206        0.4690          0.000000             0.000000e+00  

DataFrame 

## Label Encode Categorical Variables

In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
cols_to_encode = ['track_genre', 'key', 'mode', 'time_signature']

for col in cols_to_encode:
    df[col] = le.fit_transform(df[col])

In [9]:
print(df_encoded.head())
print(df_encoded.shape)

   index                track_id                 artists  \
0      0  5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   
1      1  4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   
2      2  1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   
3      3  6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   
4      4  5vjLSffimiIP26QG5WcN2K        Chord Overstreet   

                                          album_name  \
0                                             Comedy   
1                                   Ghost (Acoustic)   
2                                     To Begin Again   
3  Crazy Rich Asians (Original Motion Picture Sou...   
4                                            Hold On   

                   track_name  popularity  duration_ms  explicit  \
0                      Comedy          73       230666     False   
1            Ghost - Acoustic          55       149610     False   
2              To Begin Again          57       210826     False   
3  Can't Help Falling In Love 

## One-Hot Encode Categorical Variables (Feature Engineering)


**Reasoning**:
To perform one-hot encoding on the specified categorical columns (`track_genre`, `key`, `mode`, `time_signature`), I will use `pd.get_dummies()` to create new binary columns and then concatenate them with the original DataFrame while dropping the original columns to prevent multicollinearity.



In [None]:
categorical_cols_for_ohe = ['track_genre', 'key', 'mode', 'time_signature']
df_encoded = pd.get_dummies(df, columns=categorical_cols_for_ohe, drop_first=False)

print(df_encoded.head())
print(df_encoded.shape)

   index                track_id                 artists  \
0      0  5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   
1      1  4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   
2      2  1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   
3      3  6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   
4      4  5vjLSffimiIP26QG5WcN2K        Chord Overstreet   

                                          album_name  \
0                                             Comedy   
1                                   Ghost (Acoustic)   
2                                     To Begin Again   
3  Crazy Rich Asians (Original Motion Picture Sou...   
4                                            Hold On   

                   track_name  popularity  duration_ms  explicit  \
0                      Comedy          73       230666     False   
1            Ghost - Acoustic          55       149610     False   
2              To Begin Again          57       210826     False   
3  Can't Help Falling In Love 

**Reasoning**:
All specified data cleaning and feature engineering steps have been completed. The final instruction is to display the updated DataFrame's head and info to verify the overall changes.



In [None]:
print(df_encoded.head())
print(df_encoded.info())

   index                track_id                 artists  \
0      0  5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   
1      1  4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   
2      2  1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   
3      3  6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   
4      4  5vjLSffimiIP26QG5WcN2K        Chord Overstreet   

                                          album_name  \
0                                             Comedy   
1                                   Ghost (Acoustic)   
2                                     To Begin Again   
3  Crazy Rich Asians (Original Motion Picture Sou...   
4                                            Hold On   

                   track_name  popularity  duration_ms  explicit  \
0                      Comedy          73       230666     False   
1            Ghost - Acoustic          55       149610     False   
2              To Begin Again          57       210826     False   
3  Can't Help Falling In Love 

## Summary:

### Data Analysis Key Findings

*   **Missing Value Handling**: Initially, `artists`, `album_name`, and `track_name` columns each contained 1 missing value. These missing values were successfully imputed with the string 'Unknown', resulting in a DataFrame with no remaining missing values.
*   **Duration Conversion**: A new numerical feature, `duration_s`, was created by converting `duration_ms` (milliseconds) to seconds, by dividing by 1000.
*   **Popularity Bucketing**: The continuous `popularity` score was binned into three categorical buckets: 'Low' (0-33), 'Medium' (34-66), and 'High' (67-100). The distribution of songs across these buckets is: 55,582 'Low', 50,786 'Medium', and 7,632 'High'.
*   **One-Hot Encoding**: The categorical variables `track_genre`, `key`, `mode`, and `time_signature` were successfully one-hot encoded. This process expanded the DataFrame significantly, increasing the column count from 22 (original columns + `duration_s` + `popularity_bucket`) to 152, with the original categorical columns being replaced by binary (boolean) columns.

### Insights or Next Steps

*   The data is now clean and transformed, making it ready for machine learning model training or further statistical analysis. The one-hot encoded features and binned popularity can be directly used by most algorithms.
*   Consider exploring the distributions within each `popularity_bucket` for other features (e.g., `danceability`, `energy`) to understand the characteristics of songs in different popularity tiers.
