<a href="https://colab.research.google.com/github/sohanchelekar276/WiDS-Midterm-Report/blob/main/WiDS_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# Make plots look nicer
sns.set_theme(style="whitegrid")

Question 1: Introduction to Exploratory Data Analysis (EDA)
Step 1: Load the Data Note: Let it be named spotify_data.csv

In [None]:
# Load dataset
df = pd.read_csv('spotify_data.csv')

(a) Display first and last 5 rows

In [None]:
print("First 5 rows:")
display(df.head())

print("\nLast 5 rows:")
display(df.tail())

(b) Print dataset shape and column names

In [None]:
print(f"Shape of dataset: {df.shape}")
print(f"Column Names: {df.columns.tolist()}")

(c) Use info() and describe()

In [None]:
print("--- Info ---")
df.info()

print("\n--- Statistical Summary ---")
display(df.describe())

(d) Identify numerical and categorical columns

In [None]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical Columns: {numerical_cols}")
print(f"Categorical Columns: {categorical_cols}")

Question 2: Handling Missing Values & Feature Scaling

(a) & (b) Check and Handle Missing Values

In [None]:
# Check for missing values
print("Missing values before cleaning:\n", df.isnull().sum())

# Handling them: We usually drop rows with missing values for simple assignments
df_clean = df.dropna().copy()

print("Missing values after cleaning:\n", df_clean.isnull().sum())

(c) Select numerical features

In [None]:
features = ['danceability', 'energy', 'loudness', 'tempo', 'valence']
X = df_clean[features]

(d) Apply Standardization Standardization rescales data so it has a mean of 0 and a standard deviation of 1. Crucial for models like Logistic Regression.

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to dataframe for easier viewing later
X_scaled_df = pd.DataFrame(X_scaled, columns=features)
print("First 5 rows of scaled data:")
display(X_scaled_df.head())

(e) Explain Normalization vs Standardization

Normalization (Min-Max Scaling): Squishes all data points into a fixed range, usually between 0 and 1. Useful when you need positive values or strictly bounded data (like for neural networks or image processing).

Standardization (Z-score): Centers the data around 0. It doesn't enforce a strict limit (e.g., you can have a value of 3.5), but it handles outliers better. It is generally preferred for algorithms that assume a "bell curve" distribution (like Logistic Regression).

Question 3: Data Visualization for EDA
(a) Histogram of danceability

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(df_clean['danceability'], kde=True, color='skyblue')
plt.title('Distribution of Danceability')
plt.xlabel('Danceability Score')
plt.show()
# Observation: The data looks fairly normal, slightly skewed towards higher danceability.

(b) Boxplot of energy

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x=df_clean['energy'], color='lightgreen')
plt.title('Boxplot of Energy')
plt.show()
# Observation: Check if there are dots outside the "whiskers"—those are outliers.

(c) Scatter plot: energy vs loudness

In [None]:
plt.figure(figsize=(10, 8))
# Calculating correlation only on numerical columns
corr = df_clean[features].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Heatmap')
plt.show()
# Observation: Look for dark red squares. Energy and Loudness often have a high correlation coefficient (near 0.7 or 0.8).

(d) Correlation heatmap

In [None]:
plt.figure(figsize=(10, 8))
# Calculating correlation only on numerical columns
corr = df_clean[features].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Feature Correlation Heatmap')
plt.show()
# Observation: Look for dark red squares. Energy and Loudness often have a high correlation coefficient (near 0.7 or 0.8).

Question 4: Audio Features & Mood
(a) Explain danceability, energy, valence

Danceability: Describes how suitable a track is for dancing based on tempo, rhythm stability, and beat strength. High value = very danceable.

Energy: A measure of intensity and activity. Energetic tracks feel fast, loud, and noisy (e.g., Death Metal is high energy, Bach is low).

Valence: A measure of musical "positiveness." High valence sounds happy/cheerful; low valence sounds sad/depressed.

(b) Identify features related to happy or energetic music

Happy Music: High Valence (the primary indicator) and typically high Danceability.

Energetic Music: High Energy, high Loudness, and often faster Tempo.

Question 5: Supervised Learning – Classification
(a) Create mood column We will create a binary target variable: 1 for "Happy" (High Valence) and 0 for "Sad" (Low Valence). Let's use 0.5 as the cutoff.

In [None]:
# 1 if valence > 0.5, else 0
y = (df_clean['valence'] > 0.5).astype(int)

print("Class distribution:")
print(y.value_counts())

(b) Train-test split (80-20)

In [None]:
# X_scaled was created in Q2 (d)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

(c) Train Logistic Regression

In [None]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

(d) Evaluate Results

In [None]:
y_pred = log_reg.predict(X_test)

acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:")
print(cm)

# Observation: The model tries to predict mood based on energy/danceability.
# Accuracy might be modest because mood is complex!

Question 6: Bonus (Compare with KNN)
Let's try K-Nearest Neighbors to see if it performs better.

In [None]:
# Initialize KNN classifier (let's try k=5 neighbors)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict
y_pred_knn = knn.predict(X_test)

# Evaluate
acc_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy: {acc_knn:.2f}")

if acc_knn > acc:
    print("Observation: KNN performed better than Logistic Regression.")
else:
    print("Observation: Logistic Regression performed better (or similar) to KNN.")