# Introduction

Welcome to my report on Assignment 1 for the Machine Learning course (Fys-2021)!

In this report, we will explore the fundamental concepts and techniques of machine learning. Machine learning is a fascinating field of study that focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions without being explicitly programmed.

Throughout this report, we will cover various topics related to machine learning, including supervised learning, unsupervised learning, data preprocessing, model evaluation, and introduction to deep learning. We will delve into the concepts, algorithms, and techniques used in each of these areas, providing a comprehensive understanding of the subject matter.

The main objectives of this report are:

1. To provide a clear and concise overview of the key concepts and techniques in machine learning.
2. To demonstrate an understanding of the practical application of machine learning algorithms.
3. To showcase the ability to preprocess data, train models, and evaluate their performance.
4. To explore the potential of deep learning and its applications in various domains.

By the end of this report, you will have gained valuable insights into the field of machine learning and developed practical skills in applying machine learning algorithms to real-world datasets.

Let's dive into the exciting world of machine learning and explore the possibilities it offers for building intelligent systems!



### (1a) 

Load the SpotifyFeatures.csv file and report the number of samples (songs) as well as the number of
features (song properties) in the dataset. Hint: you may use the Python module Pandas and its function
read_csv.

In [23]:
import pandas as pd 

# load the data
data = pd.read_csv('../data/SpotifyFeatures.csv')
# print(data.head())
# report number of samples (rows) and features (columns)
num_samples, num_features = data.shape[0], data.shape[1] 
print(f'Number of samples: {num_samples}')
print(f'Number of features: {num_features}')



Number of samples: 232725
Number of features: 18


## (1b) 

Filter samples for 'Pop' and 'Classical', label them, and report the number of samples per class. Extract features 'liveness' and 'loudness'.

In [25]:


# Filter the dataset for Pop and Classical genres
# Use .copy() to avoid the SettingWithCopyWarning by working on a copy of the slice.
pop_classical_df = data[data['genre'].isin(['Pop', 'Classical'])].copy()

# Create labels: 'Pop' = 1, 'Classical' = 0
# Use .loc to explicitly modify the DataFrame and avoid modifying the slice in place, which may trigger warnings.
pop_classical_df.loc[:, 'label'] = pop_classical_df['genre'].apply(lambda x: 1 if x == 'Pop' else 0)

# Report the number of samples for each class
# .shape[0] returns the number of rows (songs) for each class.
num_pop = pop_classical_df[pop_classical_df['label'] == 1].shape[0]
num_classical = pop_classical_df[pop_classical_df['label'] == 0].shape[0]

# Print the number of songs in each class
print(f"Number of Pop songs: {num_pop}")
print(f"Number of Classical songs: {num_classical}")

# Extract only the 'liveness' and 'loudness' features into a numpy array
features = pop_classical_df[['liveness', 'loudness']].values

# Extract the labels (1 for Pop, 0 for Classical)
labels = pop_classical_df['label'].values


Number of Pop songs: 9386
Number of Classical songs: 9256


In [32]:
# import numpy as np

# # Convert features ('liveness' and 'loudness') and labels to NumPy arrays
# features = pop_classical_df[['liveness', 'loudness']].values
# labels = pop_classical_df['label'].values

# # Split the data manually while preserving the class distribution
# def train_test_split_by_class(features, labels, test_size=0.2):
#     # Separate data by class
#     pop_features = features[labels == 1]
#     classical_features = features[labels == 0]
#     pop_labels = labels[labels == 1]
#     classical_labels = labels[labels == 0]

#     # Determine the number of test samples
#     num_pop_test = int(test_size * len(pop_features))
#     num_classical_test = int(test_size * len(classical_features))

#     # Shuffle the indices
#     pop_indices = np.random.permutation(len(pop_features))
#     classical_indices = np.random.permutation(len(classical_features))

#     # Split the pop songs into train and test sets
#     pop_train_features = pop_features[pop_indices[num_pop_test:]]
#     pop_test_features = pop_features[pop_indices[:num_pop_test]]
#     pop_train_labels = pop_labels[pop_indices[num_pop_test:]]
#     pop_test_labels = pop_labels[pop_indices[:num_pop_test]]

#     # Split the classical songs into train and test sets
#     classical_train_features = classical_features[classical_indices[num_classical_test:]]
#     classical_test_features = classical_features[classical_indices[:num_classical_test]]
#     classical_train_labels = classical_labels[classical_indices[num_classical_test:]]
#     classical_test_labels = classical_labels[classical_indices[:num_classical_test]]

#     # Concatenate pop and classical songs back together
#     X_train = np.concatenate([pop_train_features, classical_train_features])
#     X_test = np.concatenate([pop_test_features, classical_test_features])
#     y_train = np.concatenate([pop_train_labels, classical_train_labels])
#     y_test = np.concatenate([pop_test_labels, classical_test_labels])

#     # Shuffle the training and test sets (optional but useful for better model generalization)
#     train_indices = np.random.permutation(len(X_train))
#     test_indices = np.random.permutation(len(X_test))

#     return X_train[train_indices], X_test[test_indices], y_train[train_indices], y_test[test_indices]

# # Perform the split
# X_train, X_test, y_train, y_test = train_test_split_by_class(features, labels, test_size=0.2)

# # Output the results
# print(f"Training set size: {len(X_train)}")
# print(f"Test set size: {len(X_test)}")


In [33]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets while preserving the class distribution
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=42)

# Output the results
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Training set size: 14913
Test set size: 3729
