# Content Bases Recsys

**Problem**

- Create a recsys based on the top 5 most similar songs based on the content (Artist, Genre and Length in this case). 
- The similarity is computed using cosine similarity, which is a common metric for such problems.

*Note: In a real-world scenario, adding more descriptive features like song lyrics, mood, tempo, etc., would help in building a more robust content-based recommendation system.*


**Features**
1. Convert categorical attributes (like Genre) to a numerical format. This can be done using techniques like One-Hot Encoding.
2. For other numerical attributes (like Length), we might normalize them so they are on a similar scale.
3. For each song, we can then compute its feature vector.
4. To get song recommendations for a given song, we can compute the similarity between this song's feature vector and the feature vectors of all other songs. The most similar songs are the ones recommended.

In [1]:

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from src.utils import load_songs_data


In [2]:
# Load df
df = load_songs_data()
df.head()

Unnamed: 0,Title,Artist,Genre,Length (seconds)
0,Song0,Christina Aguilera,Pop,290.0
1,Song1,The Rolling Stones,Rock,233.0
2,Song2,OutKast,Hip-Hop,282.0
3,Song3,Jeff Mills,Techno,358.0
4,Song4,Arcade Fire,Indie,150.0


In [3]:
# Convert Genre to one-hot encoded vector
df_encoded = pd.get_dummies(df, columns=['Genre', 'Artist'])
# Normalize length
df_encoded['Length (seconds)'] = df_encoded['Length (seconds)'] / df_encoded['Length (seconds)'].max()
df_encoded.head()

Unnamed: 0,Title,Length (seconds),Genre_Blues,Genre_Country,Genre_Electronic,Genre_Folk,Genre_Hip-Hop,Genre_Indie,Genre_Jazz,Genre_Pop,...,Artist_Skrillex,Artist_Stevie Ray Vaughan,Artist_Taylor Swift,Artist_The Chemical Brothers,Artist_The Rolling Stones,Artist_The Shins,Artist_U2,Artist_Vampire Weekend,Artist_Willie Nelson,Artist_Zedd
0,Song0,0.810056,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Song1,0.650838,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,Song2,0.787709,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Song3,1.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Song4,0.418994,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
def recommend_songs(song_title, df_encoded):
    # Find index of the song in the dataframe
    idx = df_encoded[df_encoded['Title'] == song_title].index[0]
    # Compute cosine similarity between this song's features and features of all other songs
    df_sim = df_encoded.drop(columns='Title')
    cosine_similarities = cosine_similarity(df_sim, df_sim)
    # Get similarity values for this specific song with all others
    sim_scores = list(enumerate(cosine_similarities[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Return the top 5 most similar songs (excluding the input song itself)
    sim_song_indices = [i[0] for i in sim_scores]
    return df['Title'].iloc[sim_song_indices[1:6]]

In [5]:
print(recommend_songs('Song0', df_encoded))

240    Song240
140    Song140
50      Song50
10      Song10
230    Song230
Name: Title, dtype: object


# Improving Content-based Recommender Systems

The simple content-based recommender system we've developed has several potential shortcomings and challenges:

. **Sparsity with One-Hot Encoding**: When we use one-hot encoding for categorical data (like genres), it can result in a very sparse matrix, especially if there are many genres. Sparse matrices can sometimes reduce the effectiveness of similarity computations.

. **Bias towards Popular Genres**: Since the system recommends items based on their content, songs from popular genres might get recommended more often simply because there are more of them, potentially reducing the diversity of recommendations.

. **Cold Start Problem for New Songs**: New songs added to the dataset won't have any recommendations until they are processed and added to the feature matrix.

. **Lack of Personalization**: The recommendations are based solely on song features and don't consider user-specific preferences. Two users with vastly different musical tastes could get the same recommendations for a given song.

. **Scalability**: For larger datasets with many songs and features, computing cosine similarities can become computationally intensive. More efficient methods or data structures might be needed for real-world large-scale applications.

# Improving scalability of the model


. **Dimensionality Reduction**:
    - Techniques like PCA (Principal Component Analysis), t-SNE, or autoencoders can reduce the number of dimensions in the dataset. This helps in making similarity computations faster, especially in high-dimensional spaces.
    
. **Approximate Nearest Neighbors (ANN) Algorithms**:
    - Instead of computing exact similarities, use algorithms that find approximate nearest neighbors. Libraries like `Annoy`, `Faiss`, and `NMSLib` provide efficient implementations for this. They allow for quicker similarity searches, especially in high-dimensional spaces.
    
. **Sampling**:
    - In some applications, it might be feasible to use a representative subset of data rather than the whole dataset. Techniques like stratified sampling can be useful to ensure the sample retains the characteristics of the full dataset.
    
. **Matrix Factorization**:
    - Techniques like Singular Value Decomposition (SVD) can decompose the user-item matrix into lower-dimensional matrices. This can reduce the number of computations required and can be used in both content-based and collaborative filtering.
    
. **Batching**:
    - Instead of computing similarities for each item in real-time, compute them in batches and store the results. This way, real-time recommendations can be made using pre-computed values.

. **Clustering**:
    - Group songs into clusters using algorithms like K-means or DBSCAN. Once items are grouped into clusters, you can compute similarities only within a specific cluster, thus reducing computational overhead.

. **Caching**:
    - Store the results of expensive computations in memory or other fast-access storage solutions to avoid redundant calculations. Systems like Redis can be beneficial for this.

. **Feedback Loop**:
    - Regularly prune items or users that are deemed 'inactive' or 'irrelevant' based on user feedback and interactions, thereby reducing the size of the dataset being processed.
