Xinghua Wu 

z5526892

# Part 1

## Q1:

i) The new regular expression retains letters, numbers, spaces, and some meaningful symbols.


ii) I will use the 5-fold cross-validation provided by scikit-learn. This evaluation method can more stably measure the generalization ability of the model and avoid overestimating or underestimating

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

df = pd.read_csv('dataset.tsv', sep='\t')
print(df.head())

                            artist_name         track_name  release_date  \
0                                loving  the not real lake          2016   
1                               incubus    into the summer          2019   
2                             reignwolf           hardcore          2016   
3                  tedeschi trucks band             anyhow          2016   
4  lukas nelson and promise of the real  if i started over          2017   

   genre                                             lyrics      topic  
0   rock  awake know go see time clear world mirror worl...       dark  
1   rock  shouldn summer pretty build spill ready overfl...  lifestyle  
2  blues  lose deep catch breath think say try break wal...    sadness  
3  blues  run bitter taste take rest feel anchor soul pl...    sadness  
4  blues  think think different set apart sober mind sym...       dark  


In [2]:
import re

def clean_lyrics(text):
    # Keep letters, numbers, spaces and some meaningful symbols.
    text = re.sub(r"[^a-zA-Z0-9 #@\$%&\-_\/\\]", "", str(text))
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"\s+", " ", text).strip() 
    return text

df['lyrics_clean'] = df['lyrics'].apply(clean_lyrics)


In [3]:
print(df['topic'].value_counts())

topic
dark         490
sadness      376
personal     347
lifestyle    205
emotion       82
Name: count, dtype: int64


In [4]:
print(df['lyrics'].iloc[0])
print(df['lyrics_clean'].iloc[0])

awake know go see time clear world mirror world mirror magic hour confuse power steal word unheard unheard certain forget bless angry weather head angry weather head angry weather head know gentle night mindless fight walk woods
awake know go see time clear world mirror world mirror magic hour confuse power steal word unheard unheard certain forget bless angry weather head angry weather head angry weather head know gentle night mindless fight walk woods


In [5]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

def preprocess(text, 
               keep_emotion_punct=True, 
               use_nltk_tokenizer=False, 
               stopword_source=None, 
               to_lower=True, 
               stemming=None):
    # Special character handling
    if keep_emotion_punct:
        text = re.sub(r"[^a-zA-Z!?]", " ", text)
    else:
        text = re.sub(r"[^a-zA-Z]", " ", text)
    # capitalized
    if to_lower:
        text = text.lower()
    # tokenize
    if use_nltk_tokenizer:
        tokens = word_tokenize(text)
    else:
        tokens = text.split()
    # stopwords
    if stopword_source == 'nltk':
        stop_words = set(stopwords.words('english'))
        tokens = [w for w in tokens if w not in stop_words]
    elif stopword_source == 'sklearn':
        from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
        stop_words = set(ENGLISH_STOP_WORDS)
        tokens = [w for w in tokens if w not in stop_words]
    # stemming/lemmatization
    if stemming == 'porter':
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(w) for w in tokens]
    elif stemming == 'lemmatizer':
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return ' '.join(tokens)

In [6]:
def run_experiment(preprocess_kwargs):
    # preprocessing
    df['processed'] = df['lyrics_clean'].apply(lambda x: preprocess(x, **preprocess_kwargs))
    # split data
    X_train, X_test, y_train, y_test = train_test_split(df['processed'], df['topic'], test_size=0.2, random_state=42)
    # feature extraction
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    # train model
    clf = MultinomialNB()
    clf.fit(X_train_vec, y_train)
    # evaluate
    y_pred = clf.predict(X_test_vec)
    acc = accuracy_score(y_test, y_pred)
    return acc

In [7]:
# find the best config
results = []
for keep_emotion_punct in [True, False]:
    for use_nltk_tokenizer in [True, False]:
        for stopword_source in [None, 'nltk', 'sklearn']:
            for to_lower in [True, False]:
                for stemming in [None, 'porter', 'lemmatizer']:
                    kwargs = dict(
                        keep_emotion_punct=keep_emotion_punct,
                        use_nltk_tokenizer=use_nltk_tokenizer,
                        stopword_source=stopword_source,
                        to_lower=to_lower,
                        stemming=stemming
                    )
                    acc = run_experiment(kwargs)
                    results.append((kwargs, acc))
# output the best result
best = max(results, key=lambda x: x[1])
print("Best config:", best[0], "Accuracy:", best[1])

Best config: {'keep_emotion_punct': True, 'use_nltk_tokenizer': True, 'stopword_source': 'sklearn', 'to_lower': True, 'stemming': None} Accuracy: 0.8133333333333334


## Q2 :

Under the default countervector configuration, all possible combinations were tried based on whether to retain special characters, whether to use NLTK or scikit-learn, whether to lowercase and perform stemming or lemmatization. 

The best configuration obtained was to retain special characters, use NLTK's tokenize, use scikit-learn's stopword, lowercase, and not use stemming or lemmatization.

In [8]:
df

Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,topic,lyrics_clean,processed
0,loving,the not real lake,2016,rock,awake know go see time clear world mirror worl...,dark,awake know go see time clear world mirror worl...,awake know time clear world mirror world mirro...
1,incubus,into the summer,2019,rock,shouldn summer pretty build spill ready overfl...,lifestyle,shouldn summer pretty build spill ready overfl...,shouldn summer pretty build spill ready overfl...
2,reignwolf,hardcore,2016,blues,lose deep catch breath think say try break wal...,sadness,lose deep catch breath think say try break wal...,lose deep catch breath think say try break wal...
3,tedeschi trucks band,anyhow,2016,blues,run bitter taste take rest feel anchor soul pl...,sadness,run bitter taste take rest feel anchor soul pl...,run bitter taste rest feel anchor soul play ga...
4,lukas nelson and promise of the real,if i started over,2017,blues,think think different set apart sober mind sym...,dark,think think different set apart sober mind sym...,think think different set apart sober mind sym...
...,...,...,...,...,...,...,...,...
1495,ra ra riot,absolutely,2016,rock,year absolutely absolutely absolutely crush ab...,emotion,year absolutely absolutely absolutely crush ab...,year absolutely absolutely absolutely crush ab...
1496,mat kearney,face to face,2018,rock,breakthrough hours hear truth moments trade fa...,dark,breakthrough hours hear truth moments trade fa...,breakthrough hour hear truth moment trade fake...
1497,owane,born in space,2018,jazz,look look right catch blue eye own state breat...,dark,look look right catch blue eye own state breat...,look look right catch blue eye state breath ce...
1498,nappy roots,blowin' trees,2019,hip hop,nappy root gotta alright flyin dear leave lone...,personal,nappy root gotta alright flyin dear leave lone...,nappy root gotta alright flyin dear leave lone...


In [9]:
# construct the text column
new_df = df[['artist_name', 'track_name', 'genre', 'lyrics_clean','topic']].copy()
new_df['text'] = (
    new_df['artist_name'].astype(str) + ' ' +
    new_df['track_name'].astype(str) + ' ' +
    new_df['genre'].astype(str) + ' ' +
    new_df['lyrics_clean'].astype(str)
)

# use the best preprocess parameters to process the text column
def best_preprocess(text):
    return preprocess(
        text,
        keep_emotion_punct=True,
        use_nltk_tokenizer=True,
        stopword_source='sklearn',
        to_lower=True,
        stemming=None
    )

new_df['text_processed'] = new_df['text'].apply(best_preprocess)

# use CountVectorizer to encode
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(new_df['text_processed'])

# print the shape and part of the encoded result
print('encoded result shape:', X.shape)
print('part of feature names:', vectorizer.get_feature_names_out()[:20])
print('part of encoded result (first 5 rows):\n', X[:5].toarray())

encoded result shape: (1500, 9808)
part of feature names: ['aaaah' 'aaah' 'aaahaha' 'aaliyah' 'aand' 'aaron' 'ab' 'ababa' 'aback'
 'abandon' 'abdicate' 'aberration' 'abide' 'ability' 'ablaze' 'able'
 'abolition' 'abomination' 'abound' 'abroad']
part of encoded result (first 5 rows):
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [10]:
new_df

Unnamed: 0,artist_name,track_name,genre,lyrics_clean,topic,text,text_processed
0,loving,the not real lake,rock,awake know go see time clear world mirror worl...,dark,loving the not real lake rock awake know go se...,loving real lake rock awake know time clear wo...
1,incubus,into the summer,rock,shouldn summer pretty build spill ready overfl...,lifestyle,incubus into the summer rock shouldn summer pr...,incubus summer rock shouldn summer pretty buil...
2,reignwolf,hardcore,blues,lose deep catch breath think say try break wal...,sadness,reignwolf hardcore blues lose deep catch breat...,reignwolf hardcore blues lose deep catch breat...
3,tedeschi trucks band,anyhow,blues,run bitter taste take rest feel anchor soul pl...,sadness,tedeschi trucks band anyhow blues run bitter t...,tedeschi trucks band blues run bitter taste re...
4,lukas nelson and promise of the real,if i started over,blues,think think different set apart sober mind sym...,dark,lukas nelson and promise of the real if i star...,lukas nelson promise real started blues think ...
...,...,...,...,...,...,...,...
1495,ra ra riot,absolutely,rock,year absolutely absolutely absolutely crush ab...,emotion,ra ra riot absolutely rock year absolutely abs...,ra ra riot absolutely rock year absolutely abs...
1496,mat kearney,face to face,rock,breakthrough hours hear truth moments trade fa...,dark,mat kearney face to face rock breakthrough hou...,mat kearney face face rock breakthrough hours ...
1497,owane,born in space,jazz,look look right catch blue eye own state breat...,dark,owane born in space jazz look look right catch...,owane born space jazz look look right catch bl...
1498,nappy roots,blowin' trees,hip hop,nappy root gotta alright flyin dear leave lone...,personal,nappy roots blowin' trees hip hop nappy root g...,nappy roots blowin trees hip hop nappy root go...


In [11]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import make_scorer, f1_score, accuracy_score, confusion_matrix, classification_report
import numpy as np

X = vectorizer.fit_transform(new_df['text_processed']) # feature matrix
y = new_df['topic']     # label

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Macro F1
macro_f1 = make_scorer(f1_score, average='macro')

# MultinomialNB
mnb = MultinomialNB()
mnb_f1 = cross_val_score(mnb, X, y, cv=skf, scoring=macro_f1)
mnb_acc = cross_val_score(mnb, X, y, cv=skf, scoring='accuracy')

# BernoulliNB
bnb = BernoulliNB()
bnb_f1 = cross_val_score(bnb, X, y, cv=skf, scoring=macro_f1)
bnb_acc = cross_val_score(bnb, X, y, cv=skf, scoring='accuracy')

print(f"MNB Macro F1: {np.mean(mnb_f1):.4f}, Accuracy: {np.mean(mnb_acc):.4f}")
print(f"BNB Macro F1: {np.mean(bnb_f1):.4f}, Accuracy: {np.mean(bnb_acc):.4f}")

# confusion matrix and detailed report
for train_idx, test_idx in skf.split(X, y):
    mnb.fit(X[train_idx], y[train_idx])
    y_pred = mnb.predict(X[test_idx])
    print(classification_report(y[test_idx], y_pred))
    print(confusion_matrix(y[test_idx], y_pred))
    break  # only show one fold

MNB Macro F1: 0.7091, Accuracy: 0.7867
BNB Macro F1: 0.3448, Accuracy: 0.5293
              precision    recall  f1-score   support

        dark       0.76      0.72      0.74        98
     emotion       0.29      0.12      0.17        16
   lifestyle       0.82      0.66      0.73        41
    personal       0.85      0.84      0.85        69
     sadness       0.68      0.88      0.77        76

    accuracy                           0.75       300
   macro avg       0.68      0.65      0.65       300
weighted avg       0.74      0.75      0.74       300

[[71  3  2  3 19]
 [ 3  2  3  2  6]
 [ 8  0 27  3  3]
 [ 5  1  1 58  4]
 [ 6  1  0  2 67]]


## Q3:

Regarding the evaluation method, I used StratifiedKFold from scikit-learn for 5-fold cross-validation, which can reduce the interference of data partitioning bias on model evaluation. 

As for the evaluation metrics: Accuracy: It measures the proportion of correct predictions and is the most intuitive classification performance indicator. Macro-averaged F1-score: It is particularly important in multi-class classification problems as it can balance the attention to each class. Confusion Matrix: It helps to identify in which classes the model makes severe misjudgments. Precision/Recall: They are used to further analyze the model's performance on the minority class. After analyzing the dataset, we found that the distribution of the five topic labels is somewhat imbalanced. Therefore, I decided to use Macro F1-score as the main evaluation metric to take into account the classification ability of each category.

In [12]:

max_features_list = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, None]
results = []

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
macro_f1 = make_scorer(f1_score, average='macro')

for max_features in max_features_list:
    vectorizer = CountVectorizer(max_features=max_features)
    X = vectorizer.fit_transform(new_df['text_processed'])
    y = new_df['topic']

    # MultinomialNB
    mnb = MultinomialNB()
    mnb_f1 = cross_val_score(mnb, X, y, cv=skf, scoring=macro_f1)
    mnb_acc = cross_val_score(mnb, X, y, cv=skf, scoring='accuracy')

    # BernoulliNB
    bnb = BernoulliNB()
    bnb_f1 = cross_val_score(bnb, X, y, cv=skf, scoring=macro_f1)
    bnb_acc = cross_val_score(bnb, X, y, cv=skf, scoring='accuracy')

    results.append({
        'max_features': max_features if max_features is not None else 'all',
        'MNB_f1': np.mean(mnb_f1),
        'MNB_acc': np.mean(mnb_acc),
        'BNB_f1': np.mean(bnb_f1),
        'BNB_acc': np.mean(bnb_acc)
    })

results_df = pd.DataFrame(results)
print(results_df)


   max_features    MNB_f1   MNB_acc    BNB_f1   BNB_acc
0           100  0.723673  0.750667  0.504874  0.583333
1           200  0.819569  0.839333  0.546220  0.626000
2           300  0.834056  0.859333  0.556188  0.653333
3           400  0.833829  0.862667  0.550633  0.642667
4           500  0.832464  0.860667  0.543731  0.641333
5           600  0.819904  0.849333  0.535553  0.639333
6           700  0.812070  0.836667  0.534089  0.640000
7           800  0.813866  0.839333  0.538203  0.642667
8           900  0.814999  0.839333  0.535388  0.644000
9          1000  0.806872  0.836000  0.528638  0.637333
10          all  0.709071  0.786667  0.344754  0.529333


## Q4：
As the result shows, the best N is 400

In [13]:
from sklearn.linear_model import LogisticRegression


# set the same max_features list as before
logreg_results = []

for max_features in max_features_list:
    vectorizer = CountVectorizer(max_features=max_features)
    X = vectorizer.fit_transform(new_df['text_processed'])
    y = new_df['topic']

    # logistic regression tuning: C=0.1, 1, 10
    best_f1 = -1
    best_acc = -1
    best_c = None
    for c in [0.1, 1, 10]:
        lr = LogisticRegression(C=c, max_iter=1000, random_state=42, solver='liblinear')
        lr_f1 = cross_val_score(lr, X, y, cv=skf, scoring=macro_f1)
        lr_acc = cross_val_score(lr, X, y, cv=skf, scoring='accuracy')
        mean_f1 = np.mean(lr_f1)
        mean_acc = np.mean(lr_acc)
        if mean_f1 > best_f1:
            best_f1 = mean_f1
            best_acc = mean_acc
            best_c = c

    logreg_results.append({
        'max_features': max_features if max_features is not None else 'all',
        'LogReg_f1': best_f1,
        'LogReg_acc': best_acc,
        'LogReg_C': best_c
    })

# merge the results of BNB/MNB and Logistic Regression
logreg_results_df = pd.DataFrame(logreg_results)
final_results_df = results_df.merge(logreg_results_df, on='max_features')
print(final_results_df)

# find the best classification method
best_row = final_results_df.loc[final_results_df[['MNB_f1', 'BNB_f1', 'LogReg_f1']].values.max(axis=1).argmax()]
print("\nbest model parameters and performance:")
if best_row['LogReg_f1'] >= best_row['MNB_f1'] and best_row['LogReg_f1'] >= best_row['BNB_f1']:
    print(f"best model: Logistic Regression (C={best_row['LogReg_C']}, max_features={best_row['max_features']})")
    print(f"Macro F1: {best_row['LogReg_f1']:.4f}, Accuracy: {best_row['LogReg_acc']:.4f}")
elif best_row['MNB_f1'] >= best_row['BNB_f1']:
    print(f"best model: MultinomialNB (max_features={best_row['max_features']})")
    print(f"Macro F1: {best_row['MNB_f1']:.4f}, Accuracy: {best_row['MNB_acc']:.4f}")
else:
    print(f"best model: BernoulliNB (max_features={best_row['max_features']})")
    print(f"Macro F1: {best_row['BNB_f1']:.4f}, Accuracy: {best_row['BNB_acc']:.4f}")


   max_features    MNB_f1   MNB_acc    BNB_f1   BNB_acc  LogReg_f1  \
0           100  0.723673  0.750667  0.504874  0.583333   0.717325   
1           200  0.819569  0.839333  0.546220  0.626000   0.802063   
2           300  0.834056  0.859333  0.556188  0.653333   0.831354   
3           400  0.833829  0.862667  0.550633  0.642667   0.847538   
4           500  0.832464  0.860667  0.543731  0.641333   0.846391   
5           600  0.819904  0.849333  0.535553  0.639333   0.843823   
6           700  0.812070  0.836667  0.534089  0.640000   0.840410   
7           800  0.813866  0.839333  0.538203  0.642667   0.842081   
8           900  0.814999  0.839333  0.535388  0.644000   0.842933   
9          1000  0.806872  0.836000  0.528638  0.637333   0.843828   
10          all  0.709071  0.786667  0.344754  0.529333   0.841247   

    LogReg_acc  LogReg_C  
0     0.759333       0.1  
1     0.825333       0.1  
2     0.859333       0.1  
3     0.878000       0.1  
4     0.876667       0.1

## Q5:
As the result show, the best performance is Logistic Regression with C=0.1 and max_features with 400

# Part 2

In [14]:
# 1. load user keywords
def load_user_keywords(tsv_path):
    df = pd.read_csv(tsv_path, sep='\t')
    user_keywords = {}
    for _, row in df.iterrows():
        topic = row['topic']
        keywords = [k.strip() for k in row['keywords'].split(',')]
        user_keywords[topic] = keywords
    return user_keywords

user1_keywords = load_user_keywords('user1.tsv')
user2_keywords = load_user_keywords('user2.tsv')

user1_keywords

{'dark': ['fire', 'enemy', 'pain', 'storm', 'fight'],
 'sadness': ['cry', 'alone', 'heartbroken', 'tears', 'regret'],
 'personal': ['dream', 'truth', 'life', 'growth', 'identity'],
 'lifestyle': ['party', 'city', 'night', 'light', 'rhythm'],
 'emotion': ['love', 'memory', 'hug', 'kiss', 'feel']}

In [15]:

from sklearn.feature_extraction.text import TfidfVectorizer


# 2. build user3 custom keywords
user3_keywords = {
    'dark':     ['night', 'shadow', 'fear', 'storm', 'mystery'],
    'sadness':  ['tears', 'goodbye', 'lost', 'pain', 'empty'],
    'personal': ['dream', 'hope', 'change', 'grow', 'future'],
    'lifestyle':['dance', 'party', 'city', 'drive', 'adventure'],
    'emotion':  ['love', 'smile', 'hug', 'joy', 'heart']
}

# 3. only take the first 750 songs
df_750 = new_df[['text_processed','topic']].iloc[:750].copy()
df_750


Unnamed: 0,text_processed,topic
0,loving real lake rock awake know time clear wo...,dark
1,incubus summer rock shouldn summer pretty buil...,lifestyle
2,reignwolf hardcore blues lose deep catch breat...,sadness
3,tedeschi trucks band blues run bitter taste re...,sadness
4,lukas nelson promise real started blues think ...,dark
...,...,...
745,h e r kind way pop baby sound better want mind...,personal
746,joe bonamassa ve known long time blues matter ...,lifestyle
747,khalid hopeless pop spend time worry break pro...,lifestyle
748,days grace right left wrong rock wan na away a...,sadness


In [16]:
# use max_features=400 and C=0.1(the best performance)
# vectorizer = CountVectorizer(max_features=400)
vectorizer = CountVectorizer(max_features=400)
X = vectorizer.fit_transform(df_750['text_processed'])
y = df_750['topic']

lr = LogisticRegression(C=0.1, max_iter=1000, random_state=42, solver='liblinear')
lr.fit(X, y)
predicted_topic = lr.predict(X)
df_750['predicted_topic'] = predicted_topic

text_df = new_df[['text_processed','topic']].iloc[750:1000].copy()

predicted_topic = lr.predict(vectorizer.transform(text_df['text_processed']))
text_df['predicted_topic'] = predicted_topic
df_750
text_df

Unnamed: 0,text_processed,topic,predicted_topic
750,door cinema club rock away sugar dance tongue ...,sadness,sadness
751,kelsea ballerini legends country golden magic ...,personal,personal
752,soccer mommy dog rock want drag collar neck ti...,dark,dark
753,score revolution rock wolves begin howl time o...,dark,dark
754,l ind cis sunrise drive jazz breath suffocate ...,dark,dark
...,...,...,...
995,radio moscow miles brain cycles blues recall s...,dark,sadness
996,cage elephant wide world blues young mama say ...,personal,personal
997,tesseract smile jazz calm soothe mechanical br...,sadness,sadness
998,godsmack scars rock sense think spite black wh...,dark,dark


In [17]:
# analyze the distribution of topic and predicted_topic in df_750 and text_df

print("df_750 real topic distribution:")
print(df_750['topic'].value_counts())
print("\ndf_750 predicted topic distribution:")
print(df_750['predicted_topic'].value_counts())

print("\ntext_df real topic distribution:")
print(text_df['topic'].value_counts())
print("\ntext_df predicted topic distribution:")
print(text_df['predicted_topic'].value_counts())

# cross table: real vs predicted
print("\ndf_750 real vs predicted cross table:")
print(pd.crosstab(df_750['topic'], df_750['predicted_topic']))

print("\ntext_df real vs predicted cross table:")
print(pd.crosstab(text_df['topic'], text_df['predicted_topic']))


df_750 real topic distribution:
topic
dark         246
personal     188
sadness      182
lifestyle     92
emotion       42
Name: count, dtype: int64

df_750 predicted topic distribution:
predicted_topic
dark         255
personal     188
sadness      180
lifestyle     87
emotion       40
Name: count, dtype: int64

text_df real topic distribution:
topic
dark         81
sadness      69
personal     51
lifestyle    32
emotion      17
Name: count, dtype: int64

text_df predicted topic distribution:
predicted_topic
dark         82
sadness      74
personal     58
lifestyle    27
emotion       9
Name: count, dtype: int64

df_750 real vs predicted cross table:
predicted_topic  dark  emotion  lifestyle  personal  sadness
topic                                                       
dark              244        1          0         1        0
emotion             2       39          0         1        0
lifestyle           5        0         87         0        0
personal            3        0     

In [18]:
# 4. build the profile documents for each user

# if the topic is not in user_keywords, skip it
user_profiles = {}
for uname, ukeywords in zip(['user1', 'user2', 'user3'], [user1_keywords, user2_keywords, user3_keywords]):
    profile_docs = {}
    for topic in ['dark', 'sadness', 'personal', 'lifestyle', 'emotion']:
        # if the topic is not in user_keywords, skip it
        if topic not in ukeywords or not ukeywords[topic]:
            profile_docs[topic] = ''
            continue
        topic_songs = df_750[df_750['predicted_topic'] == topic]
        liked_mask = topic_songs['text_processed'].apply(
            lambda text: any(kw in text.split() for kw in ukeywords[topic])
        )
        liked_lyrics = topic_songs[liked_mask]['text_processed']
        profile_docs[topic] = ' '.join(liked_lyrics.tolist())
        print(uname, topic, 'number of liked songs:', liked_mask.sum())
    user_profiles[uname] = profile_docs


user_profiles

user1 dark number of liked songs: 65
user1 sadness number of liked songs: 3
user1 personal number of liked songs: 117
user1 lifestyle number of liked songs: 34
user1 emotion number of liked songs: 24
user2 sadness number of liked songs: 19
user2 emotion number of liked songs: 12
user3 dark number of liked songs: 62
user3 sadness number of liked songs: 34
user3 personal number of liked songs: 99
user3 lifestyle number of liked songs: 16
user3 emotion number of liked songs: 19


{'user1': {'dark': 'loving real lake rock awake know time clear world mirror world mirror magic hour confuse power steal word unheard unheard certain forget bless angry weather head angry weather head angry weather head know gentle night mindless fight walk woods zayde w lf gladiator rock start climb face army vipers lions reach cause time tear kingdom liars jail heart pessimists nail mouth impressionists spend money therapist couldn accept gladiator gladiator gladiator pick fight gods giant slayer boneshaker dominator freight train wreck ball gladiator tell think believe catch crossfire trouble single feel gold underneath lock doors second catch breath cause heart jumpin chest know cause best time accept gladiator gladiator gladiator pick fight gods giant slayer boneshaker dominator freight train wreck ball gladiator reason fight reason fight shake hand devil night tell reason fight gladiator gladiator gladiator pick fight gods giant slayer bone shaker dominator freight train wreck ba

In [19]:
# 5. build the TfidfVectorizer for each topic(only use the songs in the training set)
topic_vectorizers = {}
topic_vocab = {}
for topic in ['dark', 'sadness', 'personal', 'lifestyle', 'emotion']:
    topic_songs = df_750[df_750['topic'] == topic]['text_processed']
    vectorizer = TfidfVectorizer()
    vectorizer.fit(topic_songs)
    topic_vectorizers[topic] = vectorizer
    topic_vocab[topic] = vectorizer.get_feature_names_out()

# 6. convert the user profile documents to TF-IDF vectors, and output the top 20 keywords for each topic
def print_top_keywords_for_user(user_profile_docs, topic_vectorizers, topn=20):
    for topic in ['dark', 'sadness', 'personal', 'lifestyle', 'emotion']:
        doc = user_profile_docs[topic]
        vectorizer = topic_vectorizers[topic]
        if not doc.strip():
            print(f"topic: {topic} (no liked songs)")
            print('-'*40)
            continue
        tfidf_vec = vectorizer.transform([doc])
        arr = tfidf_vec.toarray().flatten()
        top_idx = arr.argsort()[::-1][:topn]
        top_words = np.array(vectorizer.get_feature_names_out())[top_idx]
        top_scores = arr[top_idx]
        print(f"topic: {topic}")
        for w, s in zip(top_words, top_scores):
            print(f"{w}: {s:.3f}", end='  ')
        print('\n' + '-'*40)

print("User1 keywords:")
print_top_keywords_for_user(user_profiles['user1'], topic_vectorizers)
print("\nUser2 keywords:")
print_top_keywords_for_user(user_profiles['user2'], topic_vectorizers)
print("\nUser3 keywords:")
print_top_keywords_for_user(user_profiles['user3'], topic_vectorizers)

User1 keywords:
topic: dark
fight: 0.360  blood: 0.175  like: 0.174  know: 0.167  grind: 0.140  gon: 0.133  tell: 0.131  na: 0.130  kill: 0.129  black: 0.127  stand: 0.121  dilly: 0.121  lanky: 0.121  head: 0.114  follow: 0.108  people: 0.097  yeah: 0.097  hand: 0.096  come: 0.096  shoot: 0.094  
----------------------------------------
topic: sadness
tear: 0.396  think: 0.332  greater: 0.321  regret: 0.257  leave: 0.248  place: 0.209  beg: 0.208  want: 0.163  bring: 0.151  blame: 0.140  wider: 0.139  hold: 0.133  lord: 0.121  word: 0.121  gon: 0.119  change: 0.116  mind: 0.111  cause: 0.109  trust: 0.101  na: 0.090  
----------------------------------------
topic: personal
life: 0.449  live: 0.240  na: 0.166  change: 0.156  world: 0.149  know: 0.147  ordinary: 0.146  yeah: 0.137  dream: 0.134  wan: 0.128  like: 0.119  thank: 0.116  teach: 0.115  lord: 0.113  come: 0.107  time: 0.107  beat: 0.102  think: 0.099  learn: 0.090  need: 0.087  
----------------------------------------
topic:

## Q1:

I think these words are reasonable, especially the first three words

In [20]:

from sklearn.metrics.pairwise import cosine_similarity

# function to evaluate the recommendation system
def evaluate_recommendation(user_profile_docs, df_week4, topic_vectorizers, N=10, M=20):
    """
    user_profile_docs: dict, user profile documents for each topic (string)
    df_week4: DataFrame, Week4 songs data, need to contain 'topic' and 'text_processed'
    topic_vectorizers: dict, TfidfVectorizer for each topic
    N: size of the recommendation set, recommend N songs for each topic
    M: number of keywords to keep in the user profile (5,10,20,all)
    """
    results = {}
    for topic in ['dark', 'sadness', 'personal', 'lifestyle', 'emotion']:
        # 1. build the user interest vector (only keep the top M keywords)
        doc = user_profile_docs[topic]
        vectorizer = topic_vectorizers[topic]
        if not doc.strip():
            results[topic] = {'precision': None, 'recall': None, 'f1': None, 'num_gt': 0, 'num_hit': 0}
            continue
        tfidf_vec = vectorizer.transform([doc])
        arr = tfidf_vec.toarray().flatten()
        if M == 'all':
            top_idx = np.where(arr > 0)[0]
        else:
            top_idx = arr.argsort()[::-1][:M]
        # build the user interest vector (only keep the top M keywords, others are 0)
        user_vec = np.zeros_like(arr)
        user_vec[top_idx] = arr[top_idx]

        # 2. calculate the similarity between all songs in Week4 and the user profile
        week4_songs = df_week4[df_week4['predicted_topic'] == topic]
        if week4_songs.empty:
            results[topic] = {'precision': None, 'recall': None, 'f1': None, 'num_gt': 0, 'num_hit': 0}
            continue
        song_vecs = vectorizer.transform(week4_songs['text_processed'])
        sims = cosine_similarity(song_vecs, user_vec.reshape(1, -1)).flatten()
        topN_idx = sims.argsort()[::-1][:N]
        rec_songs = week4_songs.iloc[topN_idx]

        # 3. count the number of hits
        # ground truth: all songs in Week4 that the user likes in this topic
        # here we assume "like" is defined as: the song contains one of the user profile keywords
        user_keywords = np.array(vectorizer.get_feature_names_out())[top_idx]
        def song_hit(text):
            return any(kw in text.split() for kw in user_keywords)
        week4_songs = week4_songs.copy()
        week4_songs['is_gt'] = week4_songs['text_processed'].apply(song_hit)
        gt_songs = week4_songs[week4_songs['is_gt']]
        num_gt = len(gt_songs)
        if num_gt == 0:
            results[topic] = {'precision': None, 'recall': None, 'f1': None, 'num_gt': 0, 'num_hit': 0}
            continue
        # count the number of hits in the recommendation list
        rec_songs = rec_songs.copy()
        rec_songs['is_hit'] = rec_songs['text_processed'].apply(song_hit)
        num_hit = rec_songs['is_hit'].sum()
        precision = num_hit / N
        recall = num_hit / num_gt
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
        results[topic] = {'precision': precision, 'recall': recall, 'f1': f1, 'num_gt': num_gt, 'num_hit': num_hit}
    return results

# show the evaluation results for different M values
def print_eval_summary(user_name, user_profile_docs, df_week4, topic_vectorizers, M_list=[5,10,20,'all'], N=10):
    print(f"user: {user_name}")
    for M in M_list:
        print(f"\nM={M}, N={N}")
        res = evaluate_recommendation(user_profile_docs, df_week4, topic_vectorizers, N=N, M=M)
        precs = [v['precision'] for v in res.values() if v['precision'] is not None]
        recalls = [v['recall'] for v in res.values() if v['recall'] is not None]
        f1s = [v['f1'] for v in res.values() if v['f1'] is not None]
        print("topic\tPrecision@N\tRecall@N\tF1@N\tGT\thits")
        for topic, v in res.items():
            if v['precision'] is None:
                print(f"{topic}\t-")
            else:
                print(f"{topic}\t{v['precision']:.2f}\t\t{v['recall']:.2f}\t\t{v['f1']:.2f}\t{v['num_gt']}\t{v['num_hit']}")
        if precs:
            print(f"average\t{np.mean(precs):.2f}\t\t{np.mean(recalls):.2f}\t\t{np.mean(f1s):.2f}")

# assume df_week4 is ready, containing 'topic' and 'text_processed' columns
# below is an example for User1
print_eval_summary('User1', user_profiles['user1'], text_df, topic_vectorizers, M_list=[5,10,20,'all'], N=10)
print_eval_summary('User2', user_profiles['user2'], text_df, topic_vectorizers, M_list=[5,10,20,'all'], N=10)
print_eval_summary('User3', user_profiles['user3'], text_df, topic_vectorizers, M_list=[5,10,20,'all'], N=10)


user: User1

M=5, N=10
topic	Precision@N	Recall@N	F1@N	GT	hits
dark	1.00		0.16		0.28	62	10
sadness	1.00		0.24		0.38	42	10
personal	1.00		0.18		0.31	55	10
lifestyle	1.00		0.67		0.80	15	10
emotion	0.90		1.00		0.95	9	9
average	0.98		0.45		0.54

M=10, N=10
topic	Precision@N	Recall@N	F1@N	GT	hits
dark	1.00		0.14		0.25	71	10
sadness	1.00		0.17		0.29	58	10
personal	1.00		0.17		0.29	58	10
lifestyle	1.00		0.40		0.57	25	10
emotion	0.90		1.00		0.95	9	9
average	0.98		0.38		0.47

M=20, N=10
topic	Precision@N	Recall@N	F1@N	GT	hits
dark	1.00		0.13		0.23	76	10
sadness	1.00		0.15		0.26	68	10
personal	1.00		0.17		0.29	58	10
lifestyle	1.00		0.37		0.54	27	10
emotion	0.90		1.00		0.95	9	9
average	0.98		0.36		0.45

M=all, N=10
topic	Precision@N	Recall@N	F1@N	GT	hits
dark	1.00		0.12		0.22	82	10
sadness	1.00		0.14		0.24	74	10
personal	1.00		0.17		0.29	58	10
lifestyle	1.00		0.37		0.54	27	10
emotion	0.90		1.00		0.95	9	9
average	0.98		0.36		0.45
user: User2

M=5, N=10
topic	Precision@N	Recall@N	F1@N	GT	hits
dark	

## Q2:

Set N=10 ,This ensures that each topic has the opportunity to be covered, while also preventing the recommendation list from being too long and thus affecting the quality of feedback.

For the quality of recommendations, I think precision@N and recall@N are the main indicators, which can represent the accuracy of the recommendations and reflect how many of the songs that users like are recommended.

Based on the results, the precision is quite good when M >= 20. As for why the precision is also relatively high when M = all, I guess it's because the diversity of words in our text is not sufficient, so that the features related to the topic of all songs under a certain theme are mostly the same. Moreover, we used the topN strategy to build user profile, which is why the hit rate of recommendations performs like this.

# Part 3

In [21]:

import random

# Parameters
N = 10  # Number of songs shown per week
user_name = 'User1'  # Friendly subject

# Simulate the 4 weeks' data splits
week1_idx = range(0, 250)
week2_idx = range(250, 500)
week3_idx = range(500, 750)
week4_idx = range(0, 250)

# For reproducibility
random.seed(42)

# Select N random songs for each week
week1_songs = df_750.iloc[random.sample(list(week1_idx), N)].copy()
print(week1_songs)
week2_songs = df_750.iloc[random.sample(list(week2_idx), N)].copy()
print(week2_songs)
week3_songs = df_750.iloc[random.sample(list(week3_idx), N)].copy()
print(week3_songs)
week4_songs = text_df.iloc[list(week4_idx)].copy()  # All week 4 songs for recommendation

# user feedback: ask the subject to "like" some songs
def user_likes(songs_df, like_prob=0.4, liked_mask=None):
    songs_df = songs_df.copy()
    songs_df['liked'] = liked_mask
    return songs_df

week1_liked_mask = [False, True, False, False, True, False, False, True, True, False] 
week2_liked_mask = [True, True, True, False, True, False, False, False, False, True] 
week3_liked_mask = [False, False, False, False, False, False, True, True, True, True]


# user likes for weeks 1-3
week1_feedback = user_likes(week1_songs, liked_mask=week1_liked_mask)
week2_feedback = user_likes(week2_songs, liked_mask=week2_liked_mask)
week3_feedback = user_likes(week3_songs, liked_mask=week3_liked_mask)

# Combine all liked songs as user training data
user_liked_songs = pd.concat([
    week1_feedback[week1_feedback['liked']],
    week2_feedback[week2_feedback['liked']],
    week3_feedback[week3_feedback['liked']]
])

# Build user profile from liked songs
def build_user_profile_from_likes(liked_songs_df):
    # Group by topic, concatenate text_processed
    profile = {}
    for topic in liked_songs_df['topic'].unique():
        texts = liked_songs_df[liked_songs_df['predicted_topic'] == topic]['text_processed']
        profile[topic] = ' '.join(texts)
    return profile

user_profile_sim = build_user_profile_from_likes(user_liked_songs)


# Recommend N songs for week 4 using the method from Part 2
def recommend_for_user(user_profile_docs, week4_df, topic_vectorizers, N=10, M='all'):
    # Use the same recommendation function as in Part 2
    recs = []
    for topic, doc in user_profile_docs.items():
        vect = topic_vectorizers[topic]
        X = vect.transform(week4_df[week4_df['predicted_topic'] == topic]['text_processed'])
        user_vec = vect.transform([doc])
        sims = (X * user_vec.T).toarray().flatten()
        topic_df = week4_df[week4_df['predicted_topic'] == topic].copy()
        topic_df['sim'] = sims
        topic_df = topic_df.sort_values('sim', ascending=False)
        if M == 'all':
            topM = topic_df
        else:
            topM = topic_df.head(M)
        recs.append(topM)
    rec_songs = pd.concat(recs).sort_values('sim', ascending=False).head(N)
    return rec_songs

week4_recs = recommend_for_user(user_profile_sim, week4_songs, topic_vectorizers, N=N, M='all').copy()

# user likes for recommended week 4 songs
week4_liked_mask = [True, True, True, False, True, True, True, True, False, False]
week4_recs_feedback = user_likes(week4_recs, like_prob=0.4, liked_mask=week4_liked_mask)
print(week4_recs_feedback)
# Calculate metrics for week 4 recommendations
def calc_metrics(recs_feedback):
    num_recommended = len(recs_feedback)
    num_liked = recs_feedback['liked'].sum()
    print(num_liked)
    print(num_recommended)
    precision = num_liked / num_recommended if num_recommended > 0 else 0

    return {
        'precision': precision,
    }

metrics = calc_metrics(week4_recs_feedback)

# Show the recommended songs and which were liked
print("\nRecommended Songs for Week 4 and User Feedback:")
display_cols = ['artist_name', 'track_name', 'topic', 'text_processed', 'liked']
if 'artist_name' in week4_recs_feedback.columns:
    print(week4_recs_feedback[display_cols])
else:
    print(week4_recs_feedback)

# Show all metrics in a table
metrics_df = pd.DataFrame([metrics])
print("User Study Metrics for Week 4 Recommendations")
print(metrics_df)


                                        text_processed     topic  \
163  skillet anchor rock driftin beneath horizon bo...      dark   
28   imagine dragons walking wire rock feel away oo...   sadness   
6    rebelution trap door reggae long long road occ...      dark   
189  andy grammer wish pain rock doubt come like mo...   sadness   
70   scotty mccreery country mountains thousand fee...  personal   
62   jd mcpherson desperate love blues desperate kn...      dark   
57   gregg allman going going gone blues reach plac...   emotion   
35   george strait away country wild horse couldn d...   sadness   
188  movement cool reggae fight thoughts suicide al...  personal   
26   nappy roots walls dirty mc edit hip hop hmmmmm...  personal   

    predicted_topic  
163            dark  
28          sadness  
6              dark  
189         sadness  
70         personal  
62             dark  
57          emotion  
35          sadness  
188        personal  
26         personal  
         

## Q1:

User feedback: By straightforwardly understanding the sentence meaning, the recommended songs seem to have a decent effect, but they lack rhythm and melody, making them feel rather awkward and unable to accurately express his preferences.