# Problem Set 3 CS174
---
The following problem set will focus on the lessons on Bag of Words and TFIDF. Answer the following guide questions with your groups by writing functions or code in cells that accomplish the requirements.

1. Take your favorite Spotify or Apple Music playlist, and use their APIs to programmatically extract the first 20 song titles. Build a dictionary that has the song index as the key, and the title as the value. Please make sure that the playlist is **NOT** an instrumental playlist. **3 pts.**

In [1]:
!pip install spotipy sklearn pandas numpy



In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
import re
import numpy as np

In [3]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id="9fdc5e181b5c4c48b221b82739b05353",
                                                           client_secret="7c0a5ef5bd544aef8cb2c59b3cb9b117"))

In [4]:
PLAYLIST_ID = "37i9dQZF1DX1rVvRgjX59F"
my_playlist = sp.playlist(PLAYLIST_ID, fields=['tracks'])

In [5]:
track_list = [item['track']['name'] for item in my_playlist['tracks']['items']]
# Remove "remaster" indicator -- it might mess up genius lyrics search.
track_list = [re.sub("\s*-[\s\d]*(R|r)emaster(ed)?[\s\-]*\d*",  '', track) for track in track_list]
first_20_tracks = track_list[:20]
first_20_tracks

['Smells Like Teen Spirit',
 'Zombie',
 'Under the Bridge',
 'Killing In The Name',
 'Wonderwall',
 'Enter Sandman',
 'Come As You Are',
 'Bitter Sweet Symphony',
 'Black Hole Sun',
 'Song 2',
 'Losing My Religion',
 '1979',
 'Plush',
 'All The Small Things',
 'Creep',
 'Loser',
 "Don't Speak",
 'When I Come Around',
 'Self Esteem',
 'High And Dry']

2. Use the Genius API to programmatically download the lyrics of these songs and store them in a dictionary 
with the index as the key, and the lyrics as the value. **3 pts.**

In [6]:
import lyricsgenius as lg

genius = lg.Genius()

song_lyrics = {}
for track in first_20_tracks:
    lyrics = genius.search_song(track).lyrics
    song_lyrics[track] = lyrics

Searching for "Smells Like Teen Spirit"...
Done.
Searching for "Zombie"...
Done.
Searching for "Under the Bridge"...
Done.
Searching for "Killing In The Name"...
Done.
Searching for "Wonderwall"...
Done.
Searching for "Enter Sandman"...
Done.
Searching for "Come As You Are"...
Done.
Searching for "Bitter Sweet Symphony"...
Done.
Searching for "Black Hole Sun"...


Timeout: Request timed out:
HTTPSConnectionPool(host='genius.com', port=443): Read timed out. (read timeout=5)

In [None]:
list(song_lyrics)

3. Build a 20x20 matrix containing the cosine similarities of the songs to each other. Use Bag of Words to determine these similarities. **4 pts.**

**Preprocessing Pipeline:**

In [None]:
# Modified from: 
# Aman Kedia_ Mayank Rasu - Hands-On Python Natural Language Processing

def clean_text(corpus, keep_list=[]):
    """Replace non-alphanumeric characters with space."""
    cleaned_corpus = []
    for row in corpus:
        cleaned_text = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub('[^a-zA-Z0-9]', ' ', word)
                p1 = p1.lower()
                cleaned_text.append(p1)
            else:
                cleaned_text.append(word)
        cleaned_corpus.append(' '.join(cleaned_text))
    return cleaned_corpus


def remove_stopwords(corpus):
    """Remove all stopwords except wh_words."""
    wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
    stop = set(stopwords.words('english'))
    for word in wh_words:
        stop.remove(word)
    corpus = [[word for word in text.split() if word not in stop] for text in corpus]
    return [w for w in corpus if w]


def lemmatize(corpus):
    """Lemmatize with WordNetLemmatizer."""
    lem = WordNetLemmatizer()
    corpus = [[lem.lemmatize(word, pos='v') for word in text] for text in corpus]
    return corpus

def preprocess(corpus):
    corpus = clean_text(corpus)
    corpus = remove_stopwords(corpus)
    corpus = lemmatize(corpus)
    corpus = [' '.join(sentence) for sentence in corpus]
    return corpus

In [None]:
song_titles, lyrics = zip(*song_lyrics.items())
corpus = pd.Series(lyrics_list, index=song_titles)
corpus

**Create a bag-of-words matrix with CountVectorizer:**

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

preprocessed_corpus = preprocess(corpus)

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(preprocessed_corpus)
bow_matrix = bow_matrix.toarray()
bow_matrix

In [None]:
bow_matrix.shape

**Create the cosine similarities matrix:**

In [None]:
def cosine(v1, v2):
    "Find cosine similarity of two vectors."
    return (v1 @ v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

In [None]:
cosine_matrix = np.zeros(shape=(20, 20))
for row_idx, v1 in enumerate(bow_matrix):
    for col_idx, v2 in enumerate(bow_matrix):
        cosine_matrix[row_idx][col_idx] = cosine(v1, v2)

In [None]:
cosine_matrix

4. Build a 20x20 matrix containing the cosine similarities of the songs to each other. Use TFIDF scores to determine these similarities. **5 pts.**

5. Compare the similarities using BoW and TFIDF. Analyze the results and discuss any findings that interest you. You can use heatmaps or other plots to present your analysis. **3 pts**
Guide Questions:
- Can this be used to determine playlist quality?
- What does this say about the homogeneity of the playlist themes?
- What does this say about how songs (in that playlist genre) are written?

6. Identify the top 5 most important words for each song using TFIDF. Discuss the relationship of these words to their respective songs and analyze if they can be used as passable summaries of the songs. **2 pts** 

Make sure to remove stopwords. No need to lemmatize or stem, but is very welcome.

### Bonus:
Do a TFIDF analysis for a single artist's top 5 songs. Identify the top 10 most important words and discuss an analysis of these. **5 pts.**

Deadline **March 17 11:59PM**. Submission link to be posted in Moodle. 
Submit a .zip file containing the notebook and a 'data/' directory containing the songs with name <SURNAME>_<ID NUMBER>.ipynb.
Make sure to remove or obfuscate any API keys you include in the final submission.
    
Sample: **"BAUTISTA_110464.zip"**

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("there's")