# Text Analysis: tf-idf and cosine similarity

I've been trying to learn more about tf-idf and cosine similarity, so I thought working with music-related data could be a fun way to do that. Using the LastFM api, I put together a csv file of 500 of their top artists and the lengthy version of their bios. The goal of this experiment was to see if I could find similar artists based on the words in artist biographies. 

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import pandas as pd
import numpy as np

#read into pandas
data = pd.read_csv('artists_500.csv')
data.head(20)

Unnamed: 0,artist,bio
0,Coldplay,"Coldplay is a British alternative rock band, f..."
1,Radiohead,Radiohead are an English alternative rock band...
2,Sia,Sia Kate Isobelle Furler (born 18 December 197...
3,The Weeknd,"The Weeknd, the stage name for Abel Tesfaye, i..."
4,Drake,"Aubrey “Drake” Graham (born October 24, 1986 i..."
5,Rihanna,"Robyn Rihanna Fenty (born February 20, 1988), ..."
6,David Bowie,David Bowie (born David Robert Jones on 8th Ja...
7,Lady Gaga,Stefani Joanne Angelina Germanotta (born March...
8,The Chainsmokers,The Chainsmokers are a disc jockey/producer du...
9,Calvin Harris,Calvin Harris (born Adam Richard Wiles; Januar...


## tf-idf

TF-IDF is essentially a weighting system. It stands for Term Frequency - Inverse Document Frequency. It’s frequently used in information filtering and text mining - in applications like search engines and recommenders.As the name suggests, tf-idf is made up of two pieces - the term frequency, and the inverse document frequency. 

Term Frequency comes down to: the number of times a term appears in a document divided by the number of total terms in the document.

Inverse Document Frequency is: the log of the (total number of documents / number of documents with term t in them)

In the following code block, I use scikit-learn to define a model for tf-idf. For this example, I'm looking at unigrams, bigrams, and trigrams, and using sk-learn's built-in 'stop words' to exclude common english words such as 'the'. I then fit my data to the model using the .fit_transform() method. .get_feature_names() [INSERT EXPLANATION HERE]

In [2]:
#set up model for tfidf, using ngrams of 1, 2, 3, and excluding the list of stop words that come with sklearn

tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=0, stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['bio'].values.astype('U'))
feature_names = np.array(tfidf.get_feature_names())

<500x443189 sparse matrix of type '<class 'numpy.float64'>'
	with 645092 stored elements in Compressed Sparse Row format>

In [1]:
def top_tfidf(row_num, n):
    row = tfidf_matrix[row_num]
    tfidf_sorting = np.argsort(row.toarray()).flatten()[::-1]
    top_n = feature_names[tfidf_sorting][:n]
    return top_n

In [2]:
def cos_sim(artist_id):
    cosine_similarity = linear_kernel(tfidf_matrix[artist_id], tfidf_matrix).flatten()
    top_sim = cosine_similarity.argsort()[:-10:-1]
    s = [df.ix[top_sim[i]]['artist'] for i in range(len(top_sim))]
    return s

In [228]:
dense = tfidf_matrix[70]
tfidf_sorting = np.argsort(dense.toarray()).flatten()[::-1]
tfidf_sorting

array([376115,  79955, 156599, ..., 294673, 294674,      0])

In [229]:
n = 20
top_n = feature_names[tfidf_sorting][:n]

In [230]:
top_n

array(['spears', 'britney', 'fatale', 'femme fatale', 'femme', 'number',
       'album', 'baby time', 'britney jean', 'spears released', 'circus',
       'billboard', 'hot', 'debuted', 'hot 100', 'kevin', 'selling',
       'billboard hot', 'video', '2009 spears'], 
      dtype='<U121')

In [215]:
cosine_similarity = linear_kernel(tfidf_matrix[34], tfidf_matrix).flatten()

In [216]:
cosine_similarity.argsort()[:-10:-1]

array([ 34, 481, 150, 109,   5, 105, 313,   0,  70])

In [231]:
df.ix[70]

artist                                       Britney Spears
bio       Britney Jean Spears (born on December 2, 1981 ...
Name: 70, dtype: object