<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians). You can pick any five
you like but make sure that the list has the words “musicians” in
it and that the list has at least 30 musicians listed
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page. This is your "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_musicians</span>: A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages
<p>non_musician_finder tries its best to remove links that are not musician links from the page (not perfect, but good enough!)

In [26]:
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
from gensim import corpora, models, similarities
from collections import OrderedDict
from wordcloud import WordCloud, STOPWORDS
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize,word_tokenize 
from nltk.corpus import gutenberg
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/vriddhimisra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link):
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_alternative_country_musicians"
get_musicians(url)

[('16 Horsepower', 'https://en.wikipedia.org/wiki/16_Horsepower'),
 ('Ryan Adams', 'https://en.wikipedia.org/wiki/Ryan_Adams'),
 ('Jill Andrews', 'https://en.wikipedia.org/wiki/Jill_Andrews'),
 ('The Autumn Defense', 'https://en.wikipedia.org/wiki/The_Autumn_Defense'),
 ('Backyard Tire Fire', 'https://en.wikipedia.org/wiki/Backyard_Tire_Fire'),
 ('Del Barber', 'https://en.wikipedia.org/wiki/Del_Barber'),
 ('Eef Barzelay', 'https://en.wikipedia.org/wiki/Eef_Barzelay'),
 ("Bear's Den", 'https://en.wikipedia.org/wiki/Bear%27s_Den_(band)'),
 ('Rico Bell', 'https://en.wikipedia.org/wiki/Rico_Bell'),
 ('Blitzen Trapper', 'https://en.wikipedia.org/wiki/Blitzen_Trapper'),
 ('Blue Rodeo', 'https://en.wikipedia.org/wiki/Blue_Rodeo'),
 ('Bosque Brown', 'https://en.wikipedia.org/wiki/Bosque_Brown'),
 ('The Bottle Rockets', 'https://en.wikipedia.org/wiki/The_Bottle_Rockets'),
 ('BR549', 'https://en.wikipedia.org/wiki/BR549'),
 ('Jim Bryson', 'https://en.wikipedia.org/wiki/Jim_Bryson'),
 ('Richard B

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (musician, url) pair from our musicians list

In [4]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text


<h4>testing get_musician_text</h4>

In [5]:
url = "https://en.wikipedia.org/wiki/Jim_Morrison"
get_musician_text(url)

'\nJames Douglas Morrison (December 8, 1943 – July 3, 1971) was an American singer, poet and songwriter who was the lead vocalist of the rock band the Doors. Due to his wild personality, poetic lyrics, distinctive voice, unpredictable and erratic performances, and the dramatic circumstances surrounding his life and early death, Morrison is regarded by music critics and fans as one of the most influential frontmen in rock history. Since his death, Morrison\'s fame has endured as one of popular culture\'s top rebellious and oft-displayed icons, representing the generation gap and youth counterculture.[3]\nTogether with pianist Ray Manzarek, Morrison founded the Doors in 1965 in Venice, California. The group spent two years in obscurity until shooting to prominence with their number-one single in the United States, "Light My Fire", taken from their self-titled debut album. Morrison recorded a total of six studio albums with the Doors, all of which sold well and received critical acclaim. 

<p><span style="color:blue">get_all_musicians</span>: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the musicians in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_musicians"
<li>construct a url for the list of musicians (I've done these first three steps for you)
<li>call get_musicians for that url
<li>extend all_musicians by what get_musicians returns

In [6]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
        all_musicians.extend(get_musicians(url))
    
    return all_musicians

<h4>Example of how to use get_all_musicians</h4>

In [7]:
genre_list = ['bluegrass_musicians#G','British_blues_musicians','country_blues_musicians','emo_artists']
all_musicians = get_all_musicians(genre_list)

In [8]:
all_musicians

[('Tom Adams', 'https://en.wikipedia.org/wiki/Tom_Adams_(bluegrass_musician)'),
 ('Eddie Adcock', 'https://en.wikipedia.org/wiki/Eddie_Adcock'),
 ('David "Stringbean" Akeman',
  'https://en.wikipedia.org/wiki/David_%22Stringbean%22_Akeman'),
 ('Red Allen', 'https://en.wikipedia.org/wiki/Red_Allen_(bluegrass)'),
 ('Darol Anger', 'https://en.wikipedia.org/wiki/Darol_Anger'),
 ('Mike Auldridge', 'https://en.wikipedia.org/wiki/Mike_Auldridge'),
 ('Kenny Baker', 'https://en.wikipedia.org/wiki/Kenny_Baker_(fiddler)'),
 ('Jessie Baker', 'https://en.wikipedia.org/wiki/Jessie_Baker'),
 ('Butch Baldassari', 'https://en.wikipedia.org/wiki/Butch_Baldassari'),
 ('Russ Barenberg', 'https://en.wikipedia.org/wiki/Russ_Barenberg'),
 ('Byron Berline', 'https://en.wikipedia.org/wiki/Byron_Berline'),
 ('Carroll Best', 'https://en.wikipedia.org/wiki/Carroll_Best'),
 ('Norman Blake',
  'https://en.wikipedia.org/wiki/Norman_Blake_(American_musician)'),
 ('Kathy Boyd', 'https://en.wikipedia.org/wiki/Kathy_Boy

<p><span style="color:blue">get_all_musician_docs</span>: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_musicians list
<li>extract the name and the url of the musician
<li>get the text using the get_musician_text() function
<li>if the function returns None, ignore it and move to the next musician
<li>otherwise, append the name ot the musician_names list and the text to the musician_texts list
<li>return musician_names and musician_texts


In [9]:
def get_all_musician_docs(all_musicians):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        if get_musician_text(url) is not None:
            musician_names.append(name)
            musician_texts.append(get_musician_text(url))
        else:
            continue
    return musician_names,musician_texts
        

<h4>Example of how to use get_all_musician_docs</h4>

In [10]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)

In [11]:
reference_names

['Tom Adams',
 'Eddie Adcock',
 'David "Stringbean" Akeman',
 'Red Allen',
 'Darol Anger',
 'Mike Auldridge',
 'Kenny Baker',
 'Jessie Baker',
 'Butch Baldassari',
 'Russ Barenberg',
 'Byron Berline',
 'Carroll Best',
 'Norman Blake',
 'Kathy Boyd',
 'Dale Ann Bradley',
 'David Bromberg',
 'Herman Brock Jr',
 'Jesse Brock',
 'Alison Brown',
 'Buckethead',
 'Buzz Busby',
 'Roger Bush',
 'Sam Bush',
 'Ann Marie Calhoun',
 'Jason Carter',
 'Vassar Clements',
 'Michael Cleveland',
 'Bill Clifton',
 'Charlie Cline',
 'Curly Ray Cline',
 'Mike Compton',
 'John Byrne Cooke',
 'J. P. Cormier',
 'John Cowan',
 'Dan Crary',
 'J. D. Crowe',
 'Jamie Dailey',
 'Charlie Daniels',
 'Vernon Derrick',
 'Hazel Dickens',
 'Doug Dillard',
 'The Dillards',
 'Jerry Douglas',
 'Casey Driessen',
 'John Duffey',
 'Stuart Duncan',
 'Chris Eldridge',
 'Bill Emerson',
 'Bill Evans',
 'Lester Flatt',
 'Dennis Fetchet',
 'Pete Fidler',
 'Béla Fleck',
 'Sally Ann Forrester',
 'Tony Furtado',
 'Raymond Fairchild',
 '

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [12]:
documents = reference_docs
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)

<h3>Construct the "musician" data set</h3>
<h4>Example</h4>

In [14]:
musician_genre_list = ['acid_rock_artists']
all_musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musician_docs(all_musicians)

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [16]:
table_data = list()
for i,musician in enumerate(musician_docs):
    vec_bow = dictionary.doc2bow(musician.lower().split())   
    vec_lsi = lsi[vec_bow] 
    index = similarities.MatrixSimilarity(lsi[corpus])
    sims = index[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    
    most_similar_musician = sims[0][0]
    table_data.append((musician_names[i],reference_names[most_similar_musician]))
    
table_data

[('The 13th Floor Elevators', 'Fragile Rock'),
 ('Alice Cooper', 'At the Drive-In'),
 ('The Amboy Dukes', 'Joan of Arc'),
 ('Amon Düül', 'Joan of Arc'),
 ('Big Brother and the Holding Company', 'The Pretty Things'),
 ('Black Sabbath', 'Nude'),
 ('Blue Cheer', 'Steamhammer'),
 ('Blues Magoos', 'Hattie Hart'),
 ('The Charlatans', 'Fire Party'),
 ('Count Five', 'The Pretty Things'),
 ('Country Joe and the Fish', 'Ray Legere'),
 ('Coven', 'Marcus Mumford'),
 ('Cream', 'Cream'),
 ('Deep Purple', 'Nude'),
 ('The Deviants', 'Braid'),
 ('The Doors', 'The Pretty Things'),
 ('The Electric Prunes', 'Joan of Arc'),
 ('The Fugs', 'Free'),
 ('Grateful Dead', 'Drive Like Jehu'),
 ('The Great Society', 'Wishbone Ash'),
 ('The Groundhogs', 'The Groundhogs'),
 ('Hawkwind', 'Wishbone Ash'),
 ('Iron Butterfly', 'Steamhammer'),
 ('Jefferson Airplane', 'The Animals'),
 ('The Jimi Hendrix Experience', 'The Jimi Hendrix Experience'),
 ('Janis Joplin', 'Led Zeppelin'),
 ('JPT Scare Band', 'Rites of Spring'),
 

# Some simple sentiment analysis

In this part we are gonna run some simple sentiment analysis using the previously defined muscian_names and musician_docs lists.

Define a function simple_sentiment_analysis(musician_names,musician_docs) that takes as inputs the list of musician and their corresponding descriptions.
The expected output is a list, each element of this list should be a list with the musician name, the percentage of positive words in his description and the percentage of negative words in his description.

In [None]:
#Example output
"""
[('The 13th Floor Elevators', 0.94, 1.16),
('Alice Cooper', 1.73, 1.34),
('The Amboy Dukes', 1.28, 1.01),
 ...]
"""

To ensure results can be compared please use the following function to define your list of positive and negative words:

In [30]:
def get_pos_neg_words():
    def get_words(url):
        import requests
        words = requests.get(url).content.decode('latin-1')
        word_list = words.split('\n')
        index = 0
        while index < len(word_list):
            word = word_list[index]
            if ';' in word or not word:
                word_list.pop(index)
            else:
                index+=1
        return word_list
    #Get lists of positive and negative words
    p_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
    n_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
    positive_words = get_words(p_url)
    negative_words = get_words(n_url)
    return positive_words,negative_words


In [43]:
def simple_sentiment_analysis(musician_names,musician_docs, debug=False):
    results=[]
    positive_words,negative_words = get_pos_neg_words()
    musician_data_list=list(zip(musician_names, musician_docs))
    from nltk import word_tokenize
    
    for text in musician_data_list:
        cpos=cneg=lpos=lneg=0
        for word in word_tokenize(text[1]):
            if word in positive_words:
                if debug:
                    print("Positive",word)
                cpos+=1
            if word in negative_words:
                if debug:
                    print("Negative",word)
                cneg+=1
        c_neg_score=round(((cneg/len(word_tokenize(text[1])))*100),2)
        c_pos_score=round(((cpos/len(word_tokenize(text[1])))*100),2)
        results.append((text[0],c_pos_score,c_neg_score))
    return results

In [44]:
simple_sentiment_analysis(musician_names,musician_docs)

[('The 13th Floor Elevators', 0.94, 1.16),
 ('Alice Cooper', 1.73, 1.34),
 ('The Amboy Dukes', 1.28, 1.01),
 ('Amon Düül', 1.34, 1.05),
 ('Big Brother and the Holding Company', 1.65, 0.74),
 ('Black Sabbath', 1.31, 1.42),
 ('Blue Cheer', 0.96, 1.05),
 ('Blues Magoos', 1.15, 0.31),
 ('The Charlatans', 0.83, 1.67),
 ('Count Five', 1.07, 1.07),
 ('Country Joe and the Fish', 1.54, 1.25),
 ('Coven', 0.66, 0.66),
 ('Cream', 1.54, 1.24),
 ('Deep Purple', 1.31, 0.78),
 ('The Deviants', 0.2, 0.61),
 ('The Doors', 1.39, 1.36),
 ('The Electric Prunes', 1.4, 1.06),
 ('The Fugs', 0.51, 0.96),
 ('Grateful Dead', 1.35, 0.71),
 ('The Great Society', 0.89, 0.38),
 ('The Groundhogs', 1.07, 0.68),
 ('Hawkwind', 1.07, 0.86),
 ('Iron Butterfly', 1.11, 1.34),
 ('Jefferson Airplane', 1.53, 1.01),
 ('The Jimi Hendrix Experience', 1.4, 1.13),
 ('Janis Joplin', 1.27, 1.12),
 ('JPT Scare Band', 0.79, 1.19),
 ('Love', 1.56, 1.06),
 ('MC5', 1.87, 1.64),
 ('Moby Grape', 1.38, 1.14),
 ('The Music Machine', 1.39, 1.5