<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of musicians page (https://en.wikipedia.org/wiki/Lists_of_musicians), pick five lists of
musicians (e.g., List of big band musicians). You can pick any five
you like but make sure that the list has the words “musicians” in
it and that the list has at least 30 musicians listed
<li>Collect the urls of all the musicians on those five pages and place them in a list
<li>Grab the content of each musician in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your "reference" data set
<li>Now grab another list of musicians from wikipedia and create a new list of documents using the detail from each musicians page. This is your "musician" data set
<li>For each musician in the new list, find the musician in the reference data set that is the closest in similarity. 
<li>Print a table that contains each musician from the musician data set and the most similar musician from the reference data set
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_musicians</span>: A function that, given a "list of musicians" url, returns a list containing the names of the musicians and the urls for their wikipedia pages
<p>non_musician_finder tries its best to remove links that are not musician links from the page (not perfect, but good enough!)

In [2]:
def get_musicians(url):
    from bs4 import BeautifulSoup
    import requests
    page_soup = BeautifulSoup(requests.get(url).content,'lxml')
    li_tags = page_soup.find_all('li')
    all_musicians = list()
    for tag in li_tags:
        if tag.get('id'):
            continue

        try:
            tag.find('sup',class_="reference")
            link = tag.find('a').get('href')
            name = tag.find('a').get_text()
            if "/wiki/" in link and non_musician_finder(link):
                all_musicians.append((name,"https://en.wikipedia.org" + link))
        except:
            pass
    return all_musicians

def non_musician_finder(link):
    non_musician_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User']
    for word in non_musician_words:
        if word in link:
            return False
    return True

<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_alternative_country_musicians"
#url = "https://en.wikipedia.org/wiki/List_of_Arabic_pop_musicians"
#url = "https://en.wikipedia.org/wiki/List_of_bluegrass_musicians"
#url = "https://en.wikipedia.org/wiki/List_of_G-funk_musicians"
#url = "https://en.wikipedia.org/wiki/List_of_soul_jazz_musicians"
#url = "https://en.wikipedia.org/wiki/List_of_funk_musicians"
get_musicians(url)

[('16 Horsepower', 'https://en.wikipedia.org/wiki/16_Horsepower'),
 ('Ryan Adams', 'https://en.wikipedia.org/wiki/Ryan_Adams'),
 ('Jill Andrews', 'https://en.wikipedia.org/wiki/Jill_Andrews'),
 ('The Autumn Defense', 'https://en.wikipedia.org/wiki/The_Autumn_Defense'),
 ('Backyard Tire Fire', 'https://en.wikipedia.org/wiki/Backyard_Tire_Fire'),
 ('Del Barber', 'https://en.wikipedia.org/wiki/Del_Barber'),
 ('Eef Barzelay', 'https://en.wikipedia.org/wiki/Eef_Barzelay'),
 ("Bear's Den", 'https://en.wikipedia.org/wiki/Bear%27s_Den_(band)'),
 ('Rico Bell', 'https://en.wikipedia.org/wiki/Rico_Bell'),
 ('Blitzen Trapper', 'https://en.wikipedia.org/wiki/Blitzen_Trapper'),
 ('Blue Rodeo', 'https://en.wikipedia.org/wiki/Blue_Rodeo'),
 ('Bosque Brown', 'https://en.wikipedia.org/wiki/Bosque_Brown'),
 ('The Bottle Rockets', 'https://en.wikipedia.org/wiki/The_Bottle_Rockets'),
 ('BR549', 'https://en.wikipedia.org/wiki/BR549'),
 ('Jim Bryson', 'https://en.wikipedia.org/wiki/Jim_Bryson'),
 ('Richard B

<h4>get_musician_text(url): returns the page text of the wikipedia page associated with a musician</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (musician, url) pair from our musicians list

In [4]:
def get_musician_text(url):
    from bs4 import BeautifulSoup
    import requests
    all_text = ''
    try:
        page_soup = BeautifulSoup(requests.get(url).content,'lxml')
        for p_tag in page_soup.find_all('p'):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text


<h4>testing get_musician_text</h4>

In [5]:
url = "https://en.wikipedia.org/wiki/Jim_Morrison"
get_musician_text(url)

'\nJames Douglas "Jim" Morrison (December 8, 1943 – July 3, 1971) was an American singer, songwriter and poet, best remembered as the lead vocalist of the rock band The Doors. Due to his poetic lyrics, distinctive voice, wild personality, performances, and the dramatic circumstances surrounding his life and early death, Morrison is regarded by music critics and fans as one of the most iconic and influential frontmen in rock music history. Since his death, his fame has endured as one of popular culture\'s most rebellious and oft-displayed icons, representing the generation gap and youth counterculture.[1]\nMorrison co-founded the Doors during the summer of 1965 in Venice, California. The band spent two years in obscurity until shooting to prominence with their #1 single in the United States, "Light My Fire", taken from their self-titled debut album. Morrison recorded a total of six studio albums with the Doors, all of which sold well and received critical acclaim. Though the Doors recor

<p><span style="color:blue">get_all_musicians</span>: A function that, given a list of genres, returns a list containing the names of the musicians and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the musicians in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_musicians"
<li>construct a url for the list of musicians (I've done these first three steps for you)
<li>call get_musicians for that url
<li>extend all_musicians by what get_musicians returns

In [6]:
def get_all_musicians(genre_list):
    all_musicians = list()
    for genre in genre_list:
        url = 'https://en.wikipedia.org/wiki/List_of_' + genre
        #Your code here
        all_musicians = all_musicians + list(get_musicians(url))
    
    return all_musicians

<h4>Example of how to use get_all_musicians</h4>

In [7]:
genre_list = ['bluegrass_musicians#G','British_blues_musicians','country_blues_musicians','emo_artists', 'Indonesian_pop_musicians']
#genre_list = ['bluegrass_musicians#G']
all_musicians = get_all_musicians(genre_list)
print(all_musicians)

[('Tom Adams', 'https://en.wikipedia.org/wiki/Tom_Adams_(bluegrass_musician)'), ('Eddie Adcock', 'https://en.wikipedia.org/wiki/Eddie_Adcock'), ('David "Stringbean" Akeman', 'https://en.wikipedia.org/wiki/David_%22Stringbean%22_Akeman'), ('Red Allen', 'https://en.wikipedia.org/wiki/Red_Allen_(bluegrass)'), ('Darol Anger', 'https://en.wikipedia.org/wiki/Darol_Anger'), ('Mike Auldridge', 'https://en.wikipedia.org/wiki/Mike_Auldridge'), ('Kenny Baker (fiddler)', 'https://en.wikipedia.org/wiki/Kenny_Baker_(fiddler)'), ('Jessie Baker', 'https://en.wikipedia.org/wiki/Jessie_Baker'), ('Butch Baldassari', 'https://en.wikipedia.org/wiki/Butch_Baldassari'), ('Russ Barenberg', 'https://en.wikipedia.org/wiki/Russ_Barenberg'), ('Byron Berline', 'https://en.wikipedia.org/wiki/Byron_Berline'), ('Norman Blake', 'https://en.wikipedia.org/wiki/Norman_Blake_(American_musician)'), ('Kathy Boyd', 'https://en.wikipedia.org/wiki/Kathy_Boyd_and_Phoenix_Rising'), ('Dale Ann Bradley', 'https://en.wikipedia.org/

<p><span style="color:blue">get_all_musician_docs</span>: A function that, given the list of (musician,url) pairs, returns two lists, a list of musicians and a parallel (same size) list of documents. 

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_musicians list
<li>extract the name and the url of the musician
<li>get the text using the get_musician_text() function
<li>if the function returns None, ignore it and move to the next musician
<li>otherwise, append the name ot the musician_names list and the text to the musician_texts list
<li>return musician_names and musician_texts


In [8]:
def get_all_musician_docs(all_musicians, debug = True):
    musician_names = list()
    musician_texts = list()
    for musician in all_musicians:
        name = musician[0]
        url = musician[1]
        if get_musician_text(url) != None:
            if debug:
                musician_names.append(name)
                musician_texts.append(get_musician_text(url))
        else:
            if debug:
                musician_names = musician_names
                musician_texts = musician_texts
        
        
        #Your code here
    return musician_names, musician_texts
        

In [9]:
get_all_musician_docs(all_musicians)


(['Tom Adams',
  'Eddie Adcock',
  'David "Stringbean" Akeman',
  'Red Allen',
  'Darol Anger',
  'Mike Auldridge',
  'Kenny Baker (fiddler)',
  'Jessie Baker',
  'Butch Baldassari',
  'Russ Barenberg',
  'Byron Berline',
  'Norman Blake',
  'Kathy Boyd',
  'Dale Ann Bradley',
  'David Bromberg',
  'Herman Brock Jr',
  'Jesse Brock',
  'Alison Brown',
  'Buckethead',
  'Buzz Busby',
  'Roger Bush',
  'Sam Bush',
  'Ann Marie Calhoun',
  'Jason Carter',
  'Vassar Clements',
  'Michael Cleveland',
  'Charlie Cline',
  'Curly Ray Cline',
  'Mike Compton',
  'John Byrne Cooke',
  'J. P. Cormier',
  'John Cowan',
  'Dan Crary',
  'J. D. Crowe',
  'Jamie Dailey',
  'Charlie Daniels',
  'Vernon Derrick',
  'Hazel Dickens',
  'Doug Dillard',
  'The Dillards',
  'Jerry Douglas',
  'Casey Driessen',
  'John Duffey',
  'Stuart Duncan',
  'Chris Eldridge',
  'Bill Emerson',
  'Bill Evans',
  'Lester Flatt',
  'Dennis Fetchet',
  'Pete Fidler',
  'Béla Fleck',
  'Tony Furtado',
  'Raymond Fairchild

In [10]:
#print(len(get_all_musician_docs(all_musicians)[0]))


In [11]:
#print(len(get_all_musician_docs(all_musicians)[1]))

<h4>Example of how to use get_all_musician_docs</h4>

In [12]:
reference_names,reference_docs = get_all_musician_docs(all_musicians)

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [13]:
import warnings
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize,word_tokenize 
from gensim import corpora
from gensim.parsing.preprocessing import STOPWORDS
from gensim.similarities.docsim import Similarity
from gensim import corpora, models, similarities
import pprint


#Code for LSI model goes here
for i in range(len(reference_docs)):
    document = reference_docs[i]
    sents = sent_tokenize(document)
    for j in range(len(sents)):
        sent = sents[j]
        sent = sent.strip().replace('\n','')
        sents[j] = sent
    reference_docs[i] = '. '.join(sents)
    
documents = reference_docs
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=5)



<h3>Construct the "musician" data set</h3>
<h4>Example</h4>

In [14]:
musician_genre_list = ['acid_rock_artists']
all_musicians = get_all_musicians(musician_genre_list)
musician_names,musician_docs = get_all_musician_docs(all_musicians)

In [15]:
#len(musician_names)

<h4>find the most similar musicians for each new musician from our reference data set</h4>

In [16]:
table_data = list()
for index,musician in enumerate(musician_docs):
    
    #Your similarity code here. Use the in-class notebook as a reference
    doc = str(musician)
    vec_bow = dictionary.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    i = similarities.MatrixSimilarity(lsi[corpus])
    sims = i[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    
    most_similar_musician = sims[0][0]
    table_data.append((musician_names[index],reference_names[most_similar_musician]))
    
#Write code to print table_data after the for loop ends
table_data
    

  if np.issubdtype(vec.dtype, np.int):


[('The 13th Floor Elevators', 'Pete Fidler'),
 ('Alice Cooper', 'The Appleseed Cast'),
 ('The Amboy Dukes', 'Garden Variety'),
 ('Amon Düül', 'Joan of Arc'),
 ('Big Brother and the Holding Company', 'Joan of Arc'),
 ('Black Sabbath', 'The Anniversary'),
 ('Blue Cheer', 'Steamhammer'),
 ('Blues Magoos', 'Hattie Hart'),
 ('The Charlatans', 'Glenn Fredly'),
 ('Count Five', 'Tommy Ramone'),
 ('Country Joe and the Fish', 'Chris Hillman'),
 ('Coven', 'Rites of Spring'),
 ('Cream', 'Cream'),
 ('Deep Purple', 'The Anniversary'),
 ('The Deviants', 'The Changcuters'),
 ('The Doors', 'Sara Watkins'),
 ('The Electric Prunes', 'Bluesology'),
 ('The Fugs', 'Joan of Arc'),
 ('Grateful Dead', 'Cherry Belle'),
 ('The Great Society', 'The Anniversary'),
 ('The Groundhogs', 'The Groundhogs'),
 ('Hawkwind', 'Saetia'),
 ('Iron Butterfly', 'Joan of Arc'),
 ('Jefferson Airplane', 'Bluesology'),
 ('The Jimi Hendrix Experience', 'The Jimi Hendrix Experience'),
 ('Janis Joplin', 'Memphis Jug Band'),
 ('JPT Scar

In [17]:
#print(sims)
#print(sims[0])
#print(sims[0][0])
#print(most_similar_musician)

In [18]:
#len(table_data)