<h1>Document Similarity using LSI</h1>

<h4>In this assignment we’re going to practice document similarity. Here’s
what you need to do:</h4>
<ol>
<li>From Wikipedia’s List of writers page (https://en.wikipedia.org/wiki/Lists_of_writers), pick five lists of
writers (e.g., List of detective fiction authors). You can pick any five
you like but make sure that the list has at least 30 writers listed
<li>Collect the urls of all the writers on those five pages and place them in a list
<li>Add summarization of the pages
<li>Grab the content of each writer in the list and place them in a list (of documents)
<li>Build an LSI model using this data. This is your <b>"reference" data set</b>
<li>Now grab another list of writers from wikipedia and create a new list of documents using the detail from each writers page. This is your <b>"writer" data set</b>
<li>For each writer in the new list, find the writer in the <b>reference data set</b> that is the closest in similarity.
<li>Print a table that contains each writer from the <b>writer data set</b> and the most similar writer from the <b>reference data set</b>
<li>Perform sentiment analysis
</ol>
<h4>Use the code below to build your solution

<p><span style="color:blue">get_writers</span>: A function that, given a "list of writers" url, returns a list containing the names of the writers and the urls for their wikipedia pages
<p><span style="color:blue">non_writer_finder</span> tries its best to remove links that are not writer links from the page (not perfect, but good enough!)

In [13]:
def get_writers(url):
    from bs4 import BeautifulSoup
    import requests

    page_soup = BeautifulSoup(requests.get(url).content, "lxml")
    li_tags = page_soup.find_all("li")
    all_writers = list()
    for tag in li_tags:
        if tag.get("id"):
            continue

        try:
            tag.find("sup", class_="reference")
            link = tag.find("a").get("href")
            name = tag.find("a").get_text()
            if "/wiki/" in link and non_writer_finder(link):
                all_writers.append((name, "https://en.wikipedia.org" + link))
        except:
            pass
    return all_writers

def non_writer_finder(link):
    non_writer_words = ['Category','Template','Portal','List','File','Template','Special','Main','Help','User','https']
    for word in non_writer_words:
        if word in link:
            return False
    return True

In [None]:
# !pip install --upgrade gensim



<h4>testing the function</h4>
<li>Note that Wikipedia does not have a standard for its page design so this code may not work with every list

In [14]:
url = "https://en.wikipedia.org/wiki/List_of_detective_fiction_authors"
get_writers(url)

  parser = parser(


[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Humayun Ahmed', 'https://en.wikipedia.org/wiki/Humayun_Ahmed'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Sharadindu Bandyopadhyay',
  'https://en.wikipedia.org/wiki/Sharadindu_Bandyopadhyay'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https:/

<h4>get_writer_text(url): returns the page text of the wikipedia page associated with a writer</h4>
<li>Since we're not sure if this will always work, we use a try ... except to catch exceptions
<li>If it doesn't work, the function returns None
<li>We will need to delete this (writer, url) pair from our writers list

In [15]:
def get_writer_text(url):
    from bs4 import BeautifulSoup
    import requests

    all_text = ""
    try:
        page_soup = BeautifulSoup(requests.get(url).content, "lxml")
        for p_tag in page_soup.find_all("p"):
            all_text += p_tag.get_text()
    except:
        return None
    return all_text

<h4>testing get_writer_text</h4>

In [16]:
url = "https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)"
get_writer_text(url)

'\nKate Atkinson MBE (born 20 December 1951) is an English writer of novels, plays and short stories.[1]  She has written historical novels, detective novels and family novels, incorporating postmodern and magical realist elements into the plots. Her debut, Behind the Scenes at the Museum, won the Whitbread Book Award, the precursor to the Costa Book Award, in 1995. The novels Life After Life and A God in Ruins won the Costa Book Award for novel in 2013 and 2015. She is also known for the Jackson Brodie series of detective novels, which has been adapted into the BBC One series, Case Histories.[2][3]\nThe daughter of a shopkeeper, Atkinson was born in York, the setting for several of her books.[4] She was an only child and often had to finds ways to amuse herself. She describes herself as an anxious child, something she believes had to do with being illegitimate. Her parents lived together but were not married, because her mother could not divorce her first husband. At the time, that wa

<p><span style="color:blue">get_all_writers</span>: A function that, given a list of genres, returns a list containing the names of the writers and the urls for their wikipedia pages associated with that list of genres
<p>The function should return a list of (name,url) pairs for all the writers in the list of genres
<p>You need to:
<ol>
<li>iterate through the list of genres
<li>initialize a list "all_writers"
<li>construct a url for the list of writers (I've done these first three steps for you)
<li>call get_writers for that url
<li>extend all_writers by what get_writers returns

In [17]:
def get_all_writers(genre_list):
    all_writers = list()

    for genre in genre_list:
        url = f"https://en.wikipedia.org/wiki/List_of_{genre}"
        writers = get_writers(url)
        all_writers.extend(writers)

    return all_writers


<h4>Example of how to use get_all_writers</h4>

In [18]:
genre_list = ['detective_fiction_authors', 'romantic_novelists', 'mystery_writers', 'fantasy_authors', 'role-playing_game_designers'] # can change it to your list of 5 genres
all_writers = get_all_writers(genre_list)
all_writers


[('Mario Acevedo', 'https://en.wikipedia.org/wiki/Mario_Acevedo_(author)'),
 ('Douglas Adams', 'https://en.wikipedia.org/wiki/Douglas_Adams'),
 ('Humayun Ahmed', 'https://en.wikipedia.org/wiki/Humayun_Ahmed'),
 ('Margery Allingham', 'https://en.wikipedia.org/wiki/Margery_Allingham'),
 ('Rudolfo Anaya', 'https://en.wikipedia.org/wiki/Rudolfo_Anaya'),
 ('Gosho Aoyama', 'https://en.wikipedia.org/wiki/Gosho_Aoyama'),
 ('Frank Arnau', 'https://en.wikipedia.org/wiki/Frank_Arnau'),
 ('Taku Ashibe', 'https://en.wikipedia.org/wiki/Taku_Ashibe'),
 ('Ace Atkins', 'https://en.wikipedia.org/wiki/Ace_Atkins'),
 ('Kate Atkinson', 'https://en.wikipedia.org/wiki/Kate_Atkinson_(writer)'),
 ('Yukito Ayatsuji', 'https://en.wikipedia.org/wiki/Yukito_Ayatsuji'),
 ('Sharadindu Bandyopadhyay',
  'https://en.wikipedia.org/wiki/Sharadindu_Bandyopadhyay'),
 ('Nevada Barr', 'https://en.wikipedia.org/wiki/Nevada_Barr'),
 ('Earle Basinsky', 'https://en.wikipedia.org/wiki/Earle_Basinsky'),
 ('M. C. Beaton', 'https:/

<p><span style="color:blue">get_all_writer_docs</span>: A function that, given the list of (writer,url) pairs, returns two lists, a list of writers and a parallel (same size) list of documents.

<p>You need to:

<ol>
<li>initialize the two lists

<li>iterate through the all_writers list
<li>extract the name and the url of the writer
<li>get the text using predefined function
<li>if the function returns None, ignore it and move to the next writer
<li>otherwise, append the name ot the writer_names list and the text to the writer_texts list
<li>return writer_names and writer_texts


In [19]:
def get_all_writer_docs(all_writers):
    writer_names = []
    writer_texts = []
    for writer, url in all_writers:
        text = get_writer_text(url)
        if text is None:
            continue

        writer_names.append(writer)
        writer_texts.append(text)

    return writer_names, writer_texts


<h4>Example of how to use get_all_writer_docs</h4>

In [20]:
# may take some minutes to run the code depend on the number of writers on your list
# print(get_all_writer_docs(all_writers))

reference_names, reference_docs = get_all_writer_docs(all_writers)
print(len(reference_names), len(reference_docs))

2755 2755


<h3>Repeat the process on summarized text</h3>

<h4>Text summarization</h4>
Create a function <span style="color:blue">summarize_text()</span> which summarizes text with the most relevant sentences.
<br>The function should summarize the given text by extracting the top N sentences containing the highest frequency of important words.
    
Parameters:
<li>text (str): The full text content of a writer's Wikipedia page.
<li>num_sentences (int): Number of sentences to include in the summary (default is 3).

In [22]:
import warnings

warnings.filterwarnings("default")
import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")  # Download wordnet
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from collections import OrderedDict
import pprint


# # Ensure necessary resources are available
# nltk.download("punkt")
# nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [23]:

def summarize_text(text, num_sentences=3):
    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())
    stop_words = set(stopwords.words("english"))#delete the, a ...
    filtered_words = [word for word in words if word.isalnum() and word not in stop_words]#only number, vocab


    word_freq = Counter(filtered_words)#calculate each frequency and return dict with vocab:freq

    sentence_scores = {}
    for sentence in sentences:
        sentence_word_count = word_tokenize(sentence.lower())
        sentence_score = sum(word_freq.get(word, 0) for word in sentence_word_count if word in word_freq)#sum each vocab freq as their sentence score

        sentence_scores[sentence] = sentence_score


    top_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:num_sentences]


    summary = " ".join(top_sentences)
    return summary


<h4>Testing summarize_text</h4>

In [24]:

sample_url = "https://en.wikipedia.org/wiki/Carolyn_Zane"

nltk.download("punkt_tab")
# Fetch the full text for the writer
writer_text = get_writer_text(sample_url)

# Apply the summarization function to the fetched text
if writer_text:
    summary = summarize_text(writer_text, num_sentences=25)
    print("Summary of the writer's page:")
    print(summary)
else:
    print("Failed to fetch writer text.")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
  parser = parser(


Summary of the writer's page:
[1][2]  Her work has been selected as Top Pick by Romantic Times, where Cheryl Hanson described her latest work Beyond the Storm by saying, "The writing is so riveting and real that you'll feel the storm and the pain of the aftermath." Carolyn Pizzuti is an American author of romance novels under the pen name Carolyn Zane. [2] After what Zane refers to as the world's longest maternity leave, she went back to work as the launch author for Abingdon Press's Quilts of Love series. Carolyn graduated from high school in 1976 in Silverton, Oregon and enrolled at Oregon State University, where she majored in broadcast/speech communications. [1]
On their sixteenth wedding anniversary, Carolyn and her husband welcomed their first daughter, Madeline Alexa. When he was 18 months, an African American sibling set, Grace Elizabeth and Gabriel Robert were added to the family, adopted through foster care. [1] Since then, she has had more than 30 novels published by four pu

In [None]:
# all_writers

Now create a function that get the summaries for each writer

In [25]:
def get_all_writer_docs_summary(all_writers):
    """
    Fetches, summarizes, and returns texts for each writer in the list.

    Parameters:
    - all_writers (list): A list of tuples containing writer names and URLs.

    Returns:
    - writer_names (list): Names of the writers.
    - writer_summaries (list): Summarized text content for each writer.
    """
    writer_names = []
    writer_summaries = []

    for writer in all_writers:
        name, url = writer
        text = get_writer_text(url)


        if text:
            summary = summarize_text(text, num_sentences=3)
            writer_names.append(name)
            writer_summaries.append(summary)

    return writer_names, writer_summaries


test get all writers function with summarized texts

In [None]:
# May take a couple of minutes depend on the length of your writer list.
# print(get_all_writer_docs_summary(all_writers))

In [26]:
reference_names_2, reference_docs_2 = get_all_writer_docs_summary(all_writers)
print(len(reference_names_2), len(reference_docs_2))

2755 2755


In [None]:
# import csv
# with open("reference_docs_2.csv", "w", newline="") as file:
#     writer = csv.writer(file)
#     writer.writerow(["Reference Doc"])  # 標題
#     for doc in reference_docs_2:
#         writer.writerow([doc])

<h3>Set up the LSI model</h3>
<li>reference_docs is the list of documents
<li>construct texts, dictionary, and corpus (see class iPython notebook)
<li>construct an LSI model. Use 5 topics initially but you should play around with this number

In [27]:
# for i in range(len(reference_docs)):
#     story = reference_docs[i]
#     print(f"Processing document {i}: {story}")  # Debug point 1
#     sents = sent_tokenize(story)
#     for j in range(len(sents)):
#         sent = sents[j]
#         sent = sent.strip().replace('\n', '')
#         sents[j] = sent
#     reference_docs[i] = '. '.join(sents)
#     print(f"Updated document {i}: {reference_docs[i]}")  # Debug point 2
# documents = [doc.raw() for doc in reference_docs]
from gensim.parsing.preprocessing import STOPWORDS
from nltk.tokenize import sent_tokenize
texts = [[word for word in document.lower().split()
        if word not in STOPWORDS and word.isalnum()]
        for document in reference_docs]
# texts = [
#     [word for word in doc.lower().split() if word not in STOPWORDS]
#     for doc in reference_docs
# ]




In [28]:
texts[0]

  and should_run_async(code)


['mario',
 'acevedo',
 'july',
 'american',
 'novelist',
 'known',
 'series',
 'urban',
 'fantasy',
 'novels',
 'featuring',
 'vampire',
 'private',
 'investigator',
 'felix',
 'lives',
 'works',
 'acevedo',
 'born',
 'el',
 'published',
 'acevedo',
 'held',
 'jobs',
 'military',
 'helicopter',
 'infantry',
 'art',
 'software',
 'assorted',
 'deployed',
 'soldier',
 'artist',
 'army',
 'operation',
 'desert',
 'article',
 'novelist',
 'united',
 'states',
 'born',
 '1950s',
 'help',
 'wikipedia',
 'expanding']

In [30]:
import json


with open('texts.json', 'w', encoding='utf-8') as json_file:
    json.dump(texts, json_file, ensure_ascii=False, indent=4)


In [None]:
with open('texts.json', 'r', encoding='utf-8') as json_file:
    loaded_data = json.load(json_file)

print(loaded_data)

[['mike', 'young', 'american', 'game', 'founder', 'independent', 'professional', 'larp', 'publishing', 'works', 'include', 'rules', 'live', 'published', 'generic', 'larp', 'published', 'larps', 'card', 'contributed', 'short', 'pieces', 'professional', 'larp', 'journal', '1989', 'galactic', 'emperor', 'larp', 'sold', 'skotos', 'online', 'game', 'floating', 'vagabond', 'floating', 'vagabond', 'floating', 'vagabond', 'square', 'root', 'officially', 'licensed', 'larps', 'run', 'tales', 'floating', 'vagabond', 'rpg', 'setting', 'originally', 'published', 'avalon', 'run', 'creator', 'tales', 'floating', 'lee', 'authored', 'games', 'alien', 'including', 'neophyte', 'credited', 'legend', 'crosstime', 'credited', 'level', 'designer', 'developer', 'icebreaker', 'magnet', 'interactive', 'mike', 'published', 'card', 'game', 'won', 'award', 'polycon', 'independent', 'game', 'design', 'addition', 'designing', 'games', 'larp', 'mike', 'current', 'works', 'biap', 'systems', 'mike', 'lead', 'developer'

In [35]:
# Code for LSI model for text goes here
from gensim import corpora
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
from gensim import corpora, models, similarities
lsi_text = models.LsiModel(corpus, id2word=dictionary, num_topics=3)

  sparsetools.csc_matvecs(


In [32]:
#clean document
texts_summary = [[word for word in doc.lower().split() if word not in STOPWORDS and word.isalnum()]
        for doc in reference_docs_2]



  and should_run_async(code)


In [33]:
with open('texts_summary.json', 'w', encoding='utf-8') as json_file:
    json.dump(texts_summary, json_file, ensure_ascii=False, indent=4)


In [34]:
!pip install --upgrade gensim



In [36]:
# Code for LSI model for summaries goes here
from gensim import corpora
from gensim import corpora, models, similarities
dictionary_summary = corpora.Dictionary(texts_summary)
corpus_summary = [dictionary_summary.doc2bow(text) for text in texts_summary]

lsi_summary = models.LsiModel(corpus_summary, id2word=dictionary_summary, num_topics=5)

<h3>Construct the writer data set with texts and summaries</h3>
<h4>Example</h4>

In [37]:
# for whole text
writer_genre_list = ["Western_fiction_authors"]
all_writers = get_all_writers(writer_genre_list)
writer_names, writer_docs = get_all_writer_docs(all_writers)
writer_names, writer_docs

  parser = parser(


(['Film',
  'Television',
  'Literature',
  'Visual arts',
  'Dime novels',
  'Comics',
  'Wild West shows',
  'Acid Western',
  'Australian Western',
  'Contemporary Western',
  'Dacoit Western',
  'Epic Western',
  'Fantasy Western',
  'Florida Western',
  'Gothic Western',
  'Horror Western',
  'Northern',
  'Ostern',
  'Revisionist Western',
  'Science fiction Western',
  'Singing cowboy',
  'Space Western',
  'Spaghetti Western',
  'Weird Western',
  'Western romance',
  'Zapata Western',
  'Golden Boot Awards',
  'Old West',
  'Cowboy culture',
  'Cowboy',
  'Gunfighter',
  'Outlaw',
  'Quick draw',
  'Saloon',
  'Manifest destiny',
  'Edward Abbey',
  'Andy Adams',
  'William Lacey Amy',
  'Rudolfo Anaya',
  'Todhunter Ballard',
  'S. Omar Barker',
  'Rex Beach',
  'James Warner Bellah',
  'Don Bendell',
  'Tom W. Blackburn',
  'James Carlos Blake',
  'William Blinn',
  'Stephen Bly',
  'Frank Bonham',
  'Allan R. Bosworth',
  'Peter Bowen',
  'B.M. Bower',
  'Leigh Brackett',
 

In [38]:
# for summaries
writer_genre_list = ["Western_fiction_authors"]
all_writers = get_all_writers(writer_genre_list)
writer_names_2, writer_docs_summaries = get_all_writer_docs_summary(all_writers)
writer_names_2, writer_docs_summaries

(['Film',
  'Television',
  'Literature',
  'Visual arts',
  'Dime novels',
  'Comics',
  'Wild West shows',
  'Acid Western',
  'Australian Western',
  'Contemporary Western',
  'Dacoit Western',
  'Epic Western',
  'Fantasy Western',
  'Florida Western',
  'Gothic Western',
  'Horror Western',
  'Northern',
  'Ostern',
  'Revisionist Western',
  'Science fiction Western',
  'Singing cowboy',
  'Space Western',
  'Spaghetti Western',
  'Weird Western',
  'Western romance',
  'Zapata Western',
  'Golden Boot Awards',
  'Old West',
  'Cowboy culture',
  'Cowboy',
  'Gunfighter',
  'Outlaw',
  'Quick draw',
  'Saloon',
  'Manifest destiny',
  'Edward Abbey',
  'Andy Adams',
  'William Lacey Amy',
  'Rudolfo Anaya',
  'Todhunter Ballard',
  'S. Omar Barker',
  'Rex Beach',
  'James Warner Bellah',
  'Don Bendell',
  'Tom W. Blackburn',
  'James Carlos Blake',
  'William Blinn',
  'Stephen Bly',
  'Frank Bonham',
  'Allan R. Bosworth',
  'Peter Bowen',
  'B.M. Bower',
  'Leigh Brackett',
 

<gensim.similarities.docsim.MatrixSimilarity at 0x791215495d50>

array([0.9723313 , 0.9957996 , 0.9360981 , ..., 0.64177924, 0.91535586,
       0.7410233 ], dtype=float32)

<h4>find the least similar writers with at least 0.6 similarity for each new writer from our reference data set (both for whole and summarized texts)</h4>
<li>Write code to print table_data after the for loop ends

In [44]:
import pandas as pd
warnings.filterwarnings("ignore")
table_data = []


for i, doc in enumerate(writer_docs):

    writer_full_text = ' '.join(doc)
    writer_bow = dictionary.doc2bow(writer_full_text.lower().split())
    writer_lsi = lsi_text[writer_bow]
    index = similarities.MatrixSimilarity(lsi_text[corpus])
    sims = index[writer_lsi]
    filtered_scores = [(ref_idx, score) for ref_idx, score in enumerate(sims) if score >= 0.6]
    sorted_scores = sorted(filtered_scores, key=lambda x: x[1], reverse=True)
    least_similar_writer_index = sorted_scores[-1][0]
    least_similar_writer_name = reference_names[least_similar_writer_index]
    table_data.append([writer_names[i], least_similar_writer_name])


columns = ["Writer from New List", "Least Similar Writer (Full Text)"]
similarity_df = pd.DataFrame(table_data, columns=columns)


print(similarity_df)
# Write code to print table_data after the for loop ends




           Writer from New List Least Similar Writer (Full Text)
0                          Film                            Stan!
1                    Television                           Balrog
2                    Literature                  Tricia Sullivan
3                   Visual arts                          Authors
4                   Dime novels                       Television
..                          ...                              ...
220             S. Craig Zahler                         Mythpunk
221             Western fiction                  Tricia Sullivan
222               Western movie                            Stan!
223  Western Writers of America               Character creation
224           Western lifestyle                        Cam Banks

[225 rows x 2 columns]


In [49]:


warnings.filterwarnings("ignore")
table_data_summary = []

for i, doc in enumerate(writer_docs_summaries):
    writer_full_text = ' '.join(doc)
    writer_bow = dictionary_summary.doc2bow(writer_full_text.lower().split())
    writer_lsi = lsi_summary[writer_bow]
    index = similarities.MatrixSimilarity(lsi_summary[corpus_summary])
    sims = index[writer_lsi]
    filtered_scores = [(ref_idx, score) for ref_idx, score in enumerate(sims) if score >= 0.6]
    sorted_scores = sorted(filtered_scores, key=lambda x: x[1], reverse=True)
    least_similar_writer_index = sorted_scores[-1][0]
    least_similar_writer_name = reference_names_2[least_similar_writer_index]
    table_data_summary.append([writer_names[i], least_similar_writer_name])



columns = ["Writer from New List", "Least Similar Writer (Full Text)"]
similarity_df_sum = pd.DataFrame(table_data_summary, columns=columns)


print(similarity_df_sum)




           Writer from New List Least Similar Writer (Full Text)
0                          Film       Margaret Wetherby Williams
1                    Television                Katherine Roberts
2                    Literature                    Lauren Willig
3                   Visual arts              Lovecraftian horror
4                   Dime novels                    Margaret Coel
..                          ...                              ...
220             S. Craig Zahler                     Rebecca York
221             Western fiction                    Lauren Willig
222               Western movie       Margaret Wetherby Williams
223  Western Writers of America                           Isekai
224           Western lifestyle               Mignon G. Eberhart

[225 rows x 2 columns]


In [None]:
# print table for the texts

In [None]:
# print table for the summaries

# Some simple sentiment analysis

In this part we are gonna run some simple sentiment analysis to find out which writer has the most positive description.

Define a function simple_sentiment_analysis(writer_names, writer_docs) that takes as inputs the list of writers and their corresponding descriptions.
The expected output is a list, each element of this list should be a list with the writer name, the percentage of positive words in their description and the percentage of negative words in their description.

In [None]:
# Example output
"""
[('William Blinn', 0.81, 0.54),
 ('Stephen Bly', 0.75, 0.94),
 ('Frank Bonham', 3.73, 0.62)
 ...]
"""

To ensure results can be compared please use the following function to define your list of positive and negative words:

In [None]:
def get_pos_neg_words():
    def get_words(url):
        import requests

        words = requests.get(url).content.decode("latin-1")
        word_list = words.split("\n")
        index = 0
        while index < len(word_list):
            word = word_list[index]
            if ";" in word or not word:
                word_list.pop(index)
            else:
                index += 1
        return word_list

    # Get lists of positive and negative words
    p_url = "http://ptrckprry.com/course/ssd/data/positive-words.txt"
    n_url = "http://ptrckprry.com/course/ssd/data/negative-words.txt"
    positive_words = get_words(p_url)
    negative_words = get_words(n_url)
    return positive_words, negative_words

In [None]:
def simple_sentiment_analysis(writer_names, writer_docs):

    positive_words, negative_words = get_pos_neg_words()
    from nltk import word_tokenize


    results = []


    for name, doc in zip(writer_names, writer_docs):
        positive_count = 0
        negative_count = 0


        words = word_tokenize(doc.lower())
        for word in words:
            if word in positive_words:
                positive_count += 1
            elif word in negative_words:
                negative_count += 1

        total_words = len(words)
        positive_percentage = round(100*(positive_count / total_words) , 2) if total_words > 0 else 0
        negative_percentage = round(100*(negative_count / total_words) , 2) if total_words > 0 else 0


        results.append((name, positive_percentage, negative_percentage))

    return results


In [None]:
simple_sentiment_analysis(writer_names, writer_docs)

[('Film', 1.74, 1.64),
 ('Television', 1.21, 1.77),
 ('Literature', 2.61, 1.25),
 ('Visual arts', 2.44, 2.19),
 ('Dime novels', 2.66, 2.85),
 ('Comics', 2.09, 1.62),
 ('Wild West shows', 1.5, 2.48),
 ('Acid Western', 2.08, 2.08),
 ('Australian Western', 1.07, 0.74),
 ('Contemporary Western', 2.11, 2.3),
 ('Dacoit Western', 1.73, 1.84),
 ('Epic Western', 1.72, 2.45),
 ('Fantasy Western', 1.72, 2.45),
 ('Florida Western', 1.37, 1.91),
 ('Gothic Western', 1.13, 3.73),
 ('Horror Western', 1.95, 3.57),
 ('Northern', 1.77, 2.44),
 ('Ostern', 2.64, 1.2),
 ('Revisionist Western', 2.12, 2.51),
 ('Science fiction Western', 1.31, 3.54),
 ('Singing cowboy', 2.16, 0.89),
 ('Space Western', 1.76, 2.76),
 ('Spaghetti Western', 2.17, 2.61),
 ('Weird Western', 1.65, 3.36),
 ('Western romance', 4.73, 2.26),
 ('Zapata Western', 2.17, 2.61),
 ('Golden Boot Awards', 10.19, 0.93),
 ('Old West', 2.3, 2.06),
 ('Cowboy culture', 1.29, 0.37),
 ('Cowboy', 1.73, 1.33),
 ('Gunfighter', 2.34, 3.0),
 ('Outlaw', 2.28