# Assignment 1

In this assignment, you will explore about word vectors.

- Submision: A report in ``pdf``, your completed notebook file in ``ipynb``, and training data in ``txt``
    - The assignment will be evalulated mainly with report. So please include every detail you want to present in your report, including figures.
    - Report: Free format. You can copy and paste part of your code for some problems.
      - Report has to be written in English
    - ipynb: Save your notebook (with output of each cell if possible) as ipynb and submit it
- Evaluation criteria
    - How interesting and original are the presented examples
    - How well you describe the reason of success or failure of your examples by considering how Word2Vec is trained

## 0. Setup
- Check ``gensim`` library is installed
  - if not, you can install using ``!pip install gensim``
- List the downloadable vectors from ``gensim``


In [1]:
import gensim
import numpy as np
import pprint as pp

In [2]:
import gensim.downloader
list(gensim.downloader.info()['models'].keys())

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

- Among the Word2Vec model codes above, select one model of your choice among ``glove-wiki-gigaword`` or ``glove-twitter``
    - numbers at the last represents the number of dimension of each Word2Vec Model
        - e.g. ``glove-twitter-200`` was trained on twitter dataset while embedding each word into 200-dim vector
        - e.g. ``glove-wiki-gigaword-300`` was trained on wikipedia dataset while embedding each word into 300-dim vector
- Download the selected model and load it as a ``model``

In [3]:
your_model_code = 'glove-twitter-200' # select among the model code aboves
model = gensim.downloader.load(your_model_code) # download and load the model. It can take some time

In [4]:
# test the model output
model['cat']

array([ 1.4557e-01, -4.7214e-01,  4.5594e-02, -1.1133e-01, -4.4561e-01,
        1.6502e-02,  4.6724e-01, -1.8545e-01,  4.1239e-01, -6.7263e-01,
       -4.8698e-01,  7.2586e-01, -2.2125e-01, -2.0023e-01,  1.7790e-01,
        6.7062e-01,  4.1636e-01,  6.5783e-02,  4.8212e-01, -3.5627e-02,
       -4.7048e-01,  7.7485e-02, -2.8296e-01, -4.9671e-01,  3.3700e-01,
        7.1805e-01,  2.2005e-01,  1.2718e-01,  6.7862e-02,  4.0265e-01,
       -1.8210e-02,  7.8379e-01, -5.2571e-01, -3.9359e-01, -5.6827e-01,
       -1.5662e-01, -8.4099e-02, -2.0918e-01, -6.6157e-02,  2.5114e-01,
       -4.0015e-01,  1.5930e-01,  1.7887e-01, -3.2110e-01,  9.9510e-02,
        5.2923e-01,  4.8289e-01,  1.4505e-01,  4.4368e-01,  1.7365e-01,
        3.6350e-01, -5.1496e-01, -1.2889e-01, -1.9713e-01,  1.8096e-01,
       -1.1301e-02,  8.4409e-01,  9.8606e-01,  8.3535e-01,  3.5410e-01,
       -2.3395e-01,  3.5510e-01,  4.1899e-01, -5.4763e-02,  2.2902e-01,
       -1.9593e-01, -5.7777e-01,  2.9728e-01,  3.3972e-01, -3.11

## Problem 1. Simple Mathematics with Word2Vec (10 pts)
- In this problem, you have to complete the given functions ``word_analogy_with_vector`` and ``get_cosine_similarity``
  - To get the exactly same result with ``model.most_similar()``, you have to normalize each vector before doring arithmetic.
  - Using L2 norm (sqrt of sum of square of every item in the vector)
  - The result will also naturally include the positive query words itsef.
- In your report, **please include your code for these functions**


In [5]:
def word_analogy_with_vector(model, x_1, x_2, y_1):
  '''
  This function takes a gensim Word2Vec model and outputs a vector to find y2 that corresponds to x_1 → x_2 == y_1 → y_2
  e.g. x_1 (man) → x_2 (king) == y_1 (woman) → y_2(?)

  inputs
  model (gensim.models.keyedvectors.KeyedVectors): Word2Vec model in KeyedVectors in gensim library
  x_1, x_2, y_1 (str): Words in the model's vocabulary.

  output (np.ndarray): A vector in np.ndarray, which can be used to find proper y_2 for given (model, x_1, x_2, y_1)
  '''

  # Write your code from here
  # so basically, we have to find a vector y_2, which is similar to
  # y_1 - x_2 + x_1

  #we declare three model array to do arithmetic operations
  x_1_model = model[x_1]
  x_2_model = model[x_2]
  y_1_model = model[y_1]

  # before doing this arithmetic, we have to L2 normalize vectors that correspond to x_1, x_2, y_1
  x_1_model = x_1_model / np.linalg.norm(x_1_model)
  x_2_model = x_2_model / np.linalg.norm(x_2_model)
  y_1_model = y_1_model / np.linalg.norm(y_1_model)

  #then, we can simply return the code after arithmetic operation.
  resVec = y_1_model - x_1_model + x_2_model
  return resVec

# test whether the function works well
result_vector = word_analogy_with_vector(model, 'man', 'king', 'woman')
print(model.most_similar(positive=['woman', 'king'], negative=['man'], topn=10))
print('result vector is ', result_vector[:5])
assert isinstance(result_vector, np.ndarray), "Output of the function has to be np.ndarray"
# model.most_similar(result_vector, topn=13)
model.most_similar(result_vector)

[('queen', 0.6820898056030273), ('prince', 0.5875527262687683), ('princess', 0.5620489120483398), ('royal', 0.5522865056991577), ('mother', 0.5362966656684875), ('elizabeth', 0.5142496228218079), ('lady', 0.5010437369346619), ('lion', 0.4998807907104492), ('women', 0.4985955059528351), ('’s', 0.4935073256492615)]
result vector is  [-0.10950299  0.01094289  0.01424381  0.02372333 -0.04006428]


[('king', 0.739342212677002),
 ('queen', 0.6820898056030273),
 ('woman', 0.626846432685852),
 ('prince', 0.5875527262687683),
 ('princess', 0.5620488524436951),
 ('royal', 0.5522865056991577),
 ('mother', 0.5362966060638428),
 ('elizabeth', 0.5142496228218079),
 ('lady', 0.5010437369346619),
 ('lion', 0.49988076090812683)]

In [6]:
def get_cosine_similarity(model, x, y):
  '''
  This function returns cosine similarity of x,y

  inputs
  model (gensim.models.keyedvectors.KeyedVectors): Word2Vec model in KeyedVectors in gensim library
  x, y (str): Words in the model's vocabulary.

  output
  similarity (float): cosine similarity between x's vector and y's vector
  '''
  # Write your codes from here

  #cosine similarity is dot product of two vectors divided by l2 norm of each vector multiplied.
  x_vector = model[x]
  y_vector = model[y]

  #  return np.dot(x_vector, y_vector)/(np.sqrt(np.sum(x_vector**2))*np.sqrt(np.sum(y_vector**2)))
  # this calculation at denominator can be simplified
  return np.dot(x_vector, y_vector)/(np.linalg.norm(x_vector)*np.linalg.norm(y_vector))
# test the output with your own choice
word_a = 'good'
word_b = 'bad'

# these two should be identical
similarity = get_cosine_similarity(model, word_a, word_b)
print(similarity)
assert -1 <= similarity <= 1, "Similarity has to be between -1 and 1"

print('gensim library result:', model.similarity(word_a, word_b))

0.7983508
gensim library result: 0.7983508


## Problem 2. Find Most Similar Words
- One of the most simple and typical use case of Word2Vec is finding a word based on similarity.
- You can list the most similar words for a given query word by using ``model.most_similar(your_word)``
    - Usually, every word in Word2Vec model is in lowercase
- **In your report**, present more than **5** interesting examples and explain **why it was interesting for you**
    - Try to explain why those words are regarded similar in Word2Vec, considering how it was trained
   

In [7]:
target_words = ['diana','gaga', 'netflix', 'virus', 'cheonan'] # Enter your word strings here

for target_word in target_words:
  # check the word is in the vocabulary of the model
  assert model.has_index_for(target_word), f"The selected word, {target_word}, is not included in the model's vocabulary"
  pp.pprint(model.most_similar(target_word))

[('laura', 0.6634964942932129),
 ('nadia', 0.6587172150611877),
 ('sara', 0.6520454287528992),
 ('lina', 0.6286117434501648),
 ('sarah', 0.6259469985961914),
 ('maria', 0.6209515333175659),
 ('daniela', 0.6194291710853577),
 ('tania', 0.6157795190811157),
 ('sabrina', 0.6114152073860168),
 ('andrea', 0.6097177267074585)]
[('lady', 0.7784555554389954),
 ('katy', 0.76102215051651),
 ('madonna', 0.7485153079032898),
 ('rihanna', 0.7212374806404114),
 ('miley', 0.7206928133964539),
 ('swift', 0.7128058671951294),
 ('selena', 0.7001330256462097),
 ('taylor', 0.6920899152755737),
 ('perry', 0.6823055148124695),
 ('britney', 0.6755322217941284)]
[('movies', 0.6666310429573059),
 ('hbo', 0.6364792585372925),
 ('episodes', 0.6205011606216431),
 ('movie', 0.5996598601341248),
 ('watching', 0.5983713865280151),
 ('watch', 0.5762320160865784),
 ('channel', 0.5635140538215637),
 ('watched', 0.5465399622917175),
 ('dexter', 0.545410692691803),
 ('thrones', 0.5439996123313904)]
[('malware', 0.6426654

## Problem 3. Word Analogy
- Another interesting thing you can play with Word2Vec is word analogy
- Word analogy is done by adding and subtracting the word vector
- In the cell below, you can run an example like this
    - ``analogy('man', 'king', 'woman')`` represents a question of "man is to king as woman is to what?"
- Try with your own choice.
- **In your report**, present at least **5** interesting examples of your choice
    - You can include the failure case
    - Describe what did you expect and why the result was interesting for you

In [8]:
def analogy(model, x1, x2, y1):
  pp.pprint(model.most_similar([x2, y1], negative=[x1]))
  print()

# Try with your own word choice
# analogy(model, 'man', 'king', 'woman')

my_word_analogy_pairs = [
    ['student', 'school', 'alumni'],
    ['unemployed', 'recession', 'employed'],
    ['music', 'concert', 'dance'],
    ['train', 'station', 'airplane'],
    ['id', 'domestic', 'passport']
]

for x1, x2, y1 in my_word_analogy_pairs:
  analogy(model, x1, x2, y1)

[('sd', 0.5498698353767395),
 ('osis', 0.5004357695579529),
 ('highschool', 0.49871546030044556),
 ('reunian', 0.4949413537979126),
 ('smp', 0.4894418716430664),
 ('tomorrow', 0.48640692234039307),
 ('reuni', 0.48418116569519043),
 ('boys', 0.48372092843055725),
 ('graduation', 0.48320072889328003),
 ('shs', 0.48291248083114624)]

[('downturn', 0.4404403865337372),
 ('triple-dip', 0.4340003728866577),
 ('economic', 0.4171305298805237),
 ('explained', 0.4051300287246704),
 ('inflation', 0.4030877351760864),
 ('gdp', 0.40111812949180603),
 ('economy', 0.40024855732917786),
 ('slump', 0.3988068699836731),
 ('caused', 0.3962199091911316),
 ('slowdown', 0.3859215974807739)]

[('stage', 0.6779358386993408),
 ('tour', 0.5775321125984192),
 ('perform', 0.5644304156303406),
 ('flashmob', 0.5615183711051941),
 ('dancing', 0.5590847730636597),
 ('performing', 0.5416744351387024),
 ('tickets', 0.5249868631362915),
 ('concerts', 0.5237776041030884),
 ('performance', 0.5086872577667236),
 ('rehearsa

## Problem 4. Visualize Word Vectors
- Select a list of words of your interest
    - **At least 30 words for minimum**
    - ``word_list`` is a list of strings
    - every element in ``word_list`` has to be included in the model's vocabulary
- Visualize the vectors of words using dimensionality reduction (in this case, PCA)
- In your report, describe how words are located in 2D space
    - How are the words clustered?
    - Do you think the words are properly located based on their semantic meanings?
    - Is there anything suprising or unexpected examples?

In [9]:
# Run this cell to
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import plotly.express as px

# def display_pca_scatterplot(model, words=None, sample=0):
#   if len(words) < 30:
#     print("WARNING: For your report, please select more than 30 word samples for the visualization")
#     print(f"Current length of input word list: {len(words)}")
#   word_vectors = np.array([model[w] for w in words])

#   twodim = PCA().fit_transform(word_vectors)[:,:2]

#   # plt.figure(figsize=(12,12))
#   # plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
#   # for word, (x,y) in zip(words, twodim):
#   #     plt.text(x+0.05, y+0.05, word, fontsize=15)
#   fig = px.scatter(twodim, x=0, y=1, text=words)
#   fig.update_traces(textposition='top center')
#   fig.show()

#altered code
def display_pca_scatterplot(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")

  word_vectors = np.array([model[w] for w in words])
  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # Separate words into groups based on their source (Europe, America, Asia)
  european_words = words[:10]
  american_words = words[10:20]
  asian_words = words[20:]

  # Create a DataFrame for visualization with Plotly
  import pandas as pd
  df = pd.DataFrame(twodim, columns=['x', 'y'])
  df['word'] = words
  df['group'] = ['Europe'] * 10 + ['America'] * 10 + ['Asia'] * 10 #assign groups

  # Plot the scatterplot with different colors for each group
  fig = px.scatter(df, x='x', y='y', text='word', color='group')
  fig.update_traces(textposition='top center')
  fig.show()

In [10]:
# Select word list of your own interests
word_list = [
            "baguette", "pizza", "pierogi", "croissant", "paella",
             "borscht", "gyro", "lasagna", "bratwurst", "fondue",
             "hamburger", "hotdog", "taco", "burrito", "barbecue",
             "chili", "macaroni", "pancake", "cornbread", "cheesecake",
             "sushi", "ramen", "curry", "bibimbap", "pho",
             "bao", "satay", "samosa", "dimsum", "kimbap"
]
display_pca_scatterplot(model, word_list)

## Problem 5. Train New Word2Vec
- Word2Vec models can be trained on different corpus (text)
- Train your own model with your custom selection of text
- In your report, present at least **5** interesting examples that makes different result by dataset selection
    - You can compare some word analogy examples or similairites or visualization
- You can refer [Official Documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) Word2Vec Model

In [11]:
# You don't have to change this cell
import string
from gensim.models import Word2Vec

def remove_punctuation(x):
  return x.translate(''.maketrans('', '', string.punctuation))
def make_tokenized_corpus(corpus):
  out= [ [y.lower() for y in remove_punctuation(sentence).split(' ') if y] for sentence in corpus]
  return [x for x in out if x!=[]]

In [12]:
#multiple file -> use google drive, then browse to the correct directory

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## First Taylor Model

In [106]:
#text importer with modifications - different txt file for different names

#Modification that should be made
#0. use Google Drive - cell above
#1. multiple file structure: Albums/{album_name}/{song.txt} - for loop
#2. first line of the file should be ignored - in append function
#3. numbers, word that start with [ should be eliminated

strings=[]

from glob import glob
from tqdm import tqdm
import re

albums = glob("/content/drive/MyDrive/ColabData/TaylorSwiftAlbums/*")

for album in tqdm(albums):
  songs = glob(str(album)+"/*")
  for song in songs:
    with open(song, 'r') as f:
      song_lyric = f.readlines()[1:] #ignore first line
      if song_lyric:
        strings.extend([x.lower() for x in song_lyric[:-1] if x[0] !="["])#ignore last line
        #delete dividers like [chorus]then add lowercased version

        #dividers are not in the last line, so we can use the expression below

        #taylor rarely puts numbers in numeric form, so deletion of number followed by string in only last line
        #does not negatively affect
        #this line deletes ##embed
        last_line_modified = re.sub(r'\d+\s*\w+', '', song_lyric[-1].lower())
        #this line deletes embed, without numbers beforehand
        last_line_modified = last_line_modified.replace('embed','')
        strings.append(last_line_modified)
    # break #used to check
    # print(strings[-1]) #used to check
  # break
# pp.pprint(strings)

100%|██████████| 46/46 [00:01<00:00, 23.48it/s]


In [107]:
print(len(strings))

35786


In [108]:
'''
This line is for the case when the text file is not properly formatted.
It was used to ignore linebreaks and join the sentences into one string, since the text example included linebreak following printed book lines.

strings = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('. ')
'''
# The strings has to be a list of list of strings, where inner list is a sentence
# strings = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').replace('"','').replace("'","").split('. ')
# strings_processed = "".join(strings).replace('\n', '*').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('*')

strings_processed = "".join(strings).replace("\n", '_').replace(',', '').split('_')

# for i in strings_processed:
  # print("line")
  # print(i)
print("Checking the first 5 sentences in the text file")
for i in range(5):
  print(f"Sentence {i+1}: {strings_processed[i]}")
corpus = make_tokenized_corpus(strings_processed)

Checking the first 5 sentences in the text file
Sentence 1: it feels like a perfect night
Sentence 2: to dress up like hipsters
Sentence 3: and make fun of our exes
Sentence 4: uh-uh uh-uh
Sentence 5: it feels like a perfect night


In [109]:
#unique word counter
unique_word = set()
for x in corpus:
  for y in x:
    unique_word.add(y)
print(len(unique_word)) #num of unique words

11629


- gensim Word2Vec arguments
  - ``sentences``: list of list of strings
  - ``vector_size``: dimension of word vector
  - ``window``: maximum distance between the current and predicted word within a sentence
  - ``min_count``: ignore all words with total frequency lower than this
  - ``sg``: training algorithm: 1 for skip-gram; otherwise CBOW
  - ``negative``: if > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown (usually between 5-20)

In [110]:
taylor_model = Word2Vec(sentences=corpus, vector_size=80, window=3, min_count=6, sg=1)
taylor_model = taylor_model.wv # To match with previous codes, we use wv (KeyedVector) of the Word2Vec class
# Try the function above with the newely trained model

In [112]:
target_words = ['woman', 'love', 'friend']
for target_word in target_words:
  print(target_word)
  pp.pprint(taylor_model.most_similar(target_word))
  pp.pprint(model.most_similar(target_word))

woman
[('likes', 0.8789421319961548),
 ('fellow', 0.8760899305343628),
 ('safe', 0.8748647570610046),
 ('marrying', 0.8726508617401123),
 ('sinks', 0.8682652115821838),
 ('masterpiece', 0.8681350350379944),
 ('second', 0.8645243048667908),
 ('ere', 0.8632925152778625),
 ('realizing', 0.8592833876609802),
 ('means', 0.8580912351608276)]
[('girl', 0.7817050218582153),
 ('women', 0.7705847024917603),
 ('guy', 0.7154314517974854),
 ('she', 0.710436224937439),
 ('person', 0.703464686870575),
 ('wife', 0.7029582262039185),
 ('female', 0.7000528573989868),
 ('mother', 0.6994999647140503),
 ('lady', 0.6945760846138),
 ('who', 0.6705518960952759)]
love
[('spiral', 0.7282082438468933),
 ('uhoh', 0.6988125443458557),
 ('fallin', 0.6968759298324585),
 ('affair', 0.6765228509902954),
 ('tragic', 0.6732059717178345),
 ('nobodys', 0.6559011936187744),
 ('desperately', 0.6525399684906006),
 ('insane', 0.6440119743347168),
 ('worship', 0.643915057182312),
 ('true', 0.6437399387359619)]
[('you', 0.84608

In [91]:
analogy(taylor_model, 'man', 'he', 'woman')
analogy(model, 'man', 'he', 'woman')

[('chasing', 0.7183302640914917),
 ('fame', 0.708710253238678),
 ('hasnt', 0.6730809211730957),
 ('midnight', 0.6669391393661499),
 ('teatime', 0.6615288853645325),
 ('changed', 0.6504514217376709),
 ('agrees', 0.6459696888923645),
 ('stayed', 0.6376455426216125),
 ('trimalchio', 0.6363664269447327),
 ('calls', 0.6356037855148315)]

[('she', 0.7365854382514954),
 ('has', 0.6785745620727539),
 ('thinks', 0.6429623365402222),
 ('knows', 0.6414571404457092),
 ('says', 0.6331944465637207),
 ('told', 0.6292889714241028),
 ('tells', 0.6237863898277283),
 ('does', 0.623668909072876),
 ('said', 0.6230323314666748),
 ('once', 0.6061710119247437)]



In [92]:
analogy(taylor_model, 'good', 'bad', 'beautiful')
analogy(model, 'good', 'bad', 'beautiful')

[('tragic', 0.6374070644378662),
 ('rush', 0.6065912842750549),
 ('fuckin', 0.5918627381324768),
 ('perfect', 0.5720840692520142),
 ('peaks', 0.5548309087753296),
 ('pretty', 0.55328369140625),
 ('million', 0.5497406721115112),
 ('weird', 0.549546480178833),
 ('whos', 0.5354764461517334),
 ('motion', 0.5340595841407776)]

[('girl', 0.7060506939888),
 ('gorgeous', 0.6973251700401306),
 ('pretty', 0.6896135807037354),
 ('ugly', 0.6653832793235779),
 ('perfect', 0.6541532874107361),
 ('crazy', 0.6417362093925476),
 ('such', 0.6366500854492188),
 ('look', 0.6262451410293579),
 ('cute', 0.6249862909317017),
 ('stunning', 0.6183186173439026)]



In [93]:
#altered code for groups of 5
def display_pca_scatterplot_groups_of_five(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")

  word_vectors = np.array([model[w] for w in words])
  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # Create a DataFrame for visualization with Plotly
  import pandas as pd
  df = pd.DataFrame(twodim, columns=['x', 'y'])
  df['word'] = words
  df['group'] = ['fearless'] * 5 + ['speaknow'] * 5 + ['red'] * 5 + ['1989'] * 5+ ['reputation'] * 5+ ['lover'] * 5 + ['midnights'] * 5#assign groups of 5

  # Plot the scatterplot with different colors for each group
  fig = px.scatter(df, x='x', y='y', text='word', color='group')
  fig.update_traces(textposition='top center')
  # fig.update_layout(xaxis_type='log')
  # fig.update_layout(yaxis_type='log')
  fig.show()

In [94]:
word_list = [
  # fearless album
  'breathe', 'fearless', 'stephen', 'romeo', 'belong',
  # speak now album
  'enchanted', 'december', 'john', 'innocent', 'now',
  # red album
  'red', 'well', 'twenty', 'trouble', 'stay',
  # 1989 album
  'welcome', 'space', 'woods', 'james', 'wildest',
  # reputation album
  'delicate', 'crazy', 'look', 'ready', 'gorgeous',
  # lover album
  'thousand', 'miss', 'pageant', 'london', 'archer',
  # midnights album
  'lavender', 'karma', 'maroon', 'haze', 'hero'
]
display_pca_scatterplot_groups_of_five(taylor_model, word_list)

In [95]:
display_pca_scatterplot_groups_of_five(model, word_list)

In [96]:
#altered code for groups of 5_smaller
def display_pca_scatterplot_groups_of_five_smaller(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")

  word_vectors = np.array([model[w] for w in words])
  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # Create a DataFrame for visualization with Plotly
  import pandas as pd
  df = pd.DataFrame(twodim, columns=['x', 'y'])
  df['word'] = words
  df['group'] = ['lover'] * 5 + ['reputation'] * 5 + ['midnights'] * 5 #assign groups of 5

  # Plot the scatterplot with different colors for each group
  fig = px.scatter(df, x='x', y='y', text='word', color='group')
  fig.update_traces(textposition='top center')
  # fig.update_layout(xaxis_type='log')
  # fig.update_layout(yaxis_type='log')
  fig.show()

In [97]:
word_list = [
  #lover era(5)
  'lover', 'cornelia', 'cruel', 'london', 'death',
  #reputation era
  'delicate', 'end', 'blame', 'getaway', 'tied',
  #midnights era
  'karma', 'lavender', 'maroon', 'mastermind', 'midnight'
]
display_pca_scatterplot_groups_of_five_smaller(taylor_model, word_list)

Current length of input word list: 15


In [98]:
word_list = [
  # fearless album
  'breathe', 'fearless', 'stephen', 'romeo', 'belong',
  # speak now album
  'enchanted', 'december', 'john', 'innocent', 'now',
  # red album
  'red', 'well', 'twenty', 'trouble', 'stay',
]
display_pca_scatterplot_groups_of_five_smaller(taylor_model, word_list)

Current length of input word list: 15


## New Taylor Model

In [113]:
#text importer with modifications - different txt file for different names

#Modification that should be made
#0. use Google Drive - cell above
#1. multiple file structure: Albums/{album_name}/{song.txt} - for loop
#2. first line of the file should be ignored - in append function
#3. numbers, word that start with [ should be eliminated

strings=[]

from glob import glob
from tqdm import tqdm
import re

albums = glob("/content/drive/MyDrive/ColabData/TaylorSwiftAlbums/*")

for album in tqdm(albums):
  songs = glob(str(album)+"/*")
  for song in songs:
    with open(song, 'r') as f:
      song_lyric = f.readlines()[1:] #ignore first line
      if song_lyric:
        this_song_lyric = ""
        #ignore last line
        for line in song_lyric[:-1]:
          #delete dividers like [chorus]then add lowercased version
          line = line.lower()
          if "[" not in line:
            line = line.replace('\n',' ')
            this_song_lyric+=" "+line

        #last line
        #taylor rarely puts numbers in numeric form, so deletion of number followed by string in only last line
        #does not negatively affect
        #this line deletes ##embed
        last_line_modified = re.sub(r'\d+\s*\w+', '', song_lyric[-1].lower())
        #this line deletes embed, without numbers beforehand
        last_line_modified = last_line_modified.replace('embed','')
        this_song_lyric+=" "+(last_line_modified)

        #append lyric of the song as a long sentence
        strings.append(this_song_lyric)

# pp.pprint(strings[0])

100%|██████████| 46/46 [00:02<00:00, 16.69it/s]


In [114]:
print(len(strings))
print(strings[-1])

544
 mmm-mm, mm-mm  mmm-mm, mm-mm  mmm-mm, mm-mm, yeah    hey, stephen, i know looks can be deceiving  but i know i saw a light in you  and as we walked, we were talking  i didn't say half the things i wanted to  of all the girls tossing rocks at your window  i'll be the one waiting there even when it's cold  hey, stephen, boy, you might have me believing  i don't always have to be alone    'cause i can't help it if you look like an angel  can't help it if i wanna kiss you in the rain so  come feel this magic i've been feeling since i met you  can't help it if there's no one else  mmm, i can't help myself    hey, stephen, i've been holding back this feeling  so i got some things to say to you, ha  i've seen it all, so i thought  but i never seen nobody shine the way you do  the way you walk, way you talk, way you say my name  it's beautiful, wonderful, don't you ever change  hey, stephen, why are people always leaving?  i think you and i should stay the same  'cause i can't help it if 

In [115]:
'''
This line is for the case when the text file is not properly formatted.
It was used to ignore linebreaks and join the sentences into one string, since the text example included linebreak following printed book lines.

strings = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('. ')
'''
# The strings has to be a list of list of strings, where inner list is a sentence
# strings = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').replace('"','').replace("'","").split('. ')
# strings_processed = "".join(strings).replace('\n', '*').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('*')

# strings_processed = "".join(strings).replace("\n", '_').replace(',', '').split('_')
# print(strings_processed)
# # for i in strings_processed:
#   # print("line")
#   # print(i)
# print("Checking the first 5 sentences in the text file")
# for i in range(5):
#   print(f"Sentence {i+1}: {strings_processed[i]}")
corpus = make_tokenized_corpus(strings)

In [117]:
print(corpus[:100])

['it', 'feels', 'like', 'a', 'perfect', 'night', 'to', 'dress', 'up', 'like', 'hipsters', 'and', 'make', 'fun', 'of', 'our', 'exes', 'uhuh', 'uhuh', 'it', 'feels', 'like', 'a', 'perfect', 'night', 'for', 'breakfast', 'at', 'midnight', 'to', 'fall', 'in', 'love', 'with', 'strangers', 'uhuh', 'uhuh', 'yeah', 'were', 'happy', 'free', 'confused', 'and', 'lonely', 'at', 'the', 'same', 'time', 'its', 'miserable', 'and', 'magical', 'oh', 'yeah', 'tonights', 'the', 'night', 'when', 'we', 'forget', 'about', 'the', 'deadlines', 'its', 'time', 'ohoh', 'i', 'dont', 'know', 'about', 'you', 'but', 'im', 'feelin', 'twentytwo', 'everything', 'will', 'be', 'alright', 'if', 'you', 'keep', 'me', 'next', 'to', 'you', 'you', 'dont', 'know', 'about', 'me', 'but', 'ill', 'bet', 'you', 'want', 'to', 'everything', 'will', 'bе', 'alright', 'if', 'we', 'just', 'keep', 'dancin', 'like', 'werе', 'twentytwo', 'twentytwo', 'it', 'seems', 'like', 'one', 'of', 'those', 'nights', 'this', 'place', 'is', 'too', 'crowded'

- gensim Word2Vec arguments
  - ``sentences``: list of list of strings
  - ``vector_size``: dimension of word vector
  - ``window``: maximum distance between the current and predicted word within a sentence
  - ``min_count``: ignore all words with total frequency lower than this
  - ``sg``: training algorithm: 1 for skip-gram; otherwise CBOW
  - ``negative``: if > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown (usually between 5-20)

In [103]:
#unique word counter
unique_word = set()
for x in corpus:
  for y in x:
    unique_word.add(y)
print(len(unique_word)) #num of unique words

11025


In [75]:
taylor_model = Word2Vec(sentences=corpus, vector_size=80, window=3, min_count=6, sg=1, negative=10)
taylor_model = taylor_model.wv # To match with previous codes, we use wv (KeyedVector) of the Word2Vec class
# Try the function above with the newely trained model

In [76]:
target_words = ['woman', 'love', 'friend']
import pprint
for target_word in target_words:
  print(target_word)
  pp.pprint(taylor_model.most_similar(target_word))

woman
[('trace', 0.8545616269111633),
 ('whom', 0.8512718081474304),
 ('likes', 0.8409059643745422),
 ('wears', 0.8380570411682129),
 ('lovely', 0.8344222903251648),
 ('ascyltos', 0.8294989466667175),
 ('himself', 0.8281412720680237),
 ('countryman', 0.8275983929634094),
 ('whose', 0.8264104127883911),
 ('tis', 0.8249528408050537)]
love
[('fallin', 0.6349379420280457),
 ('affair', 0.6345676779747009),
 ('shining', 0.6283682584762573),
 ('play', 0.6232140064239502),
 ('ruthless', 0.6126230955123901),
 ('worship', 0.6116556525230408),
 ('players', 0.6061928272247314),
 ('tragic', 0.593811571598053),
 ('ours', 0.5881608128547668),
 ('heels', 0.5881523489952087)]
friend
[('share', 0.8309302926063538),
 ('mythical', 0.8218633532524109),
 ('trophy', 0.8200570344924927),
 ('heartbreak', 0.8128861784934998),
 ('shot', 0.8001842498779297),
 ('tho', 0.7999062538146973),
 ('masterpiece', 0.7982810139656067),
 ('bought', 0.7973718047142029),
 ('shall', 0.7965989112854004),
 ('nеver', 0.79616737365

In [77]:
analogy(taylor_model, 'man', 'he', 'woman')
# analogy(model, 'man', 'he', 'woman')

[('thinks', 0.6637744903564453),
 ('says', 0.6459203958511353),
 ('midnight', 0.6441094875335693),
 ('stayed', 0.613298773765564),
 ('calls', 0.6122281551361084),
 ('comfortable', 0.6117876768112183),
 ('sunshine', 0.5995890498161316),
 ('ascyltos', 0.5914212465286255),
 ('knelt', 0.5906340479850769),
 ('exactly', 0.5832395553588867)]



In [78]:
analogy(taylor_model, 'good', 'bad', 'beautiful')
# analogy(model, 'good', 'bad', 'beautiful')

[('peaks', 0.6466321349143982),
 ('tragic', 0.6325075030326843),
 ('rush', 0.6206655502319336),
 ('gold', 0.6052859425544739),
 ('flying', 0.604031503200531),
 ('weird', 0.5992770195007324),
 ('fuckin', 0.5983691811561584),
 ('windermere', 0.588883638381958),
 ('angel', 0.5804174542427063),
 ('sad', 0.5757679343223572)]



In [79]:
#altered code for groups of 5
def display_pca_scatterplot_groups_of_five(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")

  word_vectors = np.array([model[w] for w in words])
  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # Create a DataFrame for visualization with Plotly
  import pandas as pd
  df = pd.DataFrame(twodim, columns=['x', 'y'])
  df['word'] = words
  df['group'] = ['fearless'] * 5 + ['speaknow'] * 5 + ['red'] * 5 + ['1989'] * 5+ ['reputation'] * 5+ ['lover'] * 5 + ['midnights'] * 5#assign groups of 5

  # Plot the scatterplot with different colors for each group
  fig = px.scatter(df, x='x', y='y', text='word', color='group')
  fig.update_traces(textposition='top center')
  # fig.update_layout(xaxis_type='log')
  # fig.update_layout(yaxis_type='log')
  fig.show()

In [80]:
word_list = [
  # fearless album
  'breathe', 'fearless', 'stephen', 'romeo', 'belong',
  # speak now album
  'enchanted', 'december', 'john', 'innocent', 'now',
  # red album
  'red', 'well', 'twenty', 'trouble', 'stay',
  # 1989 album
  'welcome', 'space', 'woods', 'james', 'wildest',
  # reputation album
  'delicate', 'crazy', 'look', 'ready', 'gorgeous',
  # lover album
  'thousand', 'miss', 'pageant', 'london', 'archer',
  # midnights album
  'lavender', 'karma', 'maroon', 'haze', 'hero'
]
display_pca_scatterplot_groups_of_five(taylor_model, word_list)

In [81]:
# display_pca_scatterplot_groups_of_five(model, word_list)

In [82]:
#altered code for groups of 5_smaller
def display_pca_scatterplot_groups_of_five_smaller(model, words=None, sample=0):
  if len(words) < 30:
    print("WARNING: For your report, please select more than 30 word samples for the visualization")
    print(f"Current length of input word list: {len(words)}")

  word_vectors = np.array([model[w] for w in words])
  twodim = PCA().fit_transform(word_vectors)[:,:2]

  # Create a DataFrame for visualization with Plotly
  import pandas as pd
  df = pd.DataFrame(twodim, columns=['x', 'y'])
  df['word'] = words
  df['group'] = ['lover'] * 5 + ['reputation'] * 5 + ['midnights'] * 5 #assign groups of 5

  # Plot the scatterplot with different colors for each group
  fig = px.scatter(df, x='x', y='y', text='word', color='group')
  fig.update_traces(textposition='top center')
  # fig.update_layout(xaxis_type='log')
  # fig.update_layout(yaxis_type='log')
  fig.show()

In [118]:
word_list = [
  #lover era(5)
  'lover', 'cornelia', 'cruel', 'london', 'death',
  #reputation era
  'delicate', 'end', 'blame', 'getaway', 'tied',
  #midnights era
  'karma', 'lavender', 'maroon', 'mastermind', 'midnight'
]
display_pca_scatterplot_groups_of_five_smaller(taylor_model, word_list)

Current length of input word list: 15


In [84]:
word_list = [
  # fearless album
  'breathe', 'fearless', 'stephen', 'romeo', 'belong',
  # speak now album
  'enchanted', 'december', 'john', 'innocent', 'now',
  # red album
  'red', 'well', 'twenty', 'trouble', 'stay',
]
# display_pca_scatterplot_groups_of_five_smaller(taylor_model, word_list)