# CMSC 35440 Machine Learning in Biology and Medicine
## Homework 1: Embedding Research Articles
**Released**: Jan 14, 2025

**Due**: Jan 24, 2025 at 11:59 PM Chicago Time on Gradescope

**In this first homework, you'll generate embeddings for 20 provided research articles and visualize them.**

At a high-level, embeddings are vectors computed by some algorithm or model that "code" information from data. Embeddings can be computed in a wide variety of different ways, from concatenating manually created features to using deep neural networks.

For this homework, you will code text documents as vectors using the bag of words algorithm and normalize these vectors using the term-frequency inverse documentation frequency (TF-IDF) method. This method dates back over 50 years to 1972. Through this homework, hopefully we'll convince you that it's still very much relevant.

Please carefully read through the instructions below. Also, while not required for the homework, the articles themselves are worth a read. They're some seminal papers across various domains around biomedical AI/ML.

The starter notebook for this homework can be downloaded from GitHub:

https://github.com/StevenSong/CMSC-35440-Source/blob/main/hw1/CMSC_35440_HW1_Student_Version.ipynb

## Instructions


1. Download and open the starter notebook in your favorite Jupyter Notebook host. We recommend using [Google Colab](https://colab.research.google.com/).
  * **NB:** We'll design all homeworks such that they can be run on the *free* tier of Colab. You're welcome to use any other host, but the benefit of Colab is that they offer free GPU-instances.
  * Technically you don't need GPUs for any modeling but it can really speed it up. For homeworks where GPU-acceleration is recommended, we'll provide additional instructions on how to access GPU-instances on Colab.
  * For this homework, we don't require the use of any GPUs.
1. Download and unzip the research articles. We've provided them as a tarball that be downloaded from [https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw1/hw1.tar.gz](https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw1/hw1.tar.gz).
  * You'll notice that there's a CSV of article metadata and a folder of article *PDFs*. While these articles are available elsewhere on the internet as extracted-text (come to office-hours if you're interested in using such a resource for other projects), real-world data is messy. One such way that data can be messy is that it only exists as PDFs - so **you must use the article PDFs for this assignment**.
1. Extract the text from the articles. You should probably use some variables from the metadata at this step.
1. Compute the term-document matrix and then normalize the term-document method using the TF-IDF method.  **You must implement TF-IDF yourself. You may not use any existing implementations for computing the TF-IDF matrix** (e.g. you can NOT use sklearn's function for TF-IDF).
  * Defining what is a "term" is up to you but don't overcomplicate it. Splitting on whitespace characters works fine.
  * The wikipedia is hopefully all you need to understand the formula: [https://en.wikipedia.org/wiki/Tf-idf](https://en.wikipedia.org/wiki/Tf-idf).
1. Normalize your per-document embeddings. Normalization is an important step to make embeddings comparable across the data.
1. Visualize your embeddings. Embeddings are typically used in some downstream application, but visualization at this stage can be a nice sanity check before proceeding with further usage - have your embeddings actually captured information that reflect the underlying data?
  * Your embeddings are probably high-dimensional vectors. Humans have a hard time visualizing things beyond 3 dimensions and honestly we can get away with 2 dimensions in most cases.
  * There are many methods to do unsupervised dimensionality reduction. Some of the classical methods include principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and t-distributed stochastic neighbor embedding (t-SNE). These are all fine methods for this homework as they are provided by existing packages.
  * However, beware of the pitfalls of methods such as UMAP and t-SNE which are highly succeptible to the hyperparameters used with the underlying data. This is a nice post detailing these pitfalls, check out the mammoth figure: [https://pair-code.github.io/understanding-umap/](https://pair-code.github.io/understanding-umap/).
  * In your visualization plot, it may be helpful to incorporate aspects of the metadata. We'll leave that open ended; visualize in a way that you think will help your discussion.
1. After you're happy with your work, analyze your results, writeup what you've done, and submit the homework to Gradescope.
  * Your submission should include 2 things:
    1. Your writeup containing a figure with your embedding visualization.
    1. Your notebook with your code for computing TF-IDF and generating the figure.
  * Your writeup should be 0.5 to 1 page long. This length should be *before* including your figure. The text should be size 12pt, single spaced, with 1 inch margins, and on letter size paper. Please submit either a PDF or Word document.
  * Some guiding questions: Have your embeddings actually captured underlying information about the articles? How can you tell? Why are some articles embedded closer to each other while others are not?

**Tips and Tricks:**
1. In general, you're welcome to use any tools you need for this homework. The only exceptions have been noted in the instructions.
1. Reading CSVs can be done with `pandas`. We'll use `pandas` plenty more in the future so be sure to familiarize yourself with it.
1. Extracting text from PDFs is relatively simple these days with [`pypdf`](https://github.com/py-pdf/pypdf).
    * If you've used PyPDF2 in the past, that package has been merged back into and development has resumed on the original pypdf project. So make the switch back! This change was made around the end of 2022. You can see the release notes [here](https://github.com/py-pdf/pypdf/releases/tag/3.1.0).
1. `numpy` will probably be useful in the normalization step.
1. You should probably use `matplotlib` or derivative (e.g. `seaborn`) for visualization.
1. If you're looking for guidance on any part of the homework or related topics, email Steven (songs1@uchicago.edu) or come to office hours! JCL 205 Wed 11a - 12p. Also open to scheduling 1:1 meetings if this time does not work for you, just email to ask.

## Code

In [None]:
!pip install pypdf
!wget https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw1/hw1.tar.gz
!tar -xzf hw1.tar.gz

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0
--2025-01-14 20:08:33--  https://github.com/StevenSong/CMSC-35440-Source/releases/download/hw1/hw1.tar.gz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/915385537/280fd7f6-f4a2-4024-ba9b-2111f384e9df?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250114%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250114T200834Z&X-Amz-Expires=300&X-Amz-Signature=a0a5cd08d93d25670a7fe6c658489de8b582b71e28a0aa799693a9d1c0544505&X-Amz-SignedHeaders=h

In [5]:
pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m153.6/298.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pypdf
import os
from pypdf import PdfReader
from sklearn.decomposition import PCA
from collections import Counter
import re

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
print(os.getcwd())

/content


In [None]:
#read in and check meta data sheet
meta = pd.read_csv('hw1/article-metadata.csv')
pdf_path = 'hw1/articles/'
print(meta.head())
print(type(meta))

                                               title  year  journal  topic  \
0  Neural networks and physical systems with emer...  1982     PNAS  model   
1  Learning representations by back-propagating e...  1986   Nature  model   
2  ImageNet Classification with Deep Convolutiona...  2012  NeurIPS  model   
3                                      Deep learning  2015   Nature  model   
4       Deep Residual Learning for Image Recognition  2016     CVPR    vis   

      short_name  main_pages  
0    neural-nets           5  
1       backprop           4  
2            cnn           8  
3  deep-learning           7  
4         resnet           8  
<class 'pandas.core.frame.DataFrame'>


In [None]:
#add a file name to the metadata chart
meta['file_name'] = meta['short_name'] + '.pdf'
print(meta.head())

                                               title  year  journal  topic  \
0  Neural networks and physical systems with emer...  1982     PNAS  model   
1  Learning representations by back-propagating e...  1986   Nature  model   
2  ImageNet Classification with Deep Convolutiona...  2012  NeurIPS  model   
3                                      Deep learning  2015   Nature  model   
4       Deep Residual Learning for Image Recognition  2016     CVPR    vis   

      short_name  main_pages          file_name  
0    neural-nets           5    neural-nets.pdf  
1       backprop           4       backprop.pdf  
2            cnn           8            cnn.pdf  
3  deep-learning           7  deep-learning.pdf  
4         resnet           8         resnet.pdf  


In [None]:
def extract_text_from_pdf(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

texts = []

for _, row in meta.iterrows():
    pdf_file = os.path.join(pdf_path, row['file_name'])
    file_text = extract_text_from_pdf(pdf_file)
    texts.append({"file_name": row['short_name'], "text": file_text})

print(len(texts))


20


In [None]:
print(texts)

[{'file_name': 'neural-nets', 'text': 'Proc. NatL Acad. Sci. USAVol. 79, pp. 2554-2558, April 1982\nBiophysics\nNeural networks and physical systems with emergent collective\ncomputational abilities(associative memory/parallel processing/categorization/content-addressable memory/fail-soft devices)\nJ. J. HOPFIELD\nDivision of Chemistry and Biology, California Institute of Technology, Pasadena, California 91125; and Bell Laboratories, Murray Hill, New Jersey 07974\nContributed by John J. Hopfweld, January 15, 1982\nABSTRACT Computational properties of use to biological or-\nganisms or to the construction of computers can emerge as col-\nlective properties of systems -having a large number of simple\nequivalent components (or neurons). The physical meaning of con-tent-addressable memory is described by an appropriate phase\nspace flow of the state of a system. A model of such a system is\ngiven, based on aspects of neurobiology but readily adapted to in-\ntegrated circuits. The collectiv

In [None]:
#extract words from each pdf and
word_count = []
for i, entry in enumerate(texts):
  text = entry['text']
  clean_text = re.sub(r'[^\w\s]', '', text).lower()
  words = clean_text.split()
  word_counts = Counter(words)
  word_count.append({"file_name" :entry['file_name'], "word_counts" : word_counts})
print(word_count)

[{'file_name': 'neural-nets', 'word_counts': Counter({'the': 311, 'of': 236, 'a': 160, 'in': 97, 'and': 96, 'to': 80, 'is': 71, 'be': 58, 'n': 53, 'for': 48, 'by': 44, 'state': 41, 'from': 39, 'states': 37, 'memory': 33, 'system': 32, 'are': 31, 'with': 30, 'this': 30, 'stable': 29, 'can': 25, 'memories': 25, '1': 25, 'an': 24, 'on': 24, 'that': 24, 'or': 23, 'as': 23, '0': 23, 'model': 21, 'were': 21, 'was': 21, 'properties': 19, 'neurons': 19, 'algorithm': 19, 'will': 18, 'at': 18, 'collective': 17, 'j': 17, 'but': 17, 'which': 17, 'time': 17, 's': 17, 'such': 16, 'information': 16, 'tij': 16, '2': 16, 't': 16, 'processing': 15, 'i': 15, '100': 15, 'would': 14, 'if': 14, 'random': 14, 'neuron': 14, 'physical': 13, 'it': 13, '10': 13, 'not': 13, '30': 13, 'computational': 12, 'have': 12, 'new': 11, 'number': 11, 'simple': 11, 'space': 11, 'we': 11, 'has': 11, 'stored': 11, 'there': 11, 'b': 11, 'e': 11, 'one': 11, 'large': 10, 'flow': 10, 'synapses': 10, 'each': 10, '7336173158': 10, 

In [None]:
#find all unique words in all files
unique_words = set()
for entry  in word_count:
  unique_words.update(entry["word_counts"].keys())
unique_words = sorted(unique_words)
#print(unique_words)

matrix = []
file_names = []
for entry in word_count:
  row = [entry["word_counts"].get(word, 0) for word in unique_words]
  matrix.append(row)
  file_names.append(entry['file_name'])

df = pd.DataFrame(matrix, columns=unique_words)
df.index = file_names
df.index.name = "Paper Name"



['0', '00', '000', '0000', '00000', '00001', '00005', '0001', '0003', '0004', '0005', '0006', '0007', '0008', '0009', '00091', '001', '0010', '0011', '0013', '0014', '0015', '0016', '0017', '0019', '002', '0020', '0022', '0023', '0024', '0025', '0026', '0027', '0028', '0029', '003', '003045', '0033', '0039', '004', '004042', '0041', '0044', '005', '005041', '005048', '005052', '005082', '0051', '0055', '006', '007', '007058', '00710128', '0075', '0076', '008', '008068', '008075', '0084', '009048', '009063', '0098sefﬁcientnetb7', '01', '010', '010001', '010051', '010089', '0101', '0102', '0107', '011', '0110', '011066', '011070', '012', '012030', '012065', '0122', '0125', '0131', '014', '0140', '014089', '015', '0150', '015049', '016', '016024', '016088', '0164', '0167', '017', '017083', '018031', '018071', '0188', '019079', '0193', '01duetorandominitializationofthetaskspecificmodelanddropouttheperformancemayvaryfordifferent', '01indicating', '02', '020', '0200', '020022', '021078', '02

In [None]:
print(df.head(10))

                0  00  000  0000  00000  00001  00005  0001  0003  0004  ...  \
Paper Name                                                               ...   
neural-nets    23   0    0     3      1      0      0     0     0     0  ...   
backprop       12   2    0     0      0      0      0     0     0     0  ...   
cnn             2   0    0     0      0      0      2     0     0     0  ...   
deep-learning   2   0    0     0      0      0      0     0     0     0  ...   
resnet          8   0    0     0      0      2      0     0     0     0  ...   
attention       2   2    0     0      0      0      0     0     0     0  ...   
chexnet         1   0    0     0      0      0      0     1     0     0  ...   
densenet        9   0    1     0      0      0      0     0     0     0  ...   
ecg             2  14    0     0      0      0      0     0     0     0  ...   
gpt-2           0   0    0     0      0      0      0     0     0     0  ...   

               ﬂexible  ﬂip  ﬂipping  ﬂ

In [None]:
#save the arranged word count chart in csv
#df.to_csv('hw1/word_counts.csv', index=True)

In [None]:
def TfIdf()