# AcaRead

## 1. Load the data

In [1]:
# import string
from sklearn.metrics.pairwise import euclidean_distances
# import random
# import gensim.downloader as api

In [14]:
import pandas as pd

# read csv file
df = pd.read_csv("arxiv_data_210930-054931.csv", on_bad_lines = 'skip')
df = df.rename(columns={"summaries":"abstracts", "terms": "categories"})
df.head()

Unnamed: 0,categories,titles,abstracts
0,['cs.LG'],Multi-Level Attention Pooling for Graph Neural...,Graph neural networks (GNNs) have been widely ...
1,"['cs.LG', 'cs.AI']",Decision Forests vs. Deep Networks: Conceptual...,Deep networks and decision forests (such as ra...
2,"['cs.LG', 'cs.CR', 'stat.ML']",Power up! Robust Graph Convolutional Network v...,Graph convolutional networks (GCNs) are powerf...
3,"['cs.LG', 'cs.CR']",Releasing Graph Neural Networks with Different...,With the increasing popularity of Graph Neural...
4,['cs.LG'],Recurrence-Aware Long-Term Cognitive Network f...,Machine learning solutions for pattern classif...


We loaded the data from our CSV file as a pandas dataframe. We renamed some of the columns for clarity.

### 1.2 Create and Process Queries

Create a useable UI interface that allows the user to input a number selecting from the range of abstracts from our large dataset and to create a sample query.

In [15]:
# extract abstracts
abstracts = df["abstracts"]

In [16]:
# check length of abstracts
len(abstracts)

56181

In [17]:
index = 25221 # @param {type:"slider", min:0, max:56180, step:1}

query = 'algorithm' # @param {type:"string"}

In [18]:
abstract = abstracts[index]
# abstract = abstract.split("\n")
# abstract = " ".join(abstract)
print(abstract)

In this paper, we propose a novel map for dense crowd localization and crowd
counting. Most crowd counting methods utilize convolution neural networks (CNN)
to regress a density map, achieving significant progress recently. However,
these regression-based methods are often unable to provide a precise location
for each person, attributed to two crucial reasons: 1) the density map consists
of a series of blurry Gaussian blobs, 2) severe overlaps exist in the dense
region of the density map. To tackle this issue, we propose a novel Focal
Inverse Distance Transform (FIDT) map for crowd localization and counting.
Compared with the density maps, the FIDT maps accurately describe the people's
location, without overlap between nearby heads in dense regions. We
simultaneously implement crowd localization and counting by regressing the FIDT
map. Extensive experiments demonstrate that the proposed method outperforms
state-of-the-art localization-based methods in crowd localization tasks,
achievin

In [19]:
query = query.lower()
print(query)

algorithm


## 2. Named Entity Recognition

In [None]:
import spacy

# design a function that takes in an abstract as the input and the output as NER tags
def ner_tag(abstract):
  # put unique sentences in a list
  sentences = abstract.split(". ")
  # process sentences and perform NER
  nlp = spacy.load("en_core_web_sm")
  for i in range(len(sentences)):
    print(f'Sentence {i+1}:')
    doc = nlp(sentences[i])
    for ent in doc.ents:
      print(ent.text, ent.label_)
    print()

ner_tag(abstract)

Sentence 1:

Sentence 2:
cnn ORG

Sentence 3:
two CARDINAL
1 CARDINAL
blurry gaussian blobs PERSON
2 CARDINAL

Sentence 4:

Sentence 5:

Sentence 6:

Sentence 7:

Sentence 8:



## 3. Sentiment-Based Search

This step involves parsing through the abstract for sentences that match the sentiments of a given query. We employed two different methods in order to capture the texts numerically--TfIdf Vectorization and BERT Sentence Transformers. For each of these methods, we calculated and ranked the Euclidean distances between the vecotrs of each of the sentences and the query.

### 3.1 TfIdf Vectorized-Based Search

In order to perform TfIdf Vectorization, we first cleaned the abstracts through lowercasing, removing URLs, tokenization, stopword removal, and lemmatization. Afterwards, we used scikit-learn's TfIdf Vectorizer to vectorize the corpus and vectorize the lowercased query. Finally, we calculated the Euclidean distances.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# load a set of English stopwords from the NLTK library
stop_words = set(stopwords.words('english'))

def clean(abstract):
  # split abstract into sentences
  sentences = abstract.split(".")
  cleaned_abstract = []
  for sent in sentences:
    # lowercase the data
    text = sent.lower()
    # remove URLs
    text = re.sub(r'http\S+', '', text)
    text = text.strip()
    # tokenize the input text into individual tokens (words)
    tokenized_text = word_tokenize(text)
    # filter out any tokens that are in the list of stopwords
    filtered_text = [token for token in tokenized_text if token not in stop_words]
    # join the filtered tokens back into a single string, separated by spaces
    filtered_text = " ".join(filtered_text)
    # lemmatize text
    wnl = WordNetLemmatizer()
    tokens = filtered_text.split()
    lemmas = [wnl.lemmatize(token) for token in tokens]
    lemmatized_text = " ".join(lemmas)
    cleaned_abstract.append(lemmatized_text)
  return cleaned_abstract

# call clean function
cleaned_abstract = clean(abstract)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# vectorize abstract data to search by a sample query for the top 3 most similar sentences

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import pairwise_distances

# perform tfidf vectorization
# define the N for N-grams
N = 1

# initialize the TfIdfVectorizer with N-gram range
tfidfvectorizer = TfidfVectorizer(ngram_range=(N, N), lowercase = False,binary = True, analyzer=lambda x:x, max_features = 1000)

# fit and transform the corpus
tfidfvectorizer.fit(cleaned_abstract)
n_hot_matrix = tfidfvectorizer.transform(cleaned_abstract)

# print the length of vocabulary (unique words)
vocabulary_dict = tfidfvectorizer.vocabulary_
print(f"Vocabulary length {len(vocabulary_dict)}")

# convert the matrix to an array and print the result for the first 5 instances
print("N-Hot Vectorization:")
for vec in n_hot_matrix.toarray().tolist()[:5]:
  print(vec)

# process and vectorize queries and then calculate Euclidean distance scores
query = [query]
queryVector = tfidfvectorizer.transform(query).toarray()
distances = pairwise_distances(queryVector, n_hot_matrix.toarray(), metric='euclidean')[0]
sorted_indices = np.argsort(distances)
# print top 3 most similar queries
print("\n")
print (f"Printing top 3 most similar text for query {query}")
for q in sorted_indices[:3]:
  print(f'Score: {distances[q]} --- {cleaned_abstract[q]}')

Vocabulary length 34
N-Hot Vectorization:
[0.21454733125344497, 0.0, 0.0, 0.0, 0.2550395756397351, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19725208310305747, 0.0, 0.19725208310305747, 0.19725208310305747, 0.21454733125344497, 0.0, 0.21454733125344497, 0.0, 0.19725208310305747, 0.0, 0.19725208310305747, 0.19725208310305747, 0.21454733125344497, 0.19725208310305747, 0.2336663410727109, 0.2336663410727109, 0.2336663410727109, 0.21454733125344497, 0.2336663410727109, 0.2336663410727109, 0.2550395756397351, 0.0, 0.0, 0.30724315810790637]
[0.16495088599974306, 0.0, 0.33292243287036155, 0.292786640315043, 0.1960826253162601, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15165374317664382, 0.0, 0.15165374317664382, 0.15165374317664382, 0.16495088599974306, 0.17965019542811453, 0.16495088599974306, 0.23621841787157874, 0.15165374317664382, 0.26165490099852584, 0.15165374317664382, 0.15165374317664382, 0.16495088599974306, 0.15165374317664382, 0.17965019542811453, 0.17965019542811453, 0.17965019542811453, 0.16495088599974306, 

For abstract_no 25221, these results accurately indicate that this abstract does not include the query word "algorithm" because all of the distances are greater than 0. Some of the sentences do seem to indicate functions or actions that would be well-represented by an algorithm, such as "map dense crowd localization" or "fidt map," also known as Focal Inverse Distance Transform map. However, the search does seem to miss other potential matches of "algorithm," like "convolutional neural networks." Overall, the search results are somewhat reasonable for the given query.

### 3.2 Sentence Transformer-Based Search

In order to perform BERT Sentence Transformers, we first cleaned the abstracts through lowercasing, splitting, and removing URLs. Afterwards, we used the BERT model to get sentence embeddings for the abstract and the query. Finally, we calculated the Euclidean distances.

In [None]:
!pip install sentence_transformers



In [None]:
# design a function that takes in an abstract as an input and then cleans and vectorizes the data,
#so the output can be searched by a sample query for the top 3 most similar sentences
from sentence_transformers import SentenceTransformer

# Load a pre-trained BERT model
model = SentenceTransformer('bert-base-uncased')
# Load a pre-trained BERT model
model = SentenceTransformer('bert-base-uncased')

#some cleaning
# lowercase and split selected abstract text into sentences
abstract_lower = abstract.lower()
abstract_lower = abstract_lower.split("\n")
abstract_lower = " ".join(abstract_lower)
# remove URLs
abstract_cleaned = re.sub(r'http\S+', '', abstract_lower)
abstract_sentences = abstract_cleaned.split(".")
abstract_sentences = [i for i in abstract_sentences if i]
abstract_sentences = [i.strip() for i in abstract_sentences]

# generate sentence embeddings
sentence_embeddings = model.encode(abstract_sentences)

# lowercase and embed query
query_embedding = model.encode(query)
query_embedding_flat = query_embedding.reshape(1, -1)  # Reshape the query embedding to have a single sample

# iterate through sentences and generate distances
distances = []
for sent in range(len(abstract_sentences)):
    distance = pairwise_distances(query_embedding_flat, sentence_embeddings[sent].reshape(1, -1), metric='euclidean')[0][0]
    distances.append(distance)

sorted_indices = np.argsort(distances)
# print top 3 most similar queries
print("\n")
print (f"Printing top 3 most similar sentences for query: {query}")
for q in sorted_indices[:3]:
    print(f'Score: {distances[q]} --- {abstract_sentences[q]}')





Printing top 3 most similar sentences for query: ['algorithm']
Score: 9.567865371704102 --- we simultaneously implement crowd localization and counting by regressing the fidt map
Score: 9.578535079956055 --- in this paper, we propose a novel map for dense crowd localization and crowd counting
Score: 9.804736137390137 --- to tackle this issue, we propose a novel focal inverse distance transform (fidt) map for crowd localization and counting


For abstract_no 25221, these results accurately indicate that this abstract does not include the query word "algorithm" because all of the distances are much greater than 0. Similar to the TfIdf vectorization results, some of the sentences do seem to indicate actions that would be well-represented by an algorithm. However, this sentiment-based search also seems to miss other potential matches of "algorithm," like "convolutional neural networks." Overall, the search results are somewhat reasonable for the given query. Unlike the search results with TfIdf vectorization, the distance scores are much higher with the sentence embeddings. This is likely because BERT sentence embeddings are not always well-suited to generate embeddings for singular words. So, although our results were similar across both sentiment-based searches, the distance scores of the BERT sentence embedding-based search may be skewed due to the attempt to encode a single word query with no context.

## 4. Summarization

We intend to use the Mistral 7b Large Language Model (LLM) trained on yamha's Alpaca dataset to attempt to summarize a given abstract.

In [20]:
pip install mistralai

Collecting mistralai
  Downloading mistralai-0.1.8-py3-none-any.whl (15 kB)
Collecting httpx<0.26.0,>=0.25.2 (from mistralai)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson<4.0.0,>=3.9.10 (from mistralai)
  Downloading orjson-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.1/141.1 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<0.26.0,>=0.25.2->mistralai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<0.26.0,>=0.25.2->mistralai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [21]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

We download the 4bit quantized version of Mistral 7b to prevent Google Colab from crashing.

In [38]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized Mistral 7b we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth: Fast Mistral patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


We LoRA adaptors to make finetuning more efficient later on.

In [39]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

We are training Mistral 7b on the Alpaca dataset from yahma, which is a filtered version of the larger original Alpaca dataset. The Alpaca dataset is trained on completing a variety of tasks based on an instruction, input, and output.

In [40]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

We need to download several packages to ensure that we can probably use HuggingFace's SFTTrainer

In [26]:
pip install git+https://github.com/huggingface/trl.git@7630f877f91c556d9e5a3baa4b6e2894d90ff84c

Collecting git+https://github.com/huggingface/trl.git@7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Cloning https://github.com/huggingface/trl.git (to revision 7630f877f91c556d9e5a3baa4b6e2894d90ff84c) to /tmp/pip-req-build-psxaw14r
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl.git /tmp/pip-req-build-psxaw14r
  Running command git rev-parse -q --verify 'sha^7630f877f91c556d9e5a3baa4b6e2894d90ff84c'
  Running command git fetch -q https://github.com/huggingface/trl.git 7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Running command git checkout -q 7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Resolved https://github.com/huggingface/trl.git to commit 7630f877f91c556d9e5a3baa4b6e2894d90ff84c
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.4.0->trl==0.7.12.dev0)
  Using cached nv

We need to ensure that we have installed the trl library for finetuning

In [41]:
pip install trl



In [42]:
#from datasets import load_dataset
from trl import SFTTrainer
from transformers import (AutoTokenizer, AutoConfig,
                              AutoModelForSequenceClassification, TrainingArguments, Trainer)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Set num_train_epochs = 1 for full training runs
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

max_steps is given, it will override any value given in num_train_epochs


In [43]:
trainer_stats = trainer.train()

Step,Training Loss
1,1.4151
2,1.8406
3,1.3395
4,1.3249
5,1.1013
6,1.0203
7,0.7856
8,0.9497
9,0.846
10,0.9321


After we trained the model on a speedier run of 60 epoches, we can test the model. We will instruct the model to summarize our given abstract.

In [58]:
# extract the abstracts

# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Provide a brief and concise summary of the input. Do not exactly repeat or reuse text from the input.", # instruction
        abstract, # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 200)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Provide a brief and concise summary of the input. Do not exactly repeat or reuse text from the input.

### Input:
In this paper, we propose a novel map for dense crowd localization and crowd
counting. Most crowd counting methods utilize convolution neural networks (CNN)
to regress a density map, achieving significant progress recently. However,
these regression-based methods are often unable to provide a precise location
for each person, attributed to two crucial reasons: 1) the density map consists
of a series of blurry Gaussian blobs, 2) severe overlaps exist in the dense
region of the density map. To tackle this issue, we propose a novel Focal
Inverse Distance Transform (FIDT) map for crowd localization and counting.
Compared with the density maps, the FIDT maps accurately describe the people's
location, w

For this given abstract, this summary incorporates a lot of the original vocabulary. However, it does alter the sentence structure and increase concision. Through altering the prompt for specificity and experimenting with the max_new_tokens count, we were able to determine a method through which we could ensure a summary that differed from the original input. For instance, having too large of a max_new_tokens value results in the model's response almost completely copying the input text. In the future, this summarization could be improved by incorporating a more diverse vocabulary and reusing less words from the original document. For our future work, we would like to more quantitatively analyze the quality of the summary, potentially using Latent Semantic Analysis. If we had the resources available, it would be ideal to have reference summaries for our abstracts to better gauge the quality of the model's summaries.