<a href="https://colab.research.google.com/github/shanmukh2325/Shanmukha_INFO5731_-Fall2023/blob/main/Bollavaram_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

An interesting text classification task in healthcare could be the classification of medical research articles into categories such as "Clinical Trials," "Drug Discovery," "Epidemiology," "Genomic Research," and "Disease Diagnosis." This task would assist researchers in quickly identifying relevant articles for their specific research interests.

Here are five types of features that might be useful for building a machine learning model for this task:

Bag of Words (BoW) Features: BoW features represent the frequency of each word in the text. These features can help capture the most common terms associated with each research category.

TF-IDF (Term Frequency-Inverse Document Frequency) Features: TF-IDF measures the importance of a word in a document relative to its importance across the entire corpus. These features can highlight words that are particularly relevant to a specific research category.

N-grams Features: N-grams represent sequences of adjacent words or characters. Extracting n-grams can capture important phrases or patterns in the text that may be indicative of the research category.

Word Embeddings: Word embeddings, such as Word2Vec or GloVe, can convert words into dense vector representations. These embeddings capture semantic relationships between words and can be used to understand the context of terms in the text.

Topic Modeling Features: Topic modeling techniques like Latent Dirichlet Allocation (LDA) can identify the underlying topics in a document. These features can help categorize research articles based on the topics they cover



'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [7]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample healthcare research articles
documents = [
    "A randomized clinical trial on the effectiveness of drug X in treating hypertension.",
    "Genomic research identifies potential biomarkers for cancer diagnosis.",
    "Epidemiological study on the spread of infectious diseases in urban areas.",
    "A review of recent advances in drug discovery for Alzheimer's disease.",
]

# Tokenize and preprocess the text
nltk.download('punkt')
tokenized_documents = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Bag of Words (BoW) features
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform([" ".join(doc) for doc in tokenized_documents])

# TF-IDF features
vectorizer_tfidf = TfidfVectorizer()
tfidf_features = vectorizer_tfidf.fit_transform([" ".join(doc) for doc in tokenized_documents])

# N-grams features (bi-grams and tri-grams)
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 3))
ngrams_features = vectorizer_ngrams.fit_transform([" ".join(doc) for doc in tokenized_documents])

# Topic modeling features using LDA
lda = LatentDirichletAllocation(n_components=2)
lda_features = lda.fit_transform(bow_features)

# Word embeddings (using pre-trained Word2Vec or GloVe embeddings)
# You would typically load pre-trained embeddings and transform the text into vectors.

# Display the extracted features
print("BoW Features:")
print(bow_features.toarray())

print("\nTF-IDF Features:")
print(tfidf_features.toarray())

print("\nN-grams Features:")
print(ngrams_features.toarray())

print("\nTopic Modeling Features (LDA):")
print(lda_features)


BoW Features:
[[0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 1 0]
 [0 0 0 1 1 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 1]
 [1 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0]]

TF-IDF Features:
[[0.         0.         0.         0.         0.         0.33942996
  0.         0.         0.         0.         0.26761048 0.33942996
  0.         0.         0.         0.33942996 0.         0.21665375
  0.         0.21665375 0.26761048 0.         0.33942996 0.
  0.         0.         0.         0.         0.26761048 0.33942996
  0.33942996 0.        ]
 [0.         0.         0.         0.36222393 0.36222393 0.
  0.36222393 0.         0.         0.         0.         0.
  0.         0.2855815  0.36222393 0.         0.36222393 0.
  0.         0.         0.         0.36222393 0.         0.
  0.36222393 0.         0.         0.         0.         0.
  0.         0.        ]
 [0.         0.  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [9]:
from sklearn.feature_selection import SelectKBest, chi2

# Define the target labels (research categories)
labels = ["Clinical Trials", "Genomic Research", "Epidemiology", "Drug Discovery"]

# Assuming you have a target label for each document, e.g., [0, 1, 2, 3]

# Feature selection using chi-squared statistic
k_best = SelectKBest(score_func=chi2, k='all')
selected_features = k_best.fit_transform(bow_features, labels)

# Get the indices of selected features in descending order of importance
sorted_feature_indices = (-k_best.scores_).argsort()

# List the features in descending order of importance
sorted_features = [vectorizer_bow.get_feature_names_out()[i] for i in sorted_feature_indices]

print("Top Features in Descending Order of Importance:")
print(sorted_features)


Top Features in Descending Order of Importance:
['advances', 'treating', 'study', 'spread', 'review', 'research', 'recent', 'randomized', 'potential', 'infectious', 'identifies', 'trial', 'genomic', 'hypertension', 'epidemiological', 'alzheimer', 'areas', 'biomarkers', 'cancer', 'clinical', 'diagnosis', 'discovery', 'diseases', 'effectiveness', 'disease', 'urban', 'for', 'on', 'drug', 'the', 'of', 'in']


In [11]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
Insta

In [13]:
pip install torch torchvision torchaudio



Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [15]:
# You code here (Please add comments in the code):

import numpy as np
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel


# Your query
query = "Clinical trial for Alzheimer's disease treatment."

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"  # You can choose other BERT variants as well
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode the query and documents
query_encoding = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
document_encodings = tokenizer(documents, return_tensors="pt", padding=True, truncation=True)

# Get BERT embeddings for the query and documents
with torch.no_grad():
    query_embedding = model(**query_encoding).last_hidden_state.mean(dim=1)
    document_embeddings = model(**document_encodings).last_hidden_state.mean(dim=1)

# Calculate cosine similarity between the query and documents
similarities = cosine_similarity(query_embedding, document_embeddings).flatten()

# Rank documents by similarity in descending order
ranked_documents = [(document, similarity) for document, similarity in zip(documents, similarities)]
ranked_documents.sort(key=lambda x: x[1], reverse=True)

# Print ranked documents
for i, (document, similarity) in enumerate(ranked_documents, start=1):
    print(f"Rank {i}: Similarity = {similarity:.4f}")
    print(document)
    print()





Rank 1: Similarity = 0.8249
A randomized clinical trial on the effectiveness of drug X in treating hypertension.

Rank 2: Similarity = 0.8074
A review of recent advances in drug discovery for Alzheimer's disease.

Rank 3: Similarity = 0.7311
Genomic research identifies potential biomarkers for cancer diagnosis.

Rank 4: Similarity = 0.6817
Epidemiological study on the spread of infectious diseases in urban areas.

