<a href="https://colab.research.google.com/github/rajidisindhuja/sindhuja_INFO5731_Fall2023/blob/main/Rajidi_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Classifying news articles is a common and important text classification task. If we want to classify news articles into categories such as politics, sports, technology, entertainment, and health. Here are five different types of features that could be useful for building a machine learning model for this task:

TF-IDF Features:
TF-IDF (Term Frequency-Inverse Document Frequency) features represent the importance of words in a document relative to a corpus. These features are helpful because they highlight words or terms that are specific to a particular category. For example, the word "election" might be more prominent in politics articles.

Word Embeddings:
Word embeddings, such as Word2Vec or GloVe, capture the semantic relationships between words. These features can help the model understand the context and meaning of words in news articles, allowing it to distinguish between categories based on word usage.

Named Entity Recognition (NER) Features:
Explanation: NER identifies and categorizes named entities (e.g., people, organizations, locations) in text. Extracting and counting named entities can be useful for classifying news articles, as certain categories may involve specific entities. For instance, politics articles often mention politicians' names and government organizations.

Sentiment Analysis Features:
Explanation: Sentiment analysis can be applied to news articles to determine the sentiment or tone of the content. For instance, detecting positive or negative sentiment in financial news articles can be informative. Sentiment features can provide context for understanding the overall tone of an article.

Bag of Words (BoW) with N-grams:
Explanation: Combining BoW with N-grams allows you to capture not only individual words but also phrases and combinations of words. This can be helpful for recognizing patterns specific to certain categories. For example, "World Cup" might be a relevant N-gram for sports articles.

Source and Metadata Features:
Explanation: Features related to the source of the news article, such as the publication name, author, and publication date, can provide valuable information for classification. Different news sources may have specific biases or tendencies in their reporting.
'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [3]:
import nltk
import spacy
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from textblob import TextBlob

# Sample news articles
news_articles = [
    "The election results were announced yesterday, with a landslide victory for the incumbent party.",
    "The football World Cup is set to begin next month, with teams from around the world competing.",
    "A new technology breakthrough in AI research promises to revolutionize the way we use computers.",
    "The latest blockbuster movie has hit theaters, and it's already breaking box office records.",
    "A new study on health and nutrition reveals surprising findings about the benefits of certain foods."
]

# TF-IDF Features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(news_articles)

# Word Embeddings (Word2Vec)
nlp = spacy.load("en_core_web_sm")
word_embeddings = []
for article in news_articles:
    doc = nlp(article)
    word_embeddings.append(doc.vector)

# Named Entity Recognition (NER) Features
named_entities = []
for article in news_articles:
    doc = nlp(article)
    entities = [ent.text for ent in doc.ents]
    named_entities.append(entities)

# Topic Modeling Features (LDA)
lda_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
lda_features = lda_vectorizer.fit_transform(news_articles)
lda_model = LatentDirichletAllocation(n_components=5, random_state=42)
lda_features = lda_model.fit_transform(lda_features)

# Sentiment Analysis Features
sentiment_scores = []
for article in news_articles:
    analysis = TextBlob(article)
    sentiment_scores.append(analysis.sentiment.polarity)

# Bag of Words (BoW) with N-grams
bow_vectorizer = CountVectorizer(ngram_range=(1, 2))
bow_features = bow_vectorizer.fit_transform(news_articles)

# Source and Metadata Features
source_metadata_features = [
    {"source": "CNN", "author": "John Doe", "publication_date": "2023-04-10"},
    {"source": "BBC", "author": "Jane Smith", "publication_date": "2023-04-09"},
    {"source": "TechCrunch", "author": "David Brown", "publication_date": "2023-04-11"},
    {"source": "Hollywood Reporter", "author": "Alice Johnson", "publication_date": "2023-04-08"},
    {"source": "WebMD", "author": "Michael White", "publication_date": "2023-04-12"}
]

# Text Length and Readability Features
text_length = [len(article.split()) for article in news_articles]

# Display the extracted features
print("TF-IDF Features:")
print(tfidf_features.toarray())

print("\nWord Embeddings (Word2Vec):")
print(word_embeddings)

print("\nNamed Entity Recognition (NER) Features:")
print(named_entities)

print("\nTopic Modeling Features (LDA):")
print(lda_features)

print("\nSentiment Analysis Features:")
print(sentiment_scores)

print("\nBag of Words (BoW) with N-grams:")
print(bow_features.toarray())

print("\nSource and Metadata Features:")
for metadata in source_metadata_features:
    print(metadata)

print("\nText Length:")
print(text_length)


TF-IDF Features:
[[0.         0.         0.         0.         0.29412852 0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.29412852 0.
  0.         0.         0.29412852 0.         0.         0.
  0.         0.         0.29412852 0.         0.         0.29412852
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.29412852 0.         0.
  0.         0.29412852 0.         0.         0.         0.
  0.         0.         0.         0.28030763 0.         0.
  0.         0.29412852 0.         0.         0.29412852 0.23730104
  0.         0.29412852]
 [0.         0.         0.         0.         0.         0.24105092
  0.24105092 0.         0.         0.         0.         0.
  0.         0.24105092 0.         0.24105092 0.         0.
  0.         0.24105092 0.         0.24105092 0.         0.
  0.         0.         0.         0.24105092 0.         0.
  0.         0.24105092 0.        

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [13]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Sample labels for the news articles (e.g., politics, sports, technology)
labels = ['politics', 'sports', 'technology']

# Convert labels to numerical values (0, 1, 2)
label_dict = {label: idx for idx, label in enumerate(labels)}
numerical_labels = [label_dict[label] for label in labels]

# Create a new TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(news_articles)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(tfidf_features, numerical_labels)

# Get feature importances
feature_importances = rf_classifier.feature_importances_

# Create a DataFrame to store feature names and their importance scores
feature_importance_df = pd.DataFrame({'Feature': tfidf_vectorizer.get_feature_names_out(), 'Importance': feature_importances})

# Sort features by importance in descending order
sorted_features = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top N features and their importance scores
top_n = 10
for idx, row in sorted_features.head(top_n).iterrows():
    print(f"Feature: {row['Feature']}, Importance: {row['Importance']:.4f}")


Feature: offers, Importance: 0.0706
Feature: edge, Importance: 0.0647
Feature: an, Importance: 0.0588
Feature: last, Importance: 0.0529
Feature: corporation, Importance: 0.0529
Feature: in, Importance: 0.0529
Feature: advanced, Importance: 0.0471
Feature: results, Importance: 0.0471
Feature: yesterday, Importance: 0.0471
Feature: match, Importance: 0.0353


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [12]:



from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample news articles
news_articles = [
    "The election results were announced yesterday, and the new president was sworn in.",
    "The football match last night was an intense battle between two rival teams.",
    "The latest smartphone from XYZ Corporation offers advanced features and cutting-edge technology.",
]

# Define your query
query = "Who won the recent election?"

# Load the pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Encode the query and texts
query_tokens = tokenizer(query, return_tensors='pt', padding=True, truncation=True)
article_tokens = tokenizer(news_articles, return_tensors='pt', padding=True, truncation=True)

# Get BERT embeddings for the query and texts
with torch.no_grad():
    query_outputs = model(**query_tokens)
    article_outputs = model(**article_tokens)

# Extract the embeddings
query_embeddings = query_outputs.last_hidden_state.mean(dim=1).numpy()
article_embeddings = article_outputs.last_hidden_state.mean(dim=1).numpy()

# Calculate cosine similarity
similarities = cosine_similarity(query_embeddings, article_embeddings)

# Rank the documents based on similarity
ranking = np.argsort(similarities[0])[::-1]

# Print the ranked documents
print("Ranked Documents:")
for i, idx in enumerate(ranking):
    print(f"Rank {i+1}: Similarity {similarities[0][idx]:.4f}")
    print(f"Text: {news_articles[idx]}")
    print("-" * 30)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Documents:
Rank 1: Similarity 0.6951
Text: The election results were announced yesterday, and the new president was sworn in.
------------------------------
Rank 2: Similarity 0.5722
Text: The football match last night was an intense battle between two rival teams.
------------------------------
Rank 3: Similarity 0.4640
Text: The latest smartphone from XYZ Corporation offers advanced features and cutting-edge technology.
------------------------------


In [11]:
pip install transformers


Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m30.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m85.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
Insta