<a href="https://colab.research.google.com/github/sivarohith99/SivaRohith_INFO5731_Fall2024/blob/main/Jampana_SivaRohith_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
My interest in finance and stock markets. So I am classifying news relates to stock market.
Features:
Stock News Classification:
Sentiment Lexicon: Captures polarity from financial terms like "growth" or "crash."
Bag of Words (BoW) / TF-IDF: Identifies important words or phrases, such as "profit" or "plunge," linked to stock performance.
N-grams: Captures contextual phrases like "record profits" or "market volatility."
POS Tags: Focuses on adjectives and verbs that convey stock sentiment (e.g., "strong," "decline").
Word Embeddings: Encodes semantic meaning, capturing relationships between words like "rise" and "increase."




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
!python -m spacy download en_core_web_md # Download the en_core_web_md model

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
# You code here (Please add comments in the code):

import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import sentiwordnet as swn
from nltk import word_tokenize, pos_tag
import spacy
from textblob import TextBlob
from collections import Counter

# Download NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('sentiwordnet')
nlp = spacy.load('en_core_web_md')  # Load spaCy model for word embeddings

# Sample stock news data
stock_news = [
    "Tesla's stock surges after record deliveries and positive quarterly earnings.",
    "Apple shares plunge as the company faces supply chain disruptions.",
    "Amazon's strong earnings report boosts investor confidence, driving its stock price up.",
    "Microsoft's stock declines as cloud growth slows.",
    "Nvidia stock spikes due to increased demand for AI chips, but analysts warn of market overheating."
]

### 1. Sentiment Lexicon (TextBlob)
def extract_sentiment_lexicon(texts):
    sentiments = []
    for text in texts:
        blob = TextBlob(text)
        sentiments.append(blob.sentiment.polarity)  # Polarity score between [-1, 1]
    return sentiments

print("Sentiment Lexicon (TextBlob):")
print(extract_sentiment_lexicon(stock_news))

### 2. Bag of Words (BoW)
vectorizer_bow = CountVectorizer(stop_words='english')
X_bow = vectorizer_bow.fit_transform(stock_news)
print("\nBag of Words (BoW) feature matrix:")
print(X_bow.toarray())
print("Feature Names:", vectorizer_bow.get_feature_names_out())

### 3. TF-IDF
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer_tfidf.fit_transform(stock_news)
print("\nTF-IDF feature matrix:")
print(X_tfidf.toarray())
print("Feature Names:", vectorizer_tfidf.get_feature_names_out())

### 4. N-grams (Bi-grams)
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 2), stop_words='english')
X_ngrams = vectorizer_ngrams.fit_transform(stock_news)
print("\nBi-grams feature matrix:")
print(X_ngrams.toarray())
print("Feature Names:", vectorizer_ngrams.get_feature_names_out())

### 5. Part of Speech (POS) Tagging
def extract_pos_tags(texts):
    pos_features = []
    for text in texts:
        tokens = word_tokenize(text)
        pos_tags = pos_tag(tokens)
        pos_features.append(Counter([tag for word, tag in pos_tags]))
    return pos_features

print("\nPart of Speech (POS) Tags:")
print(extract_pos_tags(stock_news))

### 6. Word Embeddings using spaCy
def extract_word_embeddings(texts):
    embeddings = []
    for doc in texts:
        doc_nlp = nlp(doc)
        embeddings.append(doc_nlp.vector)  # Extract the document's embedding as a vector
    return embeddings

print("\nWord Embeddings:")
word_embeddings = extract_word_embeddings(stock_news)
print(word_embeddings)  # List of vectors for each document



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


Sentiment Lexicon (TextBlob):
[0.22727272727272727, 0.0, 0.4333333333333333, 0.0, -0.125]

Bag of Words (BoW) feature matrix:
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0
  1 1 0]
 [0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1
  0 0 0]
 [0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0
  0 0 0]
 [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0
  0 0 0]
 [1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0
  0 0 1]]
Feature Names: ['ai' 'amazon' 'analysts' 'apple' 'boosts' 'chain' 'chips' 'cloud'
 'company' 'confidence' 'declines' 'deliveries' 'demand' 'disruptions'
 'driving' 'earnings' 'faces' 'growth' 'increased' 'investor' 'market'
 'microsoft' 'nvidia' 'overheating' 'plunge' 'positive' 'price'
 'quarterly' 'record' 'report' 'shares' 'slows' 'spikes' 'stock' 'strong'
 'supply' 'surges' 'tesla' 'warn']

TF-IDF feature matrix:
[[0.         0.         0.         0.

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
# You code here (Please add comments in the code):

import numpy as np
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample stock news data
stock_news = [
    "Tesla's stock surges after record deliveries and positive quarterly earnings.",
    "Apple shares plunge as the company faces supply chain disruptions.",
    "Amazon's strong earnings report boosts investor confidence, driving its stock price up.",
    "Microsoft's stock declines as cloud growth slows.",
    "Nvidia stock spikes due to increased demand for AI chips, but analysts warn of market overheating."
]

# Labels (Sentiment): Positive = 1, Negative = 0 (Manually labeled for this example)
labels = [1, 0, 1, 0, 1]  # For simplicity, let's assume these sentiments based on the news.

### 1. Feature Extraction using Bag of Words (BoW)
vectorizer_bow = CountVectorizer(stop_words='english')
X_bow = vectorizer_bow.fit_transform(stock_news)
features_bow = vectorizer_bow.get_feature_names_out()

### 2. Apply Chi-Square Feature Selection
chi2_scores, p_values = chi2(X_bow, labels)

# Create a DataFrame for better readability
chi2_df = pd.DataFrame({
    'Feature': features_bow,
    'Chi2_Score': chi2_scores,
    'P_Value': p_values
})

# Sort the features based on Chi2 score in descending order
chi2_df_sorted = chi2_df.sort_values(by='Chi2_Score', ascending=False)
print("\nRanked Features using Chi-Square:")
print(chi2_df_sorted)

### 3. Feature Extraction using TF-IDF
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer_tfidf.fit_transform(stock_news)
features_tfidf = vectorizer_tfidf.get_feature_names_out()

# Apply Chi-Square for TF-IDF features
chi2_scores_tfidf, p_values_tfidf = chi2(X_tfidf, labels)

# Create a DataFrame for TF-IDF features
chi2_df_tfidf = pd.DataFrame({
    'Feature': features_tfidf,
    'Chi2_Score': chi2_scores_tfidf,
    'P_Value': p_values_tfidf
})

# Sort and display the features based on Chi2 score
chi2_df_tfidf_sorted = chi2_df_tfidf.sort_values(by='Chi2_Score', ascending=False)
print("\nRanked TF-IDF Features using Chi-Square:")
print(chi2_df_tfidf_sorted)





Ranked Features using Chi-Square:
        Feature  Chi2_Score   P_Value
31        slows    1.500000  0.220671
8       company    1.500000  0.220671
16        faces    1.500000  0.220671
30       shares    1.500000  0.220671
13  disruptions    1.500000  0.220671
24       plunge    1.500000  0.220671
10     declines    1.500000  0.220671
17       growth    1.500000  0.220671
7         cloud    1.500000  0.220671
35       supply    1.500000  0.220671
5         chain    1.500000  0.220671
21    microsoft    1.500000  0.220671
3         apple    1.500000  0.220671
15     earnings    1.333333  0.248213
29       report    0.666667  0.414216
28       record    0.666667  0.414216
26        price    0.666667  0.414216
27    quarterly    0.666667  0.414216
0            ai    0.666667  0.414216
32       spikes    0.666667  0.414216
34       strong    0.666667  0.414216
36       surges    0.666667  0.414216
37        tesla    0.666667  0.414216
25     positive    0.666667  0.414216
19     investor

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
# You code here (Please add comments in the code):

!pip install torch transformers sentence-transformers




Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.1


In [7]:
# You code here (Please add comments in the code):from sentence_transformers import SentenceTransformer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

# Load pre-trained Sentence-BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Sample stock news data (from question 2)
stock_news = [
    "Tesla's stock surges after record deliveries and positive quarterly earnings.",
    "Apple shares plunge as the company faces supply chain disruptions.",
    "Amazon's strong earnings report boosts investor confidence, driving its stock price up.",
    "Microsoft's stock declines as cloud growth slows.",
    "Nvidia stock spikes due to increased demand for AI chips, but analysts warn of market overheating."
]

# Define a query to match the most relevant documents
query = "Tesla stock surges due to strong earnings report."

# 1. Generate embeddings for both the query and the text documents
news_embeddings = model.encode(stock_news)
query_embedding = model.encode([query])

# 2. Calculate cosine similarity between the query and each document
cosine_similarities = cosine_similarity(query_embedding, news_embeddings).flatten()

# 3. Rank the documents based on cosine similarity scores
ranked_indices = np.argsort(cosine_similarities)[::-1]  # Sort in descending order of similarity
ranked_similarities = cosine_similarities[ranked_indices]

# 4. Create a DataFrame to display the results
ranked_news = pd.DataFrame({
    'Document': np.array(stock_news)[ranked_indices],
    'Similarity_Score': ranked_similarities
})

# Display the ranked documents with their similarity scores
print("\nRanked Documents based on Text Similarity with the Query:")
print(ranked_news)





Ranked Documents based on Text Similarity with the Query:
                                            Document  Similarity_Score
0  Tesla's stock surges after record deliveries a...          0.906337
1  Amazon's strong earnings report boosts investo...          0.872387
2  Nvidia stock spikes due to increased demand fo...          0.783173
3  Apple shares plunge as the company faces suppl...          0.484475
4  Microsoft's stock declines as cloud growth slows.          0.326168


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [10]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
This exercise was highly educational, particularly in exploring how text features can be extracted for classification tasks such as stock news sentiment analysis. Key concepts that stood out included the various methods of feature extraction like TF-IDF, Bag of Words, and Word Embeddings using BERT
One of the main challenges was implementing the different feature extraction methods and handling the nuances between each, such as how TF-IDF compares to Word Embeddings in capturing the context of words. Additionally, calculating cosine similarity for BERT embeddings and working with large vectors
This exercise is directly relevant to Natural Language Processing (NLP) in Data Science, particularly for tasks like sentiment analysis, text classification, and information retrieval.


'''

'\nThis exercise was highly educational, particularly in exploring how text features can be extracted for classification tasks such as stock news sentiment analysis. Key concepts that stood out included the various methods of feature extraction like TF-IDF, Bag of Words, and Word Embeddings using BERT\nOne of the main challenges was implementing the different feature extraction methods and handling the nuances between each, such as how TF-IDF compares to Word Embeddings in capturing the context of words. Additionally, calculating cosine similarity for BERT embeddings and working with large vectors \nThis exercise is directly relevant to Natural Language Processing (NLP) in Data Science, particularly for tasks like sentiment analysis, text classification, and information retrieval. \n\n\n'