# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:

'''
Please write you answer here:
An interesting text classification task could be sentiment analysis of product reviews. This task involves classifying customer reviews as positive, negative, or neutral based on the language used. Let’s consider we have a large dataset of product reviews written in natural language.

Features for the Machine Learning Model:

1. Bag of Words (BoW):
Description: The Bag of Words model represents text by counting the occurrence of each word in a document without
considering the order of the words.
Why Useful: BoW can capture the frequency of positive and negative terms,
such as "amazing," "terrible," "good," or "bad," which are indicative of sentiment.
The occurrence of these words can strongly correlate with the overall sentiment of a review.

2. TF-IDF (Term Frequency-Inverse Document Frequency):
Description: TF-IDF adjusts the word frequency (from BoW) by reducing the weight of common words
and increasing the weight of rarer but more significant words.
Why Useful: TF-IDF helps to reduce the noise of common words like "the" or "and" and
gives higher importance to words that carry more sentiment-specific meaning.
Words that are specific to the review domain but occur less frequently, such as "refund" or "defective,"
will get a higher weight and better capture the sentiment.

3. Part-of-Speech (POS) Tags:
Description: POS tagging identifies the grammatical parts of speech (nouns, verbs, adjectives, etc.) in the text.
Why Useful: Adjectives and adverbs often play a crucial role in sentiment classification. For instance, adjectives like "excellent" or "poor" are strong indicators of sentiment. Including POS features helps the model to focus on the parts of speech that are more sentiment-bearing.

4. Sentiment Lexicons:
Description: Sentiment lexicons are precompiled lists of words associated with positive, negative, or neutral sentiment.
Examples include the AFINN or SentiWordNet lexicons.
Why Useful: This feature is directly designed for sentiment analysis.
The model can identify and weigh sentiment-laden words from these lexicons to determine the overall sentiment of a review.
Lexicons can also help to capture the polarity of rare words not frequent enough for the model to learn from purely statistical methods.

5. N-grams (Bigrams or Trigrams):
Description: N-grams are sequences of N consecutive words from the text. For example, "not good" would be a bigram.
Why Useful: Sentiment often depends on word combinations rather than individual words.
For instance, the phrase "not bad" is positive, even though "bad" alone is negative.
Including bigrams or trigrams captures these dependencies between words and improves the model’s understanding of context.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [4]:
# You code here (Please add comments in the code):

import numpy as np
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

# Sample text data (1-5 samples)
texts = [
    "This product is fantastic! I'm so happy with my purchase.",
    "The item broke after just two days. Terrible quality.",
    "Not bad, but I expected more. It's okay for the price.",
    "Amazing service! Will definitely buy again!",
    "The product was not what I expected. I am disappointed."
]

# Function to extract various text features from the input list of texts
def extract_features(texts):
    features = pd.DataFrame()  # Initialize an empty DataFrame to hold all features

    # 1. Bag of Words (BoW) using CountVectorizer
    vectorizer = CountVectorizer(stop_words='english')  # Create a vectorizer ignoring stop words
    bow_matrix = vectorizer.fit_transform(texts).toarray()  # Convert texts to BoW feature matrix
    bow_df = pd.DataFrame(bow_matrix, columns=vectorizer.get_feature_names_out())  # Convert matrix to DataFrame
    features = pd.concat([features, bow_df], axis=1)  # Add BoW features to the final features DataFrame

    # 2. TF-IDF (Term Frequency-Inverse Document Frequency) using TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')  # Create a TF-IDF vectorizer ignoring stop words
    tfidf_matrix = tfidf_vectorizer.fit_transform(texts).toarray()  # Convert texts to TF-IDF feature matrix
    tfidf_df = pd.DataFrame(tfidf_matrix, columns=tfidf_vectorizer.get_feature_names_out())  # Convert matrix to DataFrame
    features = pd.concat([features, tfidf_df], axis=1, keys=['tfidf'])  # Add TF-IDF features to the final DataFrame

    # 3. Part of Speech (POS) Tags Count using NLTK
    pos_tag_counts = []
    for text in texts:
        tokens = word_tokenize(text)  # Tokenize each text into words
        pos_tags = pos_tag(tokens)  # Get the POS tags for each word
        # Initialize POS counts for specific tags (adjectives, nouns, adverbs, etc.)
        pos_counts = {tag: 0 for tag in ["JJ", "NN", "RB", "VB", "DT"]}
        # Count occurrences of each POS tag
        for _, tag in pos_tags:
            if tag in pos_counts:
                pos_counts[tag] += 1
        pos_tag_counts.append(pos_counts)  # Store the POS counts for each text
    pos_tag_df = pd.DataFrame(pos_tag_counts)  # Convert POS counts to DataFrame
    features = pd.concat([features, pos_tag_df], axis=1)  # Add POS features to the final DataFrame

    # 4. Sentiment Lexicon (Polarity) using TextBlob
    sentiment_polarity = [TextBlob(text).sentiment.polarity for text in texts]  # Compute sentiment polarity
    features['Sentiment_Polarity'] = sentiment_polarity  # Add sentiment polarity as a feature

    # 5. N-grams (Bigrams) using CountVectorizer
    bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')  # Create bigram vectorizer
    bigram_matrix = bigram_vectorizer.fit_transform(texts).toarray()  # Convert texts to bigram feature matrix
    bigram_df = pd.DataFrame(bigram_matrix, columns=bigram_vectorizer.get_feature_names_out())  # Convert matrix to DataFrame
    features = pd.concat([features, bigram_df], axis=1)  # Add bigram features to the final DataFrame

    return features  # Return the complete DataFrame of features

# Extract features from the sample texts
features_df = extract_features(texts)

# Display the resulting DataFrame with extracted features
print(features_df.head())


   (tfidf, amazing)  (tfidf, bad)  (tfidf, broke)  (tfidf, buy)  \
0                 0             0               0             0   
1                 0             0               1             0   
2                 0             1               0             0   
3                 1             0               0             1   
4                 0             0               0             0   

   (tfidf, days)  (tfidf, definitely)  (tfidf, disappointed)  \
0              0                    0                      0   
1              1                    0                      0   
2              0                    0                      0   
3              0                    1                      0   
4              0                    0                      1   

   (tfidf, expected)  (tfidf, fantastic)  (tfidf, happy)  ...  expected okay  \
0                  0                   1               1  ...              0   
1                  0                   0            

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  features = pd.concat([features, tfidf_df], axis=1, keys=['tfidf'])  # Add TF-IDF features to the final DataFrame


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [6]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob

# Sample text data
texts = [
    "This product is fantastic! I'm so happy with my purchase.",
    "The item broke after just two days. Terrible quality.",
    "Not bad, but I expected more. It's okay for the price.",
    "Amazing service! Will definitely buy again!",
    "The product was not what I expected. I am disappointed."
]

# Function to extract various text features from the input list of texts
def extract_features(texts):
    features = pd.DataFrame()  # Initialize an empty DataFrame to hold all features

    # 1. Bag of Words (BoW) using CountVectorizer
    vectorizer = CountVectorizer(stop_words='english')  # Create a vectorizer ignoring stop words
    bow_matrix = vectorizer.fit_transform(texts).toarray()  # Convert texts to BoW feature matrix
    bow_df = pd.DataFrame(bow_matrix, columns=vectorizer.get_feature_names_out())  # Convert matrix to DataFrame
    features = pd.concat([features, bow_df], axis=1)  # Add BoW features to the final features DataFrame

    # 2. TF-IDF (Term Frequency-Inverse Document Frequency) using TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')  # Create a TF-IDF vectorizer ignoring stop words
    tfidf_matrix = tfidf_vectorizer.fit_transform(texts).toarray()  # Convert texts to TF-IDF feature matrix
    tfidf_df = pd.DataFrame(tfidf_matrix, columns=tfidf_vectorizer.get_feature_names_out())  # Convert matrix to DataFrame
    features = pd.concat([features, tfidf_df], axis=1)  # Add TF-IDF features to the final DataFrame

    # 3. Part of Speech (POS) Tags Count using NLTK
    pos_tag_counts = []
    for text in texts:
        tokens = word_tokenize(text)  # Tokenize each text into words
        pos_tags = pos_tag(tokens)  # Get the POS tags for each word
        # Initialize POS counts for specific tags (adjectives, nouns, adverbs, etc.)
        pos_counts = {tag: 0 for tag in ["JJ", "NN", "RB", "VB", "DT"]}
        # Count occurrences of each POS tag
        for _, tag in pos_tags:
            if tag in pos_counts:
                pos_counts[tag] += 1
        pos_tag_counts.append(pos_counts)  # Store the POS counts for each text
    pos_tag_df = pd.DataFrame(pos_tag_counts)  # Convert POS counts to DataFrame
    features = pd.concat([features, pos_tag_df], axis=1)  # Add POS features to the final DataFrame

    # 4. Sentiment Lexicon (Polarity) using TextBlob
    sentiment_polarity = [TextBlob(text).sentiment.polarity for text in texts]  # Compute sentiment polarity
    features['Sentiment_Polarity'] = sentiment_polarity  # Add sentiment polarity as a feature

    # 5. N-grams (Bigrams) using CountVectorizer
    bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')  # Create bigram vectorizer
    bigram_matrix = bigram_vectorizer.fit_transform(texts).toarray()  # Convert texts to bigram feature matrix
    bigram_df = pd.DataFrame(bigram_matrix, columns=bigram_vectorizer.get_feature_names_out())  # Convert matrix to DataFrame
    features = pd.concat([features, bigram_df], axis=1)  # Add bigram features to the final DataFrame

    # 6. Readability Features (Text Length and Word Count)
    features['Text_Length'] = [len(text) for text in texts]  # Calculate text length in characters
    features['Word_Count'] = [len(text.split()) for text in texts]  # Calculate word count for each text

    # 7. Punctuation Features (Exclamation and Question Marks)
    features['Exclamation_Count'] = [text.count('!') for text in texts]  # Count exclamation marks in each text
    features['Question_Mark_Count'] = [text.count('?') for text in texts]  # Count question marks in each text

    return features  # Return the complete DataFrame of features

# Labels for sentiment (1 = positive, 0 = negative)
labels = [1, 0, 1, 1, 0]  # Manual assignment for binary sentiment classification

# Extract the features
features_df = extract_features(texts)

# Convert labels to a numpy array
labels_np = np.array(labels)

# Ensure all column names are strings to avoid TypeError
features_df.columns = features_df.columns.astype(str)

# Select only non-negative features for Chi-Square (BoW, TF-IDF, etc.)
non_negative_features_df = features_df.loc[:, (features_df >= 0).all()]

# Perform Chi-Square feature selection
chi2_selector = SelectKBest(chi2, k='all')  # Select all features for ranking
chi2_selector.fit(non_negative_features_df, labels_np)

# Get the Chi-Square scores for each feature
chi2_scores = chi2_selector.scores_

# Create a DataFrame to store feature names and their corresponding Chi-Square scores
feature_scores = pd.DataFrame({
    'Feature': non_negative_features_df.columns,
    'Chi-Square Score': chi2_scores
})

# Sort features by their Chi-Square score in descending order
feature_scores_sorted = feature_scores.sort_values(by='Chi-Square Score', ascending=False)

# Display the top-ranked features
print("Top Ranked Features based on Chi-Square Scores:")
print(feature_scores_sorted.head(10))


Top Ranked Features based on Chi-Square Scores:
                  Feature  Chi-Square Score
61      Exclamation_Count               2.0
10                   item               1.5
45             broke just               1.5
46          days terrible               1.5
48  expected disappointed               1.5
18               terrible               1.5
16                quality               1.5
53              just days               1.5
11                   just               1.5
52             item broke               1.5


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [7]:
# You code here (Please add comments in the code):
!pip install transformers torch scikit-learn

import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text data
texts = [
    "This product is fantastic! I'm so happy with my purchase.",
    "The item broke after just two days. Terrible quality.",
    "Not bad, but I expected more. It's okay for the price.",
    "Amazing service! Will definitely buy again!",
    "The product was not what I expected. I am disappointed."
]

# Define the query
query = "I am satisfied with the quality of the service and the product."

# Function to get BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the [CLS] token's embedding (typically used for classification tasks)
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze()
    return cls_embedding

# Get embeddings for the query and each of the texts
query_embedding = get_bert_embedding(query)
text_embeddings = [get_bert_embedding(text) for text in texts]

# Calculate cosine similarity between query and each text
similarities = [cosine_similarity(query_embedding.reshape(1, -1), text_emb.reshape(1, -1))[0][0] for text_emb in text_embeddings]

# Create a DataFrame to rank the texts by similarity
similarity_df = pd.DataFrame({'Text': texts, 'Similarity': similarities})
similarity_df_sorted = similarity_df.sort_values(by='Similarity', ascending=False)

# Display the ranked texts based on similarity to the query
print("Ranked texts based on similarity to the query:")
print(similarity_df_sorted)







The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked texts based on similarity to the query:
                                                Text  Similarity
4  The product was not what I expected. I am disa...    0.926592
0  This product is fantastic! I'm so happy with m...    0.917337
2  Not bad, but I expected more. It's okay for th...    0.884061
1  The item broke after just two days. Terrible q...    0.855123
3        Amazing service! Will definitely buy again!    0.839820


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:


'''
Please write you answer here:

This assignment helped me better understand how to extract and use features from text data for NLP tasks
like classification and similarity ranking. I learned how traditional methods like Bag of Words,
TF-IDF, and sentiment analysis turn text into numerical features, while modern techniques like BERT provide deeper,
context-aware embeddings. Handling data structure issues and feature selection challenges
made me appreciate how important clean, consistent data is.
Applying BERT for text similarity taught me how powerful deep learning models are for capturing meaning.
Overall, it was a hands-on experience in understanding core NLP concepts.



'''