## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Analysis of sentiment in social media comments is a fascinating text classification or text mining task.
Finding the sentiment or emotional tone communicated in a text—which may be positive, negative, or neutral—is the goal of sentiment analysis.
Examining social media comments can offer insightful information on what the general public thinks of particular issues, goods, or businesses.
In order to construct a machine learning model for this assignment, the following feature categories can be helpful:

Bag of Words (BoW): This word frequency representation turns text texts into vectors.
Every distinct word in the corpus is turned into a feature, and each word's frequency in a comment is measured.
BoW is helpful for identifying the occurrence of particular words or phrases that can be indicative of emotion, such as "love," "hate," or "great."

Term Frequency-Inverse Document Frequency (TF-IDF): A BoW variant called TF-IDF considers the significance of words in a corpus in addition to word frequency.
 It gives terms that are common in one document but relatively uncommon across the entire corpus higher weights. This can assist in locating words that are specific to particular moods.

 Word Embeddings (such as GloVe and Word2Vec): Word embeddings portray words as dense vectors in a continuous space to capture the semantic relationships between them.
 By utilizing these embeddings as features, the model will be able to comprehend the context and significance of words used in comments.
 This aids in capturing more subtly expressed emotional cues like emotionally charged word connections.

Part-of-Speech (POS) Tags: Words in a text are given grammatical labels, such as noun, verb, or adjective, using POS tagging.
Analyzing the distribution of POS tags in a comment can reveal syntactic patterns associated with sentiment.
For instance, a strong sentiment may be shown by a high frequency of adjectives or adverbs.

Emoticons and Emoji Analysis: Emoticons and emojis are frequently used in social media comments to convey emotion.
For the purpose of classifying sentiment, these symbols can be extracted and examined as features.

Sentimental Terms: Features can include lexicons or collections of words with strong emotions.
These lexicons include terms that are understood to be connected to particular emotions. You can determine the sentiment expressed in a comment by counting the number of times that terms from these lexicons appear in it.

N-grams: In a text, an N-gram is a sequence of N consecutive words.
 When using n-grams as features, it is possible to identify sentiment-related utterances or phrases that might not be clear when looking at individual words.
  As two-gram bigrams that convey important emotions, "not good" and "very happy" come to mind.

Sentiment Scores from Pre-Trained Models: To determine the sentiment of each comment, you can utilize pre-trained sentiment analysis models (such as VADER, TextBlob).
These results can act as features, streamlining the procedure and giving the model a starting point.

You may build a strong sentiment analysis system that accurately categorizes social media comments into positive,
negative, or neutral categories and provides deeper insights into the public sentiment by putting these features into a machine learning model.





'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
pip install emoji

Collecting emoji
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/358.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/358.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.8.0


In [5]:
# You code here (Please add comments in the code):

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import emoji
import re

# Sample text data
sample_text = [
    "I love this product! It's amazing.",
    "This is terrible, I hate it!",
    "Neutral comment with no strong feelings.",
    "😊 Great experience with their customer service! 😃",
    "I can't believe how bad this is. 😡",
]

# Preprocess text: tokenize, remove stopwords, and convert to lowercase
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize the text
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

preprocessed_text = [preprocess_text(text) for text in sample_text]

# 1. Bag of Words (BoW)
vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(preprocessed_text)
print("BoW Features:")
print(bow_features.toarray())

# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_text)
print("\nTF-IDF Features:")
print(tfidf_features.toarray())

# 3. Emoticons and Emoji Analysis
def extract_emojis(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F700-\U0001F77F"  # alchemical symbols
                           u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                           u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                           u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                           u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                           u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                           u"\U0001FB00-\U0001FBFF"  # Symbols and Pictographs Extended-B
                           u"\U0001FC00-\U0001FCFF"  # Enclosed Ideographic Supplement
                           u"\U0001F004-\U0001F0CF"  # Additional emoticons
                           u"\U0001F170-\U0001F251"  # Enclosed Characters
                           "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)

emoji_features = [extract_emojis(text) for text in sample_text]
print("\nEmoji Features:")
for emojis in emoji_features:
    print(emojis)

# 4. N-grams (2-grams)
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2))
ngram_features = ngram_vectorizer.fit_transform(preprocessed_text)
print("\n2-gram Features:")
print(ngram_features.toarray())

# 5. Sentiment Lexicons
positive_lexicon = ["love", "amazing", "great"]
negative_lexicon = ["terrible", "hate", "bad"]

def count_lexicon_words(text, lexicon):
    words = text.split()
    count = sum(1 for word in words if word in lexicon)
    return count

positive_lexicon_features = [count_lexicon_words(text, positive_lexicon) for text in preprocessed_text]
negative_lexicon_features = [count_lexicon_words(text, negative_lexicon) for text in preprocessed_text]

print("\nPositive Lexicon Features:", positive_lexicon_features)
print("Negative Lexicon Features:", negative_lexicon_features)



BoW Features:
[[1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0]
 [0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0]
 [0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0]]

TF-IDF Features:
[[0.57735027 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.57735027 0.
  0.57735027 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.70710678 0.         0.
  0.         0.         0.         0.70710678]
 [0.         0.         0.         0.         0.5        0.
  0.         0.5        0.         0.         0.         0.5
  0.         0.         0.5        0.        ]
 [0.         0.         0.         0.         0.         0.5
  0.5        0.         0.5        0.         0.         0.
  0.         0.5        0.         0.        ]
 [0.         0.57735027 0.57735027 0.57735027 0.         0.
  0.         0.         0.         0.         0.         0.
  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [11]:
import numpy as np

# Sample text data
sample_text = [
    "I love this product! It's amazing.",
    "This is terrible, I hate it!",
    "Neutral comment with no strong feelings.",
    "😊 Great experience with their customer service! 😃",
    "I can't believe how bad this is. 😡",
]

# Labels for text data (assuming binary classification for simplicity)
labels = [1, 0, 2, 1, 0]  # 1 for positive, 0 for negative, 2 for neutral (for demonstration)

# Preprocess text: tokenize, remove stopwords, and convert to lowercase
# (You can reuse the preprocess_text function from the previous code)

preprocessed_text = [preprocess_text(text) for text in sample_text]

# Define a list of common emojis to check for in the text
common_emojis = ['😊', '😃', '😡']  # Add more emojis as needed

# Convert emoji features to binary values (1 if emoji present, 0 if not)
emoji_features_binary = [[1 if emoji in text else 0 for emoji in common_emojis] for text in sample_text]

# Combine all features into a feature matrix
# For this example, we'll use BoW and emoji features
feature_matrix = np.concatenate((bow_features.toarray(), emoji_features_binary), axis=1)

# Apply Mutual Information feature selection
num_features_to_select = 5  # You can adjust this number as needed
selector = SelectKBest(score_func=mutual_info_classif, k=num_features_to_select)
selector.fit(feature_matrix, labels)

# Get the indices of the selected features in descending order of importance
selected_feature_indices = np.argsort(selector.scores_)[::-1]

# Rank features by importance in descending order
selected_features = [f"Feature {index + 1}" for index in selected_feature_indices]

# Display the ranked features in descending order
print("Ranked Features Based on Mutual Information (Descending Order):")
for index, feature_name in enumerate(selected_features):
    print(f"{index + 1}: {feature_name}")


Ranked Features Based on Mutual Information (Descending Order):
1: Feature 10
2: Feature 6
3: Feature 5
4: Feature 4
5: Feature 16
6: Feature 1
7: Feature 8
8: Feature 3
9: Feature 2
10: Feature 13
11: Feature 12
12: Feature 11
13: Feature 18
14: Feature 9
15: Feature 7
16: Feature 14
17: Feature 15
18: Feature 17
19: Feature 19


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [7]:
pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m54.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB

In [8]:
# You code here (Please add comments in the code):

import torch
from sentence_transformers import SentenceTransformer, util

# Sample text data
sample_text = [
    "I love this product! It's amazing.",
    "This is terrible, I hate it!",
    "Neutral comment with no strong feelings.",
    "😊 Great experience with their customer service! 😃",
    "I can't believe how bad this is. 😡",
]

# Query
query = "I had a wonderful experience with their customer service."

# Load pre-trained BERT model for sentence embeddings
model_name = "bert-base-nli-mean-tokens"
model = SentenceTransformer(model_name)

# Encode the query and text data
query_embedding = model.encode(query, convert_to_tensor=True)
text_embeddings = model.encode(sample_text, convert_to_tensor=True)

# Calculate cosine similarities
similarities = util.pytorch_cos_sim(query_embedding, text_embeddings)

# Convert similarities to a list
similarities = similarities.tolist()[0]

# Rank documents by similarity (in descending order)
similarity_ranking = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

# Display the ranked documents
print("Text Similarity Ranking:")
for index, similarity_score in similarity_ranking:
    print(f"Document {index + 1}: Similarity Score = {similarity_score:.4f}")
    print(sample_text[index])
    print()




Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Text Similarity Ranking:
Document 4: Similarity Score = 0.9181
😊 Great experience with their customer service! 😃

Document 1: Similarity Score = 0.7909
I love this product! It's amazing.

Document 5: Similarity Score = 0.2880
I can't believe how bad this is. 😡

Document 3: Similarity Score = 0.2269
Neutral comment with no strong feelings.

Document 2: Similarity Score = 0.1333
This is terrible, I hate it!

