## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Task: Sorting customer reviews into categories based on their star ratings (for example, 1 star, 2 stars, 3 stars, 4 stars, and 5 stars)
Features:
Bag-of-words: Ignoring the sequence in which the words appear, this feature displays a product evaluation as a collection of its individual words. This can be helpful for capturing both the overall sentiment of the review and the individual subjects that are being discussed.
N-grams: In a text document, an N-gram is a series of N words. Unigrams (single words), bigrams (two-word combinations), and trigrams (three-word combinations) are frequently employed when classifying product reviews. N-grams can be used to better grasp word context, which is crucial for determining the review's tone.
Tags for part of speech (POS) Each word in a product review will be marked with the appropriate part of speech, such as the noun, verb, adjective, or adverb. POS tags can be helpful for deciphering the review's grammatical structure, which can reveal clues about its tone. For instance, the inclusion of certain adjectives (such as "excellent" or "terrible") might be a powerful predictor of the reviewer's mood.
Named entities: Named entities are terms or expressions that make reference to particular individuals, locations, or organizations. Identifying named entities might offer helpful contextual data for categorizing products in product reviews. For instance, if a consumer praises the product's specific feature in their review, this can be a reliable sign of how happy they are with it.
Product ratings: To train the machine learning model, this feature can be used as a supervised learning signal. A customer's 5-star rating of a product, for instance, can be used to train the model to find other evaluations that are likely to be favorable.
It could be beneficial to add additional features that are unique to the product domain in addition to existing ones. Features like camera quality, battery life, and performance, for instance, may be helpful if the product is a smartphone.
These features can be combined to train a machine learning model to accurately categorize product reviews into various ratings. Businesses can utilize this to learn more about client sentiment and find opportunities to enhance their goods and services.

'''

'\nPlease write you answer here:\n\nTask: Sorting customer reviews into categories based on their star ratings (for example, 1 star, 2 stars, 3 stars, 4 stars, and 5 stars)\nFeatures:\nBag-of-words: Ignoring the sequence in which the words appear, this feature displays a product evaluation as a collection of its individual words. This can be helpful for capturing both the overall sentiment of the review and the individual subjects that are being discussed.\nN-grams: In a text document, an N-gram is a series of N words. Unigrams (single words), bigrams (two-word combinations), and trigrams (three-word combinations) are frequently employed when classifying product reviews. N-grams can be used to better grasp word context, which is crucial for determining the review\'s tone.\nTags for part of speech (POS) Each word in a product review will be marked with the appropriate part of speech, such as the noun, verb, adjective, or adverb. POS tags can be helpful for deciphering the review\'s gram

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

# Collect sample text data
sample_texts = [
    "This product has exceeded my expectations! The camera is fantastic, and the battery just keeps on going. I've been using it for two weeks, and I couldn't be happier.",
    "The customer service was terrible. I had a bad experience with this company.",
    "I love the design of this product, but the performance is disappointing.",
    "The weather is beautiful today. I'm going for a walk in the park.",
    "I'm planning a trip to Europe. I can't wait to explore new places and try different cuisines.",
    "The new software update fixed all the issues I was experiencing. It's much more stable now.",
    "I had a delicious meal at the restaurant last night. The food was amazing.",
    "I'm reading a fascinating book. The author's writing style is captivating."
]

# Function to extract bag-of-words features
def extract_bag_of_words(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Return the bag-of-words features
    return tokens

# Function to extract n-grams features
def extract_n_grams(text, n):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Generate n-grams
    n_grams = []
    for i in range(len(tokens) - n + 1):
        n_grams.append(' '.join(tokens[i:i + n]))
    # Return the n-grams features
    return n_grams

# Function to extract part-of-speech tags features
def extract_pos_tags(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Tag the tokens with their part-of-speech tags
    pos_tags = pos_tag(tokens)
    # Return the part-of-speech tags features
    return pos_tags

# Function to extract named entities features
def extract_named_entities(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Tag the tokens with their part-of-speech tags
    pos_tags = pos_tag(tokens)
    # Chunk the tokens into named entities
    named_entities = ne_chunk(pos_tags)
    # Return the named entities features
    return [ne.label() for ne in named_entities.subtrees(filter=lambda t: t.label() != 'O')]

# Function to extract all features
def extract_features(text):
    # Extract bag-of-words features
    bag_of_words_features = extract_bag_of_words(text)
    # Extract n-grams features
    n_grams_features = extract_n_grams(text, 1) + extract_n_grams(text, 2)
    # Extract part-of-speech tags features
    pos_tags_features = extract_pos_tags(text)
    # Extract named entities features
    named_entities_features = extract_named_entities(text)
    # Return all features
    return bag_of_words_features + n_grams_features + pos_tags_features + named_entities_features

# Extract features from sample text data
for i, text in enumerate(sample_texts):
    features = extract_features(text)
    print(f"Features for Sample Text {i + 1}:")
    print(features)
    print()



Features for Sample Text 1:
['This', 'product', 'exceeded', 'expectations', '!', 'The', 'camera', 'fantastic', ',', 'battery', 'keeps', 'going', '.', 'I', "'ve", 'using', 'two', 'weeks', ',', 'I', 'could', "n't", 'happier', '.', 'This', 'product', 'exceeded', 'expectations', '!', 'The', 'camera', 'fantastic', ',', 'battery', 'keeps', 'going', '.', 'I', "'ve", 'using', 'two', 'weeks', ',', 'I', 'could', "n't", 'happier', '.', 'This product', 'product exceeded', 'exceeded expectations', 'expectations !', '! The', 'The camera', 'camera fantastic', 'fantastic ,', ', battery', 'battery keeps', 'keeps going', 'going .', '. I', "I 've", "'ve using", 'using two', 'two weeks', 'weeks ,', ', I', 'I could', "could n't", "n't happier", 'happier .', ('This', 'DT'), ('product', 'NN'), ('has', 'VBZ'), ('exceeded', 'VBN'), ('my', 'PRP$'), ('expectations', 'NNS'), ('!', '.'), ('The', 'DT'), ('camera', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ'), (',', ','), ('and', 'CC'), ('the', 'DT'), ('battery', 'NN')

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
import re
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import TfidfVectorizer

# Define sample texts
sample_texts = [
    "This product has exceeded my expectations! The camera is fantastic, and the battery just keeps on going. I've been using it for two weeks, and I couldn't be happier.",
    "The customer service was terrible. I had a bad experience with this company.",
    "I love the design of this product, but the performance is disappointing.",
    "The weather is beautiful today. I'm going for a walk in the park.",
    "I'm planning a trip to Europe. I can't wait to explore new places and try different cuisines.",
    "The new software update fixed all the issues I was experiencing. It's much more stable now.",
    "I had a delicious meal at the restaurant last night. The food was amazing.",
    "I'm reading a fascinating book. The author's writing style is captivating."
]

# Function to extract features from text
def extract_features(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    return tokens

# Initialize a list to store features for each text
features = []

# Extract features from each sample text
for text in sample_texts:
    features.append(extract_features(text))

# Convert text features to numerical representations using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform([" ".join(tokens) for tokens in features])

# Define labels corresponding to each sample text (for example, sentiment labels)
labels = ["positive", "negative", "negative", "positive", "positive", "positive", "positive", "positive"]

# Calculate Chi-squared scores for each feature
chi2_scores = chi2(X, labels)

# Create a list of (feature, score) tuples for ranking
feature_scores = list(zip(tfidf_vectorizer.get_feature_names_out(), chi2_scores[0]))

# Sort the features based on their Chi-squared scores in descending order
sorted_feature_scores = sorted(feature_scores, key=lambda x: x[1], reverse=True)

# Print the ranked features based on Chi-squared scores
for i, (feature, score) in enumerate(sorted_feature_scores, start=1):
    print(f"Rank {i}: Feature = {feature}, Chi-squared Score = {score:.4f}")


Rank 1: Feature = design, Chi-squared Score = 1.3834
Rank 2: Feature = disappointing, Chi-squared Score = 1.3834
Rank 3: Feature = love, Chi-squared Score = 1.3834
Rank 4: Feature = performance, Chi-squared Score = 1.3834
Rank 5: Feature = bad, Chi-squared Score = 1.2000
Rank 6: Feature = company, Chi-squared Score = 1.2000
Rank 7: Feature = customer, Chi-squared Score = 1.2000
Rank 8: Feature = experience, Chi-squared Score = 1.2000
Rank 9: Feature = service, Chi-squared Score = 1.2000
Rank 10: Feature = terrible, Chi-squared Score = 1.2000
Rank 11: Feature = product, Chi-squared Score = 0.4870
Rank 12: Feature = going, Chi-squared Score = 0.1875
Rank 13: Feature = new, Chi-squared Score = 0.1788
Rank 14: Feature = beautiful, Chi-squared Score = 0.1366
Rank 15: Feature = park, Chi-squared Score = 0.1366
Rank 16: Feature = today, Chi-squared Score = 0.1366
Rank 17: Feature = walk, Chi-squared Score = 0.1366
Rank 18: Feature = weather, Chi-squared Score = 0.1366
Rank 19: Feature = amazi

Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m42.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
Insta

In [6]:
pip install tensorflow-hub



In [7]:
pip install tensorflow-text

Collecting tensorflow-text
  Downloading tensorflow_text-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow<2.15,>=2.14.0 (from tensorflow-text)
  Downloading tensorflow-2.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (489.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.8/489.8 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting ml-dtypes==0.2.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text)
  Downloading ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
Collecting wrapt<1.15,>=1.11.0 (from tensorflow<2.15,>=2.14.0->tensorflow-text)
  Downloading wrapt-1.14.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_

In [8]:
pip install --upgrade tensorflow




In [9]:
pip install pytorch

Collecting pytorch
  Downloading pytorch-1.0.2.tar.gz (689 bytes)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pytorch
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for pytorch (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for pytorch[0m[31m
[0m[?25h  Running setup.py clean for pytorch
Failed to build pytorch
[31mERROR: Could not build wheels for pytorch, which is required to install pyproject.toml-based projects[0m[31m
[0m

In [10]:
pip install tensorflow tensorflow-hub tensorflow-text




In [11]:
import re
import nltk
import torch
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Initialize BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Sample texts
sample_texts = [
    "This product has exceeded my expectations! The camera is fantastic, and the battery just keeps on going. I've been using it for two weeks, and I couldn't be happier.",
    "The customer service was terrible. I had a bad experience with this company.",
    "I love the design of this product, but the performance is disappointing.",
    "The weather is beautiful today. I'm going for a walk in the park.",
    "I'm planning a trip to Europe. I can't wait to explore new places and try different cuisines.",
    "The new software update fixed all the issues I was experiencing. It's much more stable now.",
    "I had a delicious meal at the restaurant last night. The food was amazing.",
    "I'm reading a fascinating book. The author's writing style is captivating."
]

# Target text for comparison
target_text = """
The new software update fixed all the issues I was experiencing. It's much more stable now."""

# Extract features from the provided sample texts and target text
def extract_features(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    # Join tokens back into a single text
    clean_text = ' '.join(tokens)
    # Tokenize and encode using BERT tokenizer
    inputs = tokenizer(clean_text, return_tensors='pt', padding=True, truncation=True)
    # Get BERT embeddings for the text
    with torch.no_grad():
        outputs = bert_model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)  # Average pooling over tokens
    # Convert embeddings to a numpy array
    return embeddings.numpy()

# Calculate cosine similarity between the target text and each sample text
similarities = []
target_embeddings = extract_features(target_text)
for text in sample_texts:
    text_embeddings = extract_features(text)
    similarity = cosine_similarity(target_embeddings, text_embeddings)[0][0]
    similarities.append((text, similarity))

# Rank the similarities in descending order
ranked_similarities = sorted(similarities, key=lambda x: x[1], reverse=True)

# Print the ranked similarities
for i, (text, similarity) in enumerate(ranked_similarities, start=1):
    print(f"Rank {i}: Similarity = {similarity:.4f}")
    print(text)
    print()


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Rank 1: Similarity = 1.0000
The new software update fixed all the issues I was experiencing. It's much more stable now.

Rank 2: Similarity = 0.7257
This product has exceeded my expectations! The camera is fantastic, and the battery just keeps on going. I've been using it for two weeks, and I couldn't be happier.

Rank 3: Similarity = 0.6857
The weather is beautiful today. I'm going for a walk in the park.

Rank 4: Similarity = 0.6615
I'm reading a fascinating book. The author's writing style is captivating.

Rank 5: Similarity = 0.6533
I love the design of this product, but the performance is disappointing.

Rank 6: Similarity = 0.6452
I had a delicious meal at the restaurant last night. The food was amazing.

Rank 7: Similarity = 0.6381
I'm planning a trip to Europe. I can't wait to explore new places and try different cuisines.

Rank 8: Similarity = 0.5893
The customer service was terrible. I had a bad experience with this company.

