<a href="https://colab.research.google.com/github/rmvsaipavan/manivenkatasaipavan_INFO5731_Fall2023/blob/main/Ramisetty_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
'''
I want to categorize customer reviews of a mobile phone into different aspects such as "Performance", "Camera Quality", "Battery Life", "Design", and "Overall Satisfaction". Here are the features that we've used and why they might be helpful:
Word Frequency Features, Sentiment Features, N-gram Features (bi-grams), Part-of-Speech (POS) Tag Features, N, amed Entity Features, Length-based Feature.


1. Word Frequency Features:
These features capture the frequency of specific words in the review.
Helpful for identifying common keywords associated with different categories. For example, "battery life" might be a common phrase associated with the Battery Life category.

2. Sentiment Features:
These features quantify the sentiment of the review (positive, negative, neutral).
Useful for identifying the overall sentiment towards different aspects of the mobile. For instance, a negative sentiment in the "Camera Quality" aspect might indicate dissatisfaction.

3. N-gram Features (bi-grams):
These features represent sequences of two words (bi-grams).
Useful for capturing context and phrases that might be indicative of specific categories. For example, "amazing camera" might be indicative of a positive sentiment towards Camera Quality.

4. Part-of-Speech (POS) Tag Features:
These features label each word with its part of speech (e.g., noun, verb, adjective).
Useful for identifying the linguistic structure of the review. For instance, identifying adjectives like "amazing" might indicate positive sentiment.

5. Named Entity Features:
These features identify entities like names of people, organizations, locations, etc.
Useful for detecting mentions of specific entities that may be relevant to the categories. For example, if a review mentions a specific mobile brand or model, it might be relevant to the Design or Overall Satisfaction category.

6. Length-based Features:
These features include the length of the review, average word length, etc.
Useful for capturing the complexity or verbosity of the review. For instance, a longer review might contain more detailed feedback about different aspects of the mobile.

These features provide a diverse set of information about the text, allowing the machine learning model to capture various aspects of the reviews. This can help in making more accurate predictions about the categories. Keep in mind that the effectiveness of these features may vary depending on the specific dataset and task at hand. It's always a good practice to experiment with different feature sets and evaluate their performance.Length-based features are important as they provide a quantitative measure of the text data, which can be relevant in understanding the level of detail or verbosity in a review.






'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import pos_tag, ne_chunk

# Sample customer reviews
reviews = [
    "The battery life is amazing, but the camera quality could be better.",
    "The phone's design is sleek and elegant.",
    "I had a little trouble with the installation process, but the support team was quick to help. Great service!",
    "I'm very satisfied with the overall performance of this mobile."
    "Excellent processor makes me to take this mobile",
    "Amazing camera performance but the night mode could be better",
    "Mobile phone is very good, go for it"
]

# 1. Word Frequency Features
def word_frequency(review):
    tokens = word_tokenize(review.lower())
    return {word: tokens.count(word) for word in set(tokens)}

# 2. Sentiment Features
def sentiment_features(review):
    sia = SentimentIntensityAnalyzer()
    sentiment = sia.polarity_scores(review)
    return sentiment

# 3. N-gram Features (bi-grams)
def ngram_features(review, n=2):
    tokens = word_tokenize(review.lower())
    return [' '.join(gram) for gram in ngrams(tokens, n)]

# 4. Part-of-Speech (POS) Tag Features
def pos_tag_features(review):
    tokens = word_tokenize(review)
    return [tag for word, tag in pos_tag(tokens)]

# 5. Named Entity Features
def named_entity_features(review):
    tokens = word_tokenize(review)
    entities = ne_chunk(pos_tag(tokens))
    return [chunk.label() for chunk in entities if hasattr(chunk, 'label')]

# 6. Length-based Features
def length_features(review):
    tokens = word_tokenize(review)
    return {
        'num_words': len(tokens),
        'avg_word_length': sum(len(word) for word in tokens) / len(tokens)
    }

# Extract features for each review
for review in reviews:
    print("\nReview:", review)
    print("Word Frequency:", word_frequency(review))
    print("Sentiment Features:", sentiment_features(review))
    print("Bi-grams:", ngram_features(review))
    print("POS Tags:", pos_tag_features(review))
    print("Named Entities:", named_entity_features(review))
    print("Length Features:", length_features(review))


[nltk_data] Downloading package maxent_ne_chunker to C:\Users\Sai
[nltk_data]     Pavan\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to C:\Users\Sai
[nltk_data]     Pavan\AppData\Roaming\nltk_data...



Review: The battery life is amazing, but the camera quality could be better.
Word Frequency: {'could': 1, 'is': 1, 'life': 1, ',': 1, '.': 1, 'amazing': 1, 'camera': 1, 'be': 1, 'quality': 1, 'but': 1, 'battery': 1, 'better': 1, 'the': 2}
Sentiment Features: {'neg': 0.0, 'neu': 0.615, 'pos': 0.385, 'compound': 0.7391}
Bi-grams: ['the battery', 'battery life', 'life is', 'is amazing', 'amazing ,', ', but', 'but the', 'the camera', 'camera quality', 'quality could', 'could be', 'be better', 'better .']
POS Tags: ['DT', 'NN', 'NN', 'VBZ', 'JJ', ',', 'CC', 'DT', 'NN', 'NN', 'MD', 'VB', 'JJR', '.']
Named Entities: []
Length Features: {'num_words': 14, 'avg_word_length': 4.071428571428571}

Review: The phone's design is sleek and elegant.
Word Frequency: {'is': 1, 'elegant': 1, '.': 1, 'and': 1, 'phone': 1, 'sleek': 1, "'s": 1, 'the': 1, 'design': 1}
Sentiment Features: {'neg': 0.0, 'neu': 0.659, 'pos': 0.341, 'compound': 0.4767}
Bi-grams: ['the phone', "phone 's", "'s design", 'design is',

[nltk_data]   Unzipping corpora\words.zip.


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer

# Sample labels for the reviews (for demonstration purposes)
labels = ["Battery Life", "Design", "Support", "Performance", "Camera Quality", "Overall Satisfaction"]

# Sample customer reviews
reviews = [
    "The battery life is amazing, but the camera quality could be better.",
    "The phone's design is sleek and elegant.",
    "I had a little trouble with the installation process, but the support team was quick to help. Great service!",
    "I'm very satisfied with the overall performance of this mobile. Excellent processor makes me take this mobile",
    "Amazing camera performance but the night mode could be better",
    "Mobile phone is very good, go for it"
]

# 1. Create a Bag-of-Words representation of the reviews
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)

# Check if the number of labels matches the number of samples
if len(labels) != X.shape[0]:
    raise ValueError("Number of labels must match the number of samples.")

# 2. Calculate Mutual Information scores
mi_scores = mutual_info_classif(X, labels)

# 3. Create a list of feature names (in this case, words)
feature_names = vectorizer.get_feature_names_out()

# 4. Combine feature names and their corresponding MI scores
features_with_scores = list(zip(feature_names, mi_scores))

# 5. Sort features by MI scores in descending order
sorted_features = sorted(features_with_scores, key=lambda x: x[1], reverse=True)

# Print the sorted features
for feature, score in sorted_features:
    print(f"Feature: {feature}, MI Score: {score:.4f}")


Feature: the, MI Score: 1.0114
Feature: mobile, MI Score: 0.8676
Feature: but, MI Score: 0.6931
Feature: is, MI Score: 0.6931
Feature: amazing, MI Score: 0.6365
Feature: be, MI Score: 0.6365
Feature: better, MI Score: 0.6365
Feature: camera, MI Score: 0.6365
Feature: could, MI Score: 0.6365
Feature: performance, MI Score: 0.6365
Feature: phone, MI Score: 0.6365
Feature: very, MI Score: 0.6365
Feature: with, MI Score: 0.6365
Feature: and, MI Score: 0.4506
Feature: battery, MI Score: 0.4506
Feature: design, MI Score: 0.4506
Feature: elegant, MI Score: 0.4506
Feature: excellent, MI Score: 0.4506
Feature: for, MI Score: 0.4506
Feature: go, MI Score: 0.4506
Feature: good, MI Score: 0.4506
Feature: great, MI Score: 0.4506
Feature: had, MI Score: 0.4506
Feature: help, MI Score: 0.4506
Feature: installation, MI Score: 0.4506
Feature: it, MI Score: 0.4506
Feature: life, MI Score: 0.4506
Feature: little, MI Score: 0.4506
Feature: makes, MI Score: 0.4506
Feature: me, MI Score: 0.4506
Feature: mod

Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample customer reviews
reviews = [
    "The battery life is amazing, but the camera quality could be better.",
    "The phone's design is sleek and elegant.",
    "I had a little trouble with the installation process, but the support team was quick to help. Great service!",
    "I'm very satisfied with the overall performance of this mobile.",
    "Excellent processor makes me to take this mobile",
    "Amazing camera performance but the night mode could be better",
    "Mobile phone is very good, go for it"
]

# Define the query
query = "I'm looking for a mobile phone with a great camera quality and long battery life."

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and convert query to tensor
query_tokens = tokenizer(query, return_tensors='pt', padding=True, truncation=True)
query_output = model(**query_tokens).last_hidden_state.mean(dim=1)

# Tokenize and convert reviews to tensors
review_tensors = [tokenizer(review, return_tensors='pt', padding=True, truncation=True) for review in reviews]
review_outputs = [model(**tokens).last_hidden_state.mean(dim=1) for tokens in review_tensors]

# Calculate cosine similarity between query and reviews
similarities = [cosine_similarity(query_output.detach().numpy(), output.detach().numpy())[0][0] for output in review_outputs]

# Combine similarities with reviews
ranked_reviews = list(zip(reviews, similarities))

# Sort reviews by similarity in descending order
ranked_reviews.sort(key=lambda x: x[1], reverse=True)

# Print ranked reviews
for review, similarity in ranked_reviews:
    print(f"Similarity: {similarity:.4f}\nReview: {review}\n")


Similarity: 0.8121
Review: The battery life is amazing, but the camera quality could be better.

Similarity: 0.7852
Review: I'm very satisfied with the overall performance of this mobile.

Similarity: 0.7418
Review: The phone's design is sleek and elegant.

Similarity: 0.7260
Review: Mobile phone is very good, go for it

Similarity: 0.7085
Review: Amazing camera performance but the night mode could be better

Similarity: 0.6797
Review: I had a little trouble with the installation process, but the support team was quick to help. Great service!

Similarity: 0.6675
Review: Excellent processor makes me to take this mobile

