# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here: In this activity, we will classify the movie reviews into three classes, those being positive, negative, or neutral. The dataset will be a sample text dataset of movie reviews. We will be gathering the following features that might help build the machine learning model. Term Frequency-Inverse Document Frequency (TFIDF) An important technique for this analysis TFIDF will help in determining the key terms of the reviews based on how often they appear relative to the prolonged archive. N-grams: This is capturing sentiments indicating phrases especially bigrams or trigrams which helps in contextual analysis. Sentiment Score: Each review can be processed with a TextBlob or similar library to get a sentiment breakdown if the review is positive, negative, or neutral. Word Count: This metric helps to know the number of excessive words in each review. Parts of Speech Tags: Counting parts of speech allows understanding of which sentiment-laden types of words adjectives, for example, are popular. Character Count: Counts the number of characters in a particular review and dares to hypothesize the depth and width of the review. Average Word Length: Related in a way to the review’s complexity metric, average word length in a review as expected may correlate with the complexity of a review. Exclamation Points: The number of exclamation marks can suggest the level of intensity and emotion.
'''

'\nPlease write you answer here: In this activity, we will classify the movie reviews into three classes, those being positive, negative, or neutral. The dataset will be a sample text dataset of movie reviews. We will be gathering the following features that might help build the machine learning model. Term Frequency-Inverse Document Frequency (TFIDF) An important technique for this analysis TFIDF will help in determining the key terms of the reviews based on how often they appear relative to the prolonged archive. N-grams: This is capturing sentiments indicating phrases especially bigrams or trigrams which helps in contextual analysis. Sentiment Score: Each review can be processed with a TextBlob or similar library to get a sentiment breakdown if the review is positive, negative, or neutral. Word Count: This metric helps to know the number of excessive words in each review. Parts of Speech Tags: Counting parts of speech allows understanding of which sentiment-laden types of words adje

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
#Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import nltk
from nltk import pos_tag
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
#Downloading 'punkt' for tokenization
nltk.download('punkt')
#Downloading 'averaged_perceptron_tagger' for part-of-speech tagging
nltk.download('averaged_perceptron_tagger')

#Sample movie reviews dataset in text format
data = {
    'review':[
        "I loved this movie! The actors were amazing",
        "This was the worst purchase I have ever made",
        "Its okay, the movie did not reach my expectations",
        "Highly recommend! Amazing screenplay, direction and acting by the actors",
        "Dont watch this movie. Its a total time waste!",
        "Just Superb. Amazing Story",
        "Just wasted three hours of my valuable time!",
        "Not bad, the movie was okay for the first half",
        "Dont skip this movie! Everyone must watch it.",
        "The movie is a total flop. No story, no screenplay, and the actors did not do justice for their roles",

    ],
    'sentiment':[
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative'
    ]
}
#Creating the DataFrame to extract the features
df = pd.DataFrame(data)

#Display the first 5 rows of dataset
print("Sample movie reviews data\n", df.head())

#Extracting TF-IDF Features
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

#Fitting and transforming the reviews to TF-IDF representation
X_tfidf = tfidf_vectorizer.fit_transform(df['review'])
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("\n Extracted TF-IDF Features are\n",tfidf_df)

#Extracting N-grams
ngram_vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(2,2))

#Fitting and transforming the reviews to TF-IDF representation
X_ngrams = ngram_vectorizer.fit_transform(df['review'])
ngram_df = pd.DataFrame(X_ngrams.toarray(), columns=ngram_vectorizer.get_feature_names_out())
print("\n Extracted N-grams Features are\n",ngram_df)

#Calculating sentiment score
def sentiment(review):
    blob = TextBlob(review)
    #Returns sentiment score between -1 and 1. -1 indicates negative, 1 indicates positive
    return blob.sentiment.polarity

#Calculating sentiment scores for each review
sentiment_score = [sentiment(review) for review in df['review']]
sentiment_df = pd.DataFrame({'Sentiment score is': sentiment_score})
print("\n Extracted Sentiment scores are\n",sentiment_df)

#Defining function to extract Part_of_Speech tags

def pos_tags(review):
    #Tokenizing the review text
    tokens = nltk.word_tokenize(review)
    tag = pos_tag(tokens)
    return tag

#Generating pos tags for all reviews
pos_list = [pos_tags(review) for review in df['review']]
print("\n Extracted pas tags are\n",pos_list)

#Calculating word count
#Counting the number of words in each review
word_count = [len(review.split()) for review in df['review']]
wordcount_df = pd.DataFrame({'Word Count' : word_count})
print("\n Word counts are\n",wordcount_df)

#Calculating the exclamation Marks
exclamation_count = [review.count('!') for review in df['review']]
exclamationcount_df = pd.DataFrame({'Exclamation Mark Count' : exclamation_count})
print("\n Exclamation marks count is\n",exclamationcount_df)

#Final dataset after combining all features
final_df = pd.concat([tfidf_df,ngram_df,sentiment_df,wordcount_df,exclamationcount_df ],axis = 1)
print("\n Dataset after combining all features\n",final_df)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Sample movie reviews data
                                               review sentiment
0        I loved this movie! The actors were amazing  positive
1       This was the worst purchase I have ever made  negative
2  Its okay, the movie did not reach my expectations   neutral
3  Highly recommend! Amazing screenplay, directio...  positive
4     Dont watch this movie. Its a total time waste!  negative

 Extracted TF-IDF Features are
      acting    actors   amazing      bad       did  direction      dont  \
0  0.000000  0.480631  0.480631  0.00000  0.000000   0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.00000  0.000000   0.000000  0.000000   
2  0.000000  0.000000  0.000000  0.00000  0.439955   0.000000  0.000000   
3  0.414196  0.308050  0.308050  0.00000  0.000000   0.414196  0.000000   
4  0.000000  0.000000  0.000000  0.00000  0.000000   0.000000  0.415853   
5  0.000000  0.000000  0.429504  0.00000  0.000000   0.000000  0.000000   
6  0.000000  0.000000  0.000000  0.00

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
# You code here (Please add comments in the code):
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import chi2

#Preparing labels for feature selection
label_encoder = LabelEncoder()

#Transforming categorical sentiment labels into numerical
encoded_labels = label_encoder.fit_transform(df['sentiment'])

#Applying Chi-square test for tfidf features
chi_square_score, _ = chi2(X_tfidf,encoded_labels)
chi_square_df = pd.DataFrame({
    'Feature' : tfidf_vectorizer.get_feature_names_out(),
    'Chi_square value': chi_square_score
}).sort_values(by='Chi_square value', ascending=False)

#Displaying the chi square test results
print("Chi square results\n",chi_square_df)

Chi square results
          Feature  Chi_square value
16          okay          3.719483
9           half          2.305240
3            bad          2.305240
18         reach          2.070152
7   expectations          2.070152
2        amazing          1.827277
25          time          1.228571
26         total          1.115237
17      purchase          1.060660
31         worst          1.060660
14         loved          0.969367
22          skip          0.907261
24        superb          0.866250
4            did          0.842853
28         waste          0.733779
11         hours          0.711443
29        wasted          0.711443
27      valuable          0.711443
10        highly          0.621294
5      direction          0.621294
0         acting          0.621294
19     recommend          0.621294
20         roles          0.578124
8           flop          0.578124
13       justice          0.578124
1         actors          0.561812
6           dont          0.245498


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
# You code here (Please add comments in the code):
#Importing libraries
import tensorflow as tf
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, TFBertModel, logging
logging.set_verbosity_error()

#Loading BERT Model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

#Defining function to generate BERT embeddings

def bert_embedding(text):
    input = bert_tokenizer(text, return_tensors='tf', truncation= True, padding=True)
    output = bert_model(**input)
    return tf.reduce_mean(output.last_hidden_state, axis=1)

#Generating embeddings for all reviews present
embedding_of_review = tf.concat([bert_embedding(review) for review in df['review']], axis=0)

#Defining a query to match relevant documents
query = "Great movie with amazing acting!"
embedding_of_query = bert_embedding(query)

#Calculating cosine similarity between query and each review
similarity = cosine_similarity(embedding_of_query.numpy(), embedding_of_review.numpy())

# Ranking the reviews based on similarities
sorting_similarity = similarity.argsort()[0][::-1]
reviews_ranking = [df['review'].iloc[i] for i in sorting_similarity]
similarity_ranking = [similarity[0][i] for i in sorting_similarity ]

#Displaying the reviews with their similarity values
dataset = pd.DataFrame({'Rank' : range(1, len(reviews_ranking) + 1), 'Review' : reviews_ranking, 'Similarity' : similarity_ranking})


#Displaying the ranking in descending order
print("\nReviews Based on Similarity to the Query in descending order:\n")
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


Reviews Based on Similarity to the Query in descending order:

   Rank                                             Review  Similarity
0     1        I loved this movie! The actors were amazing    0.819950
1     2                         Just Superb. Amazing Story    0.728510
2     3  Highly recommend! Amazing screenplay, directio...    0.727544
3     4  The movie is a total flop. No story, no screen...    0.716363
4     5     Not bad, the movie was okay for the first half    0.656546
5     6  Its okay, the movie did not reach my expectations    0.653633
6     7      Dont skip this movie! Everyone must watch it.    0.641844
7     8       Just wasted three hours of my valuable time!    0.601925
8     9     Dont watch this movie. Its a total time waste!    0.596875
9    10       This was the worst purchase I have ever made    0.549333


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Learning Experience: While doing this task of extracting features from text data, I found it quite interesting and helpful in grasping the fundamentals of NLP. I learned one such important aspect, use of TF-IDF and N-grams for feature extraction and how they helped in placing a value for a word pertinent to the context of the document and its size. These techniques helped to gain insight in how importance of certain words and phrases could be measured in relation to the entire document and its relative silence. This was an entirely new concept as I only came to learn how to use BERT embeddings which are highly efficient in obtaining contextual vectors of texts. This exercise helped emphasize the need for cleaning and transforming text before it is processed by machine learning algorithms, thus improving both the practical and theoretical aspects of my knowledge in NLP.

Challenges Encountered: Regarding this exercise, I can say that the process has not been smooth since I had to encounter some challenges. Firstly, there were challenges on loading the BERT model especially in respect to the many warnings on model weight initialization. It also required time to learn what those warnings were, as well as how to disable them. Further, the task was made difficult by the need to calculate cosine similarity as well as using TensorFlow tensors properly. It also required a level of precision that was necessary to ensure that every codec component was suitable and that the final output was well structured in a table.

Relevance to Your Field of Study: This exercise is very significant in relation to my study. Any document analysis procedure which features text classification included in it has discourse feature extraction as one of its vital processes. Wise feature extraction from textual information makes one capable of training a more efficient model and interpreting the outcome in a more relevant context. Also, it is an indication of the reliance towards using readily available models on different NLP tasks, which is crucial in the field of NLP when working with advanced models such as BERT. This exercise has further made me more eager to engage in more advanced NLP tasks such as text analytics and classification which are very important in real life scenarios.


'''

