## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

Text classification and text mining tasks are essential in transforming unstructured text data into structured information for analysis and decision-making. One interesting task within this domain is "Intent Detection." Intent detection involves automatically identifying the underlying goals or purposes of a given text, which can be invaluable in various applications, such as customer support, marketing, and chatbots. In this assignment, I will discuss the features that can be useful for building a machine learning model for intent detection.

1. **Word2Vec Features**: Word embeddings, such as Word2Vec, capture semantic relationships between words. By representing words in a continuous vector space, these features can help the model understand the context and meaning of words within a text. For intent detection, Word2Vec features can be instrumental in recognizing similarities and associations between different phrases or keywords, aiding in the classification of text into specific intents.

2. **Linguistic Feature Extraction**: Linguistic features encompass various aspects of language, including parts of speech, syntactic structures, and grammatical patterns. These features can provide valuable information about the syntactical and grammatical characteristics of the text. For example, identifying the presence of verbs, nouns, or specific grammatical constructs can assist in determining the intent behind a sentence.

3. **Dependency Parse**: Dependency parsing involves analyzing the grammatical structure of a sentence by identifying the relationships between words, such as subject-verb-object dependencies. Utilizing dependency parse features can help the model understand the underlying structure and semantics of a sentence, aiding in the classification of intent. For instance, recognizing that "buy" is related to "product" in a sentence can suggest an intent related to purchasing.

4. **TF-IDF (Term Frequency-Inverse Document Frequency)**: TF-IDF is a numerical statistic that reflects the importance of a word within a document relative to a corpus of documents. It can be helpful in identifying keywords or phrases that are highly indicative of specific intents. Words with high TF-IDF scores are likely to be significant in determining the intent of a text. For instance, in a customer support context, the presence of words like "refund" or "complaint" with high TF-IDF scores may suggest an intent related to issue resolution.

5. **Entity Features**: Entities represent specific named objects or concepts, such as names of products, locations, or dates. Recognizing and extracting entities from text can be crucial for intent detection. For example, if a text mentions "booking a flight to New York on December 1st," identifying entities like "New York" and "December 1st" can help classify the intent as travel planning.

In summary, building a machine learning model for intent detection requires a combination of diverse features to effectively capture the nuances of human language and context. Word2Vec features offer semantic understanding, linguistic features provide grammatical insights, dependency parse reveals sentence structure, TF-IDF identifies important keywords, and entity recognition pinpoints specific entities. By incorporating these features, the model can robustly classify text into meaningful intents, enabling businesses to streamline processes and provide more efficient services to customers.

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [19]:
pwd()

'/Users/shyamsonu/Downloads'

In [1]:
# Import necessary libraries
import nltk
from bs4 import BeautifulSoup as bs
import numpy as np
import pandas as pd
import requests as rq
import matplotlib.pyplot as plt
import nltk
from wordcloud import WordCloud
from nltk.corpus import stopwords

# Collecting sample text data from IMDb "Pathaan movie reviews"

# Assigning the URL to a variable
url = 'https://www.imdb.com/title/tt12844910/reviews'

# Creating a data frame with one column to store the text of reviews
df = pd.DataFrame(columns=['Review'])

# Send an HTTP GET request to the specified URL and retrieve the HTML content
req = rq.get(url).text

# Parse the HTML content using BeautifulSoup
soup = bs(req, 'html.parser')

# Collecting review text elements
review_text = soup.find_all('div', attrs={'class': 'text show-more__control'})

# Creating an empty list to store review text
review_list = []

# Loop through the review_text elements and extract the text, then append it to review_list
for i in range(len(review_text)):
    review_list.append(review_text[i].get_text())

# Assigning the list of reviews to the 'Review' column in the data frame
df['Review'] = review_list

# Calculate the word count for each review and add it as a new column 'word_count'
df['word_count'] = df['Review'].apply(lambda x: len(str(x).split(" ")))

# Printing the first few rows of the data frame with columns 'Review' and 'word_count'
df[['Review', 'word_count']].head()


Unnamed: 0,Review,word_count
0,I went in to see this because of positive RT s...,168
1,This would have been a great films with such s...,105
2,Yash Raj Films cashed on the 'Civilian Vs Mili...,299
3,"As a part of the audience, I feel that I just ...",211
4,"Yes as a educated upsc aspirant this is , disa...",130


In [2]:
# Calculate the character count for each review (including spaces) and add it as a new column 'char_count'
df['char_count'] = df['Review'].str.len()

# Displaying the first few rows of the data frame with columns 'Review' and 'char_count'
df[['Review', 'char_count']].head()


Unnamed: 0,Review,char_count
0,I went in to see this because of positive RT s...,947
1,This would have been a great films with such s...,594
2,Yash Raj Films cashed on the 'Civilian Vs Mili...,1818
3,"As a part of the audience, I feel that I just ...",1117
4,"Yes as a educated upsc aspirant this is , disa...",747


In [3]:
# Import the stopwords corpus from NLTK and download the English stopwords list
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Get the list of English stopwords
stop = stopwords.words('english')

# Calculating the number of stopwords in each review and add it as a new column 'stopwords'
df['stopwords'] = df['Review'].apply(lambda x: len([x for x in x.split() if x in stop]))

# Displaying the first few rows of the data frame with columns 'Review' and 'stopwords'
df[['Review', 'stopwords']].head()


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shyamsonu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Review,stopwords
0,I went in to see this because of positive RT s...,84
1,This would have been a great films with such s...,42
2,Yash Raj Films cashed on the 'Civilian Vs Mili...,120
3,"As a part of the audience, I feel that I just ...",94
4,"Yes as a educated upsc aspirant this is , disa...",51


In [4]:
# Calculating the number of hashtags in each review and add it as a new column 'hashtags'
df['hashtags'] = df['Review'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

# Displaying the first few rows of the data frame with columns 'Review' and 'hashtags'
df[['Review', 'hashtags']].head()


Unnamed: 0,Review,hashtags
0,I went in to see this because of positive RT s...,0
1,This would have been a great films with such s...,0
2,Yash Raj Films cashed on the 'Civilian Vs Mili...,0
3,"As a part of the audience, I feel that I just ...",0
4,"Yes as a educated upsc aspirant this is , disa...",0


In [5]:
# Calculating the number of numeric digits in each review and add it as a new column 'numerics'
df['numerics'] = df['Review'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))

# Displaying the first few rows of the data frame with columns 'Review' and 'numerics'
df[['Review', 'numerics']].head()


Unnamed: 0,Review,numerics
0,I went in to see this because of positive RT s...,0
1,This would have been a great films with such s...,1
2,Yash Raj Films cashed on the 'Civilian Vs Mili...,0
3,"As a part of the audience, I feel that I just ...",2
4,"Yes as a educated upsc aspirant this is , disa...",1


In [6]:
# Calculating the number of uppercase words in each review and add it as a new column 'upper'
df['upper'] = df['Review'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# Displaying the first few rows of the data frame with columns 'Review' and 'upper'
df[['Review', 'upper']].head()


Unnamed: 0,Review,upper
0,I went in to see this because of positive RT s...,5
1,This would have been a great films with such s...,2
2,Yash Raj Films cashed on the 'Civilian Vs Mili...,5
3,"As a part of the audience, I feel that I just ...",15
4,"Yes as a educated upsc aspirant this is , disa...",1


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [13]:
text_classifcation = (df['Review'][1:2]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0).reset_index()
text_classifcation.columns =['words', 'tf']
text_classifcation

Unnamed: 0,words,tf
0,movie,5
1,and,5
2,the,4
3,a,3
4,with,3
...,...,...
77,in,1
78,lines,1
79,comedy,1
80,keep,1


In [16]:
# Loop through each word in the 'words' column of the 'text_classification' DataFrame
for i, word in enumerate(text_classifcation['words']):
    
    # Calculate the IDF for the current word and store it in the 'idf' column
    # IDF is calculated as the logarithm of the total number of documents in 'df' divided by the number of documents containing the word
    text_classification.loc[i, 'idf'] = np.log(df.shape[0] / (len(df[df['Review'].str.contains(word)])))


In [17]:
# Calculating the TF-IDF for each word and add it as a new column 'tfidf' in the 'text_classification' DataFrame
text_classifcation['tfidf'] = text_classifcation['tf'] + text_classifcation['idf']

# Displaying the 'text_classification' DataFrame with the newly added 'tfidf' column
text_classifcation


Unnamed: 0,words,tf,idf,tfidf
0,movie,5,0.040822,5.040822
1,and,5,0.000000,5.000000
2,the,4,0.000000,4.000000
3,a,3,0.000000,3.000000
4,with,3,0.328504,3.328504
...,...,...,...,...
77,in,1,0.000000,1.000000
78,lines,1,2.525729,3.525729
79,comedy,1,2.120264,3.120264
80,keep,1,1.832581,2.832581


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [18]:
# You code here (Please add comments in the code):
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Loading the pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Sample text data 
texts = df['Review'].tolist()

# Query
query = "The weakest part of pathaan, I thought, was..."

# Tokenize and encode the query and texts
query_tokens = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
text_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Encoding the query and texts using BERT
with torch.no_grad():
    query_embeddings = model(**query_tokens).last_hidden_state.mean(dim=1)
    text_embeddings = model(**text_tokens).last_hidden_state.mean(dim=1)

# Calculating cosine similarity
cosine_similarities = cosine_similarity(query_embeddings, text_embeddings)

# Ranking documents by similarity in descending order
ranking = np.argsort(cosine_similarities[0])[::-1]

# Printing the ranked documents
print("Query:", query)
for i, idx in enumerate(ranking):
    print(f"Rank {i + 1}: {texts[idx]} (Similarity: {cosine_similarities[0][idx]:.4f})")



Query: The weakest part of pathaan, I thought, was...
Rank 1: I went in to see this because of positive RT scores and comments which said this movie reeked of action. The reviewers just left out the part where they were supposed to say the action was unbelievable. It is worth a watch if you have 2.5 hours to kill.The fight choreography between SRK and 'Jim' (I'm too tired to look up the actor's name) were exceptional. Those featuring Padukone were kind of hit and miss. Sometime they reached Marvel's Black Widow heights, other times they were almost laughable.Jim was a good villain. His backstory was touching and you could empathize with him and the reasons that he was doing what he was doing. I couldn't root for him because you already know how the story is going to end but you just don't know how they will get there.Finally, the soundtrack (or background music). It was repetitive when there was a fight between the protagonist and antagonist. So much so that I found it irritating after

### Thank you!