<a href="https://colab.research.google.com/github/sreyeshkonduru/sreyesh_INFO5731_Fall2024/blob/main/KONDURU_SREYESH_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Title: CUSTOMER TICKET CLASSIFICATION TASK
Using this classification task we can differentiate betwwen the different types of customer tickets like technical issues, paymment issues, product issues,accouunt problems etc.

List of Features in this task:
1. Text Length (Word Count):
This feature measures the number of words in the combined subject and body of the ticket.

2. Number of Exclamation Marks:
It givves The count of exclamation marks in the text.

3. Presence of Urgent Words:
A  feature that checks for the presence of urgent words such as "urgent," "immediately," or "asap."

4. Sentiment Score:
The overall sentiment polarity score of the entire text, calculated using NLTK's VADER sentiment analyzerto chech the positivity or negativity of the sentence.

5. Named Entities:
This feature classifies named entities such as products, dates, or organizations extracted from the text using SpaCy’s Named Entity Recognition (NER).

6. Number of Specific Keywords:
This feature counts how many times certain important keywords (such as “refund,” “password,” “support,” “upgrade,” “account”) appear in the ticket text.

How These Features are helpful:

-Text length: The length of a ticket gives insights into its complexity. Short tickets may indicate simple inquiries, while longer ones may involve more detailed complaints or technical issues.

-Number of Exclamation Marks: Tickets with multiple exclamation marks could likely belong to complaints or urgent requests that require immediate attention.

-Presence of Urgent Words: Urgent words signal that the customer expects a prompt response, and these tickets should be routed to high-priority teams.
                          It can help in identifying tickets that require immediate attention.

-Sentiment Score: A negative sentiment score might indicate dissatisfaction, while a positive score might represent a simple inquiry or positive feedback.
                  This feature helps classify whether the customer is facing an issue or is simply requesting information.

-Named Entities: Named entities help in understanding the specific product or service that the customer is referring to.

-Number of Specific Keywords: Certain keywords are strong indicators of the ticket type. For instance, "refund" suggests a billing issue,
                              while "password" indicates an account management problem.


'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):
import pandas as pd
import spacy
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import re

# Download NLTK VADER for sentiment analysis
nltk.download('vader_lexicon')

# Load Spacy's language model for Named Entity Recognition (NER)
nlp = spacy.load("en_core_web_sm")

# Sample customer service ticket data
data = [
    {"subject": "Unable to login to my account", "body": "I forgot my password and need help resetting it."},
    {"subject": "Payment Issue!", "body": "I was charged twice for my last purchase. Please issue a refund asap."},
    {"subject": "Product not working", "body": "The vacuum cleaner I bought is not turning on. It stopped working after a week."},
    {"subject": "Need upgrade assistance", "body": "Can you help me upgrade my software immediately to the latest version?"},
    {"subject": "Complaint about service", "body": "I am very disappointed with the service. The response time is too slow."}
]

# Convert to DataFrame
df = pd.DataFrame(data)

# Combine subject and body into a single text column for feature extraction
df['text'] = df['subject'] + ' ' + df['body']

# Feature 1: Length of the text (number of words)
df['text_length'] = df['text'].apply(lambda x: len(x.split()))
print("Feature 1 - Text Length (Word Count):")
print(df[['text', 'text_length']])

# Feature 2: Number of exclamation marks
df['exclamation_marks'] = df['text'].apply(lambda x: x.count('!'))
print("\nFeature 2 - Exclamation Marks Count:")
print(df[['text', 'exclamation_marks']])

# Feature 3: Presence of urgent words (like 'urgent', 'immediately', 'asap')
urgent_words = ['urgent', 'immediately', 'asap']
df['urgent_flag'] = df['text'].apply(lambda x: any(word in x.lower() for word in urgent_words))
print("\nFeature 3 - Presence of Urgent Words:")
print(df[['text', 'urgent_flag']])

# Feature 4: Sentiment Analysis (using NLTK's VADER sentiment analysis)
sia = SentimentIntensityAnalyzer()
df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])
print("\nFeature 4 - Sentiment Scores:")
print(df[['text', 'sentiment']])

# Feature 5: Named Entities (using SpaCy NER for entities like product names or organizations)
def extract_entities(text):
    doc = nlp(text)
    return [ent.label_ for ent in doc.ents]  # Extract only entity labels

df['entities'] = df['text'].apply(extract_entities)
print("\nFeature 5 - Named Entities:")
print(df[['text', 'entities']])

# Feature 6: Number of specific keywords (like 'refund', 'password', 'support')
keywords = ['refund', 'password', 'support', 'upgrade', 'account']
df['keyword_count'] = df['text'].apply(lambda x: sum([x.lower().count(word) for word in keywords]))
print("\nFeature 6 - Specific Keywords:")
print(df[['text', 'keyword_count']])



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Feature 1 - Text Length (Word Count):
                                                text  text_length
0  Unable to login to my account I forgot my pass...           15
1  Payment Issue! I was charged twice for my last...           15
2  Product not working The vacuum cleaner I bough...           18
3  Need upgrade assistance Can you help me upgrad...           15
4  Complaint about service I am very disappointed...           16

Feature 2 - Exclamation Marks Count:
                                                text  exclamation_marks
0  Unable to login to my account I forgot my pass...                  0
1  Payment Issue! I was charged twice for my last...                  1
2  Product not working The vacuum cleaner I bough...                  0
3  Need upgrade assistance Can you help me upgrad...                  0
4  Complaint about service I am very disappointed...                  0

Feature 3 - Presence of Urgent Words:
                                                text  urg

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):
#importing required libraries and modules
import pandas as pd
import numpy as np
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest

# Assuming df is already created with features from the previous step

# Step 1: Generate labels (1 for urgent, 0 for non-urgent)
# We will assume tickets with urgent words or negative sentiment are "urgent"
df['label'] = df.apply(lambda x: 1 if x['urgent_flag'] or x['sentiment'] < 0 else 0, axis=1)

# Step 2: Select the features for Chi-Square testing
# These are the features we want to test
features = ['text_length', 'exclamation_marks', 'urgent_flag', 'sentiment', 'keyword_count']

# Prepare the feature matrix (X) and target vector (y)
X = df[features]
y = df['label']

# Chi2 expects non-negative values, so we scale the features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply Chi-Square feature selection
chi_scores, p_values = chi2(X_scaled, y)

# Step 4: Rank features based on Chi-Square score
chi2_scores_df = pd.DataFrame({
    'Feature': features,
    'Chi-Square Score': chi_scores,
    'P-Value': p_values
}).sort_values(by='Chi-Square Score', ascending=False)

print("\nRanked Features by Chi-Square Score:")
print(chi2_scores_df)



Ranked Features by Chi-Square Score:
             Feature  Chi-Square Score   P-Value
4      keyword_count          0.625000  0.429195
2        urgent_flag          0.500000  0.479500
0        text_length          0.333333  0.563703
1  exclamation_marks          0.250000  0.617075
3          sentiment          0.189825  0.663062


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample customer service ticket data (same as in previous code)
data = [
    {"subject": "Unable to login to my account", "body": "I forgot my password and need help resetting it."},
    {"subject": "Payment Issue!", "body": "I was charged twice for my last purchase. Please issue a refund asap."},
    {"subject": "Product not working", "body": "The vacuum cleaner I bought is not turning on. It stopped working after a week."},
    {"subject": "Need upgrade assistance", "body": "Can you help me upgrade my software immediately to the latest version?"},
    {"subject": "Complaint about service", "body": "I am very disappointed with the service. The response time is too slow."}
]

# Convert to DataFrame
df = pd.DataFrame(data)

# Combine subject and body into a single text column for feature extraction
df['text'] = df['subject'] + ' ' + df['body']

# Define the query to match relevant documents
query = "Help me reset my password and account access"

# Function to encode text using BERT model
def encode_text(text):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

# Encode query and text data
query_embedding = encode_text(query)
df['embedding'] = df['text'].apply(lambda x: encode_text(x))

# Compute cosine similarity between the query and each document
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(query_embedding, x.reshape(1, -1)).item())

# Rank documents by similarity in descending order
df = df.sort_values(by='similarity', ascending=False)

# Display the ranked documents
print("Ranked Documents based on Similarity to the Query:")
print(df[['subject', 'body', 'similarity']])




Ranked Documents based on Similarity to the Query:
                         subject  \
3        Need upgrade assistance   
0  Unable to login to my account   
1                 Payment Issue!   
4        Complaint about service   
2            Product not working   

                                                body  similarity  
3  Can you help me upgrade my software immediatel...    0.849565  
0   I forgot my password and need help resetting it.    0.832687  
1  I was charged twice for my last purchase. Plea...    0.813991  
4  I am very disappointed with the service. The r...    0.765909  
2  The vacuum cleaner I bought is not turning on....    0.758997  


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Learning experience: Overall i learnt about data extraction from different websites through various methods and it was very exciting.
Chalanges encountered: At first it was though to extract as so many websites were not giving access to the data. Then I learnt ways to overome it.
'''