# **Natural Language Processing Assignment**  
### **Classifying Bird Species Based on Descriptions Using Supervised Learning Techniques**

**Student Name**: Tia Isabel Solanki  
**Admin Number**: 220892L  
**Class**: AA2303

---

## **Part 3: Model Evaluation and Insights**  
*Evaluating model performance and analyzing results for actionable insights.*

---

**Importing Necessary Libraries**

In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import spacy
from collections import Counter

# Text Processing Libraries
import nltk
from nltk.corpus import stopwords
from nltk.util import bigrams
from wordcloud import WordCloud
from textblob import TextBlob

# Scikit-learn Libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

import joblib  # For loading the saved model

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load SpaCy model
nlp = spacy.load("en_core_web_sm")

# Define NLTK stopwords
nltk_stopwords = set(stopwords.words('english'))

# Domain-specific stopwords
domain_stopwords = {'species', 'bird', 'animal', 'description'}

# Combine NLTK stopwords with domain-specific stopwords
combined_stopwords = nltk_stopwords.union(domain_stopwords)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


###**Load and Preprocess Data**


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Load the test data
filepath = '/content/drive/MyDrive/NYP/Year 2/sem 2/[2] IT2391 NATURAL LANGUAGE PROCESSING/NLP Assignment/Data_test.xlsx'
df = pd.read_excel(filepath)
df.head()

Unnamed: 0,description,species
0,Looking for fun and interesting facts about a ...,Black-naped Oriole
1,Giant Panda. Grey Wolf. Canis lupus. Proboscis...,Black-naped Oriole
2,javanicus displayed during feeding such as wal...,Javan Myna
3,Black-naped Oriole: Explore more topics. Name;...,Black-naped Oriole
4,,Javan Myna


###**Cleaning and Processing Text Data**

In [6]:
df.shape

(40, 2)

In [7]:
# Drop the first occurrence of duplicate rows
df = df.drop_duplicates(keep='last')

In [8]:
print("\nData types of columns:")
print(df.dtypes)


Data types of columns:
description    object
species        object
dtype: object


In [9]:
# Drop rows with missing values
df.dropna(inplace=True)
print("Shape after removing missing values:", df.shape)

Shape after removing missing values: (39, 2)


In [10]:
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove HTML tags (if any)
    text = re.sub(r'<[^>]+>', '', text)

    # Remove non-ASCII characters (but keep hyphens within words)
    text = re.sub(r'[^\x00-\x7F-]+', '', text)

    # Replace newlines, carriage returns, and tabs with a space
    text = re.sub(r'[\r\n\t]', ' ', text)

    # Remove URLs
    text = re.sub(r'http\S+|www.\S+', '', text)

    # Remove standalone numbers but keep numbers in words (e.g., "version 2.0")
    text = re.sub(r'\b\d+\b(?!\.\d)', '', text)

    # Remove all punctuation except for hyphens within words
    text = re.sub(r'[!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~“”‘’]', '', text)

    # Remove any other special symbols like accented characters, etc.
    text = re.sub(r'[^\w\s-]', '', text)

    # Remove multiple spaces (including leading and trailing spaces)
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove words containing numbers (e.g., "version2", "item10")
    text = re.sub(r'\b\w*\d\w*\b', '', text)

    # Remove rows starting with a month name (full or short form)
    text = re.sub(r'^(january|jan|february|feb|march|mar|april|apr|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)\b[\s]*', '', text, flags=re.IGNORECASE)

    # Remove the word "article"
    text = re.sub(r'\barticle\b', '', text, flags=re.IGNORECASE)

    # Remove leading and trailing hyphens (if they appear at the start or end of the sentence)
    text = re.sub(r'^[\-]+', '', text)
    text = re.sub(r'[\-]+$', '', text)

    # Use spaCy to recognize compound words (hyphenated words)
    doc = nlp(text)
    cleaned_text = " ".join([token.text for token in doc])

    return cleaned_text

**Removing Stopwords and Generating Tokens & Bigrams**


In [11]:
# Function to remove stopwords
def remove_stopwords(text):
    # Remove stopwords from the text and return the filtered text
    tokens = text.split()
    removed = [word for word in tokens if word.lower() in combined_stopwords]  # Track removed stopwords
    filtered_tokens = [word for word in tokens if word.lower() not in combined_stopwords]
    return " ".join(filtered_tokens), removed  # Return both filtered text and removed stopwords

In [12]:
# Function to process tokens (tokenize, lemmatize, and generate bigrams)
def process_tokens(text):
    # Remove stopwords first
    text_no_stopwords, removed_stopwords = remove_stopwords(text)

    # Tokenize and lemmatize
    doc = nlp(text_no_stopwords)  # Tokenize and lemmatize

    # Tokenize and filter punctuations/spaces
    tokens = [token.text for token in doc if not token.is_punct and not token.is_space]

    # Lemmatize tokens and remove punctuation and spaces
    cleaned_tokens = [token.lemma_ for token in doc if not token.is_punct and not token.is_space]

    # Generate bigrams
    bigrams_list = list(bigrams(cleaned_tokens))
    bigrams_str = [" ".join(bigram) for bigram in bigrams_list]

    return pd.Series([text_no_stopwords, tokens, bigrams_str, removed_stopwords])

In [13]:
# Apply cleaning to the 'description' column
if 'description' in df.columns:
    print("\nCleaning raw text data...\n")
    df['cleaned_description'] = df['description'].apply(clean_text)
    print("\nFirst few rows of cleaned text:")
df[['description', 'cleaned_description']].head()



Cleaning raw text data...


First few rows of cleaned text:


Unnamed: 0,description,cleaned_description
0,Looking for fun and interesting facts about a ...,looking for fun and interesting facts about a ...
1,Giant Panda. Grey Wolf. Canis lupus. Proboscis...,giant panda grey wolf canis lupus proboscis mo...
2,javanicus displayed during feeding such as wal...,javanicus displayed during feeding such as wal...
3,Black-naped Oriole: Explore more topics. Name;...,black - naped oriole explore more topics name ...
5,"The Javan myna, also known as the white-vented...",the javan myna also known as the white - vente...


In [14]:
# Function to remove unwanted rows (e.g., rows that match a publication date pattern)
def remove_unwanted_rows(df):
    # Remove rows where 'description' matches the publication date pattern
    df = df[~df['cleaned_description'].str.contains(r'published: \w+ \d{1,2}, \d{4}', case=False, na=False)]
    return df

In [15]:
    # Remove rows where the 'description' column is empty or contains only whitespace
    df = df[df['cleaned_description'].str.strip() != '']
    print("\nShape after removing empty descriptions:", df.shape)


Shape after removing empty descriptions: (39, 3)


In [16]:
    # Apply processing to remove stopwords, tokenize, and create bigrams
    print("\nProcessing text to remove stopwords, tokenize, and generate bigrams...\n")
    df[['cleaned_no_stopwords', 'tokens', 'bigrams', 'removed_stopwords']] = df['cleaned_description'].apply(process_tokens)


Processing text to remove stopwords, tokenize, and generate bigrams...



In [17]:
    # Print sample of cleaned text (no stopwords), tokens, bigrams, and removed stopwords
    print("\nSample cleaned text (no stopwords), tokens, bigrams, and removed stopwords:")
    df[['cleaned_description', 'cleaned_no_stopwords', 'tokens', 'bigrams', 'removed_stopwords']].head()


Sample cleaned text (no stopwords), tokens, bigrams, and removed stopwords:


Unnamed: 0,cleaned_description,cleaned_no_stopwords,tokens,bigrams,removed_stopwords
0,looking for fun and interesting facts about a ...,looking fun interesting facts black - naped or...,"[looking, fun, interesting, facts, black, nape...","[look fun, fun interesting, interesting fact, ...","[for, and, about, a, about, this, bird, and, o..."
1,giant panda grey wolf canis lupus proboscis mo...,giant panda grey wolf canis lupus proboscis mo...,"[giant, panda, grey, wolf, canis, lupus, probo...","[giant panda, panda grey, grey wolf, wolf cani...","[about, and, and, and, and]"
2,javanicus displayed during feeding such as wal...,javanicus displayed feeding walking jumping ho...,"[javanicus, displayed, feeding, walking, jumpi...","[javanicus display, display feed, feed walk, w...","[during, such, as, and, a]"
3,black - naped oriole explore more topics name ...,black - naped oriole explore topics name male ...,"[black, naped, oriole, explore, topics, name, ...","[black naped, naped oriole, oriole explore, ex...","[more, description, description, with, and, but]"
5,the javan myna also known as the white - vente...,javan myna also known white - vented myna myna...,"[javan, myna, also, known, white, vented, myna...","[javan myna, myna also, also know, know white,...","[the, as, the, is, a, species, of, it, is, a, ..."


In [18]:
    # Extract all removed stopwords from the dataframe
    all_removed_stopwords = [item for sublist in df['removed_stopwords'] for item in sublist]
    removed_stopwords_set = set(all_removed_stopwords)  # Get unique stopwords

In [19]:
    # Print the removed stopwords
    print("\nList of unique stopwords removed:")
    print(removed_stopwords_set)


List of unique stopwords removed:
{'to', 'that', 'me', 'over', 'for', 'when', 'during', 'he', 'once', 'from', 'against', 'been', 'description', 'him', 'it', 'at', 'all', 'if', 'while', 'more', 'such', 'the', 'or', 'be', 'does', 'has', 's', 'she', 'its', 'are', 'do', 'and', 'them', 'very', 'there', 'which', 'i', 'o', 'm', 'out', 'into', 't', 'before', 'of', 'no', 'with', 'both', 'can', 'by', 'about', 'have', 'as', 'my', 'their', 'species', 'how', 'was', 'not', 'you', 'an', 'what', 'through', 'other', 'we', 'than', 'this', 'where', 'now', 'is', 'in', 'after', 'up', 'our', 'these', 'most', 'so', 'on', 'a', 'they', 'her', 'bird', 'but'}


In [20]:
df[['cleaned_description', 'cleaned_no_stopwords', 'tokens', 'bigrams', 'removed_stopwords']].head()# Function to clean text

Unnamed: 0,cleaned_description,cleaned_no_stopwords,tokens,bigrams,removed_stopwords
0,looking for fun and interesting facts about a ...,looking fun interesting facts black - naped or...,"[looking, fun, interesting, facts, black, nape...","[look fun, fun interesting, interesting fact, ...","[for, and, about, a, about, this, bird, and, o..."
1,giant panda grey wolf canis lupus proboscis mo...,giant panda grey wolf canis lupus proboscis mo...,"[giant, panda, grey, wolf, canis, lupus, probo...","[giant panda, panda grey, grey wolf, wolf cani...","[about, and, and, and, and]"
2,javanicus displayed during feeding such as wal...,javanicus displayed feeding walking jumping ho...,"[javanicus, displayed, feeding, walking, jumpi...","[javanicus display, display feed, feed walk, w...","[during, such, as, and, a]"
3,black - naped oriole explore more topics name ...,black - naped oriole explore topics name male ...,"[black, naped, oriole, explore, topics, name, ...","[black naped, naped oriole, oriole explore, ex...","[more, description, description, with, and, but]"
5,the javan myna also known as the white - vente...,javan myna also known white - vented myna myna...,"[javan, myna, also, known, white, vented, myna...","[javan myna, myna also, also know, know white,...","[the, as, the, is, a, species, of, it, is, a, ..."


### **Splitting Data into Training and Testing Sets**


In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    df["cleaned_no_stopwords"], df["species"], test_size=0.2, random_state=42
)

###**Feature Extraction Using TF-IDF**

In [22]:
print("Missing values in X_train:", X_train.isnull().sum())
print("Missing values in X_test:", X_test.isnull().sum())
# Handle missing values by replacing NaN with an empty string
X_train = X_train.fillna("")
X_test = X_test.fillna("")

def get_tfidf_features(train_texts, test_texts):
    tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
    tfidf_train_matrix = tfidf_vectorizer.fit_transform(train_texts)
    tfidf_test_matrix = tfidf_vectorizer.transform(test_texts)
    return tfidf_train_matrix, tfidf_test_matrix, tfidf_vectorizer.get_feature_names_out()

tfidf_train_array, tfidf_test_array, feature_names = get_tfidf_features(X_train, X_test)

Missing values in X_train: 0
Missing values in X_test: 0


###**Model Evaluation and Justification**

In [23]:

def evaluate_model(classifier, model_name):
    classifier.fit(tfidf_train_array, y_train)  # Train model
    y_pred = classifier.predict(tfidf_test_array)  # Predict species
    metrics = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, average='weighted'),
        "Recall": recall_score(y_test, y_pred, average='weighted'),
        "F1-Score": f1_score(y_test, y_pred, average='weighted')
    }

    print(f"\n=== {model_name} ===")
    print("Justification:")
    if model_name == "Logistic Regression":
        print("Chosen for its ability to perform well on linearly separable text data, offering interpretable results.")
    elif model_name == "Naive Bayes":
        print("Selected for its efficiency in handling text data with independent features.")
    elif model_name == "Random Forest":
        print("Selected for its capability to capture complex, non-linear relationships between text-derived features and species.")
    print("\nPerformance Metrics:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.2f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    return classifier, metrics

In [24]:
model_results = {}

# Logistic Regression
lr_model, lr_metrics = evaluate_model(LogisticRegression(random_state=42, max_iter=200), "Logistic Regression")
model_results["Logistic Regression"] = lr_metrics


=== Logistic Regression ===
Justification:
Chosen for its ability to perform well on linearly separable text data, offering interpretable results.

Performance Metrics:
Accuracy: 0.88
Precision: 0.90
Recall: 0.88
F1-Score: 0.87

Classification Report:
                    precision    recall  f1-score   support

Black-naped Oriole       0.83      1.00      0.91         5
        Javan Myna       1.00      0.67      0.80         3

          accuracy                           0.88         8
         macro avg       0.92      0.83      0.85         8
      weighted avg       0.90      0.88      0.87         8



## **Logistic Regression Model Analysis**
**Model Justification**

The Logistic Regression model was selected due to its effectiveness on linearly separable text data and its ability to generate interpretable results. Its performance, as observed, aligns with expectations for such data types, providing valuable insights into classification tasks involving simple relationships.

**Overall Performance Metrics**

The Logistic Regression model demonstrates solid performance with an accuracy of 0.88, which indicates that the model correctly classified 88% of the instances in the test set. Precision is also high at 0.90, showing that when the model predicts a positive class, it is correct 90% of the time. Recall of 0.88 implies that the model successfully identified 88% of all true positive instances. The F1-score of 0.87 indicates a balanced performance between precision and recall, confirming the model’s overall efficiency in classification.

**Classification Report Analysis**

Black-naped Oriole: The model achieves a precision of 0.83, meaning that when it predicted the Black-naped Oriole, it was correct 83% of the time. The recall is perfect (1.00), indicating that the model correctly identified all instances of this species, but at the cost of some misclassification in the process, reflected in the moderate precision.

Javan Myna: For the Javan Myna, the model shows a precision of 1.00, meaning that all predictions for this species were correct. However, the recall of 0.67 indicates that 33% of true Javan Myna instances were missed, reflecting lower sensitivity for this class.

**Macro and Weighted Averages**

Macro Average: The macro average provides an overall view of the model’s performance across all classes, with precision at 0.92, recall at 0.83, and F1-score at 0.85. The macro average highlights that while precision is high across all classes, recall is a bit lower, which indicates room for improvement in identifying all positive instances, especially for the Javan Myna.

Weighted Average: The weighted average takes into account the class imbalance and provides an F1-score of 0.87, reflecting a good overall performance despite the small class imbalance, as indicated by the difference in the number of instances for Black-naped Oriole and Javan Myna.

**Suggestions for Improvement**

While Logistic Regression provides strong results, there is room for improvement, particularly in the recall for the Javan Myna. The model seems to struggle with capturing all instances of this species, as evidenced by the lower recall score. Further tuning of the model or exploring different models with better handling of imbalances (such as Random Forest or more complex models like Support Vector Machines) could improve the classification performance for less represented classes. Additionally, feature engineering may also help in capturing more subtle patterns that could assist in improving recall across classes.

---
Reference: I utilized AI tools to assist with phrasing, debugging, and as a learning and referencing resource. This support helped me enhance my work.