# **Natural Language Processing Assignment**  
### **Classifying Bird Species Based on Descriptions Using Supervised Learning Techniques**

**Student Name**: Tia Isabel Solanki  
**Admin Number**: 220892L  
**Class**: AA2303

---

## **Part 2: Feature Extraction and Model Training**  
*Transforming text data into features and training supervised classification models.*


---

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:

import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/NYP/Year 2/sem 2/[2] IT2391 NATURAL LANGUAGE PROCESSING/NLP Assignment/cleaned_text_output.csv')
df

Unnamed: 0,description,cleaned_no_stopwords,species
0,2 Jun 2023 ï¿½ The Javan myna shares some simi...,javan myna shares similarities common myna ter...,Javan Myna
1,The black-headed oriole ( Oriolus larvatus) is...,black - headed oriole oriolus larvatus family ...,Black-naped Oriole
2,"Search from thousands of royalty-free ""Javan M...",search thousands royalty - free javan myna sto...,Javan Myna
3,521 foreground recordings and 156 background ...,foreground recordings background recordings eg...,Little Egret
4,The little egret (Egretta garzetta) is a smal...,little egret egretta garzetta small white hero...,Little Egret
...,...,...,...
599,"August 13, 2016 - HISTORICAL records show that...",historical records show little egret egretta g...,Little Egret
600,File: Black-naped Oriole (Oriolus chinensis ch...,file black - naped oriole oriolus chinensis ch...,Black-naped Oriole
601,Larger than a Cattle Egret and with black leg...,larger cattle egret black legs yellow slippers...,Little Egret
602,22 Oct 2023 ï¿½ Dragon Snake (Javan Tubercle S...,dragon snake javan tubercle snake javan mudsna...,Javan Myna


**Import Necessary Libraries**

In [3]:
# Import necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

### **Step 1: Data Splitting and Preprocesisng**

**Goal:**
The primary aim of this assignment is to classify species based on textual descriptions. The dataset contains species descriptions along with their corresponding species names. The goal is to develop a predictive model that can accurately classify species based on the input descriptions.

**Why Splitting the Data is Important:**
To evaluate how well the model generalizes to unseen data, we need to split the dataset into a training set and a testing set. The training set is used to train the model, while the testing set is reserved for evaluating the model’s performance.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    df["cleaned_no_stopwords"], df["species"], test_size=0.2, random_state=42
)

**Usage of TF-IDF Over Count Vectorization**

I used TF-IDF (Term Frequency-Inverse Document Frequency) instead of Count Vectorizer to transform the text into a numeric format that the machine learning models can understand. While Count Vectorizer counts the frequency of words, TF-IDF captures the importance of terms by assigning higher weights to unique words that help distinguish between species descriptions and lower weights to common words that don't add much value, such as "bird," which may appear frequently but is not necessarily important for classification. This is crucial because frequent words across many species descriptions can dominate the model if their importance is not adjusted. Limiting the number of features to 5000 improves efficiency by reducing the dimensionality of the data without sacrificing important information, ensuring the model focuses on the most relevant terms for classification.

In [5]:
print("Missing values in X_train:", X_train.isnull().sum())
print("Missing values in X_test:", X_test.isnull().sum())
# Handle missing values by replacing NaN with an empty string
X_train = X_train.fillna("")
X_test = X_test.fillna("")

def get_tfidf_features(train_texts, test_texts):
    tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_features=5000)
    tfidf_train_matrix = tfidf_vectorizer.fit_transform(train_texts)
    tfidf_test_matrix = tfidf_vectorizer.transform(test_texts)
    return tfidf_train_matrix, tfidf_test_matrix, tfidf_vectorizer.get_feature_names_out()

tfidf_train_array, tfidf_test_array, feature_names = get_tfidf_features(X_train, X_test)

Missing values in X_train: 2
Missing values in X_test: 1


### **Step 3: Model Selection and Justification**
I selected three models based on their suitability for text classification:

**Logistic Regression:** Simple and interpretable, works well for linearly separable data.

**Naive Bayes:** Effective for text features, especially when feature independence holds, and computationally efficient.

**Random Forest:** Captures complex, non-linear relationships and reduces overfitting by averaging multiple decision trees. It also provides feature importance metrics, which help understand the role of specific terms in species descriptions.

**Why Random Forest Over Decision Tree:** Decision Trees were excluded due to their tendency to overfit, making them less robust. Random Forest mitigates this by averaging predictions from multiple trees, improving generalization and providing feature importance insights, which is valuable for text classification tasks.

### **Step 4: Model Training and Evaluation**
Each model is trained and evaluated using accuracy, precision, recall, and F1-score. We also print a classification report for a deeper understanding of the model's performance across all classes.

In [6]:

def evaluate_model(classifier, model_name):
    classifier.fit(tfidf_train_array, y_train)  # Train model
    y_pred = classifier.predict(tfidf_test_array)  # Predict species
    metrics = {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred, average='weighted'),
        "Recall": recall_score(y_test, y_pred, average='weighted'),
        "F1-Score": f1_score(y_test, y_pred, average='weighted')
    }

    print(f"\n=== {model_name} ===")
    print("Justification:")
    if model_name == "Logistic Regression":
        print("Chosen for its ability to perform well on linearly separable text data, offering interpretable results.")
    elif model_name == "Naive Bayes":
        print("Selected for its efficiency in handling text data with independent features.")
    elif model_name == "Random Forest":
        print("Selected for its capability to capture complex, non-linear relationships between text-derived features and species.")
    print("\nPerformance Metrics:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.2f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    return classifier, metrics

### **Step 5: Comparing Models**
After training all the models, we compare them by organizing the metrics in a DataFrame. We also evaluate models based on a variety of metrics (accuracy, precision, recall, F1-score), not just accuracy alone.

In [7]:
# Step 4: Train and Evaluate Models
model_results = {}

# Logistic Regression
lr_model, lr_metrics = evaluate_model(LogisticRegression(random_state=42, max_iter=200), "Logistic Regression")
model_results["Logistic Regression"] = lr_metrics


=== Logistic Regression ===
Justification:
Chosen for its ability to perform well on linearly separable text data, offering interpretable results.

Performance Metrics:
Accuracy: 0.88
Precision: 0.89
Recall: 0.88
F1-Score: 0.89

Classification Report:
                     precision    recall  f1-score   support

 Black-naped Oriole       0.83      0.89      0.86        38
Collared Kingfisher       1.00      0.94      0.97        31
         Javan Myna       0.93      0.86      0.89        29
       Little Egret       0.79      0.83      0.81        23

           accuracy                           0.88       121
          macro avg       0.89      0.88      0.88       121
       weighted avg       0.89      0.88      0.89       121



The Logistic Regression model shows a strong overall performance with an accuracy of 88%. It shows balanced precision, recall, and F1-scores of 0.89, 0.88, and 0.89, respectively. The model performs exceptionally well for the Collared Kingfisher class, with both precision and recall at 1.00, highlighting perfect classification. For the Black-naped Oriole class, it has high recall (89%) but slightly lower precision (83%), suggesting some misclassifications. The Javan Myna class achieves impressive precision (93%) but slightly lower recall (86%). The Little Egret class has the lowest recall (83%) and F1-score (81%), indicating that the model occasionally misses true instances for this class. The macro and weighted averages confirm the model’s consistency across all classes. To further improve performance, especially for the Little Egret, additional training data could be collected, and misclassified instances for Black-naped Oriole and Javan Myna could be analyzed to identify patterns. Overall, the model performs very well, providing reliable and interpretable results.

In [8]:
# Naive Bayes
nb_model, nb_metrics = evaluate_model(MultinomialNB(), "Naive Bayes")
model_results["Naive Bayes"] = nb_metrics


=== Naive Bayes ===
Justification:
Selected for its efficiency in handling text data with independent features.

Performance Metrics:
Accuracy: 0.83
Precision: 0.84
Recall: 0.83
F1-Score: 0.83

Classification Report:
                     precision    recall  f1-score   support

 Black-naped Oriole       0.76      0.89      0.82        38
Collared Kingfisher       0.97      0.90      0.93        31
         Javan Myna       0.84      0.90      0.87        29
       Little Egret       0.81      0.57      0.67        23

           accuracy                           0.83       121
          macro avg       0.84      0.81      0.82       121
       weighted avg       0.84      0.83      0.83       121



The Naive Bayes classifier demonstrates strong overall performance with an accuracy of 83%. It shows a precision of 0.84, recall of 0.83, and F1-score of 0.83. The model performs well for the Collared Kingfisher class, with both precision and recall at 0.97 and 0.90, respectively, indicating high accuracy in classification. For the Black-naped Oriole class, it has a recall of 0.89 but slightly lower precision (0.76), suggesting some misclassifications. The Javan Myna class achieves a balance with precision (0.84) and recall (0.90). However, the Little Egret class shows a notable weakness, with a recall of only 0.57 and an F1-score of 0.67, indicating that the model struggles to identify all instances of this class. The macro and weighted averages confirm the model’s consistent performance across all classes. To improve performance, especially for the Little Egret, additional training data could be collected, and misclassified instances for other classes could be analyzed to identify patterns. Overall, the model provides reliable results, though addressing class imbalances and feature overlap could improve performance.



In [9]:
# Random Forest
rf_model, rf_metrics = evaluate_model(RandomForestClassifier(random_state=42, n_estimators=100), "Random Forest")
model_results["Random Forest"] = rf_metrics


=== Random Forest ===
Justification:
Selected for its capability to capture complex, non-linear relationships between text-derived features and species.

Performance Metrics:
Accuracy: 0.87
Precision: 0.91
Recall: 0.87
F1-Score: 0.87

Classification Report:
                     precision    recall  f1-score   support

 Black-naped Oriole       0.70      1.00      0.83        38
Collared Kingfisher       1.00      0.94      0.97        31
         Javan Myna       1.00      0.83      0.91        29
       Little Egret       1.00      0.61      0.76        23

           accuracy                           0.87       121
          macro avg       0.93      0.84      0.86       121
       weighted avg       0.91      0.87      0.87       121



The Random Forest classifier shows very good performance metrics, showing an accuracy of 87%. It attains a precision of 0.91, recall of 0.87, and an F1-score of 0.87. The model is good in classifying the Collared Kingfisher class with perfect precision and high recall (0.94), resulting in an F1-score of 0.97. For the Javan Myna class, the model also shows high precision (1.00) but slightly lower recall (0.83), with an F1-score of 0.91. The Black-naped Oriole class has a recall of 1.00, indicating that all instances were correctly identified, but a lower precision (0.70), leading to an F1-score of 0.83. The Little Egret class presents the biggest challenge, with the lowest recall (0.61) and F1-score (0.76), indicating some difficulty in identifying all instances accurately. The macro and weighted averages further illustrate the model’s robust and consistent performance across most classes. To imrpvoe the performance, especially for the Little Egret class, additional training data and further refinement of the model could be beneficial. Overall, the Random Forest classifier provides strong and reliable results.

In [10]:
# Compare models
print("\n=== Model Comparison ===")
comparison_df = pd.DataFrame(model_results).T
print(comparison_df)


=== Model Comparison ===
                     Accuracy  Precision    Recall  F1-Score
Logistic Regression  0.884298   0.889028  0.884298  0.885654
Naive Bayes          0.834711   0.840101  0.834711  0.830847
Random Forest        0.867769   0.906948  0.867769  0.867996


**Logistic Regression:** The Logistic Regression model achieves an accuracy of 88%, with precision, recall, and F1-scores of 0.89, 0.88, and 0.89, respectively. It performs very well for Collared Kingfishers with perfect scores. However, it shows some misclassifications for Black-naped Orioles and Javan Mynas, and the lowest recall (83%) and F1-score (81%) for Little Egrets.

**Naive Bayes:** The Naive Bayes model has an accuracy of 83%, with precision, recall, and F1-scores of 0.84, 0.83, and 0.83. It performs well for Collared Kingfishers (precision 0.97, recall 0.90) and Javan Mynas (precision 0.84, recall 0.90). However, it struggles with Little Egrets, with a recall of 0.57 and an F1-score of 0.67.

**Random Forest:** The Random Forest model achieves an accuracy of 87%, with precision, recall, and F1-scores of 0.91, 0.87, and 0.87. It is good with Collared Kingfishers (perfect precision and recall of 0.94). For Javan Mynas, it shows high precision (1.00) but slightly lower recall (0.83). The Little Egret class has the lowest recall (0.61) and F1-score (0.76).

###**Best Performing Model**

**Best Performing Model**
**Best-Performing Model with Justification:** The Logistic Regression model is identified as the best-performing model based on its strong performance metrics. It achieves an accuracy of 0.88, precision of 0.89, recall of 0.88, and an F1-score of 0.89 compared to the Naive Bayes and Random Forest models. These results demonstrate that Logistic Regression balances precision and recall effectively, minimizing false positives and false negatives. Its strength lies in its ability to model linear decision boundaries, which aligns well with the structure of the dataset.

**Evaluation on Unseen Data:** When evaluated on unseen data, the Logistic Regression model consistently produces accurate predictions. Its high recall indicates the model effectively captures most of the positive instances, while its high precision minimizes misclassifications. This balance ensures reliability in real-world applications where both false positives and false negatives carry significant implications. The model's robust performance on unseen data highlights its generalization ability, a critical factor in real-world scenarios.

### **Feature Importance for Random Forest**
For Random Forest, we display the most important features that contribute to the classification decision. This helps us understand which terms play a key role in predicting species.

In [11]:
def display_feature_importance_rf(classifier, feature_names):
    importances = classifier.feature_importances_
    feature_importances = [(feature, round(importance, 10))
                           for feature, importance in zip(feature_names, importances)]
    feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
    print("\n=== Random Forest Feature Importances ===")
    print("Key terms driving species classification:")
    for pair in feature_importances[:20]:
        print(f"Feature: {pair[0]:<20} Importance: {pair[1]}")

display_feature_importance_rf(rf_model, feature_names)


=== Random Forest Feature Importances ===
Key terms driving species classification:
Feature: javan                Importance: 0.061122987
Feature: oriole               Importance: 0.0479829825
Feature: collared             Importance: 0.0477226666
Feature: black                Importance: 0.0472700762
Feature: kingfisher           Importance: 0.0427217708
Feature: myna                 Importance: 0.0375950451
Feature: little               Importance: 0.0321154017
Feature: naped                Importance: 0.0311431383
Feature: egret                Importance: 0.0308495884
Feature: oriolus              Importance: 0.0221175838
Feature: mynas                Importance: 0.0149703135
Feature: acridotheres         Importance: 0.0143053619
Feature: chinensis            Importance: 0.0129926631
Feature: kingfishers          Importance: 0.0115418927
Feature: chloris              Importance: 0.0094753432
Feature: egretta              Importance: 0.0093240088
Feature: garzetta             Import

**Random Forest Model Performance:** The Random Forest model identifies key terms driving species classification. The term "javan" has the highest importance score of 0.0611, indicating its strong relevance in identifying the Javan Myna. Other critical features include "oriole" (0.0480) and "collared" (0.0477), essential for classifying the Collared Kingfisher. Features like "black" (0.0473) and "kingfisher" (0.0427) significantly aid in recognizing the Black-naped Oriole. Additionally, "myna" (0.0376) and "little" (0.0321) contribute to the accurate identification of other species. The model effectively uses these important features to enhance its classification performance.