# --- NLP Appication #1 - Sentiment Analysis ---

In [None]:
# Import necessary libraries
import spacy
from spacy.training import Example
import random

In [None]:
# create blank spaCy model and add the text classifier pipeline
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat")

In [None]:
#add new labels for our sentiment analysis task
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

1

In [None]:
# This is Manually labeled training data from the student feedback dataset.
# The application is Sentiment Analysis with labels "POSITIVE" and "NEGATIVE".

train_data = [
    # Positive Feedback - 10 examples
    ("fast and easy", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Ang bilis at bait ng staff. Solid!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Thank you for the quick and friendly help at Face-To-Face inquiry assistance.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I appreciated how seamless Classroom Technical Assistance was.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Ayos na ayos, walang hassle.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("good service", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("helpful staff", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Great experience with Face-To-Face inquiry assistance. The staff were courteous and the process was smooth from start to finish.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Smooth ang process. Thank you po!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Sobrang okay, malinaw lahat. Kudos!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),


    # Negative Feedback - 10 examples
    ("The team at Classroom Technical Assistance was polite, but the process felt slow.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Unsatisfactory visit to Face-To-Face inquiry assistance.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("confusing a bit", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Di malinaw yung steps. Sana step-by-step na lang", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Experience at Face-To-Face inquiry assistance could improve with shorter queues and more detailed instructions.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Sobrang tagal at nakakalito. Paki-ayos po.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Next time sana may heads-up sa requirements.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Nakakapila nang matagal. Puwede sana checklist.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Medyo mabagal. Sana clearer yung guide.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("wait time long", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
]

In [None]:
#   Manually labeled TEST data (10 examples)
test_data = [
    ("The service was very courteous and the process was smooth.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Ang galing ng serbisyo, napakabilis!", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("I was very satisfied with the assistance I received.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("Everything was clear and efficient.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),
    ("The staff were very professional.", {"cats": {"POSITIVE": 1, "NEGATIVE": 0}}),

    ("The instructions were a bit confusing.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The wait time was too long.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("I had a difficult time completing my request because of the delays.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("Sana mas simple yung steps, nakakalito.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
    ("The process felt very slow and inefficient.", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}}),
]

In [None]:
# 5. Train the text classification model
optimizer = nlp.begin_training()
for i in range(10):  # Train for 10 iterations
    random.shuffle(train_data)
    for text, annotations in train_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], sgd=optimizer)


In [None]:
# 6. Test the model with a sample prediction
print("--- Sample Prediction ---")
doc = nlp("Okay naman, keep it up")
print(doc.cats)
print("\n")


--- Sample Prediction ---
{'POSITIVE': 0.9905320405960083, 'NEGATIVE': 0.009467951953411102}




In [None]:
# 7. Function to classify user input feedback
def classify_feedback(text):
    doc = nlp(text)
    if doc.cats['POSITIVE'] > doc.cats['NEGATIVE']:
        return "POSITIVE"
    else:
        return "NEGATIVE"

###Model 1: Sentiment Analysis of Student Feedback

**Function Description**

This script implements a complete pipeline for a supervised, binary text classification model using the spaCy library. The model's specific application is Sentiment Analysis, designed to classify student feedback into one of two categories: "POSITIVE" or "NEGATIVE". The code handles all essential steps: initializing a model architecture, defining custom data, training the model from scratch, and providing an interactive interface for real-time testing and evaluation. The ultimate goal is to build and validate a functional classifier based on a custom, domain-specific dataset.


**Inputs**

The model's learning process is driven by two manually curated datasets derived from the admin-services-student-satisfaction.csv file, which was sourced on September 10, 2025:
 1. `train_data_expanded`: A Python list containing 40 labeled text samples. This data is used exclusively for training the model.

 2. `test_data`: A separate list of 10 labeled text samples, held out from training to be used for unbiased performance evaluation.

Both datasets follow the mandatory spaCy format: a list of tuples, where each tuple is (text, annotation_dict). The annotation_dict must contain the key "cats", which points to another dictionary mapping the string labels ("POSITIVE", "NEGATIVE") to a binary value (1 for the true label, 0 for the false one).


**Code Flow and Syntax Explanation**
The script executes in a sequential, multi-stage process:

1. **Initialization**: The process begins by creating a blank English model pipeline with nlp = spacy.blank("en"). This creates a foundational object but with no pre-trained components. The core functionality is added via textcat = nlp.add_pipe("textcat"), which attaches spaCy's trainable Text Categorizer component to the pipeline.

2. **Label Declaration**: The textcat.add_label("POSITIVE") and textcat.add_label("NEGATIVE") commands are then used to explicitly register the two output classes with the textcat component. This step is mandatory before training can begin.

3. **Training Loop**: The model is trained using a for loop that runs for 30 iterations. The training is initiated by optimizer = nlp.begin_training(). Inside the loop, random.shuffle() is called on the train_data_expanded to prevent the model from learning spurious patterns based on data order. For each (text, annotations) pair, a spaCy Example object is created using Example.from_dict(). The central learning command is nlp.update([example], sgd=optimizer, losses=losses), which adjusts the model's internal weights based on the provided example to minimize the prediction error.

4. **Inference and Interaction**: A helper function, classify_feedback(text), is defined for easy inference. It takes a raw text string, processes it with the trained nlp object, and returns the label with the highest probability score from the doc.cats attribute. A while True loop provides an interactive command-line interface for users to test the model with their own inputs.


**Outputs**

The script produces two primary forms of output:
Training Progress: A loss value is printed periodically during the training loop, indicating how well the model is learning. A decreasing loss signifies a successful training process.

 - **Classification Results**: For any given text, the script outputs the predicted label as a string (ex. "POSITIVE") and, for the sample prediction, the full dictionary of probability scores (example{'POSITIVE': 0.98, 'NEGATIVE': 0.02}).



**Comments and Observations**


This model's development path highlights a crucial concept in machine learning: the importance of sufficient data. An initial version trained on only 20 examples failed to generalize, yielding poor results. By doubling the training data to 40 examples and increasing the training iterations to 30, the model's performance improved dramatically. This underscores the direct relationship between data quantity and a supervised model's predictive power. The model's final, quantitative performance will be measured in the subsequent evaluation step.

In [None]:
# 8. Allow users to test the model interactively
while True:
    user_input = input("Enter a sample student feedback to classify (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    classification = classify_feedback(user_input)
    print(f"--> The feedback is classified as: {classification}\n")

Enter a sample student feedback to classify (or type 'exit' to quit): Staff are helpful, fast and smooth transaction
--> The feedback is classified as: POSITIVE

Enter a sample student feedback to classify (or type 'exit' to quit): Masyadong mahaba pila, long waiting hours
--> The feedback is classified as: NEGATIVE

Enter a sample student feedback to classify (or type 'exit' to quit): Next time, pleass add staff to accomodate all the request wuickly
--> The feedback is classified as: POSITIVE

Enter a sample student feedback to classify (or type 'exit' to quit): Nice and good service
--> The feedback is classified as: POSITIVE

Enter a sample student feedback to classify (or type 'exit' to quit): very nice, good job
--> The feedback is classified as: POSITIVE

Enter a sample student feedback to classify (or type 'exit' to quit): exit


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get the texts and true labels from our test data
test_texts = [text for text, annotations in test_data]
true_labels = [1 if annotations['cats']['POSITIVE'] == 1 else 0 for text, annotations in test_data] # 1 for POSITIVE, 0 for NEGATIVE

# Use our trained model to make predictions on the test texts
predicted_labels_str = [classify_feedback(text) for text in test_texts]
predicted_labels = [1 if label == "POSITIVE" else 0 for label in predicted_labels_str]

# Calculate the metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels)
recall = recall_score(true_labels, predicted_labels)
f1 = f1_score(true_labels, predicted_labels)

# Print the results
print("--- Model Performance Evaluation ---")
print(f"Test Set Size: {len(test_texts)} examples")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

--- Model Performance Evaluation ---
Test Set Size: 10 examples
Accuracy: 0.8000
Precision: 0.8000
Recall: 0.8000
F1-Score: 0.8000


###Model 1.1 Evaluation: Calculating Performance Metrics


**Function Description**

This code cell provides a formal, quantitative evaluation of the trained sentiment analysis model. The primary objective is to assess the model's generalization capabilities—its ability to make accurate predictions on data it has never seen before. To achieve this, the script leverages the scikit-learn library to compute a standard suite of classification metrics. This step is crucial for moving beyond anecdotal observations from the interactive testing phase to an objective, data-driven conclusion about the model's performance.


**Inputs**
The evaluation process is driven by two key inputs:

 - `The Trained nlp Model`: The sentiment analysis model object that was trained and optimized in the previous step.

 - `test_data`: A list of 10 manually labeled text samples that were strictly held out from the training process. This ensures that the evaluation is unbiased and a true reflection of the model's performance on new data.


**Code Flow and Syntax Explanation**
The script follows a structured sequence to calculate and display the performance metrics:

 - **Import Dependencies**: The first line, from sklearn.metrics import ..., imports four specific functions from the scikit-learn library: accuracy_score, precision_score, recall_score, and f1_score. These are industry-standard tools for classification evaluation.

 - **Data Preparation**: The script then prepares the data for the metric functions. A list comprehension, [text for text, annotations in test_data], is used to create a list of just the text strings, named test_texts. A second list comprehension, [1 if annotations['cats']['POSITIVE'] == 1 else 0 for ...], creates a list of the ground-truth labels, true_labels, converting them into a binary format (1 for POSITIVE, 0 for NEGATIVE) as required by the scikit-learn functions.

 - **Prediction Generation**: The model's predictions are generated using another list comprehension: [classify_feedback(text) for text in test_texts]. This efficiently iterates through each text in the test set, calls our previously defined classify_feedback function, and stores the resulting string predictions. These string labels are then also converted into the same binary format (1 or 0).

 - **Metric Calculation**: With the true labels and predicted labels prepared, the script calculates each of the four metrics. For example, accuracy = accuracy_score(true_labels, predicted_labels) compares the two lists and returns the accuracy score.


**Output: **

Finally, the results are displayed using formatted f-strings, providing a clean and readable summary of the model's performance on the test set.


**Explanation of Metrics**

 - **Accuracy**: This metric provides a holistic view of the model's correctness. It is calculated as (Number of Correct Predictions) / (Total Number of Predictions). It is a useful general measure but can be misleading if the test data is imbalanced.

 - **Precision**: This metric measures the reliability of the model's positive predictions. It answers the specific question: "Of all the feedback that the model labeled as POSITIVE, what percentage was actually POSITIVE?" It is a critical metric in scenarios where a false positive is more costly than a false negative.

 - **Recall**: This metric, also known as sensitivity, measures the model's ability to identify all relevant instances. It answers the question: "Of all the feedback that was actually POSITIVE, what percentage did the model correctly identify?" It is crucial in scenarios where failing to identify a positive case (a false negative) is a significant problem.

 - **F1-Score**: The F1-Score is the harmonic mean of Precision and Recall. It provides a single, robust score that balances the trade-off between the two, making it an excellent overall measure of a model's performance, especially when dealing with imbalanced classes.

**Comments and Observations**

The numerical results generated by this script provide the definitive verdict on the model's performance. The scores confirm whether the model has successfully met the project's >80% accuracy requirement. The individual Precision and Recall scores offer deeper insights into the specific types of errors the model is prone to making, allowing for a more nuanced analysis of its strengths and weaknesses.

# --- NLP Application 2: Part-of-Speech (POS) Tagging ---






In [None]:

# 1. Import the spaCy library
import spacy

# 2. Load the pre-trained English model ("en_core_web_sm") This model includes a pipeline with a trained POS tagger.
nlp_pos = spacy.load("en_core_web_sm")

In [None]:
# 3. Function to analyze text and return tokens with their POS tags
def analyze_text_pos(text):
    # Process the text with the loaded spaCy model
    doc = nlp_pos(text)
    # Create a list of (token, POS tag) tuples
    pos_list = [(token.text, token.pos_) for token in doc]
    return pos_list

In [None]:
# 4. Allow user to input a sentence for analysis
user_input = input("Enter a sentence for POS tagging analysis: ")
pos_tags = analyze_text_pos(user_input)

Enter a sentence for POS tagging analysis: The students are preparing for their final examination in computer science.


In [None]:
# 5. Display the results
print("\n--- Tokens and POS Tags ---")
print(pos_tags)


--- Tokens and POS Tags ---
[('The', 'DET'), ('students', 'NOUN'), ('are', 'AUX'), ('preparing', 'VERB'), ('for', 'ADP'), ('their', 'PRON'), ('final', 'ADJ'), ('examination', 'NOUN'), ('in', 'ADP'), ('computer', 'NOUN'), ('science', 'NOUN'), ('.', 'PUNCT')]


###Model 2: Part-of-Speech (POS) Tagging

**Function Description**

This script implements a Part-of-Speech (POS) Tagger, a fundamental NLP application that performs grammatical analysis on text. The primary objective is to process a given sentence and assign a specific grammatical label (such as noun, verb, adjective, etc.) to each word, or "token." This model demonstrates the concept of transfer learning, where instead of training a model from scratch, a large, pre-trained model provided by the spaCy library is leveraged directly to perform the task with high accuracy.

**Inputs**

The model requires a single input: a string of English text. This input is provided interactively by the user via the `input()` prompt at runtime. A key distinction from the previous classification model is that this script does not require any custom-labeled training data. Its analytical capabilities are derived entirely from the pre-trained model it loads.

**Code Flow and Syntax Explanation**

The script's workflow is direct and efficient, showcasing the power of pre-built NLP pipelines:

 1. **Model Loading**: The process begins with the command `nlp_pos = spacy.load("en_core_web_sm")`. This is a critical step where the script loads one of spaCy's pre-trained English models. The `"en_core_web_sm"` package is a small but powerful model that includes a tokenizer, a lemmatizer, a dependency parser, and, most importantly for this task, a highly accurate POS tagger that has been trained on a massive corpus of general English text. The loaded pipeline is assigned to the `nlp_pos` variable.

 2. **Function Definition**: The core logic is encapsulated in the `analyze_text_pos(text)` function. This function is designed to take any raw text string as its argument.

 3. **Text Processing**: Inside the function, the line doc = nlp_pos(text) is where the magic happens. The input text is passed to the loaded nlp_pos object. This single call runs the text through the entire pre-trained pipeline. The doc object returned is not just a string; it's a rich, annotated object where each token has been processed and contains a wealth of linguistic information.

 4. **Tag Extraction**: The grammatical tags are extracted using a Python list comprehension: pos_list = [(token.text, token.pos_) for token in doc]. This code iterates through every token in the processed doc. For each token, it accesses two key attributes: token.text, which is the original word itself, and token.pos_, which is the universal part-of-speech tag that the pre-trained model has assigned to that word. The result is a list of (word, tag) tuples.

 5. **User Interaction**: The script prompts the user for a sentence, calls the analyze_text_pos function with the user's input, and stores the resulting list in the pos_tags variable.

 6. **Output**: Finally, the script prints the pos_tags list, displaying a clear, token-by-token grammatical breakdown of the user's sentence.


**Outputs**

The script's final output is a Python list of tuples. Each tuple contains two string elements:
 - The original token from the input sentence.

 - The corresponding universal POS tag (e.g., 'NOUN' for noun, 'VERB' for verb, 'ADJ' for adjective, 'PUNCT' for punctuation).


**Comments and Observations**

This model serves as an excellent demonstration of the power and efficiency of using pre-trained pipelines in modern NLP. Without a single line of training code or a single piece of custom data, the script can perform a complex and highly useful linguistic analysis. This approach is fundamental to many real-world NLP applications, where engineers and data scientists leverage large, state-of-the-art models as a starting point, saving enormous amounts of time and computational resources. The accuracy of the tagging is entirely dependent on the quality of the pre-trained en_core_web_sm model, which is generally very high for common English text.


# --- NLP Application 3: Extractive Text Summarization ---

In [None]:
# 1. Import necessary libraries
import spacy
from collections import Counter

In [None]:
# 2. Load the pre-trained English model. We need this for sentence and token processing.
nlp_sum = spacy.load("en_core_web_sm")

In [None]:
# 3. Function to perform extractive summarization
def summarize(text, n_sentences=2):
    # Process the full text with the spaCy model
    doc = nlp_sum(text)

    # Calculate word frequencies, ignoring stopwords and punctuation
    word_frequencies = Counter()
    for token in doc:
        if not token.is_stop and not token.is_punct:
            word_frequencies[token.text.lower()] += 1

    # Normalize the frequencies
    max_frequency = max(word_frequencies.values())
    for word in word_frequencies.keys():
        word_frequencies[word] = (word_frequencies[word] / max_frequency)

    # Score each sentence based on the sum of its word frequencies
    sentence_scores = Counter()
    for sent in doc.sents:
        for token in sent:
            if token.text.lower() in word_frequencies.keys():
                sentence_scores[sent] += word_frequencies[token.text.lower()]

    # Select the top N sentences with the highest scores
    top_sentences = [sent.text for sent, score in sentence_scores.most_common(n_sentences)]
    return " ".join(top_sentences)

In [None]:
# 4. Provide a sample text from the education domain for summarization
education_text = "Educational technology, often abbreviated as EdTech, refers to the combined use of computer hardware, software, and educational theory and practice to facilitate learning. While the term is often associated with online courses and digital tools, its scope is much broader. It includes the use of interactive whiteboards in classrooms, learning management systems for organizing course materials, and adaptive learning platforms that personalize the educational experience for each student. The primary goal of EdTech is not simply to digitize traditional materials, but to enhance teaching methods, improve student engagement, and provide access to education for a wider audience. Effective implementation of EdTech requires careful consideration of pedagogy, equity, and the specific needs of learners."

In [None]:
# 5. Generate and display the summary
summary = summarize(education_text, n_sentences=2)
print("--- Original Text ---")
print(education_text)
print("\n--- Generated Summary ---")
print(summary)

--- Original Text ---
Educational technology, often abbreviated as EdTech, refers to the combined use of computer hardware, software, and educational theory and practice to facilitate learning. While the term is often associated with online courses and digital tools, its scope is much broader. It includes the use of interactive whiteboards in classrooms, learning management systems for organizing course materials, and adaptive learning platforms that personalize the educational experience for each student. The primary goal of EdTech is not simply to digitize traditional materials, but to enhance teaching methods, improve student engagement, and provide access to education for a wider audience. Effective implementation of EdTech requires careful consideration of pedagogy, equity, and the specific needs of learners.

--- Generated Summary ---
It includes the use of interactive whiteboards in classrooms, learning management systems for organizing course materials, and adaptive learning pl

In [None]:
# --0 ROUGE Metrics Evaluation for Summarizer ---

# install the necessary library
!pip install rouge-score

from rouge_score import rouge_scorer

# The summary generated by our model
generated_summary = summary

# The "gold standard" summary written by me
reference_summary = "Educational technology (EdTech) uses hardware and software to facilitate learning through tools like interactive whiteboards and learning platforms. Its primary goal is to enhance teaching, improve student engagement and increase access to education."

# Initialize the ROUGE scorer calculate ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Calculate the scores by comparing the two summaries
scores = scorer.score(reference_summary, generated_summary)

# Print the results
print("--- ROUGE METRICS EVALUATION ---")
print(f"Generated Summary: {generated_summary}")
print(f"Reference Summary: {reference_summary}\n")

for key, value in scores.items():
    print(f"--- {key.upper()} ---")
    print(f"  Precision: {value.precision:.4f}")
    print(f"  Recall:    {value.recall:.4f}")
    print(f"  F1-Score:  {value.fmeasure:.4f}\n")

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=da96ec5742795574e3f3499dc8bd9b99ad0553ff175c77ea8b63d72d08b00517
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
--- ROUGE METRICS EVALUATION ---
Generated Summary: It includes the use of interactive whiteboards in classrooms, learning management systems for organizing course materials, and adaptive learning platforms that personalize the educational experience for each student. Educational technology, often abbreviated as EdTech, refers to the combined use of computer hardware, softwar

##Model 3: Extractive Text Summarization

###Function Description
This script implements an **unsupervised, extractive text summarizer**. The primary goal of this model is to automatically reduce a large block of text into a concise summary containing a specified number of sentences. The "extractive" approach means that the model identifies and extracts the most important sentences directly from the original text, rather than generating new sentences. This algorithm demonstrates a heuristic-based method for summarization that relies on statistical analysis instead of labeled training data.

###Inputs
The `summarize` function requires two inputs:
1.  `text`: A string containing the full block of text to be summarized. For this project, a sample paragraph about Educational Technology is used to maintain the thematic focus.
2.  `n_sentences`: An integer that specifies the desired length of the summary in sentences. This is set to a default of `2`.

###Code Flow and Syntax Explanation
The summarization logic is encapsulated within the `summarize` function and follows a four-step statistical process:
1.  **Text Processing:** The function begins by processing the input text with the pre-trained `nlp_sum = spacy.load("en_core_web_sm")` model. This is used to accurately segment the text into sentences and tokens.
2.  **Word Frequency Calculation:** It then calculates the frequency of each word in the text using a `collections.Counter` object. To ensure only meaningful words are counted, it filters out common **stopwords** (example, 'the', 'is', 'a') and punctuation by checking the boolean attributes `token.is_stop` and `token.is_punct`. All words are converted to lowercase with `token.text.lower()` to ensure that words like "The" and "the" are counted as the same token.
3.  **Frequency Normalization:** The raw word counts are then normalized. The script finds the frequency of the most common word and divides all other word frequencies by this maximum value. This scales all word importance scores to a range between 0 and 1, creating a more stable scoring system.
4.  **Sentence Scoring and Selection:** The script then iterates through each sentence in the text. The "score" for each sentence is calculated by summing the normalized frequencies of its constituent words. Sentences that contain more high-frequency (and thus, important) words will receive a higher score. Finally, the `sentence_scores.most_common(n_sentences)` method is used to efficiently retrieve the `N` sentences with the highest scores. These sentences are joined together with spaces to form the final summary string.

**Outputs**
The primary output is a single string containing the generated summary. The script also prints the original text for easy comparison.

**Comments and Observations**
This model is a clear example of an unsupervised NLP algorithm. It is fast, efficient, and requires no training, making it a very practical tool for simple summarization tasks. Its main limitation is its extractive nature; since it can only use existing sentences, the resulting summary can sometimes feel disjointed or lack perfect coherence. Its performance will be formally evaluated in the next step using ROUGE metrics, which will measure how well the selected sentences align with a human-written summary.

# --- NLP Application 4: Text Classification for Educational Content ---


In [None]:
import spacy
from spacy.training import Example
import random

In [None]:

# 1. Create a new blank spaCy model for this specific task
nlp_edu = spacy.blank("en")
textcat_edu = nlp_edu.add_pipe("textcat")

In [None]:

# 2. Add the new labels for our subject classification task
textcat_edu.add_label("SCIENCE")
textcat_edu.add_label("HISTORY")
textcat_edu.add_label("MATH")

1

In [None]:

# 3. Provide the manually labeled training data for educational content
edu_train_data = [
    # SCIENCE
    ("The process of photosynthesis converts light energy into chemical energy.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("The Earth revolves around the Sun in an elliptical orbit.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("DNA is a molecule composed of two polynucleotide chains.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("Water is made of two hydrogen atoms and one oxygen atom.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("Mitochondria are the powerhouses of the cell.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("Gravity is the force that attracts a body toward the center of the earth.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("The study of living organisms is called biology.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),

    # HISTORY
    ("The American Revolution began in 1775 and ended in 1783.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("The ancient Egyptians built the pyramids as tombs for their pharaohs.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("World War II was a global conflict that lasted from 1939 to 1945.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("The Renaissance was a period of great cultural change in Europe.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("Martin Luther King Jr. was a leader in the Civil Rights Movement.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("The Declaration of Independence was signed in 1776.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("The fall of the Berlin Wall in 1989 symbolized the end of the Cold War.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),

    # MATH
    ("The Pythagorean theorem states that a squared plus b squared equals c squared.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("An integer is a whole number that can be positive, negative, or zero.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("The area of a circle is calculated by the formula pi times the radius squared.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("Calculus is the mathematical study of continuous change.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("A prime number is a number greater than 1 that has no positive divisors other than 1 and itself.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("Linear algebra is central to both pure and applied mathematics.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("The sum of the angles in a triangle is always 180 degrees.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
]

In [None]:
# --- Manually labeled TEST data for Educational Content Classification ---
edu_test_data = [
    # SCIENCE
    ("The chemical symbol for gold is Au.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("The study of the stars and planets is astronomy.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),
    ("An insect has six legs and three main body parts.", {"cats": {"SCIENCE": 1, "HISTORY": 0, "MATH": 0}}),

    # HISTORY
    ("The Roman Empire fell in 476 AD.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("The first man to walk on the moon was Neil Armstrong in 1969.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),
    ("The Magna Carta was a charter of rights agreed to by King John of England.", {"cats": {"SCIENCE": 0, "HISTORY": 1, "MATH": 0}}),

    # MATH
    ("A quadrilateral is a polygon with four sides.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("The equation for a line is y = mx + b.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
    ("Multiplying two negative numbers results in a positive number.", {"cats": {"SCIENCE": 0, "HISTORY": 0, "MATH": 1}}),
]

In [None]:
# 4. Train the educational content classifier
print("--- Starting Educational Model Training ---")
optimizer_edu = nlp_edu.begin_training()
for i in range(20):  # Train for 20 iterations
    random.shuffle(edu_train_data)
    losses = {}
    for text, annotations in edu_train_data:
        doc = nlp_edu.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp_edu.update([example], sgd=optimizer_edu, losses=losses)
    if (i + 1) % 5 == 0:
      print(f"Iteration {i+1}/20, Loss: {losses.get('textcat', 0.0)}")
print("--- Training Complete ---\n")

--- Starting Educational Model Training ---
Iteration 5/20, Loss: 0.00020447709107429546
Iteration 10/20, Loss: 6.30395024714403e-06
Iteration 15/20, Loss: 1.7257572340056981e-06
Iteration 20/20, Loss: 6.917990895694004e-07
--- Training Complete ---



In [None]:
# 5. Allow users to test the model interactively
while True:
    user_input = input("Enter a sentence about SCIENCE, HISTORY, or MATH to classify (or type 'exit' to quit): ")
    if user_input.lower() == 'exit':
        break
    doc = nlp_edu(user_input)
    # Get the category with the highest score
    classification = max(doc.cats, key=doc.cats.get)
    print(f"--> The content is classified as: {classification}")
    print(f"    (Scores: {doc.cats})\n")

Enter a sentence about SCIENCE, HISTORY, or MATH to classify (or type 'exit' to quit): The theory of relativity was developed by Einstein.
--> The content is classified as: HISTORY
    (Scores: {'SCIENCE': 0.36335209012031555, 'HISTORY': 0.442747563123703, 'MATH': 0.19390033185482025})

Enter a sentence about SCIENCE, HISTORY, or MATH to classify (or type 'exit' to quit): Photosynthesis requires sunlight, water and carbon dioxide.
--> The content is classified as: MATH
    (Scores: {'SCIENCE': 0.13377413153648376, 'HISTORY': 0.009692267514765263, 'MATH': 0.8565335869789124})

Enter a sentence about SCIENCE, HISTORY, or MATH to classify (or type 'exit' to quit): Cleopatra was the last active ruler of the Ptolemaic Kingdom of Egypt.
--> The content is classified as: SCIENCE
    (Scores: {'SCIENCE': 0.5842010378837585, 'HISTORY': 0.4083878695964813, 'MATH': 0.007411125581711531})

Enter a sentence about SCIENCE, HISTORY, or MATH to classify (or type 'exit' to quit): The value of pi is app

### Model 4: Text Classification for Educational Content

**Function Description**
This script implements a second, distinct supervised learning model: a **multi-class text classifier**. The objective of this model is to categorize short pieces of educational text into one of three predefined academic subjects: **"SCIENCE," "HISTORY,"** or **"MATH."** This application was deliberately designed to explore the challenges of training a classifier from scratch on a minimal, manually-created dataset. It serves as a practical case study in multi-class classification and highlights the potential pitfalls of training on limited data.

**Inputs**

The model is trained and evaluated using two manually created datasets:
1.  **`edu_train_data`**: A list of 21 labeled sentences, with 7 examples for each of the three subject categories. This data is used for training the model.
2.  **`edu_test_data`**: A separate list of 9 labeled sentences (3 for each category), held out from training to be used for the final, unbiased performance evaluation.

Both datasets adhere to the mandatory `spaCy` format, `(text, {"cats": {...}})`, where the `"cats"` dictionary maps each of the three labels to a binary value.

**Code Flow and Syntax Explanation**

1.  **Model Initialization:** A new, independent `spaCy` model is created with `nlp_edu = spacy.blank("en")` to ensure it does not interfere with the previous sentiment model. A `textcat` pipeline is added, and the three class labels—"SCIENCE", "HISTORY", and "MATH"—are registered using `textcat_edu.add_label()`.
2.  **Training Loop:** The model is trained for 20 iterations on the `edu_train_data`. The training process is identical to the sentiment model, using `nlp_edu.begin_training()` to set up the optimizer and `nlp_edu.update()` within the loop to adjust the model's weights. The training loss is printed every 5 iterations to monitor the learning progress.
3.  **Interactive Testing:** After training, a `while True` loop provides an interface for the user to input their own sentences. The trained `nlp_edu` model processes the text to generate probability scores for each of the three categories.
4.  **Classification Logic:** The final prediction is determined using the command `classification = max(doc.cats, key=doc.cats.get)`. This is an efficient Python technique that finds the key (the subject label) in the `doc.cats` dictionary that corresponds to the highest value (the probability score).
5.  **Output:** The script then prints the predicted subject category for the user's input, along with the full dictionary of scores for transparency.

**Outputs**

For each user input, the model outputs two things:
1.  The predicted subject category as a string (e.g., "SCIENCE").
2.  The full dictionary of probability scores for all three categories.

**Comments and Observations**

This model serves as a critical experiment in the project. The training loss is observed to decrease to near-zero, which indicates that the model has successfully **memorized** the 21 training examples. However, this does not guarantee good performance on new data. The key question, which will be answered in the formal evaluation, is whether the model has learned the general *concepts* of each subject or if it has simply overfit to the specific keywords in its tiny training set.

In [None]:
# --- Step 4 (Model 4): Select & Implement Metrics for Educational Classifier ---

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Helper function to get the predicted label from the trained edu model
def get_edu_prediction(text):
    doc = nlp_edu(text)
    return max(doc.cats, key=doc.cats.get)

# Prepare the test data
test_texts_edu = [text for text, annotations in edu_test_data]
# For the true labels, we will use the string names directly
true_labels_edu = []
for text, annotations in edu_test_data:
    if annotations['cats']['SCIENCE'] == 1:
        true_labels_edu.append("SCIENCE")
    elif annotations['cats']['HISTORY'] == 1:
        true_labels_edu.append("HISTORY")
    else:
        true_labels_edu.append("MATH")

# Get the model's predictions
predicted_labels_edu = [get_edu_prediction(text) for text in test_texts_edu]

# --- Calculate and Print Metrics ---
accuracy_edu = accuracy_score(true_labels_edu, predicted_labels_edu)

print("--- Educational Model Performance Evaluation ---")
print(f"Test Set Size: {len(test_texts_edu)} examples")
print(f"Overall Accuracy: {accuracy_edu:.2f} ({accuracy_edu:.0%})\n")

# For a multi-class problem, a classification report is more informative
print("--- Detailed Classification Report ---")
# This report shows Precision, Recall, and F1-Score for EACH category
print(classification_report(true_labels_edu, predicted_labels_edu))

--- Educational Model Performance Evaluation ---
Test Set Size: 9 examples
Overall Accuracy: 0.56 (56%)

--- Detailed Classification Report ---
              precision    recall  f1-score   support

     HISTORY       0.75      1.00      0.86         3
        MATH       0.40      0.67      0.50         3
     SCIENCE       0.00      0.00      0.00         3

    accuracy                           0.56         9
   macro avg       0.38      0.56      0.45         9
weighted avg       0.38      0.56      0.45         9



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


###Model 4.2 Evaluation: Calculating Performance Metrics

**Function Description**

This code cell provides the formal, quantitative evaluation for the educational content classifier. Its primary purpose is to objectively measure the model's performance and diagnose its weaknesses. Because this is a multi-class problem, a simple accuracy score is insufficient for a deep analysis. Therefore, the script generates a detailed **Classification Report**, which provides a full breakdown of Precision, Recall, and F1-Score for each of the three subject categories individually.

**Inputs**

The evaluation is performed using two key inputs:
1.  The trained `nlp_edu` model object.
2.  The `edu_test_data` list, containing 9 labeled sentences that the model has never seen before.

**Code Flow and Syntax Explanation**

1.  **Import Dependencies:** The script imports `accuracy_score` and `classification_report` from the `sklearn.metrics` library.
2.  **Helper Function:** A small helper function, `get_edu_prediction`, is defined to streamline the process of getting a final prediction from the `nlp_edu` model.
3.  **Data Preparation:** The script prepares the test data by creating two lists: `test_texts_edu` (the sentences) and `true_labels_edu` (the correct string labels, e.g., "SCIENCE").
4.  **Prediction Generation:** The script iterates through the test texts and uses the `get_edu_prediction` function to generate a list of the model's predictions.
5.  **Metric Calculation and Output:** The `accuracy_score` is calculated to provide an overall performance number. The core of the evaluation is the `print(classification_report(...))` command. This function from scikit-learn takes the true labels and the predicted labels and automatically computes the Precision, Recall, F1-Score, and Support for each class, presenting them in a clean, tabular format.

**Explanation of the Classification Report**

*   **Overall Accuracy:** The percentage of all predictions that were correct across all classes.
*   **Precision (per class):** Measures the accuracy of the predictions for a specific class. For "SCIENCE," it answers: "Of all the times the model predicted SCIENCE, how often was it right?"
*   **Recall (per class):** Measures the model's ability to find all instances of a specific class. For "SCIENCE," it answers: "Of all the actual SCIENCE examples, how many did the model find?"
*   **F1-Score (per class):** The harmonic mean of precision and recall for that class, providing a single balanced score.
*   **Support:** The number of actual examples of that class in the test set.

**Comments and Observations**

The results in this classification report provide the definitive, quantitative evidence of the model's performance. The per-class metrics are particularly important as they can reveal if the model is biased or has failed to learn one or more categories, which is a common problem when training on very small datasets. This detailed report is essential for the final analysis in the IEEE paper.