## Assignment 1 - Named Entity Recognition using CRF model
## 02.CRF-Testing Program
## Group 8
### Anurag Maharshi - 2023MT12125
### SRIRAM ROKKAM - 2023MT12251
### GODLAVEETI ANIL GOVIND - 2023MT12272
### KORUKONDA SOWMYA - 2023MT12203
### VIGNESHWARAN K R - 2023mt12091

In [16]:
!pip install sklearn-crfsuite
!pip install simpletransformers
!pip install gdown
!pip install spacy



In [17]:
import joblib
import gdown
import spacy

## 02. Function to download the trained and saved model (.pkl file) from Google Drive using file ID

In [18]:
def download_file_from_google_drive(file_id, output_path):
    # Create the Google Drive URL
    drive_url = f"https://drive.google.com/uc?id={file_id}"

    # Download the file using gdown
    gdown.download(drive_url, output_path, quiet=False)


file_id = '18JYe_hVZYTp2wYASdybLfWl0ieL3r5WY'

# Define the output path where the file will be saved locally
output_path = 'crf_model2.pkl'

# Download the pickle file
download_file_from_google_drive(file_id, output_path)

Downloading...
From: https://drive.google.com/uc?id=18JYe_hVZYTp2wYASdybLfWl0ieL3r5WY
To: /Users/I310202/Library/CloudStorage/OneDrive-wilp.bits-pilani.ac.in/Desktop/03.Sem_03/ZG599-NLP/Assignment_1/F1_Versions/crf_model2.pkl
100%|██████████| 3.24M/3.24M [00:00<00:00, 6.99MB/s]


## Load the saved model


2. **Feature Extraction (`word2features`)**:
   - Each word in the sentence is transformed into a feature dictionary that includes properties like lowercase form, suffixes, capitalization, and POS tag.
   - Features for neighboring words are added to capture context, with special flags for the beginning (BOS) and end (EOS) of a sentence.

In [19]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1] if sent[i][1] is not None else ""  # Handle None case

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }

    # Add features for the previous word
    if i > 0:
        prev_word = sent[i - 1][0]
        features.update({
            '-1:word.lower()': prev_word.lower(),
            '-1:word.isupper()': prev_word.isupper(),
        })

    # Add features for the next word
    if i < len(sent) - 1:
        next_word = sent[i + 1][0]
        features.update({
            '+1:word.lower()': next_word.lower(),
            '+1:word.isupper()': next_word.isupper(),
        })

    return features

3. **Helper Functions**:
   - `sent2features(sent)`: Converts an entire sentence into a list of feature dictionaries.
   - `sent2labels(sent)`: Extracts true NER labels for each word in the sentence (if available).
   - `sent2tokens(sent)`: Extracts word tokens from the sentence.
   - `sentence_to_word_pos(sentence)`: Uses spaCy to tag each word in the input sentence with its POS tag.

4. **NER Prediction (`predict_sentence`)**:
   - Converts the input sentence into features and uses the CRF model to predict NER tags.

5. **Full NER Pipeline (`ner_with_crf`)**:
   - Takes an input sentence, converts it into POS-tagged word pairs, extracts features, and predicts NER tags.
   - Returns a list of words with their predicted NER tags.


In [20]:
def sent2features(sent):
    # Convert a sentence into a list of feature dictionaries
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    # Extract the labels from the sentence
    return [label for token, label in sent]


def sent2tokens(sent):
    # Extract tokens from the sentence
    return [token for token, label in sent]


def predict_sentence(crf_model, sentence):
    # Convert the sentence to features and predict using the CRF model
    features = sent2features(sentence)
    predicted_tags = crf_model.predict_single(features)
    return predicted_tags


def ner_with_crf(text):
    # Split the text into tokens
    tokens = text.split()

    # Create a list of tuples for tokens (token, None) since POS is not needed
    new_sentence_features = [(token, None) for token in tokens]

    # Make predictions using the CRF model
    predicted_tags = predict_sentence(crf_model, new_sentence_features)

    results = []
    for (word, _), tag in zip(new_sentence_features, predicted_tags):
        results.append(f"Word: {word}, Predicted NER Tag: {tag}")

    # Return results as a formatted string
    return "".join([f"{result} \n" for result in results])

## 03. Build a Continous loop to simulate the chatbot

In [15]:
# Chatbot loop
print("Welcome to the NER Chatbot! Enter your text or 'exit' to quit.")

while True:
    # Get input from the user
    user_input = input("You: ")

    # Exit condition
    if user_input.lower() == 'exit':
        print("Goodbye!")
        break

    # Get NER predictions
    ner_results = ner_with_crf(user_input)

    # Display the NER results
    print("NER Results:\n", ner_results)

Welcome to the NER Chatbot! Enter your text or 'exit' to quit.
NER Results:
 Word: India, Predicted NER Tag: B-geo 

NER Results:
 
NER Results:
 Word: Australia, Predicted NER Tag: B-gpe 

NER Results:
 Word: India, Predicted NER Tag: B-geo 
Word: vs, Predicted NER Tag: I-geo 
Word: Australia, Predicted NER Tag: I-geo 

NER Results:
 Word: I, Predicted NER Tag: B-eve 
Word: love, Predicted NER Tag: I-eve 
Word: india, Predicted NER Tag: I-eve 

NER Results:
 
