# Dataset

You will use the data introduced by the Language-Independent Named Entity Recognition tasks, through the following body of work:

* Erik F. Tjong Kim Sang and Fien De Meulder. 2003. [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition](https://aclanthology.org/W03-0419/). In *Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003*, pages 142--147.
* Erik F. Tjong Kim Sang. 2002. [Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition](https://aclanthology.org/W02-2024/). In *COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)*.

This assignment, however, is restricted to NER in the English language only, and the dataset consists of three files:

1. `eng.train`, for training
2. `eng.testa`, as the development set
3. `eng.testb`, as the final test set

These files can be downloaded as a single `.zip` [here](https://drive.google.com/file/d/15YEXQlDk8wvqAFOE1chaS_PYvMaLGUGX/view?usp=sharing)

To avoid any complications, you should take advantage of the fact that the total amount of data is much smaller than the previous assignment, and store the entire dataset in your own Google Drive. To do this, connect your Drive to your Colab notebook:

In [None]:
from google.colab import drive
drive.mount ('/content/drive', force_remount=True)

Mounted at /content/drive


Then, unzip the dataset (**remember to change the path to where you have stored it in your own Google drive**):

In [None]:
!unzip /content/drive/MyDrive/courses/cse354/eng-ner-dataset.zip

unzip:  cannot find or open /content/drive/MyDrive/courses/cse354/eng-ner-dataset.zip, /content/drive/MyDrive/courses/cse354/eng-ner-dataset.zip.zip or /content/drive/MyDrive/courses/cse354/eng-ner-dataset.zip.ZIP.


At this point, you have the unzipped corpus (with the three files) as the `eng-ner-dataset` folder accessible to your Colab notebook. The format of this data is probably new to you, so the first thing to do is to use the `head` command and see what the data looks like ([see this man page](https://www.gnu.org/software/coreutils/manual/html_node/head-invocation.html) for the details of its syntax). For example, you can view the top 20 lines of the `eng.train` file as follows:

In [None]:
!head -n 20 /content/drive/MyDrive/eng-ner-dataset/eng.train

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG
said VBD I-VP O


The format you see is known as the [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), popularly used in many NLP tasks since the CoNLL 2003 NER tasks. The file format requires

- each token has to be on a separate line
- there must be an empty line after each sentence
- a line must contain at least two columns: first, the token itself; and the last, the named entity

It doesn't matter if there are extra columns in between (perhaps containing part-of-speech tag or other information), as long as the named entity information is given in the IOB format (either IOB or IOB2).

**Note:** There is a slight difference between the original IOB and IOB2 formats, and you may need to convert the training and test data to IOB2 (if you spot that some instances are using IOB while others are using IOB2).

# Task Overview

The programming involves three stages:

1. converting the text data into feature vectors, so that it can readily be used by supervised machine learning algorithms,
2. implement your own logistic regression classifier to identify whether or not a token is part of a person's name, and
3. implement your own multinomial logistic regression classifier to develop a complete NER system.

Throughout this assignment, remember to use type annotations in your Python code. Even if you are unable to do this for variables whose data types are dependent on external libraries that are allowed in this assignment (specified later), don't forget the type annotations for the core data types. These annotations are already provided to you in the method signatures from this point onward (to illustrate how to do this, as well as to specify the method signatures required by this assignment).

* Feel free to import additional types as needed (see the line below, where a few data types are already imported for such type annotations: `from typing import ...`).

#### 1.1 Importing required libraries
- You may import modules from core Python
- You may use any modules from `numpy` and `pandas` as long as it does not involve any 'outsourcing' of machine learning algorithms to these modules.

**Do not add the following dependencies:**
- Any module from NLTK
- Any module from SciPy
- Any module from scikit-learn (i.e., `sklearn`) unless it is already provided to you in this Colab notebook
- Any library/module that performs optimizations (minimization or maximization of a function) for you. Purely numeric calculations that arise from mathematics (outside the topics in this assignment) can be done by calling numpy functions, but you must implement the stochastic gradient descent algorithm on your own.
  - For example, computing a dot product can be done using numpy, but logits, sigmoid, softmax, etc. must be your own implementation.

**What about additional methods, variables, data structures, etc.?**

Throughout this assignment, you may add any number of helper methods, as you feel the need to do so. Similarly, you may use additional variables and/or data structures as the need arises. For example, if your implementation of the classifier requires you to add a class attribute, you can certainly do that.

However, please keep in mind three things:

1. Any external user should remain oblivious to any such additional function or variable (i.e., they should not have to assume or figure out things beyond what is already given, in order to run your code).
2. You must update the docstring of a class if you are introducing any additional attribute.
3. Any additional method that you write (say, a helper method) must also have a proper docstring and type hint/annotation for its signature (i.e., what data types it expects as parameters, and what data type it returns).

# Data Preparation [20 points]

The first step is to make sure that your code is able to read the data one sentence at a time. Given the number of sentences, and that you may have to do analyze or process each sentence in computationally complex ways, it is always prudent in this kind of work to write your code in a ways that avoids loading the entire training set. In this assignment, it may be possible, but the better option is to use the *generator* idea in Python. In this approach, the sentences are generated one at a time in a *lazy* manner (if you are more familiar with Java, think `Stream` insted of `List`).

In [None]:
import re, pandas, numpy
from typing import Dict, Iterator, List, TextIO, Tuple
from pathlib import Path
import numpy as np

UNKNOWN_TOKEN = 'UNK'  # This will be needed at times, so let's just declare it as a global constant right away

def load_instances(iob_file: TextIO, sep: str = '\n') -> Iterator[str]:
    """
    Load instances (which are sentences) from an input file stream.

    This function reads an input file stream (`iob_file`), where tokenized sentences are provided in the IOB or IOB2
    format, which requires each token to be on a separate line, and that there is an empty line after each sentence.
    This empty line acts as the default separator (`sep`). Each yielded instance is a single annotated sentence (in the
    IOB or IOB2 format, as given in the input file).

    Parameters:
        iob_file (TextIO): An input file stream containing annotated text data.
        sep (str, optional): The separator used to separate instances. Defaults to '\\n'.

    Yields:
        str: A string representing a single (tokenized and annotated) sentence.
    """
    # TODO
    sentence = []
    for line in iob_file:
        line = line.strip()
        if line:
            sentence.append(line)
        else:
            if sentence:
                yield ' '.join(sentence)
                sentence = []
    if sentence:
        yield ' '.join(sentence)

def convert_to_iob2(sentence: str) -> str:
    """
    Convert a sentence from IOB to IOB2 format.
    """
    previous_tag = "O"
    converted_sentence = []
    for token in sentence.split(' '):
        parts = token.rsplit(' ', 1)  # Split each token from its tag
        if len(parts) == 2:
            word, tag = parts
            if tag.startswith("I-") and (previous_tag == "O" or previous_tag[2:] != tag[2:]):
                # Convert I- to B- if previous tag is O or different entity type
                tag = "B" + tag[1:]
            converted_sentence.append(f"{word} {tag}")
            previous_tag = tag
        else:
            converted_sentence.append(token)
            previous_tag = "O"  # Reset for non-tagged tokens
    return '\n'.join(converted_sentence)

def process_file(input_file_path: Path, output_file_path: Path) -> None:
    """
    Process the given file from IOB to IOB2 format and save the output.
    """
    with input_file_path.open('r', encoding='utf-8') as input_file:
        sentences = load_instances(input_file)
        with output_file_path.open('w', encoding='utf-8') as output_file:
            for sentence in sentences:
                converted_sentence = convert_to_iob2(sentence)
                output_file.write(converted_sentence + "\n\n")  # Add extra newline to separate sentences

input_path = Path('/content/drive/MyDrive/eng-ner-dataset/eng.train')  # Adjust path as necessary
output_path = Path('/content/drive/MyDrive/eng-ner-dataset/engIOB2.train')
process_file(input_path, output_path)

input_path = Path('/content/drive/MyDrive/eng-ner-dataset/eng.testa')  # Adjust path as necessary
output_path = Path('/content/drive/MyDrive/eng-ner-dataset/engIOB2.testa')
process_file(input_path, output_path)

input_path = Path('/content/drive/MyDrive/eng-ner-dataset/eng.testb')  # Adjust path as necessary
output_path = Path('/content/drive/MyDrive/eng-ner-dataset/engIOB2.testb')
process_file(input_path, output_path)

In [None]:
with open("/content/drive/MyDrive/eng-ner-dataset/engIOB2.train") as file:
  for line in load_instances(file):
    print(line)
    break

EU NNP I-NP I-ORG rejects VBZ I-VP O German JJ I-NP I-MISC call NN I-NP O to TO I-VP O boycott VB I-VP O British JJ I-NP I-MISC lamb NN I-NP O . . O O


Building the feature vectors will require the use of tokens seen in the training data, as well as other properties such as the part-of-speech (POS) tags of these tokens. Thus, it is imperative that all such tokens and POS tags are properly collected and tracked. The next method should help you do just that.

> ---
> **Optional features: phrasal information**
>
> You may have already noticed that the annotated data also contains information about the phrase containing a token (again, in IOB or IOB2 format). You are welcome to build features out of this information as well, although this is not mandated by the assignment. If you want to do this, we strongly suggest that you investigate this only *after* finishing everything else.
>
> *Phrasal information* is encoded as whether a token is a part of a noun phrase (NP), verb phrase (VP), prepositional phrase (PP), adjective/adverb phrases (ADJP/ADVP), verb particles (PRT), interjections (INTJ), and clauses introduced by a subordinating conjunction (SBAR).
>
> * If you are interested, you can read more about it [here](https://aclanthology.org/W00-0726.pdf).
>
> If you want to include features based on this phrase-level annotation, and want to add a third dictionary to the return type of the following function, please mention it very clearly in the docstring, and also modify the docstring to reflect this updated use of the `get_vocabulary` method.
>
> ---

In [None]:
def get_vocabulary(training_file: str) -> Tuple[Dict[str, int], Dict[str, int]]:
    """
    Create a vocabulary of lowercase tokens and part-of-speech (POS) tags from a training file, associating each token
    and each POS with a unique index.

    This function reads the specified training file, extracts tokens and POS tags from each sentence. It then converts
    the tokens to lowercase, and associates each lowercase token with a unique index. It also associates each POS with
    a unique index. The result is returned as a pair of dictionaries.

    Parameters:
        training_file (str): The path to the training file containing annotated text data.

    Returns:
        Tuple[Dict[str, int], Dict[str, int]]: A pair of dictionaries where the first dictionary consists of keys that
        are lowercase tokens and values are their corresponding indices; and the second dictionary consists of keys that
        are part-of-speech tags and values are their corresponding indices.

    Raises:
        IOError: If the specified training file cannot be opened or read.
    """
    # TODO
    tokens = {}
    pos_tags = {}
    token_index = 0
    pos_index = 0

    try:
      with open(training_file, 'r') as file:
          for line in load_instances(file):
              line = line.strip().split()
              line_tokens = [tok.lower() for tok in line[::4]]  # Convert tokens to lowercase
              line_pos_tags = line[1::4]
              for tok, pos in zip(line_tokens, line_pos_tags):
                  if tok not in tokens:
                      tokens[tok] = token_index
                      token_index += 1
                  if pos not in pos_tags:
                      pos_tags[pos] = pos_index
                      pos_index += 1
    except:
      raise FileNotFoundError(f"File cannot be opened or read.")

    return (tokens, pos_tags)


In [None]:
tokens, pos = get_vocabulary("/content/drive/MyDrive/eng-ner-dataset/engIOB2.train")

In [None]:
print(len(pos.keys()))

46


## Feature Selection and Data Frames

The most important question to ask at this point is about the features. *What are the type of features likely to be important in the identification of various kinds of named entities?*

Unsurprisingly, the token itself and its part of speech are the most important indicators. For example, a conjunction is probably not the name of a person; an adjective is probably not a part of the name of a place (assuming that the greatness of "Great Britain" or the length of "Long Island" are correctly tagged as nouns). A few other features that research in NER detection has found to be helpful are the orthographic properties of a token, which involve the patterns of capitalization (e.g., is the word in all capital letters? is it starting with a capital letter?), the POS tags of surroundings tokens, the surrounding tokens themselves, and the orthographic properties of the surrounding tokens.

You are by no means restricted to use only these properties. They are provided to you as a minimal set to explore (i.e., a starting point from where you can/should explore incorporating better features).

This work is what people often call **feature engineering**. It is, in some ways, "old school" NLP. Nevertheless, it is a relatively recent phase of NLP research, and going through this help you gain hands-on knowledge of various programming tools/approaches in NLP. It will (hopefully) also help you appreciate the complexity and utility of neural networks where such feature engineering is rarely needed.

It is, of course, important to represent the training, development, and test instances using the same set of features. And for the supervised classification, we want to store the training, development, and test sets are data frames (essentially, vectors with class labels). Your next task is to complete the following method to do this.

First, let's define an enumerable type so that only a fixed set of the "kinds of data frames" are allowed.

In [None]:
from enum import Enum

class ActionType(Enum):
    TRAIN = 'train'
    TEST = 'test'
    DEV = 'dev'

Use this `ActionType` below:

In [None]:
import pandas as pd
def create_dataframe(actiontype: ActionType, output_file_name: str) -> None:
    """
    Generate a pandas DataFrame from text files containing sentences (tokenized and annotated, in IOB or IOB2 format)
    and class labels, and write the DataFrame to a CSV file.

    Parameters:
        actiontype (ActionType): The type of action, either ActionType.TRAIN, ActionType.TEST, or ActionType.DEV.
        output_file_name (str): The name of the output CSV file.

    Returns:
        None

    Raises:
        FileNotFoundError: If the input training or test file is not found.
    """
    if actiontype == ActionType.TRAIN:
        input_file = '/content/drive/MyDrive/eng-ner-dataset/engIOB2.train'
    elif actiontype == ActionType.TEST:
        input_file = '/content/drive/MyDrive/eng-ner-dataset/engIOB2.testb'
    elif actiontype == ActionType.DEV:
        input_file = '/content/drive/MyDrive/eng-ner-dataset/engIOB2.testa'
    else:
        raise ValueError("Invalid action type")

    try:
        with open(input_file, 'r') as file:
            data = []
            sentences = file.read().strip().split('\n\n')
            for sentence in sentences:
                lines = sentence.split('\n')
                tokens = lines[::4]
                pos_tags = lines[1::4]
                labels = lines[3::4]
                for i, (token, pos, label) in enumerate(zip(tokens, pos_tags, labels)):
                    features = {
                        'POS': pos,
                        'PrevLabel': labels[i-1] if i > 0 else '',
                        'NextLabel': labels[i+1] if i < len(labels)-1 else '',

                        'PrevIsAllCaps': (1 if tokens[i-1].isupper() else 0) if i > 0 else 0,
                        'IsAllCaps': 1 if token.isupper() else 0,
                        'NextIsAllCaps': (1 if tokens[i+1].isupper() else 0) if i < len(tokens)-1 else 0,

                        'PrevIsTitle': (1 if tokens[i-1].istitle() else 0) if i > 0 else 0,
                        'IsTitle': 1 if token.istitle() else 0,
                        'NextIsTitle': (1 if tokens[i+1].istitle() else 0) if i < len(tokens)-1 else 0,

                        'PrevHasDigits': (1 if any(char.isdigit() for char in tokens[i-1]) else 0) if i > 0 else 0,
                        'HasDigits': 1 if any(char.isdigit() for char in token) else 0,
                        'NextHasDigits': (1 if any(char.isdigit() for char in tokens[i+1]) else 0) if i < len(tokens)-1 else 0,

                        'PrevHasPunctuations': (1 if any(char in '.,;?!' for char in tokens[i-1]) else 0) if i > 0 else 0,
                        'HasPunctuations': 1 if any(char in '.,;?!' for char in token) else 0,
                        'NextHasPunctuations': (1 if any(char in '.,;?!' for char in tokens[i+1]) else 0) if i < len(tokens)-1 else 0,

                        'TokenLength': len(token),
                        'Label': label
                    }
                    data.append(features)
          # Create the DataFrame from the collected data
        df = pd.DataFrame(data)

        df["POS"] = pd.Categorical(df["POS"], categories=['NNP', 'VBZ', 'JJ', 'NN', 'TO', 'VB', '.', 'CD', 'DT', 'VBD', 'IN', 'PRP', 'NNS', 'VBP', 'MD', 'VBN', 'POS', 'JJR', '"', 'RB', ',', 'FW', 'CC', 'WDT', '(', ')', ':', 'PRP$', 'RBR', 'VBG', 'EX', 'WP', 'WRB', '-X-', '$', 'RP', 'NNPS', 'SYM', 'RBS', 'UH', 'PDT', "''", 'LS', 'JJS', 'WP$', 'NN|SYM'])
        df['PrevLabel'] = pd.Categorical(df['PrevLabel'], categories=['B-ORG', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O', 'B-MISC'])
        df['NextLabel'] = pd.Categorical(df['NextLabel'], categories=['B-ORG', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O', 'B-MISC'])

        # One-hot encode the 'POS', 'NextLabel', and 'PrevLabel' columns
        pos_dummies = pd.get_dummies(df['POS'], prefix='POS')
        next_label_dummies = pd.get_dummies(df['NextLabel'], prefix='NextLabel')
        prev_label_dummies = pd.get_dummies(df['PrevLabel'], prefix='PrevLabel')

        # Concatenate the one-hot encoded columns with the original DataFrame
        df = pd.concat([df, pos_dummies, next_label_dummies, prev_label_dummies], axis=1)

        # Optionally, you can drop the original columns if you don't need them anymore
        df = df.drop(['POS', 'NextLabel', 'PrevLabel'], axis=1)

        # Write the DataFrame to a CSV file
        df.to_csv(output_file_name, index=False)

    except FileNotFoundError:
        raise FileNotFoundError(f"The file {input_file} does not exist")



In [None]:
create_dataframe(ActionType.TRAIN, "training.csv")
pandas.read_csv("training.csv")

Unnamed: 0,PrevIsAllCaps,IsAllCaps,NextIsAllCaps,PrevIsTitle,IsTitle,NextIsTitle,PrevHasDigits,HasDigits,NextHasDigits,PrevHasPunctuations,...,NextLabel_I-PER,NextLabel_O,NextLabel_B-MISC,PrevLabel_B-ORG,PrevLabel_I-LOC,PrevLabel_I-MISC,PrevLabel_I-ORG,PrevLabel_I-PER,PrevLabel_O,PrevLabel_B-MISC
0,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204562,0,0,0,0,1,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
204563,0,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
204564,0,0,0,0,1,0,1,0,1,0,...,0,1,0,0,0,0,0,0,1,0
204565,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0


In [None]:
create_dataframe(ActionType.DEV, "testing.csv")
pandas.read_csv("testing.csv")

Unnamed: 0,PrevIsAllCaps,IsAllCaps,NextIsAllCaps,PrevIsTitle,IsTitle,NextIsTitle,PrevHasDigits,HasDigits,NextHasDigits,PrevHasPunctuations,...,NextLabel_I-PER,NextLabel_O,NextLabel_B-MISC,PrevLabel_B-ORG,PrevLabel_I-LOC,PrevLabel_I-MISC,PrevLabel_I-ORG,PrevLabel_I-PER,PrevLabel_O,PrevLabel_B-MISC
0,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,1,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
4,1,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51573,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
51574,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
51575,0,0,0,1,1,0,0,0,1,0,...,0,1,0,0,0,0,1,0,0,0
51576,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0


In [None]:
create_dataframe(ActionType.TEST, "testingB.csv")
pandas.read_csv("testingB.csv")

Unnamed: 0,PrevIsAllCaps,IsAllCaps,NextIsAllCaps,PrevIsTitle,IsTitle,NextIsTitle,PrevHasDigits,HasDigits,NextHasDigits,PrevHasPunctuations,...,NextLabel_I-PER,NextLabel_O,NextLabel_B-MISC,PrevLabel_B-ORG,PrevLabel_I-LOC,PrevLabel_I-MISC,PrevLabel_I-ORG,PrevLabel_I-PER,PrevLabel_O,PrevLabel_B-MISC
0,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,1,1,1,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46661,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
46662,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
46663,0,0,0,0,1,0,0,0,0,1,...,0,1,0,0,0,0,0,0,1,0
46664,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [None]:
set1 = set(pandas.read_csv("testing.csv").columns)
set2 = set(pandas.read_csv("training.csv").columns)
set3 = set(pandas.read_csv("testingB.csv").columns)
set1-set3

set()

# Binary Logistic Regression Classifier [30 points]

Now that your data frames are built, it is time to build your binary logistic regression classifier to identify if a token is a part of a person's name. To do this, you can effectively treat the labels `I-PER` and `B-PER` together as a single label, `PER`, and treat all the other labels simply as *other*, denoted by `O` (how you denote it internally in your code is entirely up to you).

Complete the class `BinaryLogisticRegression` below.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
import pandas as pd
from sklearn.metrics import log_loss
import ast

class LogisticRegressionClassifier:
    """
    A binary logistic regression classifier.

    Attributes:
        learning_rate (float): The learning rate for gradient descent.
        learning_rate_decay (float, optional): The factor by which learning rate decays with each iteration.
        num_iterations (int): The number of iterations for gradient descent.
        weights (ndarray): The weights for the features.
        bias (float): The bias term.
        training_data (pandas.DataFrame): The training data as a pandas DataFrame (to be read from a valid CSV file)
    """

    def __init__(self, training_data_csv: str, learning_rate=0.01, learning_rate_decay=1.0, num_iterations=1000,
                 regularization='l2', regularization_strength=0.1, threshold=0.5):
        self.training_data_csv = training_data_csv
        self.learning_rate = learning_rate
        self.learning_rate_decay = learning_rate_decay
        self.num_iterations = num_iterations
        self.regularization = regularization
        self.regularization_strength = regularization_strength
        self.threshold = threshold
        self.df = pd.read_csv(training_data_csv)
        self.weights = np.zeros(self.__to_feature_matrix(self.df).shape[1])
        self.bias = 0

    @staticmethod
    def sigmoid(z: float) -> float:
        """
        Compute the sigmoid function.

        Parameters:
            z (float): The input to the sigmoid function.

        Returns:
            float: The output of the sigmoid function.
        """
        return 1 / (1 + numpy.exp(-z))

    def __to_feature_matrix(self, df: pandas.DataFrame) -> pandas.DataFrame:
        """
        A private method to extract the feature matrix from the data frame.

        Parameters:
            df (pandas.DataFrame): The given data frame.

        Returns:
            The matrix of features as a pandas DataFrame
        """
        # TODO
        return df.drop('Label', axis='columns')
    def __to_class_labels(self, df: pandas.DataFrame) -> pandas.DataFrame:
        """
        A private method to extract the class labels from the data frame.

        Parameters:
            df (pandas.DataFrame): The given data frame.

        Returns:
            The vector of class labels as a pandas DataFrame.
        """
        # TODO
        df['Label'] = df['Label'].apply(lambda x: 1 if x in ['I-PER', 'B-PER'] else 0)
        return df['Label']


    def learn(self, X, y, batch_size=250):
        """
        Learn the weight vector to obtain the best decision boundary that separates the two classes in the training set.

        It initializes the model parameters (the weights are initialized to zeros, and the bias is also initially set to
        zero). It then performs gradient descent to optimize the parameters based on the training data. The optimization
        is done by minimizing the cross-entropy loss or the logistic loss. The learning rate determines the step size
        taken during gradient descent. If the decay factor is less than 1, the step size reduces with each iteration.

        Parameters:
            feature_matrix (ndarray): The feature matrix.
            y (ndarray): The target class labels.

        Returns:
            None
        """
        # TODO
        if isinstance(X, pd.DataFrame):
            X = X.to_numpy()

        self.weights = np.zeros(X.shape[1])
        self.bias = 0
        update_interval = self.num_iterations // 10

        for i in range(self.num_iterations):
            if update_interval and (i % update_interval == 0):
                print(f"Iteration {i} - {int((i / self.num_iterations) * 100)}% complete.")

            # Shuffle the dataset
            indices = np.random.permutation(len(X))
            X_shuffled = X[indices]
            y_shuffled = y[indices]

            for start in range(0, len(X), batch_size):
                end = start + batch_size
                X_batch = X_shuffled[start:end]
                y_batch = y_shuffled[start:end]

                # Skip empty batches
                if X_batch.size == 0:
                    continue

                predictions = self.sigmoid(np.dot(X_batch, self.weights) + self.bias)
                errors = predictions - y_batch

                # Compute gradients for weights and bias
                gradient_weights = np.dot(X_batch.T, errors) / len(X_batch)
                gradient_bias = np.mean(errors)

                # Apply L2 regularization
                if self.regularization == 'l2':
                    gradient_weights += (self.regularization_strength / len(X_batch)) * self.weights

                # Apply L1 regularization
                elif self.regularization == 'l1':
                    gradient_weights += (self.regularization_strength / len(X_batch)) * np.sign(self.weights)

                # Update weights and bias
                self.weights -= self.learning_rate * gradient_weights
                self.bias -= self.learning_rate * gradient_bias


    def predict(self, feature_matrix) -> List[int]:
        """
        Predict the target labels for new/test data.

        Parameters:
            feature_matrix (ndarray): The feature matrix of new/test data.

        Returns:
            list: The predicted target labels.
        """
        # TODO
        return (self.sigmoid(np.dot(feature_matrix, self.weights) + self.bias) >= self.threshold).astype(int)

    def set_threshold(self, threshold: float):
        """
        Set a new threshold value for the classifier.

        Parameters:
            threshold (float): The new threshold value.

        Returns:
            None
        """
        self.threshold = threshold

    def compute_loss(self, feature_matrix, y_true):
        """
        Compute the binary cross-entropy loss.

        Parameters:
            feature_matrix (ndarray): The feature matrix of new/test data.
            y_true (ndarray): The true class labels.

        Returns:
            float: The binary cross-entropy loss.
        """
        predictions = self.sigmoid(np.dot(feature_matrix, self.weights) + self.bias)
        return log_loss(y_true, predictions)

    def report(self, feature_matrix: numpy.ndarray, y_true: numpy.ndarray) -> Tuple[float, float, float]:
        """
        Compute the precision, recall, and F-1 scores for new/test data.

        Parameters:
            feature_matrix (ndarray): The feature matrix of new/test data.
            y_true (ndarray): The true class labels.

        Returns:
            Tuple[float, float, float]: A tuple containing three values: the positive class' precision, recall, and F-1
            scores (in this order)
        """
        # TODO
        #
        # Note 1: For binary classification, we only need these three values for the positive class (instead of micro-
        # or macro-averaging). The PER class is considered as the positive class for this component of the assignment.
        #
        # Note 2: You should aim for an F-1 measure of at least 0.7 on the final test set (eng.testb)
        y_pred = self.predict(feature_matrix)
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred)
        return precision, recall, f1

    def printWB(self):
        print(f"W: {self.weights} and B: {self.bias}")

    def predict_proba(self, feature_matrix):
        return self.sigmoid(np.dot(feature_matrix, self.weights) + self.bias)

    # Public method that calls the private method
    def get_feature_matrix(self, df: pd.DataFrame) -> pd.DataFrame:
        return self.__to_feature_matrix(df)

    # Public method that calls the private method
    def get_class_labels(self, df: pd.DataFrame) -> pd.Series:
        return self.__to_class_labels(df)

## Training the model

In [None]:
classifier = LogisticRegressionClassifier('training.csv')
classifier.set_threshold(0.5)

train_data = classifier.df
train_feature_matrix = classifier.get_feature_matrix(train_data)
train_labels = classifier.get_class_labels(train_data)

classifier.learn(train_feature_matrix, train_labels)

precision, recall, f1 = classifier.report(train_feature_matrix, train_labels)
print(f"Precision: {precision}, Recall: {recall}, F1 Score: {f1}")

Iteration 0 - 0% complete.
Iteration 100 - 10% complete.
Iteration 200 - 20% complete.
Iteration 300 - 30% complete.
Iteration 400 - 40% complete.
Iteration 500 - 50% complete.
Iteration 600 - 60% complete.
Iteration 700 - 70% complete.
Iteration 800 - 80% complete.
Iteration 900 - 90% complete.
Precision: 0.9484841827768014, Recall: 0.7759705248023006, F1 Score: 0.8535982601818901


## Calculating Loss

In [None]:
loss = classifier.compute_loss(train_feature_matrix, train_labels)
print(f"Binary Cross-Entropy Loss: {loss}")

Binary Cross-Entropy Loss: 0.05030440624329137


## Testing the model

### Testing on dev set

In [None]:
test_data = test_data = pd.read_csv('testing.csv')

test_feature_matrix = classifier.get_feature_matrix(test_data)
test_labels = classifier.get_class_labels(test_data)

precision, recall, f1 = classifier.report(test_feature_matrix, test_labels)
print(f"Precision: {precision}, Recall: {recall}, F1 Score: {f1}")

Precision: 0.960967993754879, Recall: 0.7818355033343919, F1 Score: 0.862195762563474


### Testing on test set

In [None]:
test_data = test_data = pd.read_csv('testingB.csv')

test_feature_matrix = classifier.get_feature_matrix(test_data)
test_labels = classifier.get_class_labels(test_data)

precision, recall, f1 = classifier.report(test_feature_matrix, test_labels)
print(f"Precision: {precision}, Recall: {recall}, F1 Score: {f1}")

Precision: 0.9471195184866724, Recall: 0.7944464478903714, F1 Score: 0.8640909982349481


# Multinomial Logistic Regression Classifier [30 points]

This is also known as **Softmax Regression** or **Maxent Classifier**. It is a popular tool for multi-class classification, which we will use here for NER.

Note that the classification is happening on a *per-token* basis, and the class labels are `B-ORG`, `I-ORG`, etc.

Generalizing the binary classification task, you should complete the `MultinomialLogisticRegression` class, whose skeleton is provided to you next. For this portion, you may (optionally) import the `OneHotEncoder` or `LabelEncoder` from scikit-learn. For example, you may add this line at the beginning of the next cell:

```
from sklearn.preprocessing import LabelEncoder
```


In [None]:
from sklearn.metrics import precision_recall_fscore_support, classification_report
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

class MultinomialLogisticRegression:
    """
    A multinomial logistic regression classifier.

    Attributes:
        learning_rate (float): The learning rate for gradient descent.
        learning_rate_decay (float): The factor by which learning rate decays with each iteration.
        weights (ndarray): The weights for the features.
        bias (float): The bias term.
        training_data (pandas.DataFrame): The training data as a pandas DataFrame (to be read from a valid CSV file)
    """

    def __init__(self, learning_rate: float, learning_rate_decay: float, epochs: int, training_data_csv: str):
        """
        Initialize the multinomial logistic regression classifier.

        Parameters:
            learning_rate (float): The learning rate for gradient descent.
            learning_rate_decay (float): The factor by which learning rate decays with each iteration.
            epochs (int): The number of training epochs.
            training_data_csv (str): The file path to the CSV file containing training data.
        """
        # TODO
        self.learning_rate = learning_rate
        self.learning_rate_decay = learning_rate_decay
        self.epochs = epochs
        self.training_data_csv = training_data_csv
        self.weights = None
        self.bias = None
        self.label_encoder = None
        self.initialize(training_data_csv)

    @staticmethod
    def softmax(logits):
        """
        Compute the softmax function.

        Parameters:
            z (numpy.ndarray): The input to the softmax function.

        Returns:
            numpy.ndarray: The output of the softmax function.
        """
        exp_logits = numpy.exp(logits - numpy.max(logits))
        return exp_logits / numpy.sum(exp_logits)

    def learn(self) -> None:
        """
        Train the multinomial logistic regression model.

        This method trains the multinomial logistic regression model using stochastic gradient descent. It begins by
        encoding the target labels into one-hot encoded vectors and initializes the weights (to zeros) for the model.
        During each training epoch, it iterates through the training data, computing the softmax probabilities for each
        class and updating the weights based on the gradient of the cross-entropy loss.

        Returns:
            None
        """
        # TODO
        # Note: Remember to use regularization, but think about what type of regularization might suit this task.
        onehot_encoder = OneHotEncoder(sparse=False)
        y = self.training_data['Label'].values.reshape(-1, 1)
        y_onehot = onehot_encoder.fit_transform(y)
        X = self.training_data.drop('Label', axis=1).values

        for epoch in range(self.epochs):
            logits = np.dot(X, self.weights.T) + self.bias
            probs = self.softmax(logits)

            # Compute the gradient of the cross-entropy loss
            error = probs - y_onehot
            gradient_weights = np.dot(error.T, X) / X.shape[0]
            gradient_bias = np.mean(error, axis=0)

            # Update weights and bias
            self.weights -= self.learning_rate * gradient_weights
            self.bias -= self.learning_rate * gradient_bias

            # Apply learning rate decay
            self.learning_rate *= self.learning_rate_decay

            if epoch % 100 == 0 or epoch == self.epochs - 1:
                # Calculate the cross-entropy loss
                loss = -np.mean(np.sum(y_onehot * np.log(probs + 1e-9), axis=1))
                print(f'Epoch {epoch}, Loss: {loss:.4f}')

    def predict(self, test_data_csv: str) -> numpy.ndarray:
        """
        Predict class labels for test data using the trained multinomial logistic regression model.

        Loads the test data from a CSV file into a pandas DataFrame. Then, computes the softmax probabilities for each
        class using the dot product of the feature matrix and the model weights. The class label for each instances is
        predicted based on the highest probability using argmax.

        Parameters:
            test_data_csv (str): File path to the CSV file containing test data.

        Returns:
            numpy.ndarray: Predicted class labels for the test data.
        """
        # TODO
        # Load test data
        test_data = pd.read_csv(test_data_csv)
        X_test = test_data.drop('Label', axis=1).values
        true_labels_encoded = self.label_encoder.transform(test_data['Label'])
        logits = np.dot(X_test, self.weights.T) + self.bias
        probs = self.softmax(logits)
        predictions_encoded = np.argmax(probs, axis=1)

        self.predictions_encoded = predictions_encoded
        self.true_labels_encoded = true_labels_encoded
        return self.label_encoder.inverse_transform(predictions_encoded)

    def initialize(self, training_data_csv):
        self.training_data = pd.read_csv(training_data_csv)
        self.label_encoder = LabelEncoder()
        # Fit the label encoder and transform the 'Label' column to get encoded labels
        self.training_data['Label'] = self.label_encoder.fit_transform(self.training_data['Label'])
        self.num_classes = len(self.label_encoder.classes_)
        self.num_features = self.training_data.drop('Label', axis=1).shape[1]
        self.weights = np.zeros((self.num_classes, self.num_features))
        self.bias = np.zeros(self.num_classes)

    def generate_report(self):
        if hasattr(self, 'predictions_encoded') and hasattr(self, 'true_labels_encoded'):
            true_labels = self.label_encoder.inverse_transform(self.true_labels_encoded)
            predictions = self.label_encoder.inverse_transform(self.predictions_encoded)

            report = classification_report(true_labels, predictions)
            print(report)
        else:
            raise ValueError("Predictions or true labels are not available. Please run predict() first.")

In [None]:
multiClassifier = MultinomialLogisticRegression(0.001, 1.0, 100, "training.csv")
multiClassifier.learn()
multiClassifier.predict("training.csv")
multiClassifier.generate_report()



Epoch 0, Loss: 14.3065
Epoch 99, Loss: 19.2726


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       B-LOC       0.00      0.00      0.00        11
      B-MISC       0.00      0.00      0.00        35
       B-ORG       0.00      0.00      0.00        24
       I-LOC       0.00      0.00      0.00      8286
      I-MISC       0.00      0.00      0.00      4558
       I-ORG       0.00      0.00      0.00     10001
       I-PER       0.00      0.00      0.00     11128
           O       0.83      1.00      0.91    170524

    accuracy                           0.83    204567
   macro avg       0.10      0.12      0.11    204567
weighted avg       0.69      0.83      0.76    204567



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
multiClassifier.predict("testing.csv")
multiClassifier.generate_report()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

      B-MISC       0.00      0.00      0.00         4
       I-LOC       0.00      0.00      0.00      2094
      I-MISC       0.00      0.00      0.00      1264
       I-ORG       0.00      0.00      0.00      2092
       I-PER       0.00      0.00      0.00      3149
           O       0.83      1.00      0.91     42975

    accuracy                           0.83     51578
   macro avg       0.14      0.17      0.15     51578
weighted avg       0.69      0.83      0.76     51578



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
multiClassifier.predict("testingB.csv")
multiClassifier.generate_report()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       B-LOC       0.00      0.00      0.00         6
      B-MISC       0.00      0.00      0.00         9
       B-ORG       0.00      0.00      0.00         5
       I-LOC       0.00      0.00      0.00      1919
      I-MISC       0.00      0.00      0.00       909
       I-ORG       0.00      0.00      0.00      2491
       I-PER       0.00      0.00      0.00      2773
           O       0.83      1.00      0.90     38554

    accuracy                           0.83     46666
   macro avg       0.10      0.12      0.11     46666
weighted avg       0.68      0.83      0.75     46666



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
df = pd.read_csv("testingB.csv")
label_counts = df['Label'].value_counts()
print(label_counts)

O         38554
I-PER      2773
I-ORG      2491
I-LOC      1919
I-MISC      909
B-MISC        9
B-LOC         6
B-ORG         5
Name: Label, dtype: int64


Reporting the results of a multiclass classification is a little more complicated than binary classification. For this part, [please read the documentation of scikit-learn's `classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Then, add an instance method called `generate_report` to the above class. This method should take no arguments (other than the default `self`), and print the report for your classification results in the same format as shown in the above documentation. That is, for a 3-class classification, it should print (shown with dummy results):

```
                precision    recall  f1-score   support

     class 0       0.50      1.00      0.67         1
     class 1       0.00      0.00      0.00         1
     class 2       1.00      0.67      0.80         3

    accuracy                           0.60         5
   macro avg       0.50      0.56      0.49         5
   micro avg       1.00      0.67      0.80         5
```

In your generated report, the class names must be the actual labels (e.g., `I-PER`) and not just numbers or indices. As you may have realized, your method will simply be a wrapper around scikit-learn's `classification_report`, but you will have to carefully think about the parameter values to use.

**Note:** You are not responsible for generating a report if a user calls this method before testing (by a call to the `predict` function). If a user does invoke `generate_report` without invoking `predict` first, it is acceptable for your code to raise an error.

# Your insights [20 points]

## 1. Test set vs Dev set: Binary classification [4 points]

For the binary classification task, how much does the performance (in terms of each metric) differ between the dev set and the final test set?

**Precision: Difference: 0.960968 - 0.949935 = 0.011033**<br>
**Recall: Difference: 0.793725 - 0.781836 = 0.011890**<br>
**F1 Score: Difference: 0.864833 - 0.862196 = 0.002637** ​<br><br>
  **These differences indicate that the model's ability to correctly classify positive instances (precision) has decreased slightly in the test set, but its ability to identify all positive instances (recall) has increased marginally. The overall F1 score, which balances precision and recall, is slightly lower on the test set, suggesting a small drop in the model's overall performance from the dev set to the test set.**

What do you think are the causes behind these differences?

> **The differences in performance metrics between the development (dev) set and the test set can be attributed to several factors:**<br>
**1. Data Distribution**<br>
**2. Feature Representation:**<br>
**3. Class Imbalance:<br>**

>**But after taking a look at the data I think it is class imbalance because there is a little difference between the occurance of different classes.**

Suggest one or two experiments that you should design and conduct, in order to test your hypothesis (i.e., in order to test whether your answer to the above question is, indeed, correct).

> **So I just tried printing the occurances of each element in the Label's column and there is a slight difference. May be that slight difference is causing that slight difference in F-1 Scores.**

## 2. Stochastic Gradient Descent [10 points]

### What exactly is an epoch?

>**An epoch in the context of stochastic gradient descent (SGD) is one complete pass through the entire training dataset. During an epoch, the model iterates over all training examples, updating its parameters based on the gradients computed for each example or batch of examples.**

### Why is it important to optimize over multiple epochs, when in each epoch, the training is happening over the same data?

>**Optimizing over multiple epochs is important because it allows the model to refine its parameters iteratively. In each epoch, the model has the opportunity to adjust its weights based on the errors it made in the previous epoch. This iterative process helps the model to converge to a set of parameters that minimize the loss function.**

### For the binary classification task, what regularization did you choose when optimizing? Why did you choose this, and not any other?

>**For the binary classification task, I chose L2 regularization when optimizing. L2 regularization, adds a penalty term to the loss function proportional to the square of the magnitude of the weights. I chose L2 regularization because it tends to produce models with smaller and more spread-out weights, which can help prevent overfitting and improve generalization.**

### For the multiclass classification task, what regularization did you choose when optimizing? Why did you choose this, and not any other?

> **For the multiclass classification task in my NLP project, I chose L2 regularization as the optimization method. I selected L2 regularization because it effectively prevents overfitting by adding a penalty term to the loss function. This penalty is proportional to the square of the magnitude of the coefficients, which encourages the model to maintain smaller weights. This approach can lead to a more generalizable model, as it discourages the model from relying too heavily on any single feature and helps to manage the complexity of the model.**



### What types of learning rate decay were included in your experiments (as discussed in the lecture before Spring break)? Did the dev set play an important role in these experiments? Briefly explain how. Also briefly explain what made you fix the type of learning rate decay when testing on the final test set.

> **In my experiments with logistic regression for NLP, I included exponential and time-based learning rate decay methods. The development (dev) set played a crucial role in these experiments as it allowed me to tune the hyperparameters, such as the learning rate and decay rate, without overfitting to the test set. By evaluating the performance on the dev set, I could adjust the learning rate decay parameters to achieve a balance between learning speed and model stability.<br>
I fixed the type of learning rate decay when testing on the final test set based on the results obtained from the dev set. I chose the decay type that provided the best balance between convergence speed and model performance on the dev set. This approach ensured that the model was neither learning too slowly nor oscillating too much around the optimal solution, leading to better generalization on unseen data.**



## 3. Multiclass classification [6 points]

###Were there any classes that were particularly hard to detect?

>**Based on the report I've examined, it looks like my model has had a tough time identifying any classes other than "O". Every other class has zeroed out on precision, recall, and F1-score, which tells me that my model hasn't correctly identified a single instance of those classes.<br>
Here's why I think these classes were difficult for the model to identify:<br>
The "O" class appears to be heavily overrepresented, which may have caused the model to become biased towards predicting "O" and neglecting the other categories.<br>
My model might not have been trained effectively, either due to due to the imbalance in the dataset as I have tried to convered the dataset into IOB2 which was not giving different results.
**

###Why do you think these classes were comparatively more difficult to identify correctly?

>**To improve performance on these more challenging categories, I would conduct several experiments: <br>
Balancing the Dataset: I would experiment with data balancing techniques such as oversampling the minority classes or undersampling the majority class. Balancing should provide a more uniform distribution of classes for the model to learn from.<br>
Enhancing Feature Engineering: I would try out different features or look into more sophisticated representations like word embeddings or even contextual embeddings from transformer models, which might capture the nuances of language better.<br>
Adjusting Model Complexity: I would assess whether my model is complex enough to capture the data's diversity. If it's too simplistic, that could be a problem, so I might consider making it more complex, possibly exploring deep learning approaches.
Hyperparameter Optimization: I would perform a thorough search for the optimal set of hyperparameters, including the learning rate and regularization strength, to see if any adjustments lead to better model performance.**

###What experiments would you design and conduct to try and improve the performance on these difficult categories? Support your answer with technical reasoning (in this context, "technical" means either based on mathematical reasoning, linguistic insights, or statistical insights drawn from data).

> **To tackle the issue of my model's poor performance on certain classes, I would design and conduct the following experiment by these technical reasoning:<br>
Class Rebalancing Experiment: First, I would adjust the class distribution in my training data since the support numbers suggest a large imbalance. I will either augment the minority classes by synthesizing new data points using techniques like SMOTE or by simply duplicating existing ones. Also, I might also try down-sampling the "O" class. The technical reasoning here is based on statistical learning theory which suggests that models trained on balanced datasets are less biased towards the majority class.<br>
Feature Engineering Experiment: I would hypothesize that the current features might not capture the cues necessary for NER. Thus, I would experiment with linguistic features such as part-of-speech tags, word embeddings, and perhaps position embeddings to give the model more contextual information. Mathematically, richer feature vectors can help in defining more complex decision boundaries, making it easier to distinguish between different classes.**


# Collaboration Policy

You may discuss any details of this assignment at a conceptual level with anyone. In fact, discussion of ideas and helping each other to gain a better understanding of the concepts and the mathematical principles is encouraged. But any written answers (natural language or programming language) must be entirely your own original work.

- There must not be any collaboration in programming (including the design, implementation, and debugging of code).
- There must not be any code in your submission that is written by anyone other than you (whether human or AI).
  - Submitted code will be checked against other submissions AND against AI-generated code, and evidence of plagiarism will lead to academic dishonesty charges.


# What to submit?

1.  Make your colab notebook publicly accessible. You can do this by clicking "Share" on the top-right corner of your notebook and make sure "anyone with the link" can view your notebook. Also make sure that viewers are allowed to download your notebook. Then, **put this link in the comment section of your submission on Brightspace**.
2.  Create an empty folder (locally, on your computer) called `firstname-lastname-cse354-hw2`. For example, John Doe will create `john-doe-cse354-hw2`.
  * Download this colab notebook with all the questions (code as well as the text questions) implemented/answered. This will be a single Python notebook, as a `.ipynb` file. Put the notebook in your folder.
  
  Zip this folder (i.e., create `firstname-lastname-cse354-hw2.zip`) and submit on Brightspace.

Once unzipped, your submission is expcted to have the following structure:

```
john-doe-cse354-hw2
├── CSE354-Assignment-2.ipynb
└── README.md (optional)
```

# Important Notes

- Write comments to make your code readable. The short-term benefit is that if your code is doing something wrong, members of the teaching staff can consult your comments to see if the attempt was in the right direction (for potential partial credit). The long-term benefit is that you get into the habit of writing code that had good human readability, including your own future self!
- **DO NOT change any code already given to you** (except potentially in one or two places where exceptions have been clearly articulated). This includes the limitations imposed on the use of external libraries, the method signatures, the data types specified through type hints, the return types, and the descriptions provided in docstrings.

Next, is an example use of your code. This is not the only way to run your code, but it is provided here as an example of how you might want to check the various components of your code (whether you are developing locally on your computer or directly working on colab). As you can see, this is only an outline (for example, there is no use of a dev set in this example code snippet). As the developer of this NER application, you should perform much more extensive testing and debugging to ensure correctness of this application.

```
# Data preparation: creating the CSV files
create_dataframe(dataloader.ActionType.TRAIN, 'train_set.csv')
create_dataframe(dataloader.ActionType.TEST, 'test_set.csv')

# Model training
model = MultinomialLogisticRegression(learning_rate=0.05, epochs=5, training_data_csv='train_set.csv')
model.learn()

# Model testing and report generation
predictions: numpy.ndarray = model.predict(test_data_csv='test_set.csv')
model.generate_report()
```

# Due date

## **11:59 pm, March 26 (Tuesday)**