<a href="https://colab.research.google.com/github/trilokimodi/Student-Test-Repository/blob/master/LT2222_Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LT2222 V23 Assignment 1: Intent Classification

In this assignment you will be working with data from the *Slot and Intent Detection for Low Resource
language varieties (SID4LR)* shared task. 

Given an utterance, the task consists on identifying the intent of the speaker along with the key spans that require an action from the system. For example, given the utterance *Add reminder to swim at 11am tomorrow*, the intent is *add reminder*, while the slots are *to do* and *datetime*. **Here we'll focus on intent classification only.**

The dataset consists of 13 languages (en, de-st, de, da, nl, it, sr, id, ar, zh, kk, tr, ja).

For more details about the data please check [this paper](https://aclanthology.org/2021.naacl-main.197.pdf) by van der Goot et al., (2021).

## General instructions

You will do all the work inside this notebook and submit your edited notebook back into Canvas. You many not copy code from elsewhere, but you can use functions from any module currently available on mltgpu, where the notebook will be tested. A major goal of the assignment is, in fact, for you to find them yourself and apply them. Only edit the notebook in the places where we specify you should do so.

You will need to give reasonable, but not excessively verbose, documentation of your code so that we understand what you did.

**The assignment is officially due at 23:59 CET on Thursday February 16, 2023. There are 33 points and 5 bonus points.**

### 1. Choose a language and download the corresponding train, validation and test data splits. (2 points)

https://bitbucket.org/robvanderg/sid4lr/src/master/xSID-0.4/

Store the chosen data into a directory `data/`. You may use any method you prefer (e.g., command line tools, graphic interphase, copy+paste, ...). 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from IPython.display import clear_output

In [None]:
dataset = "/content/drive/My Drive/MLSNLP/Data/"

### 2. Import all necessary modules here. (1 point)

**Enter and run your code below.**

In [None]:
!pip install conllu
clear_output()

In [None]:
import numpy as np
import pandas as pd
import conllu
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
import matplotlib.pyplot as plt

### 3. Import the data into Python. (7 points)

Write code to read the data into Python. You should have two variables: one corresponding to the utterances or `x` and another one corresponding to the intents or `y`. 

**Hint:** both the #slot information and the IOB column are **not** used in this assignment.

Note that the amount of data varies depending on the language selected. 

**Enter and run your code below. You may insert additional code boxes and text boxes for comments.**

In [None]:
train_dataset = dataset + 'en.train.conll'
train_data = open(train_dataset, mode='r', encoding='utf-8')

a = train_data.read()

In [None]:
sentences = conllu.parse(a, fields=["id", "form", "intent"])

In [None]:
split_func = lambda line, i: line[i].split("/")
sentences = conllu.parse(a, fields=["id", "form", "intent"], field_parsers={"intent": split_func})

In [None]:
list(sentences[3])[8]['form']

'bay'

In [None]:
xTrain = list()
yTrain = list()
for iSentence in sentences:
    sentenceString = list()
    intentString = list()
    for iToken in list(iSentence):
        sentenceString.append(iToken['form'])
        intentString.append(iToken['intent'])
    xTrain.append(' '.join(sentenceString))
    yTrain.append(list(set(intentString))[0])

In [None]:
train_data = {'X': xTrain, 'y': yTrain}
df = pd.DataFrame(data=train_data)

In [None]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

In [None]:
df.shape

(37163, 2)

### 4. Explore your features. (11 points)

In this part of the assignment, we work with count features only. You need to convert the features into sparce vectors. You may use tools from Scikit-learn and/or Pandas to do this. Scikit-learn in particular has very handy tools for the vectorization of categorical features, for example [CountVectorizer()](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). 

After converting your features into sparse vectors, answer the following questions: 

a) How many features are there?

b) What are the most common features?

**Enter and run your code below. You may insert additional code boxes and text boxes for comments.**

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['X'])

In [None]:
X.shape

(37163, 12971)

There are 12971 features. Note that Count Vectorizer removes features with less than 2 characters. For example tokens like 'a' or ',' are excluded.

In [None]:
X.sum(axis=0)

matrix([[257,   1,   1, ...,   4,   7,   1]])

In [None]:
vectorizer.get_feature_names_out()[X.sum(axis=0).argmax()]

'the'

The most common feature is the word 'the'.

### 5. Using Scikit-learn fit either a Decision Tree model or a Multinomial Naive Bayes model. (3 points)

**Enter and run your code below.**

In [None]:
le = preprocessing.LabelEncoder()

In [None]:
le.fit(list(set(df['y'])))

LabelEncoder()

In [None]:
y = le.transform(df['y'])

In [None]:
y.shape

(37163,)

In [None]:
classifier = GaussianNB()
classifier.fit(X.toarray(), y)

GaussianNB()

### 6. Using [Scikit-learn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) evaluate your model using the development (aka as validation) set. (9 points)

In order to evaluate your model you need to use the development set. Keep in mind that you need to pre-process the dev set in the same way that you pre-process your train set. 

Answer the following questions:

a) What is the accuracy?

b) What are the precision, recall and f1?

c) How do your results compare with those reported in the paper?

**Enter and run your code below. You may insert additional code boxes and text boxes for comments.**

In [None]:
y_pred = classifier.predict(X.toarray())

In [None]:
y_pred.shape

(37163,)

In [None]:
cm = confusion_matrix(y, y_pred)
print(cm)

In [None]:
plot_confusion_matrix(classifier, X.toarray(), y, display_labels=['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware'], cmap=plt.cm.Blues)

In [None]:
cm.accuracy

AttributeError: ignored

### 7. Bonus - Improve the results obtained in step 6. (5 points)

Some options that you may explore are to:

- Target the data (slide or subset according to some criterion).
- Target the pre-processing.
- Consider different features.

 Then explain why your adjustments produced improved results. 

**Enter and run your code below. You may insert additional code boxes and text boxes for comments and write-up.**

### Submission

Submit this notebook with all your code in Canvas. 