<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/course_project_Reza2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s): Mohammadreza Akhtari
- Date: April 2024
- Chosen Corpus: imdb
- Contributions (if group project): -

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification. A set of 25,000 highly polar movie reviews for training, and 25,000 for testing.
- Paper(s) and other published materials related to the corpus: Maas, Andrew, et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
- State-of-the-art performance (best published results) on this corpus:

---

## 1. Setup

In [None]:
# Your code to install and import libraries etc. here
!pip install --quiet datasets transformers[torch] evaluate optuna plotly
import datasets
from pprint import pprint #pprint => pretty-print

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m837.7 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.8/78.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [13]:
# Your code to download the corpus here¨
dataset = datasets.load_dataset("stanfordnlp/imdb")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [37]:
dataset['train'][0:2]

{'text': ["Before watching this film I had very low expectations and went to just see the cars. Eventually I even regretted going for that reason. Plot is almost non-existent. Character development is non-existent. So many clichés and so much jaw-dropping cheesiness existed in the movie that I could only stare and wonder how it was even released. If not for the exotics, I wouldn't have even rated this movie a 1. An attempt at a coherent story line is destroyed by the sheer absurdity of this elite racing cult and the laughable characters that make up its members. In fact, the movie's plot is so predictable and simple-minded that an average child could foretell the majority of the storyline. Bad acting, bad plot, bad jokes, bad movie.<br /><br />Don't see it. Play Gran Turismo HD instead and it'll satiate your thirst for fast sexy cars without leaving a bad aftertaste.",
  "I had never heard of this flick despite the connection to George Clooney (whose company produced and he appears in 

### 2.2. Preprocessing

In [15]:
# Your code for any necessary preprocessing here
#This is never a bad idea, datasets may have ordering to them, which is not what we want
dataset=dataset.shuffle()
#Delete the unlabeled part of the dataset, we don't need it for anything
del dataset["unsupervised"]

In [18]:
import sklearn.feature_extraction
vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True, max_features=25000)

#get a list of all texts from the training data
texts=[ex["text"] for ex in dataset["train"]]

#"Trains" the vectorizer, i.e. builds its vocabulary
vectorizer.fit(texts)

In [21]:
def vectorize_example(ex):
    #because the vectorizer expects a list/iterable over inputs, not one input
    vectorized=vectorizer.transform([ex["text"]])
    #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features=vectorized.nonzero()[1]
    #feature index 0 will have a special meaning
    # so let us produce it by adding +1 to everything
    non_zero_features+=1
    return {"input_ids":non_zero_features}

vectorized=vectorize_example(dataset["train"][0])

In [22]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    ## It is easy to forgot we moved all by +1
    words.append(idx2word[idx-1])
#This is now the bag of words representation of the document
pprint(", ".join(words))

('absurdity, acting, aftertaste, almost, an, and, at, attempt, average, bad, '
 'before, br, by, cars, character, characters, cheesiness, child, clichés, '
 'coherent, could, cult, destroyed, development, don, dropping, elite, even, '
 'eventually, existed, existent, expectations, fact, fast, film, for, going, '
 'had, have, hd, how, if, in, instead, is, it, its, jaw, jokes, just, '
 'laughable, leaving, line, ll, low, majority, make, many, members, minded, '
 'movie, much, non, not, of, only, play, plot, predictable, racing, rated, '
 'reason, regretted, released, see, sexy, sheer, simple, so, stare, story, '
 'storyline, that, the, thirst, this, to, up, very, was, watching, went, '
 'without, wonder, wouldn, your')


In [38]:
# Apply the tokenizer to the whole dataset using .map()
dataset_tokenized = dataset.map(vectorize_example,num_proc=4)
pprint(dataset_tokenized["train"][0])

{'input_ids': [374,
               484,
               698,
               905,
               1035,
               1066,
               1566,
               1607,
               1722,
               1830,
               2157,
               2840,
               3258,
               3529,
               3779,
               3789,
               3879,
               3934,
               4215,
               4358,
               5123,
               5455,
               6121,
               6177,
               6695,
               6913,
               7295,
               7771,
               7779,
               7918,
               7920,
               7941,
               8099,
               8227,
               8451,
               8767,
               9567,
               10018,
               10262,
               10294,
               10824,
               11010,
               11216,
               11565,
               11851,
               11881,
               11889,
       

---

## 3. Machine learning model

### 3.1. Model training

In [5]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

### 3.2 Hyperparameter optimization

In [6]:
# Your code for hyperparameter optimization here

### 3.3. Evaluation on test set

In [7]:
# Your code to evaluate the final model on the test set here

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [8]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [9]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [10]:
# Include your annotated out-of-domain data here