# Lab 4d: Let's start working on our Bag of Words (BoW) implementation
Ultimately, we will build a system that can classify text as positive or negative in tone, called [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). The objective of `lab-4d` is to familiarize students with working with text documents and a simple [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) model such as the [bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model).

* We'll use the [Cornell movie review v2.0 data set](http://www.cs.cornell.edu/people/pabo/movie-review-data) as our corpus. This data set was introduced and analyzed in [Pang/Lee ACL 2004](https://aclanthology.org/P04-1035/). It contains 1000 positive and 1000 negative movie reviews in free(ish) text.

## Setup
We set up the computational environment by including [the `Include. jl` file](Include.jl) using [the `include(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.include). The [`Include.jl` file](Include.jl) loads external packages and functions we will use in these examples. 
* For additional information on functions and types used in this example, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [3]:
include("Include.jl");

## Prerequisite 
Break up into teams of 2-3 people and take `5 min` to walk through all the files (starting with [`Include.jl` in the `root` directory](Include.jl) in `Lab-4d`. At the end of `5 min`, we'll do a class Q&A to ensure everyone understands the purpose of each file.

## Task 1: Load the positive and negative movie review datasets
In this task, we will load the positive and negative movie datasets from the review text files by parsing each movie file.

In [6]:
list_of_positive_review_files = readdir(_PATH_TO_POSITIVE_REVIEWS); # what is happening here?
postive_review_documents = readfiles(list_of_positive_review_files, 
    base = _PATH_TO_POSITIVE_REVIEWS, delim = " ");

Next, do the same thing for the negative reviews.

In [8]:
list_of_negative_review_files = readdir(_PATH_TO_NEGATIVE_REVIEWS);
negative_review_documents = readfiles(list_of_negative_review_files, 
    base = _PATH_TO_NEGATIVE_REVIEWS, delim = " ");

## Task 2: Let's build a movie review corpus model
In this task, we create [a `MyMoviewReviewDocumentCorpusModel` instance](src/Types.jl) that holds information about the _entire collection_ of positive and negative reviews.

In [10]:
corpus = let
    allmoviewreviews = Dict{Int64, MyMoviewReviewRecordModel}();
    counter = 1;
    
    for (k,v) ∈ postive_review_documents
        allmoviewreviews[counter] = v;
        counter += 1
    end

    for (k,v) ∈ negative_review_documents
        allmoviewreviews[counter] = v;
        counter += 1
    end

    build(MyMoviewReviewDocumentCorpusModel, allmoviewreviews)
end;

#### What's in the corpus model?
Let's check out what's in the corpus model. The `records::Dict{Int64, MyMoviewReviewRecordModel}` dictionary holds a list of records, i.e., [`MyMoviewReviewRecordModel` instances](src/Types.jl). The key of the `records` dictionary is a file index, while the value is the `MyMoviewReviewRecordModel` model instance.
* __Hmmm__: Is there a better way to link the files to the corresponding record than using a file index? Let's think about implementing an updated version of the records dictionary. Whatb is involved here (simple fix)?

In [12]:
corpus.records

Dict{Int64, MyMoviewReviewRecordModel} with 2002 entries:
  1144 => MyMoviewReviewRecordModel(["the", "first", "species", "was", "a", "mo…
  1175 => MyMoviewReviewRecordModel(["five", "years", "after", "his", "director…
  1953 => MyMoviewReviewRecordModel(["the", "44", "caliber", "killer", "has", "…
  719  => MyMoviewReviewRecordModel(["good", "films", "are", "hard", "to", "fin…
  1546 => MyMoviewReviewRecordModel(["i'll", "bet", "right", "now", "you're", "…
  1703 => MyMoviewReviewRecordModel(["the", "swooping", "shots", "across", "dar…
  1956 => MyMoviewReviewRecordModel(["i", "think", "maybe", "it's", "time", "fo…
  1028 => MyMoviewReviewRecordModel(["mulholland", "drive", "did", "very", "wel…
  699  => MyMoviewReviewRecordModel(["through", "a", "spyglass", "i", "could", …
  831  => MyMoviewReviewRecordModel(["i", "know", "that", "funnest", "isn't", "…
  1299 => MyMoviewReviewRecordModel(["it", "would", "be", "hard", "to", "choose…
  1438 => MyMoviewReviewRecordModel(["woof", "too",

We can access the vocabulary dictionary (the mapping between `token => index`) in the `vocabulary::Dict{String, Int64}` field of the `corpus::MyMoviewReviewDocumentCorpusModel` instance.
* __Hmmmmm__: There seem to be strange characters in the tokens. We need to update our logic to build the token collection. Update the implementation of [the `_deepclean(...)` function](src/Factory.jl), which removes forbidden characters. Let's do this as a class.

In [14]:
## look at the vocab - have strange chars?

## Task 3: Rethink the hashing vector formulation
In this task, we're going to do some next-gen foo with [the `hashing(...)` function](src/Compute.jl); however, we need to have an `<OOV>` token in our vocabulary to make this happen. 
* __Hmmmmm__: Can you confirm that we have an `<OOV>` token in the corpus, and if not, add it? Wait a minute, do you know if we already add this to the records?

In [16]:
# fill in a check for the <OOV> token

We want [a `hashing(...)` function](src/Compute.jl) implementation that doesn't wrap back on itself for indexes larger than `size,` while simultaneously handing the `0-based` error issue, i.e., the error that is generated when $i\leftarrow\texttt{mod}(\text{h},\text{size})$ for the case of $\text{h} = \text{size}$. Also, what happens if we get a new review that uses words we don't have in the vocabulary?
* __Hmmmmmm__: Let's revisit the implementation of [the `hashing(...)` function](src/Compute.jl) and try to address each of these issues. What kind of test case could we write to make sure this is working as it should?

In [31]:
let
    test_record = corpus.records[1]; # pick a record at random
    vocabulary = test_record.vocabulary;
    inverse = test_record.inverse;
    words = test_record.fields;

    # compute a fv -
    fv = hashing(words, hash = vocabulary, size = (length(vocabulary) + 1))
end

599-element Vector{Int64}:
 5
 1
 1
 1
 1
 1
 2
 1
 1
 1
 1
 1
 1
 ⋮
 1
 3
 1
 1
 1
 1
 1
 3
 1
 1
 1
 0