# Lab 4d: Let's start working on our Bag of Words (BoW) implementation
Ultimately, we will build a system that can classify text as positive or negative in tone, called [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). The objective of `lab-4d` is to familiarize students with working with text documents and a simple [natural language processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) model such as the [bag of words model](https://en.wikipedia.org/wiki/Bag-of-words_model).

* We'll use the [Cornell movie review v2.0 data set](http://www.cs.cornell.edu/people/pabo/movie-review-data) as our corpus. This data set was introduced and analyzed in [Pang/Lee ACL 2004](https://aclanthology.org/P04-1035/). It contains 1000 positive and 1000 negative movie reviews in free(ish) text.

## Setup
We set up the computational environment by including [the `Include. jl` file](Include.jl) using [the `include(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.include). The [`Include.jl` file](Include.jl) loads external packages and functions we will use in these examples. 
* For additional information on functions and types used in this example, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [3]:
include("Include.jl");

## Prerequisite 
Break up into teams of 2-3 people and take `5 min` to walk through all the files (starting with [`Include.jl` in the `root` directory](Include.jl) in `Lab-4d`. At the end of `5 min`, we'll do a class Q&A to ensure everyone understands the purpose of each file.

## Task 1: Load the positive and negative movie review datasets
In this task, we will load the positive and negative movie datasets from the review text files by parsing each movie file.

In [6]:
list_of_positive_review_files = readdir(_PATH_TO_POSITIVE_REVIEWS); # what is happening here?
postive_review_documents = readfiles(list_of_positive_review_files, 
    base = _PATH_TO_POSITIVE_REVIEWS, delim = " ");

Next, do the same thing for the negative reviews.

In [8]:
list_of_negative_review_files = readdir(_PATH_TO_NEGATIVE_REVIEWS);
negative_review_documents = readfiles(list_of_negative_review_files, 
    base = _PATH_TO_NEGATIVE_REVIEWS, delim = " ");

## Task 2: Let's build a movie review corpus model
Fill me in

In [10]:
corpus = let
    allmoviewreviews = Dict{Int64, MyMoviewReviewRecordModel}();
    counter = 1;
    
    for (k,v) ∈ postive_review_documents
        allmoviewreviews[counter] = v;
        counter += 1
    end

    for (k,v) ∈ negative_review_documents
        allmoviewreviews[counter] = v;
        counter += 1
    end

    build(MyMoviewReviewDocumentCorpusModel, allmoviewreviews)
end;

Let's check out what's in the corpus model.

In [12]:
corpus.records

Dict{Int64, MyMoviewReviewRecordModel} with 2002 entries:
  1144 => MyMoviewReviewRecordModel(["the", "first", "species", "was", "a", "mo…
  1175 => MyMoviewReviewRecordModel(["five", "years", "after", "his", "director…
  1953 => MyMoviewReviewRecordModel(["the", "44", "caliber", "killer", "has", "…
  719  => MyMoviewReviewRecordModel(["good", "films", "are", "hard", "to", "fin…
  1546 => MyMoviewReviewRecordModel(["ill", "bet", "right", "now", "youre", "ju…
  1703 => MyMoviewReviewRecordModel(["the", "swooping", "shots", "across", "dar…
  1956 => MyMoviewReviewRecordModel(["i", "think", "maybe", "its", "time", "for…
  1028 => MyMoviewReviewRecordModel(["mulholland", "drive", "did", "very", "wel…
  699  => MyMoviewReviewRecordModel(["through", "a", "spyglass", "i", "could", …
  831  => MyMoviewReviewRecordModel(["i", "know", "that", "funnest", "isnt", "a…
  1299 => MyMoviewReviewRecordModel(["it", "would", "be", "hard", "to", "choose…
  1438 => MyMoviewReviewRecordModel(["woof", "too",

In [13]:
corpus.inverse

Dict{Int64, String} with 47582 entries:
  45120 => "vapor"
  1703  => "alliance"
  37100 => "seductive"
  3406  => "backer"
  28804 => "noseworthy"
  40691 => "stuckup"
  11251 => "devine"
  3220  => "automatically"
  422   => "25"
  46806 => "witness"
  15370 => "fiances"
  4030  => "bedrock"
  8060  => "code"
  3163  => "aurelius"
  22241 => "javier"
  23265 => "knockin"
  35395 => "ridley"
  27851 => "myrlie"
  23690 => "larters"
  44399 => "unexpectedly"
  844   => "abstraction"
  24859 => "lovemaker"
  20571 => "idioplot"
  2920  => "assist"
  2783  => "artistaction"
  ⋮     => ⋮