# Notes: this Java notebook requires Ganymede 
* Ganymede (Java kernel for Jupyter): [Installation and documentation](https://github.com/allen-ball/ganymede)
* We need additional libraries for lucene

In [1]:
%%pom
dependencies:
- org.apache.lucene:lucene-core:9.7.0
- org.apache.lucene:lucene-analysis-common:9.7.0

#### Common imports (java)

In [20]:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

#### Common imports (lucene)

In [21]:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.FilteringTokenFilter;
import org.apache.lucene.analysis.TokenStream;  
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.analysis.en.EnglishPossessiveFilter;
import org.apache.lucene.analysis.en.KStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.apache.lucene.analysis.CharArraySet;

## Let's read in the data collection

In [33]:
ArrayList<Map<String, String>> read_collection(String name) throws IOException {
    ArrayList<Map<String, String>> docs = new ArrayList<Map<String, String>>();
    String splitter = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
    BufferedReader reader = new BufferedReader(new FileReader(name));
    String line, keys[] = reader.readLine().split(splitter);

    while ((line = reader.readLine()) != null) {
        String[] values = line.split(splitter);
        Map<String, String> dataMap = new HashMap<>();

        for (int i = 0; i < keys.length; i++) {
            // dataMap.put(keys[i], values[i]);
            switch(keys[i]){
                case "Series_Title":
                    dataMap.put("title", values[i]);
                    break;
                case "Released_Year":
                    dataMap.put("year", values[i]);
                    break;
                case "Runtime":
                    dataMap.put("runtime", values[i].replace(" min", ""));
                    break;
                case "Genre":
                    dataMap.put("genre", values[i].replace(",",""));
                    break;
                case "IMDB_Rating":
                    dataMap.put("rating", values[i]);
                    break;
                case "Overview":
                    dataMap.put("summary", values[i].replace("\"", ""));
                    break;
                case "Star1":
                    dataMap.put("actors", values[i]);
                    break;
                case "Star2":
                case "Star3":
                case "Star4":
                    dataMap.put("actors", dataMap.get("actors") + " " + values[i]);
                    break;
            }
        }
        docs.add(dataMap);
    }
    reader.close();

    // print summary
    System.out.println("Read " + docs.size() + " documents from " + name);
    return docs;
}

var collection = read_collection("datasets/imdb_top_1000.csv");
System.out.println("\nfirst document:");
collection.get(0).forEach((key, value) -> System.out.println(String.format("%10s: %s", key, value)));

Read 1000 documents from datasets/imdb_top_1000.csv

first document:
   summary: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
    actors: Tim Robbins Morgan Freeman Bob Gunton William Sadler
      year: 1994
     genre: Drama
    rating: 9.3
   runtime: 142
     title: The Shawshank Redemption


## Let's start with the analyzer of Lucene

In [37]:

void print_tokens(Analyzer analyzer, String text) throws IOException {
    TokenStream ts = analyzer.tokenStream("text", new StringReader(text));
    CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);

    for(ts.reset(); ts.incrementToken();) 
        System.out.print(termAtt.toString() + " ");
    ts.end();
    System.out.println();
}

class MyAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
          final Tokenizer source = new StandardTokenizer();
          TokenStream result = new EnglishPossessiveFilter(source);
          // result = new LowerCaseFilter(result);
          result = new FilteringTokenFilter(result) {
              private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
              @Override
              protected boolean accept() throws IOException {
                  return termAtt.length() > 3;
              }
          };
          result = new KStemFilter(result);
          return new TokenStreamComponents(source, result);
    }
}

var text = "I think text's values' color goes here; WHAT happens with it? do we see IT again; I went there to be gone with houses";
var stopWords = new CharArraySet(Arrays.asList("i", "do"), false);

System.out.println("             text: "+ text);
System.out.println();

// standard analyzer
System.out.print("         standard: ");
print_tokens(new StandardAnalyzer(), text);

// english analyzer (with porter stemmer)
System.out.print("          english: ");
print_tokens(new EnglishAnalyzer(), text);

// english analyzer (with porter stemmer) and new set of stopwords
System.out.print("english/stopwords: ");
print_tokens(new EnglishAnalyzer(stopWords), text);

// a custom analyzer, no lower case and kstemmer
System.out.print("      my analyzer: ");
print_tokens(new MyAnalyzer(), text);

// print standard stop word list
System.out.println("\nenglish stopword list:");
System.out.println(EnglishAnalyzer.getDefaultStopSet());

             text: I think text's values' color goes here; WHAT happens with it? do we see IT again; I went there to be gone with houses

         standard: i think text's values color goes here what happens with it do we see it again i went there to be gone with houses 
          english: i think text valu color goe here what happen do we see again i went gone hous 
english/stopwords: think text valu color goe here what happen with it we see it again went there to be gone with hous 
      my analyzer: think text value color go here WHAT happen with again went there gone with house 

stopword list:
[but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]


## Building an index (in memory)

In [5]:
import org.knowm.xchart.XYChart;
import org.knowm.xchart.XYChartBuilder;

var xchart = new XYChartBuilder().title("Trig").build();

xchart.addSeries("sin", x, sinx);
xchart.addSeries("cos", x, cosx);

print(xchart)

REJECTED ERRONEOUS


print(xchart)
cannot find symbol
  symbol:   method print(org.knowm.xchart.XYChart)
  location: class 
