# Exercise 4 - Spell Correction
(8 points)

In this exercise, you should finish the implementation of the `TrigramBasedSpellCorrector` class by implementing a) the creation of a tri-gram matrix from a given text and b) a simple spell correction that is based on the trigrams in the matrix and the Levenshtein distance.

Note that for this exercise, code from the other three exercises might be reusable. However, please think about which parts you want to reuse. Not all of them might be necssary or helpful.

In the test, we are using texts from [William Shakespeare](https://en.wikipedia.org/wiki/William_Shakespeare) for getting a larger amount of tri-grams. The file `shakespeare.txt` can be downloaded from [http://norvig.com/ngrams/shakespeare.txt](http://norvig.com/ngrams/shakespeare.txt).

#### Matrix Creation

The creation of the matrix is based on a given text. This text should be preprocessed with the same preprocessing rules as in Exercise 1 of this exercise series. After that, the tri-grams need to be extracted and stored in a way that they can be reused for getting word candidates for the spell correction.

##### Hints

* While the bi-gram matrix might work as a dense matrix implementation, the tri-gram matrix should use an implementation which is optimized for a sparse matrix. Otherwise, your solution might get memory issues.
* Make sure that the creation of your matrix does not take too much time. The hidden tests might have to initialize your matrix several times and if this takes more than 1 minute, it might be possible that the complete evaluation of your solution creates a time out (would result in 0 points). (hint: reading the content of the `shakespeare.txt` file and generating the matrix typically takes less than 2 seconds)

#### Spell Correction

The spell correction should be implemented in the `getCorrection(String word1, String word2, String word3)`. It gets a tri-gram $(w_1,w_2,w_3)$ as input where the third word $w_3$ might be misspelled. Its aim is to provide a correct word which fits to the first two words of the trigram. Its internal process should be based on the following two steps:
1. Get a list of candidate words $w_c$ from the tri-gram matrix which are known to occur as $(w_1,w_2,w_c)$.
2. From this list, choose the word that has the smallest Levenshtein distance to $w_3$.

The chosen word should be returned as the suggested correction for the third word.

##### Hints

* The given words of the tri-gram may have to be preprocessed to fit to the preprocessed words from the text. However, you can assume that every given word will be a single word even after applying the preprocessing, i.e., non of the words swill contain characters that are not alphanumerical.
* If the list of candidates is empty (because $w_1$, $w_2$ or their combination are simply not available in your tri-gram matrix), your solution should return `null`.
* If more than 1 candidate have the lowest Levenshtein distance, all solutions will be accepted.

#### Example

The following text could be used as basis for the tri-gram matrix.
``` 
London is the capital and largest city of England. Million people live in London. 
The River Thames is in London. London is the largest city in Western Europe.
```
For the following tri-grams, your implementation of the `getCorrection` method should return the following results:

<!--<table>
    <tr>
        <th style="text-align:center">Tri-gram</th><th style="text-align:center">Result</th><th style="text-align:center">Explanation</th>
    </tr>
    <tr>
        <td style="text-align:left">("largest", "city", "im")</td>
        <td style="text-align:center">"in"</td>
        <td style="text-align:left">In the given text, there are two tri-grams starting with `"largest", "city"` leading to the two candidates `"of"` and `"in"`. The latter has the smaller Levenshtein distance.</td>
    </tr>
    <tr>
        <td style="text-align:left">("largest", "city", "on")</td>
        <td style="text-align:center">`"in"` OR `"of"`</td>
        <td style="text-align:left">There are the same candidates as in the line above but both of them have the same Levenshtein distance to the given third word. So both results would be correct.</td>
    </tr>
    <tr>
        <td style="text-align:left">("London", "is", "teh")</td>
        <td style="text-align:center">`"in"` OR `"of"`</td>
        <td style="text-align:left">In the given text, there are two tri-grams starting with `"London", "is"`. However, both have `"the"` as a third word. So it is the only candidate.</td>
    </tr>
    <tr>
        <td style="text-align:left">("largest", "capital", "in")</td>
        <td style="text-align:center">`"in"` OR `"of"`</td>
        <td style="text-align:left">There are no tri-grams starting with `"largest", "capital"`.</td>
    </tr>
</table>-->

| Tri-gram | Result | Explanation |
|---|---|---|
| `("largest", "city", "im")` | `"in"` | In the given text, there are two tri-grams starting with `"largest", "city"` leading to the two candidates `"of"` and `"in"`. The latter has the smaller Levenshtein distance. |
| `("largest", "city", "on")` | `"in"` OR `"of"` | There are the same candidates as in the line above but both of them have the same Levenshtein distance to the given third word. So both results would be correct. |
| `("London", "is", "teh")` | `"the"` | In the given text, there are two tri-grams starting with `"London", "is"`. However, both have `"the"` as a third word. So it is the only candidate. |
| `("largest", "capital", "in")` | `null` | There are no tri-grams starting with `"largest", "capital"`. |

#### Notes

- Do not add additional external libraries.
- Interface
  - You can use _[TAB]_ for autocompletion and _[SHIFT]_+_[TAB]_ for code inspection.
  - Use _Menu_ -> _View_ -> _Toggle Line Numbers_ for debugging.
  - Check _Menu_ -> _Help_ -> _Keyboard Shortcuts_.
- Finish
  - Save your solution by clicking on the _disk icon_.
  - Finally, choose _Menu_ -> _File_ -> _Close and Halt_.
  - Do not forget to _Submit_ your solution in the _Assignments_ view.

In [1]:
import java.util.stream.Collectors;
/**
 * A simple spell correction approach based on tri-grams and the Levenshtein
 * distance.
 */
public class TrigramBasedSpellCorrector {
    // YOUR CODE HERE
    
    Map<String, List<String>> triGramsMap = new HashMap<String, List<String>>();

    /**
     * Constructor.
     */
    public TrigramBasedSpellCorrector(String text) {
        create(text);
    }

    /**
     * Internal methods for determining the necessary statistics about the tri-grams
     * of the given text.
     */
    protected void create(String text) {
        // YOUR CODE HERE
        
   /*     String dtext = "London is the capital and largest city of England. Million people " + 
    "live in London. The River Thames is in London. London is the largest city in " +
    "Western Europe. The river styx"; */
        
        if(text != null || text.length() != 0){
            String[] words = text.toLowerCase().split("[^a-zA-Z0-9]+");
            //^a-zA-Z0-9   ([.!?,:;'\"-]|\\s)+

        if(words.length>=3)
            for(int i = 0; i <= words.length-3 ; i++) {
                
                StringBuilder keyBuilder = new StringBuilder(words[i].trim());
                for (int j = 1; j < 2; j++) {
                    keyBuilder.append(' ').append(words[i + j].trim());
                }
                String key = keyBuilder.toString();

                List<String> list = triGramsMap.get(key);
                if (list == null) {
                    triGramsMap.put(key, list = new ArrayList<String>());
                }
                // Added benefit: we can write the following line just once.
                list.add(words[i + 2]);
            }
        }
    }

    /**
     * Returns the correction of the third word based on the internal tri-grams that
     * start with word1 and word2 as well as the Levenshtein distance of candidates
     * from these tri-grams to the given word3.
     * 
     * @return a word for which a tri-gram with word1 and word2 at the beginning
     *         exists and which has the smallest Levenshtein distance to the given
     *         word3. Or null, if such a word does not exist.
     */
    public String getCorrection(String word1, String word2, String word3) {
        // YOUR CODE HERE
        
        
        // YOUR CODE HERE
        StringBuilder keyBuilder = new StringBuilder(word1.toLowerCase().trim());
        keyBuilder.append(' ').append(word2.toLowerCase().trim());
        String keyToCheck = keyBuilder.toString();
        
        List<String> keyValues = triGramsMap.get(keyToCheck);
        
        if(keyValues==null || keyValues.isEmpty())
        {
            return null;
        }
        
        int min = Integer.MAX_VALUE;
        //int minDistnace = 0;
        String correctVal = null;
        for (String value : keyValues)
        {
            
           int minDistance = calcLevenshteinDistance(word3.toLowerCase().trim(), value);
            if(minDistance < min)
            {
                min = minDistance;
                correctVal = value;
                
            }
            
        }    
        
        
      return correctVal;  
        
    }
    
    
    public int calcLevenshteinDistance(String string1, String string2) {
    int distance = 0;
    // YOUR CODE HERE
    
    String s = string1, t = string2;
    
    if (s == null || t == null) {
          throw new IllegalArgumentException("Strings must not be null");
      }
    
    int n = s.length(); // length of s
      int m = t.length(); // length of t

      if (n == 0) {
          return m;
      } else if (m == 0) {
          return n;
      }

      if (n > m) {
          // swap the input strings to consume less memory
          String tmp = s;
          s = t;
          t = tmp;
          n = m;
          m = t.length();
      }

      int p[] = new int[n+1]; //'previous' cost array, horizontally
      int d[] = new int[n+1]; // cost array, horizontally
      int _d[]; //placeholder to assist in swapping p and d

      // indexes into strings s and t
      int i; // iterates through s
      int j; // iterates through t

      char t_j; // jth character of t

      int cost; // cost

      for (i = 0; i<=n; i++) {
          p[i] = i;
      }

      for (j = 1; j<=m; j++) {
          t_j = t.charAt(j-1);
          d[0] = j;

          for (i=1; i<=n; i++) {
              cost = s.charAt(i-1)==t_j ? 0 : 1;
              // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
              d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
          }

          // copy current distance counts to 'previous row' distance counts
          _d = p;
          p = d;
          d = _d;
      }

      // our last action in the above loop was to switch d and p, so p now 
      // actually has the most recent cost counts
      distance = p[n];
    
    
    return distance;
}

}

// This line should make sure that compile errors are directly identified when executing this cell
// (the line itself does not produce any meaningful result)
Arrays.sort(new Object[]{new TrigramBasedSpellCorrector("")});

# Evaluation

- Run the following cell to test your implementation.
- You can ignore the cells afterwards.

In [2]:
%maven org.junit.jupiter:junit-jupiter-api:5.3.1
%maven commons-io:commons-io:2.6
import org.apache.commons.io.FileUtils;
import org.junit.jupiter.api.Assertions;
import org.opentest4j.AssertionFailedError;

public void checkCorrection(TrigramBasedSpellCorrector corrector, String word1, String word2, String word3,
        String... expectedCorrections) {
    try {
        String result = corrector.getCorrection(word1, word2, word3);
        if(expectedCorrections.length > 0) {
            Set<String> expectedResults = new HashSet<String>(Arrays.asList(expectedCorrections));
            Assertions.assertTrue(expectedResults.contains(result),
                    "For the trigram (\"" + word1 + "\",\"" + word2 + "\",\"" + word3 + "\") your solution returned "
                            + result + " while one of the following words has been expected: "
                            + Arrays.toString(expectedCorrections));
        } else {
            Assertions.assertNull(result,
                    "For the trigram (\"" + word1 + "\",\"" + word2 + "\",\"" + word3 + "\") your solution returned "
                            + result + " while null has been expected.");
        }
        System.out.println("Test(s) successfully completed.");
    } catch (AssertionFailedError e) {
        throw e;
    } catch (Throwable e) {
        System.err.println("Your solution caused an unexpected error:");
        throw e;
    }
}

System.out.println("----- Testing on short example ----");
String text = "London is the capital and largest city of England. Million people " + 
    "live in London. The River Thames is in London. London is the largest city in " +
    "Western Europe.";

TrigramBasedSpellCorrector corrector = new TrigramBasedSpellCorrector(text);

checkCorrection(corrector, "largest", "city", "im", "in");
checkCorrection(corrector, "largest", "city", "on", "in", "of"); // we expect "in" OR "of"
checkCorrection(corrector, "London", "is", "teh", "the");
checkCorrection(corrector, "largest", "capital", "in"); // we expect null as rsult
checkCorrection(corrector, "natural", "language", "processing"); // we expect null as rsult

System.out.println("----- Testing on Shakespeare example ----");
// Read text of Shakespeare
File file = new File("/srv/distribution/shakespeare.txt");
text = FileUtils.readFileToString(file, "UTF-8");
long time = System.currentTimeMillis();
corrector = new TrigramBasedSpellCorrector(text);
time = System.currentTimeMillis() - time;
System.out.println("Loading the tri-grams from Shakespeare took " + time + "ms.");

checkCorrection(corrector, "The", "river", "stüx", "styx");
checkCorrection(corrector, "The", "River", "Stüx", "styx");
checkCorrection(corrector, "ambassadors", "from", "noway", "norway");
checkCorrection(corrector, "first", "noble", "Frodo", "friend");
checkCorrection(corrector, "the", "devil", "siaeks", "speaks", "rides"); // we expect "speaks" OR "rides"

----- Testing on short example ----
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
----- Testing on Shakespeare example ----
Loading the tri-grams from Shakespeare took 1066ms.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.
Test(s) successfully completed.


In [None]:
// Ignore this cell

In [None]:
// Ignore this cell