# Advanced NLP HW0

Before starting the task please read thoroughly these chapters of Speech and Language Processing by Daniel Jurafsky & James H. Martin:

•	N-gram language models: https://web.stanford.edu/~jurafsky/slp3/3.pdf

•	Neural language models: https://web.stanford.edu/~jurafsky/slp3/7.pdf 

In this task you will be asked to implement the models described there.

Build a text generator based on n-gram language model and neural language model.
1.	Find a corpus (e.g. http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt ), but you are free to use anything else of your interest
2.	Preprocess it if necessary (we suggest using nltk for that)
3.	Build an n-gram model
4.	Try out different values of n, calculate perplexity on a held-out set
5.	Build a simple neural network model for text generation (start from a feed-forward net for example). We suggest using tensorflow + keras for this task

Criteria:
1.	Data is split into train / validation / test, motivation for the split method is given
2.	N-gram model is implemented
  *	Unknown words are handled
  * Add-k Smoothing is implemented
3.	Neural network for text generation is implemented
4.	Perplexity is calculated for both models
5.	Examples of texts generated with different models are present and compared
6.	Optional: Try both character-based and word-based approaches.

In [5]:
from typing import Iterable, Union, Tuple, List
import random
from functools import wraps

## Custom ngram model

Base class for the model.

In [6]:
class BaseLM:
    def _check_fit(func):
        """
        A helper decorator that ensures that the LM was fit on vocab.
        """
        @wraps(func)
        def wrapper(self,*args,**kwargs):
            if not self.is_fitted:
                raise AttributeError(f"Fit model before call {func.__name__} method")
            return func(self, *args,**kwargs)
        return wrapper

    def __init__(self, 
                 n: int, 
                 vocab: Iterable[str] = None, 
                 unk_label: str = "<UNK>"
                ):
        """
        Language model constructor
        n -- n-gram size
        vocab -- optional fixed vocabulary for the model
        unk_label -- special token that stands in for so-called "unknown" items
        """
        self.n = n
        self._vocab = vocab if vocab else None
        self.unk_label = unk_label
  
    def _lookup(self, 
                words: Union[str, Iterable[str]]
               ) -> Union[str, Tuple[str]]:
        """
        Look ups words in the vocabulary
        """
        raise NotImplementedError

    @_check_fit
    def prob(self, 
             word: str, 
             context: Tuple[str] = None
            ) -> float:
        """This method returns probability of a word with given context: P(w_t | w_{t - 1}...w_{t - n + 1})

        For example:
        >>> lm.prob('hello', context=('world',))
        0.99988
        """
        raise NotImplementedError

    def prob_with_smoothing(self, 
                            word: str, 
                            context: Tuple[str] = None, 
                            alpha: float = 1.0
                            ) -> float:
        """Proabaility with Additive smoothing

        see: https://en.wikipedia.org/wiki/Additive_smoothing
        where:
        x - count of word in context
        N - total
        d - wocab size
        a - alpha

        """
        raise NotImplementedError

    @_check_fit
    def generate(self, 
                 text_length: int, 
                 text_seed: Iterable[str] = None,
                 random_seed: Union[int,random.Random] = 42,
                 prob_method = str
                 ) -> List[str]:
        """
        This method generates text of a given length. 

        text_length: int -- Length for the output text including `text_seed`.
        text_seed: List[str] -- Given text to calculates probas for next words.
        prob_method: str -- Specifies what method to use: with or without smoothing.

        For example
        >>> lm.generate(2)
        ["hello", "world"]

        """
        raise NotImplementedError

    def fit(self, 
            sequence_of_tokens: Iterable[str]
           ):
        """
        This method learns probabilities based on given sequence of tokens and
        updates `self.vocab`.

        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        raise NotImplementedError

    @_check_fit  
    def perplexity(self, 
                   sequence_of_tokens: Union[Iterable[str], Iterable[Tuple[str]]]
                   ) -> float:
        """
        This method returns perplexity for a given sequence of tokens

        sequence_of_tokens -- iterable of tokens
        """
        raise NotImplementedError