# Lexer

[Lexer](https://en.wikipedia.org/wiki/Lexical_analysis) jer deo kompajlera koji na osnovu izvornog koda formira niz tokena. Token je uređeni par (klasa, leksema). Leksema je karakter ili ključna reč koja ima funkciju u sintaksi programskog jezika. Upravo ta funkcija određuje klasu lekseme. Lekser će pročitati izvorni kod i to je jedini put kada će se to uraditi u čitavom procesu kompajliranja. Naredne faze kompajliranja zahtevaju samo formirani niz tokena..

![pp-01](https://i.postimg.cc/SNmFQ6X0/pp-01.png)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from enum import Enum, auto

Class defines all possible lexeme classes that can be found in Pascal source code.

In [None]:
class Class(Enum):
    TYPE          = auto()
    ID            = auto()
    INT           = auto()
    REAL          = auto()
    BOOLEAN       = auto()
    CHAR          = auto()
    STRING        = auto()
    ARRAY         = auto()    # ?
    OF            = auto()

    PLUS          = auto()
    MINUS         = auto()
    ASTERISK      = auto()  # mnozenje
    DIV           = auto()  # celobrojno deljenje
    FWDSLASH      = auto()  # decimalno deljenje
    MOD           = auto()

    ABS           = auto()
    ROUND         = auto()
    INC           = auto()
    DEC           = auto()

    NOT = auto()
    AND = auto()
    OR = auto()
    XOR = auto()

    EQ = auto()             # =
    NEQ = auto()            # <>
    LT = auto()
    GT = auto()
    LTE = auto()
    GTE = auto()

    LPAREN        = auto()
    RPAREN        = auto()
    LBRACKET      = auto()
    RBRACKET      = auto()
    RANGE         = auto()

    COMMA         = auto()
    DOT           = auto()
    COLON         = auto()
    SEMICOLON     = auto()
    ASSIGN        = auto()      # :=

    VAR           = auto()  
    PROCEDURE     = auto()
    FUNCTION      = auto()
    EXIT          = auto()
    BEGIN         = auto()
    END           = auto()     

    IF            = auto()
    THEN          = auto()
    ELSE          = auto()

    FOR           = auto()
    TO            = auto()
    DO            = auto()
    WHILE         = auto()
    REPEAT        = auto()
    UNTIL         = auto()
    BREAK         = auto()
    CONTINUE      = auto()

    EOF           = auto()



** Token ** class is an ordered pair par (class, lexeme).

Method **str** returns a string representation of a token. It is used in the debugging process.

In [None]:
class Token:
    def __init__(self, class_, lexeme):
        self.class_ = class_
        self.lexeme = lexeme

    def __str__(self):
        return "<{} {}>".format(self.class_, self.lexeme)

 **Lekser** class contains methods for lexical analysis of the Pascal source code.

Method **lex** forms an array of tokens using the method **next_token**.

Method **next_token** forms a token of the appropriarte class using **next_char** method.

Method **next_char** moves the pointer to the next character.

Method **read_keyword** forms a keyword token under the condition the current character is a letter.

Method **read_string** forms a string token under the condition the current char is a '.

Method **read_char** forms a string token under the condition the current char is a ' and the length is 1.

Method **read_num** forms an int or real token depending on the format of the number.

Method **read_space** moves the pointer to the first next non-space character.

Method **die** is called in the case lexer came across an unexpected character.

In [None]:
class Lexer:
    def __init__(self, text):
        self.text = text
        self.len = len(text)
        self.pos = -1

    def read_space(self):
        while self.pos + 1 < self.len and self.text[self.pos + 1].isspace():
            self.next_char()

    def read_num(self):
        lexeme = self.text[self.pos]
        while self.pos + 1 < self.len and self.text[self.pos + 1].isdigit():
            lexeme += self.next_char()
        if self.text[self.pos + 1] == '.' and self.text[self.pos + 2] != '.':
            lexeme += self.next_char()
            return read_real(lexeme)
        return Token(Class.INT, int(lexeme)) 

    def read_real(self, lexeme):
        while self.pos + 1 < self.len and self.text[self.pos + 1].isdigit():
            lexeme += self.next_char()
        return Token(Class.REAL, real(lexeme))

    def read_char(self): 
        self.pos += 1
        lexeme = self.text[self.pos]
        self.pos += 1
        return lexeme

    def read_string(self): 
        lexeme = ''
        while self.pos + 1 < self.len and self.text[self.pos + 1] != '\'':
            lexeme += self.next_char()
        self.pos += 1
        return lexeme

    def read_keyword(self):
        lexeme = self.text[self.pos]
        while self.pos + 1 < self.len and self.text[self.pos + 1].isalnum():
            lexeme += self.next_char()
        if lexeme == 'begin':
            return Token(Class.BEGIN, lexeme)
        elif lexeme == 'end':                           
            return Token(Class.END, lexeme)
        elif lexeme == 'procedure':
            return Token(Class.PROCEDURE, lexeme)
        elif lexeme == 'function':
            return Token(Class.FUNCTION, lexeme)
        elif lexeme == 'exit':
            return Token(Class.EXIT, lexeme)
        elif lexeme == 'var':
            return Token(Class.VAR, lexeme)
        elif lexeme == 'div':
            return Token(Class.DIV, lexeme)
        elif lexeme == 'mod':
            return Token(Class.MOD, lexeme)
        elif lexeme == 'true' or lexeme == 'false':
            return Token(Class.BOOLEAN, lexeme)
        elif lexeme == 'not':
            return Token(Class.NOT, lexeme)
        elif lexeme == 'and':
            return Token(Class.AND, lexeme)
        elif lexeme == 'or':
            return Token(Class.OR, lexeme)
        elif lexeme == 'xor':
            return Token(Class.XOR, lexeme)
        elif lexeme == 'array':
            return Token(Class.ARRAY, lexeme)
        elif lexeme == 'of':
            return Token(Class.OF, lexeme)
        elif lexeme == 'if':
            return Token(Class.IF, lexeme)
        elif lexeme == 'then':
            return Token(Class.THEN, lexeme)
        elif lexeme == 'else':
            return Token(Class.ELSE, lexeme)
        elif lexeme == 'while':
            return Token(Class.WHILE, lexeme)
        elif lexeme == 'for':
            return Token(Class.FOR, lexeme)
        elif lexeme == 'to':
            return Token(Class.TO, lexeme)
        elif lexeme == 'do':
            return Token(Class.DO, lexeme)
        elif lexeme == 'repeat':
            return Token(Class.REPEAT, lexeme)
        elif lexeme == 'until':
            return Token(Class.UNTIL, lexeme)
        elif lexeme == 'Break':
            return Token(Class.BREAK, lexeme)
        elif lexeme == 'Continue':
            return Token(Class.CONTINUE, lexeme)
        elif lexeme == 'return':
            return Token(Class.RETURN, lexeme)
        elif lexeme == 'integer' or lexeme == 'real' or lexeme == 'char' or lexeme == 'string' or lexeme == 'boolean':
            return Token(Class.TYPE, lexeme)
        return Token(Class.ID, lexeme)

    def next_char(self):
        self.pos += 1
        if self.pos >= self.len:
            return None
        return self.text[self.pos]

    def peek(self, step):
        peek_pos = self.pos + step
        if peek_pos >= self.len:
            return None
        return self.text[peek_pos]   

    def next_token(self):
        self.read_space()
        curr = self.next_char()
        if curr is None:
            return Token(Class.EOF, curr)
        token = None
        if curr.isalpha():
            token = self.read_keyword()
        elif curr.isdigit():
            token = self.read_num()
        elif curr == ':' and self.peek(1) == '=':
                self.next_char()
                token = Token(Class.ASSIGN, ':=')
        elif curr == '\'':
            self.pos += 1
            if self.peek(1) == '\'':
                self.pos -= 1
                token = Token(Class.CHAR, self.read_char())
            else:
                self.pos -= 1
                token = Token(Class.STRING, self.read_string())
        elif curr == '+':
            token = Token(Class.PLUS, curr)
        elif curr == '-':
            token = Token(Class.MINUS, curr)
        elif curr == '*':
            token = Token(Class.ASTERISK, curr)
        elif curr == '/':
            token = Token(Class.FWDSLASH, curr)
        elif curr == '%':
            token = Token(Class.PERCENT, curr)
        elif curr == '=':
                token = Token(Class.EQ, '=')
        elif curr == '<':
            if self.peek(1) == '=':
                self.next_char()
                token = Token(Class.LTE, '<=')
            elif self.peek(1) == '>':
                self.next.char()
                token = Token(Class.NEQ, '<>')
            else:
                token = Token(Class.LT, '<')
        elif curr == '>':
            if self.peek(1) == '=':
                self.next_char()
                token = Token(Class.GTE, '>=')
            else:
                token = Token(Class.GT, '>')
        elif curr == '(':
            token = Token(Class.LPAREN, curr)
        elif curr == ')':
            token = Token(Class.RPAREN, curr)
        elif curr == '[':
            token = Token(Class.LBRACKET, curr)
        elif curr == ']':
            token = Token(Class.RBRACKET, curr)
        elif curr == ';':
            token = Token(Class.SEMICOLON, curr)
        elif curr == ':':
            token = Token(Class.COLON, curr)
        elif curr == ',':
            token = Token(Class.COMMA, curr)
        elif curr == '.':
            token = Token(Class.DOT, curr)
            if self.peek(1) == '.':
                self.next_char()
                token = Token(Class.RANGE, '..')
        else:
            self.die(curr)
        return token

    def lex(self):
        tokens = []
        while True:
            curr = self.next_token()
            tokens.append(curr)
            if curr.class_ == Class.EOF:
                break
        return tokens

    def die(self, char):
        raise SystemExit("Unexpected character: {}".format(char))