## Lexical Analyzer

The first step in designing a compiler is to find and create tokens from input.

There are two ways to tokenize text/code:

- The text is separated character by character, and then we check whether the set of characters in the sequence also refers to a specific sign.
- Split sections when a special character appears (like space, comma, etc.).

Here we follow the second method using regular expressions...

In [9]:
import re
from functools import partial
from dataclasses import dataclass, field
from typing import Optional, Callable, Type, List

from nltk.tokenize import word_tokenize

In [3]:
code = """
// This is a useless comment
print "What is your name? ".
my $name <- readLn.
printLn "I think you are ${name}!".

for my $i in 1..10 // Repeat from 1 to 10
  printLn "${i}".
done
"""

In [4]:
code2 = """
int age = 20;
std::cout << age << std::endl;
"""

Using built-in methods and natural language processing libraries:

In [5]:
split_by_space_method = code.split()
print(split_by_space_method)

['//', 'This', 'is', 'a', 'useless', 'comment', 'print', '"What', 'is', 'your', 'name?', '".', 'my', '$name', '<-', 'readLn.', 'printLn', '"I', 'think', 'you', 'are', '${name}!".', 'for', 'my', '$i', 'in', '1..10', '//', 'Repeat', 'from', '1', 'to', '10', 'printLn', '"${i}".', 'done']


In [6]:
nltk_word_tokenize_method = word_tokenize(code)
print(nltk_word_tokenize_method)

['//', 'This', 'is', 'a', 'useless', 'comment', 'print', '``', 'What', 'is', 'your', 'name', '?', '``', '.', 'my', '$', 'name', '<', '-', 'readLn', '.', 'printLn', '``', 'I', 'think', 'you', 'are', '$', '{', 'name', '}', '!', "''", '.', 'for', 'my', '$', 'i', 'in', '1', '..', '10', '//', 'Repeat', 'from', '1', 'to', '10', 'printLn', '``', '$', '{', 'i', '}', "''", '.', 'done']


In [8]:
difference = set(split_by_space_method).difference(set(nltk_word_tokenize_method))
print(difference)

{'name?', '"I', '1..10', '${name}!".', '$i', '$name', '".', '"${i}".', 'readLn.', '"What', '<-'}


Create based on need and creativity maybe!

In [12]:
@dataclass(frozen=True)
class Token:
    name: str
    pattern: str
    calls: Optional[Callable] = None

In [11]:
@dataclass(frozen=True)
class TokenState:
    name: str
    value: str

In [13]:
@dataclass
class Tokens:
    dict_: dict = field(init=False, default_factory=dict)
    
    def add(self, name: str, pattern: str, calls: Optional[str] = None) -> None:
        """Adds the token as a class attribute and a dictionary item."""
        setattr(self, name, Token(name, pattern, calls))
        self.dict_[name] = Token(name, pattern, calls)

In [15]:
@dataclass
class Lexer:
    tokens: Type[Tokens]
    separator: Optional[str] = r'(\w+)|\s'
    
    def tokenize(self, code: str) -> List[Type[TokenState]]:
        """It is clear from the name of the function what it does."""
        result = []
        parts = filter(lambda x: x, re.split(f'{self.separator}', code))

        for part in parts:
            for token_name, token_info in self.tokens.dict_.items():
                match = re.match(f'{token_info.pattern}$', part)
                if match is not None:
                    result.append(TokenState(token_name, match.group()))
                    break
            else:
                raise ValueError(f'{repr(part)} is not a valid token!')
        
        return result

Our anonymous programming language tokens:

In [16]:
tokens = Tokens()
tokens.add('COMMENT', r'/{2}.*')
tokens.add('DOT', r'\.')
tokens.add('BETWEEN', r'\.{2}')
tokens.add('DOLLARSIGN', r'\$')
tokens.add('ARROW', r'<\-')
tokens.add('NUMBER', r'[\+\-]?(\d+|\d+\.\d*|\d*\.\d+)')
tokens.add('STRING', r'".*"')
tokens.add('PRINT', r'print', partial(print, end=''))
tokens.add('PRINTLINE', r'printLn', print)
tokens.add('READLINE', r'readLn', input)
tokens.add('MY', r'my')
tokens.add('IN', r'in')
tokens.add('IDENTIFIER', r'\$[A-Za-z]+[_A-Za-z0-9]*')
tokens.add('FOR', r'for')
tokens.add('DONE', r'done')

Making the separator pattern a little complex!

In [23]:
separator_tokens = ['STRING', 'BETWEEN', 'NUMBER', 'COMMENT', 'DOT']
separator_pattern = '|'.join([
    f'({getattr(tokens, token_name).pattern})'
    for token_name in separator_tokens
])
separator_pattern += '|(\(\)\{\}\[\])|\s'
print(separator_pattern)

(".*")|(\.{2})|([\+\-]?(\d+|\d+\.\d*|\d*\.\d+))|(/{2}.*)|(\.)|(\(\)\{\}\[\])|\s


*NOTE: The order in the pattern above is important.*

In [24]:
lexer = Lexer(tokens, separator_pattern)
print(lexer.tokenize(code))

[TokenState(name='COMMENT', value='// This is a useless comment'), TokenState(name='PRINT', value='print'), TokenState(name='STRING', value='"What is your name? "'), TokenState(name='DOT', value='.'), TokenState(name='MY', value='my'), TokenState(name='IDENTIFIER', value='$name'), TokenState(name='ARROW', value='<-'), TokenState(name='READLINE', value='readLn'), TokenState(name='DOT', value='.'), TokenState(name='PRINTLINE', value='printLn'), TokenState(name='STRING', value='"I think you are ${name}!"'), TokenState(name='DOT', value='.'), TokenState(name='FOR', value='for'), TokenState(name='MY', value='my'), TokenState(name='IDENTIFIER', value='$i'), TokenState(name='IN', value='in'), TokenState(name='NUMBER', value='1'), TokenState(name='NUMBER', value='1'), TokenState(name='BETWEEN', value='..'), TokenState(name='NUMBER', value='10'), TokenState(name='NUMBER', value='10'), TokenState(name='COMMENT', value='// Repeat from 1 to 10'), TokenState(name='PRINTLINE', value='printLn'), Toke

Test on a C++ code:

In [26]:
lexer = Lexer(tokens, separator_pattern)
print(lexer.tokenize(code2))

ValueError: 'int' is not a valid token!