<a href="https://colab.research.google.com/github/rdkdaniel/Compiler-Construction/blob/main/Lexical_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#Simple script breaking words into tokens
#Can we generalize this tokenizer?

#Key link: https://medium.com/@pythonmembers.club/building-a-lexer-in-python-a-tutorial-3b6de161fe84
#Key link: https://www.jayconrod.com/posts/37/a-simple-interpreter-from-scratch-in-python-part-1

In [None]:
#What is a lexer?
#From the code written by the user, four major steps follow during compilation
#Lexical analysis, parsing, code generation and execution
#A Lexer does lexical analysis.
#Today, this also includes scanning (before, these two were separate steps)

In [None]:
#Lexical analysis turns scanned characters into lexemes (known piece of string)
#Refer to the theory material for examples using simple sentences

**Building the Lexer**

In [None]:
#NB - the lexer will scan a source code and break it into a list of items
#Once this is done, it creates a type and value pair (word and classifies its role to give the token i.e. 'token id', word) ['Integer', '#']


In [None]:
#Our Source Code is: int result = 20;

In [1]:
import re                                #Library for performing regex expressions

In [2]:
tokens = []                               # for string tokens
source_code = 'int result = 100;'.split() # turning source code into list of words

# Loop through each source code word
for word in source_code:
    
    # This will check if a token has datatype decleration
    if word in ['str', 'int', 'bool']: 
        tokens.append(['DATATYPE', word])
    
    # This will look for an identifier which would be just a word
    elif re.match("[a-z]", word) or re.match("[A-Z]", word):
        tokens.append(['IDENTIFIER', word])
    
    # This will look for an operator
    elif word in '*-/+%=':
        tokens.append(['OPERATOR', word])
    
    # This will look for integer items and cast them as a number
    elif re.match(".[0-9]", word):
        if word[len(word) - 1] == ';': 
            tokens.append(["INTEGER", word[:-1]])
            tokens.append(['END_STATEMENT', ';'])
        else: 
            tokens.append(["INTEGER", word])

print(tokens) # Outputs the token array

[['DATATYPE', 'int'], ['IDENTIFIER', 'result'], ['OPERATOR', '='], ['INTEGER', '100'], ['END_STATEMENT', ';']]


In [None]:
#The output above shows that we broke the source code into a token stream
#This stream had the 'TYPE' and 'VALUE'

Explaining the Code

1. First, we imported the regex (regular expression library)
    This library was used to check if certain words match certain predefined regex patterns.
2. An empty list named "tokens" was created [This list would store all the tokens created].
3. The source code was split - it was a string broken down into a list of words
4. These separated item list was stored in a variable called "source_code"
5. The first check was then done i.e.
    if word in ['str', 'int', 'bool']: 
       tokens.append(['DATATYPE', word])
6. This check verified the datatype which then told us the type of variable in use.
7. Thereafter, more checks were done, where each word in the source code was identifed and a token was created for it.

# ***Second variation: based on a new source code***

In [5]:
tokens = []                               # for string tokens
source_code = 'x+y=z;'.split() # turning source code into list of words

# Loop through each source code word
for word in source_code:
    
    # This will check if a token has datatype decleration
    if word in ['str', 'int', 'bool']: 
        tokens.append(['DATATYPE', word])
    
    # This will look for an identifier which would be just a word
    elif re.match("[a-z]", word) or re.match("[A-Z]", word):
        tokens.append(['IDENTIFIER', word])
    
    # This will look for an operator
    elif word in '*-/+%=':
        tokens.append(['OPERATOR', word])
    
    # This will look for integer items and cast them as a number
    elif re.match(".[0-9]", word):
        if word[len(word) - 1] == ';': 
            tokens.append(["INTEGER", word[:-1]])
            tokens.append(['END_STATEMENT', ';'])
        else: 
            tokens.append(["INTEGER", word])

print(tokens) # Outputs the token array

[['IDENTIFIER', 'x+y=z;']]


***Why did the second variation give just one token***

# **Third variation**

In [6]:
tokens = []                               # for string tokens
source_code = 'int x = 75;'.split() # turning source code into list of words

# Loop through each source code word
for word in source_code:
    
    # This will check if a token has datatype decleration
    if word in ['str', 'int', 'bool']: 
        tokens.append(['DATATYPE', word])
    
    # This will look for an identifier which would be just a word
    elif re.match("[a-z]", word) or re.match("[A-Z]", word):
        tokens.append(['IDENTIFIER', word])
    
    # This will look for an operator
    elif word in '*-/+%=':
        tokens.append(['OPERATOR', word])
    
    # This will look for integer items and cast them as a number
    elif re.match(".[0-9]", word):
        if word[len(word) - 1] == ';': 
            tokens.append(["INTEGER", word[:-1]])
            tokens.append(['END_STATEMENT', ';'])
        else: 
            tokens.append(["INTEGER", word])

print(tokens) # Outputs the token array

[['DATATYPE', 'int'], ['IDENTIFIER', 'x'], ['OPERATOR', '='], ['INTEGER', '75'], ['END_STATEMENT', ';']]


***That worked!***