# What's Inside a Data Query Engine  
## *Building one from Scratch*  

## Part 2: Just A Tad More Detail 
  
![What's Inside a Data Query Engine](./images/dataengine05.png)

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/00-Python-Collections/01.03%20Fun%20with%20Functools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [None]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

In [None]:
datalocation = "./data/ml-latest-small/"

In [None]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Here's what our data engine should be able to do  
* Load the data into the memory and capture some metadata (things like column names, data types etc.)  
* Get a query, a SELECT (xxx) FROM (xxx) WHERE (XXX)  
* Parse the query to make sense of it  
* Highlight if there are any errors  
* Build a query plan  
* By looking at the plan and metadata, optimize the query futher  
* Execute the query  
* Show the results  
* Show the cost of running the query  
  
  
_The full set of notebooks also covers JOINs and nested queries, but we are going to treat them as intermediate to advanced cases - since they may distract us from the goal of just being able to understand how data engines work._

We'll directly use the [CSV module](https://docs.python.org/3/library/csv.html) here just to keep our focus on the data engine itself and not get distracted by the intricacies of loading a CSV file.

# SQL Parser  
  
converts SQL queries into a structured Abstract Syntax Tree (AST). 
AST is a tree representation of the syntactic structure of the SQL code.  
  

In [None]:
import csv
import re
from typing import List, NamedTuple
from enum import Enum, auto

## Tokeninze

In [None]:
class TokenType(Enum):
    SELECT = auto()
    FROM = auto()
    WHERE = auto()
    JOIN = auto()
    ON = auto()
    ORDER = auto()
    BY = auto()
    GROUP = auto()
    HAVING = auto()
    INSERT = auto()
    UPDATE = auto()
    DELETE = auto()
    IDENTIFIER = auto()
    STRING = auto()
    NUMBER = auto()
    OPERATOR = auto()
    PUNCTUATION = auto()
    WHITESPACE = auto()  # so we can ignore it in further processing

In [None]:
# # Define a dictionary for quick keyword lookup
# SQL_KEYWORDS = {
#     'SELECT': TokenType.KEYWORD,
#     'FROM': TokenType.KEYWORD,
#     'WHERE': TokenType.KEYWORD,
#     'JOIN': TokenType.KEYWORD,
#     'ON': TokenType.KEYWORD,
#     'ORDER': TokenType.KEYWORD,
#     'BY': TokenType.KEYWORD,
#     'GROUP': TokenType.KEYWORD,
#     'HAVING': TokenType.KEYWORD,
#     'INSERT': TokenType.KEYWORD,
#     'UPDATE': TokenType.KEYWORD,
#     'DELETE': TokenType.KEYWORD
# }

In [None]:
# Define a dictionary for quick keyword lookup
SQL_KEYWORDS = {
	'SELECT': TokenType.SELECT,
	'FROM': TokenType.FROM,
	'WHERE': TokenType.WHERE,
	'JOIN': TokenType.JOIN,
	'ON': TokenType.ON,
	'ORDER': TokenType.ORDER,
	'BY': TokenType.BY,
	'GROUP': TokenType.GROUP,
	'HAVING': TokenType.HAVING,
	'INSERT': TokenType.INSERT,
	'UPDATE': TokenType.UPDATE,
	'DELETE': TokenType.DELETE
}

In [None]:
def tokenize(sql):
    token_patterns = r'''
        ('[^']*'|"[^"]*")              # String literals
      | (<=|>=|<>|!=|<|>|=)            # Comparison operators
      | (\d+\.\d*|\.\d+|\d+)           # Numeric values
      | ([,;()])                       # Punctuation
      | (\b[a-zA-Z_][a-zA-Z0-9_]*\b)   # Identifiers or SQL keywords
      | (\s+)                          # Whitespace
    '''
    token_regex = re.compile(token_patterns, re.VERBOSE) #VERBOSE allows for multiline regex with comments 
    for match in token_regex.finditer(sql):
        token = match.group(0)
        if token.isspace():
            yield (token, TokenType.WHITESPACE)
        elif token in (',', ';', '(', ')'):
            yield (token, TokenType.PUNCTUATION)
        elif token.upper() in SQL_KEYWORDS:
            yield (token.upper(), SQL_KEYWORDS[token.upper()])
        elif re.match(r'^[\'"].*[\'"]$', token):
            yield (token, TokenType.STRING)
        elif re.match(r'^\d+(\.\d+)?$', token):
            yield (token, TokenType.NUMBER)
        elif re.match(r'<=|>=|<>|!=|<|>|=$', token):
            yield (token, TokenType.OPERATOR)
        else:
            yield (token, TokenType.IDENTIFIER)

In [None]:
# Example usage:
sql_query_test_01 = "SELECT name, age FROM users WHERE age >= 21 AND status = 'active' ORDER BY age DESC;"
t = tokenize(sql_query_test_01)
print(list(t))
# next(t)

## AST Node Definitions

In [None]:
class ASTNode:
    pass

In [None]:
class SelectStatement(ASTNode):
    def __init__(self, columns, table_name, where_clause=None):
        self.columns = columns  # List of column names or '*'
        self.table_name = table_name  # Name of the table
        self.where_clause = where_clause  # WhereClause node or None

In [None]:
class WhereClause(ASTNode):
    def __init__(self, condition):
        self.condition = condition  # This could be a more complex structure in a full implementation

## Parser

In [None]:
import re

def tokenize(expression):
    token_specification = [
        ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
        ('PLUS',     r'\+'),           # Addition operator
        ('MULT',     r'\*'),           # Multiplication operator
        ('WS',       r'\s+'),          # Whitespace
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    for mo in re.finditer(tok_regex, expression):
        kind = mo.lastgroup
        value = mo.group()
        if kind != 'WS':  # Ignore whitespace
            yield (kind, value)

# Example usage
tokens = tokenize("3 + 5 * 2")



In [None]:
class Parser:
    def __init__(self, tokens):
        self.tokens = iter(tokens)
        self.current_token = None
        self.next_token()
    
    def next_token(self):
        try:
            self.current_token = next(self.tokens)
        except StopIteration:
            self.current_token = None

    def parse_expression(self):
        """Expression ::= Term ((PLUS) Term)*"""
        value = self.parse_term()
        while self.current_token and self.current_token[0] == 'PLUS':
            self.next_token()
            value += self.parse_term()
        return value

    def parse_term(self):
        """Term ::= Factor ((MULT) Factor)*"""
        value = self.parse_factor()
        while self.current_token and self.current_token[0] == 'MULT':
            self.next_token()
            value *= self.parse_factor()
        return value

    def parse_factor(self):
        """Factor ::= NUMBER"""
        if self.current_token and self.current_token[0] == 'NUMBER':
            value = float(self.current_token[1])
            self.next_token()
            return value
        else:
            raise SyntaxError('Expected NUMBER')

# Example usage
tokens = tokenize("3 + 5 * 2")
parser = Parser(tokens)
result = parser.parse_expression()
print(result)  # Output: 13.0


In [None]:
# Example Usage
sql_query_test_02 = "SELECT name, age FROM users WHERE age > 30"
# print("SQL Query:\n\t", sql_query_test_02)
tokens = tokenize(sql_query_test_02)
print("Tokenized:\n\t",list(tokens), " ", tokens)

In [None]:
class Parser:
	def __init__(self, tokens):
		self.tokens = tokens
		print("Parser:__init__: tokens = ",type(tokens))
		print("Parser:__init__: dir(tokens) = ",dir(tokens))
		print("Parser:__init__: tokens = ",list(tokens))
		self.current_token = None
		print("Parser:__init__: self.current_token = ",self.current_token)
		self.next_token = None
		self._next_token()

	def _next_token(self):
		"""Advance to the next token."""
		print("Parser:_advance: self.current_token = ",self.current_token)
		try:
			self.current_token = self.next_token
			self.next_token = next(self.tokens, None)
		except StopIteration:
			self.current_token = None
		# if self.current_token == None:
		# 	self.current_token = next(self.tokens, None)
		# self.current_token = self.next_token
		# self.next_token = next(self.tokens, None)

	def parse(self):
		"""Parse the tokens into an AST."""
		print("Parse:parse, self.current_token = ", self.current_token)
		print("Parse:parse, dir(self.current_token) = ", dir(self.current_token))
		if self.current_token.type != TokenType.SELECT:
			raise SyntaxError("Query must start with SELECT")
		self._advance()

		columns = self._parse_columns()

		if self.current_token.type != TokenType.FROM:
			raise SyntaxError("Expected FROM after column list")
		self._advance()

		table_name = self._parse_table_name()

		where_clause = None
		if self.current_token.type == TokenType.WHERE:
			self._advance()
			where_clause = self._parse_where_clause()

		return SelectStatement(columns, table_name, where_clause)

	def _parse_columns(self):
		"""Parse the columns part of the SELECT statement."""
		columns = []
		if self.current_token.type == TokenType.ASTERISK:
			columns.append('*')
			self._advance()
		else:
			while True:
				if self.current_token.type != TokenType.IDENTIFIER:
					raise SyntaxError("Expected column name")
				columns.append(self.current_token.value)
				self._advance()

				if self.current_token.type != TokenType.COMMA:
					break
				self._advance()  # Skip the comma

		return columns

	def _parse_table_name(self):
		"""Parse the table name."""
		if self.current_token.type != TokenType.IDENTIFIER:
			raise SyntaxError("Expected table name")
		table_name = self.current_token.value
		self._advance()
		return table_name

	def _parse_where_clause(self):
		"""Parse the WHERE clause."""
		# In a full implementation, this would need to handle complex expressions.
		# For simplicity, we'll assume it's just a single condition.
		if self.current_token.type != TokenType.IDENTIFIER:
			raise SyntaxError("Expected condition after WHERE")
		condition = self.current_token.value
		self._advance()
		return WhereClause(condition)



In [None]:
parser = Parser(tokens)
# print("Parser:\n\t",tokens)
ast = parser.parse()

In [None]:
# Printing the AST for demonstration
print(ast)

In [None]:
eval(eval_expr,sample_row)

**A note of caution:**   
eval() and exec() built-in methods in Python are considered problematic from a Security standpoint as they let one run arbitrary code.
We'll see later how to implement a safer version.  

In [None]:
# what would our simple query look like?
# let's say something that get's us movies with a specific Id?
table_metadata = {
	'name': 'movies',
	'columns': ['movieId', 'title', 'genres']
}

where_clause = 'int(movieId) == 12'

a_simple_query = where_clause

In [None]:
# Execute a SELECT query given a dictionary with data in it
def execute_select(query, table_data):
	columns = table_metadata['columns']
	table_name = table_metadata['name']
	where_clause = query
	selected_rows = []
	# SELECT * FROM
	data = table_data[table_name]
	for row in data:
		if where_clause:
			# Apply WHERE clause filtering
			if eval(where_clause, row):
				selected_rows.append({col: row[col] for col in columns})
		else:
			selected_rows.append({col: row[col] for col in columns})
	return selected_rows

In [None]:
result = execute_select(a_simple_query, movies_data)
print('result: \n', result)

In [None]:
another_query = 'int(movieId) <= 12'
result = execute_select(another_query, movies_data)
print('result: \n', result)

Wait, was it this simple?   
Yea!

# Next

Building a more feature rich data engine