# What's Inside a Data Query Engine  
## *Building one from Scratch*  

## Part 1: Starting Simple 
  
![What's Inside a Data Query Engine](./images/dataengine03.png)

### <font color='green'>__Support for Google Colab__  </font>  
    
open this notebook in Colab using the following button:  
  
<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/00-Python-Collections/01.03%20Fun%20with%20Functools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>  

  
<font color='green'>uncomment and execute the cell below to setup and run this notebook on Google Colab.</font>

In [1]:
# # SETUP FOR COLAB: select all the lines below and uncomment (CTRL+/ on windows)
# # Let's download and unzip the Small MovieLens Dataset
# ! mkdir ./../data
# ! wget -q https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
# ! unzip ./ml-latest-small.zip -d ./../data/

### Get the _Small_ MovieLens Dataset

We'll use the [small MovieLens dataset](https://grouplens.org/datasets/movielens/#:~:text=Small%3A%20100%2C000%20ratings%20and%203%2C600%20tag%20applications) here.

Download it and unzip to the data folder under the name `ml-latest-small`.

This dataset expands to about 3.2 MB on your local disk. 

In [2]:
datalocation = "./data/ml-latest-small/"

In [3]:
# specify file names
file_path_movies = datalocation + "movies.csv"
file_path_links = datalocation + "links.csv"
file_path_ratings = datalocation + "ratings.csv"
file_path_tags = datalocation + "tags.csv"

# Here's what our data engine should be able to do  
* Load the data into the memory and capture some metadata (things like column names, data types etc.)  
* Get a query, a SELECT (xxx) FROM (xxx) WHERE (XXX)  
* Parse the query to make sense of it  
* Highlight if there are any errors  
* Build a query plan  
* By looking at the plan and metadata, optimize the query futher  
* Execute the query  
* Show the results  
* Show the cost of running the query  
  
  
_The full set of notebooks also covers JOINs and nested queries, but we are going to treat them as intermediate to advanced cases - since they may distract us from the goal of just being able to understand how data engines work._

We'll directly use the [CSV module](https://docs.python.org/3/library/csv.html) here just to keep our focus on the data engine itself and not get distracted by the intricacies of loading a CSV file.

# A Naïve Engine  

Let's first build a method to query the data.  
We'll worry about parsing a text query into something our engine understands later.

In [4]:
import csv

In [5]:
# Load CSV data into a python dictionary
def load_csv(file_path, table_name = 'table'):
    data = []
    with open(file_path, "r", encoding="utf-8") as csvfile:
        csvreader = csv.DictReader(csvfile, delimiter=",", quotechar='"')
        for row in csvreader:
            data.append(row)
    return {table_name: data}

In [6]:
movies_data = load_csv(file_path_movies, 'movies')

In [7]:
movies_data['movies'][0:12]

[{'movieId': '1',
  'title': 'Toy Story (1995)',
  'genres': 'Adventure|Animation|Children|Comedy|Fantasy'},
 {'movieId': '2',
  'title': 'Jumanji (1995)',
  'genres': 'Adventure|Children|Fantasy'},
 {'movieId': '3',
  'title': 'Grumpier Old Men (1995)',
  'genres': 'Comedy|Romance'},
 {'movieId': '4',
  'title': 'Waiting to Exhale (1995)',
  'genres': 'Comedy|Drama|Romance'},
 {'movieId': '5',
  'title': 'Father of the Bride Part II (1995)',
  'genres': 'Comedy'},
 {'movieId': '6', 'title': 'Heat (1995)', 'genres': 'Action|Crime|Thriller'},
 {'movieId': '7', 'title': 'Sabrina (1995)', 'genres': 'Comedy|Romance'},
 {'movieId': '8',
  'title': 'Tom and Huck (1995)',
  'genres': 'Adventure|Children'},
 {'movieId': '9', 'title': 'Sudden Death (1995)', 'genres': 'Action'},
 {'movieId': '10',
  'title': 'GoldenEye (1995)',
  'genres': 'Action|Adventure|Thriller'},
 {'movieId': '11',
  'title': 'American President, The (1995)',
  'genres': 'Comedy|Drama|Romance'},
 {'movieId': '12',
  'title

For the WHERE clause, we need a way to evaluate an expression.   
[Python's built-in eval()](https://docs.python.org/3/library/functions.html#eval) can be used provided the expression is valid python syntax.  
Let's test this idea first.  

In [8]:
eval_expr = "int(movieId) == 12"
sample_row =  {'movieId': '12',
  'title': 'Dracula: Dead and Loving It (1995)',
  'genres': 'Comedy|Horror'}

In [9]:
eval(eval_expr,sample_row)

True

**A note of caution:**   
eval() and exec() built-in methods in Python are considered problematic from a Security standpoint as they let one run arbitrary code.
We'll see later how to implement a safer version.  

In [10]:
# what would our simple query look like?
# let's say something that get's us movies with a specific Id?
table_metadata = {
	'name': 'movies',
	'columns': ['movieId', 'title', 'genres']
}

where_clause = 'int(movieId) == 12'

a_simple_query = where_clause

In [11]:
# Execute a SELECT query given a dictionary with data in it
def execute_select(query, table_data):
	columns = table_metadata['columns']
	table_name = table_metadata['name']
	where_clause = query
	selected_rows = []
	# SELECT * FROM
	data = table_data[table_name]
	for row in data:
		if where_clause:
			# Apply WHERE clause filtering
			if eval(where_clause, row):
				selected_rows.append({col: row[col] for col in columns})
		else:
			selected_rows.append({col: row[col] for col in columns})
	return selected_rows

In [12]:
result = execute_select(a_simple_query, movies_data)
print('result: \n', result)

result: 
 [{'movieId': '12', 'title': 'Dracula: Dead and Loving It (1995)', 'genres': 'Comedy|Horror'}]


In [13]:
another_query = 'int(movieId) <= 12'
result = execute_select(another_query, movies_data)
print('result: \n', result)

result: 
 [{'movieId': '1', 'title': 'Toy Story (1995)', 'genres': 'Adventure|Animation|Children|Comedy|Fantasy'}, {'movieId': '2', 'title': 'Jumanji (1995)', 'genres': 'Adventure|Children|Fantasy'}, {'movieId': '3', 'title': 'Grumpier Old Men (1995)', 'genres': 'Comedy|Romance'}, {'movieId': '4', 'title': 'Waiting to Exhale (1995)', 'genres': 'Comedy|Drama|Romance'}, {'movieId': '5', 'title': 'Father of the Bride Part II (1995)', 'genres': 'Comedy'}, {'movieId': '6', 'title': 'Heat (1995)', 'genres': 'Action|Crime|Thriller'}, {'movieId': '7', 'title': 'Sabrina (1995)', 'genres': 'Comedy|Romance'}, {'movieId': '8', 'title': 'Tom and Huck (1995)', 'genres': 'Adventure|Children'}, {'movieId': '9', 'title': 'Sudden Death (1995)', 'genres': 'Action'}, {'movieId': '10', 'title': 'GoldenEye (1995)', 'genres': 'Action|Adventure|Thriller'}, {'movieId': '11', 'title': 'American President, The (1995)', 'genres': 'Comedy|Drama|Romance'}, {'movieId': '12', 'title': 'Dracula: Dead and Loving It (19

Wait, was it this simple?   
Yea!

# Next

Building a more feature rich data engine