# Building Fast Queries on a CSV

For this project, we will assume that we own an online laptop store and want to build a way to answer a few different business questions about our inventory.

We will use the 'laptops.csv' file as our inventory. This CSV file was adapted from the [Laptop Prices dataset on Kaggle](https://www.kaggle.com/datasets/muhammetvarl/laptop-price). 

Let's load the data first.

In [1]:
import csv
import numpy as np
import pandas as pd

with open('laptops.csv') as file:
    reader = csv.reader(file)
    rows = list(reader)
    header = rows[0]
    rows = rows[1:]

print(header)

for i in range(5):
    print(rows[i])

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339']
['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898']
['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575']
['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537']
['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256GB SSD',

## Inventory Class

We would like to create a class that represents our inventory. The methods in that class will implement the queries that we want to answer about our inventory. We will also preprocess that data to make those queries run faster.

Here are some queries that we will want to answer:

- Given a laptop id, find the corresponding data.
- Given an amount of money, find whether there are two laptops whose total price is that given amount.
- Identify all laptops whose price falls within a given budget.

Let's start by implementing the constructor. It will take the name of the CSV file as argument and then read the rows contained in it.

In [2]:
class Inventory():
    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            reader = csv.reader(file)  # read the file
            rows = list(reader)  # convert the file to a list
        self.header = rows[0]  # first row is the header
        self.rows = rows[1:]  # the rest of the rows are the data
        for row in self.rows:
            row[-1] = int(row[-1])  # Convert the price to an integer (last row)

Let's print the header and the number of rows.

In [3]:
inventory = Inventory('laptops.csv')
print(inventory.header)
print(len(inventory.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
1303


## Finding a Laptop From the Id

We will now update the `Inventory` class bit by bit and make a couple of improvements.

The first thing that we will implement is a way to look up a laptop from a given identifier. In this way, when a customer comes to our store with a purchase slip, we can quickly identify the laptop to which it corresponds.

For this, we will write a function named `get_laptop_from_id()`. This function will take as argument the identifier of the laptop and return the full row of the laptop with that id.

In [4]:
class Inventory():
    
    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            reader = csv.reader(file)  # read the file
            rows = list(reader)  # convert the file to a list
        self.header = rows[0]  # first row is the header
        self.rows = rows[1:]  # the rest of the rows are the data
        for row in self.rows:
            row[-1] = int(row[-1])  # Convert the price to an integer (last row)
        
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:  # if there's a matching laptop_id, return the row
                return row
        return None  # Default behaviour: return nothing

Let's test this addition.

In [5]:
Inventory('laptops.csv').get_laptop_from_id('3362737')

['3362737',
 'HP',
 '250 G6',
 'Notebook',
 '15.6',
 'Full HD 1920x1080',
 'Intel Core i5 7200U 2.5GHz',
 '8GB',
 '256GB SSD',
 'Intel HD Graphics 620',
 'No OS',
 '1.86kg',
 575]

In [6]:
Inventory('laptops.csv').get_laptop_from_id('3362736')  # does not find a match

Nice, our lookup works as it should!

## Improving Id Lookups

The current lookup algorithm requires us to look at every single row to find the one that we are looking for (or decide that such a row does not exist). This algorithm has time complexity O(R) where R is the number of rows.

But as we have learned, we can solve this problem more efficiently by preprocessing the data. If we would use a set, we can check in constant time whether a given identifier exists. But we also want to retrieve the remaining row information, and a dictionary seems more suited to that. Dictionaries have the same fast lookup properties that sets have, but allow us to associate values to the keys.

Let us preprocess the data so that IDs become the keys and the rows become the dictionary values.

In [7]:
class Inventory():

    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            reader = csv.reader(file)  # read the file
            rows = list(reader)  # convert the file to a list
        self.header = rows[0]  # first row is the header
        self.rows = rows[1:]  # the rest of the rows are the data
        for row in self.rows:
            row[-1] = int(row[-1])  # Convert the price to an integer (last row)
        self.id_to_row = {}  # assign an empty dictionary
        for row in self.rows:
            self.id_to_row[row[0]] = row  # assign the laptop_id as the key and the row as the value

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:  # if there's a matching laptop_id, return the row
                return row
        return None  # Default behaviour: return nothing

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]  # if the laptop_id is in the dictionary, return the row
        return None  # Default behaviour: return nothing

Let's test this improved lookup on the same values as before.

In [8]:
Inventory('laptops.csv').get_laptop_from_id_fast('3362737')

['3362737',
 'HP',
 '250 G6',
 'Notebook',
 '15.6',
 'Full HD 1920x1080',
 'Intel Core i5 7200U 2.5GHz',
 '8GB',
 '256GB SSD',
 'Intel HD Graphics 620',
 'No OS',
 '1.86kg',
 575]

In [9]:
Inventory('laptops.csv').get_laptop_from_id_fast('3362736') # does not find a match

## Comparing the Performance

The 'get_laptop_from_id()' method has time complexity *O(R)* where *R* is the number of rows. In contrast, the new implementation is time complexity *O(1)*.

Let's experiment to compare the performance of the two methods. The idea is to generate random IDs using the random module. Then, use both methods to lookup these same IDs. We will use the time module to measure the execution time of each lookup and, for each method, add all times together.

In [10]:
import time
import random

ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]

inventory = Inventory('laptops.csv')
total_time_no_dict = 0

for id in ids:
   start = time.time()
   inventory.get_laptop_from_id(id)
   end = time.time()
   total_time_no_dict += end - start
    
total_time_dict = 0

for id in ids:
   start = time.time()
   inventory.get_laptop_from_id_fast(id)
   end = time.time()
   total_time_dict += end - start
    
print(total_time_no_dict)
print(total_time_dict)
print(total_time_dict < total_time_no_dict)

1.0860135555267334
0.0045986175537109375
True


As we can see here, the lookup method using the dictionary is by far faster than the method looping through the list.

## Two Laptop Promotion

Sometimes, the store offers a promotion where you give a gift card. A customer can use the gift to buy up to two laptops. To avoid having to keep track of what was already spent, the gift card has a single time usage. This means that, even if there is leftover money, it cannot be used anymore.

Assume the prices of three laptops are \\$1,339, \\$898, and \\$575. Say we offered a gift card of \\$2,500. Since a customer can buy, at most, two laptops with a gift card, the maximum they can spend is  \\$2,237 (\\$1,339 plus \\$898). Therefore, they might feel cheated because, no matter how they spend their gift card, they cannot spend the full \\$2,500.

We don't want to make a customer feel cheated, so whenever you issue a gift card, you want to make sure that there is at least one way to spend it in full. In other words, before issuing a gift card for D dollars, you want to make sure that either there is a laptop that costs exactly D dollars or two laptops whose costs add up to precisely D dollars.

We will therefore now write a function that, given a dollar amount, checks whether it is possible to spend precisely that amount by purchasing up to two laptops.

In [11]:
class Inventory():

    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            reader = csv.reader(file)  # read the file
            rows = list(reader)  # convert the file to a list
        self.header = rows[0]  # first row is the header
        self.rows = rows[1:]  # the rest of the rows are the data
        for row in self.rows:
            row[-1] = int(row[-1])  # Convert the price to an integer (last row)
        self.id_to_row = {}  # assign an empty dictionary
        for row in self.rows:
            self.id_to_row[row[0]] = row  # assign the laptop_id as the key and the row as the value

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:  # if there's a matching laptop_id, return the row
                return row
        return None  # Default behaviour: return nothing

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]  # if the laptop_id is in the dictionary, return the row
        return None  # Default behaviour: return nothing

    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        
        for row1 in self.rows:
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:
                    return True
        return False
 
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars(1000))
print(inventory.check_promotion_dollars(442))

True
False


While we can find one or two laptops costing \\$1000, there is no laptop costing \\$442 dollars.

## Optimizing Laptop Promotion

As before, we can optimize our check for the eligibility of the promotion by preprocessing the data.

Since we only care about whether or not there is a solution, we can store all laptops prices in a set when we initialize the inventory. Then we can check in constant time whether there is a laptop with a given price.

In [12]:
class Inventory():

    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            reader = csv.reader(file)  # read the file
            rows = list(reader)  # convert the file to a list
        self.header = rows[0]  # first row is the header
        self.rows = rows[1:]  # the rest of the rows are the data
        for row in self.rows:
            row[-1] = int(row[-1])  # Convert the price to an integer (last row)
        self.id_to_row = {}  # assign an empty dictionary
        for row in self.rows:
            self.id_to_row[row[0]] = row  # assign the laptop_id as the key and the row as the value
        self.prices = set()  # assign an empty set
        for row in self.rows:
            self.prices.add(row[-1])  # add the price to the set

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:  # if there's a matching laptop_id, return the row
                return row
        return None  # Default behaviour: return nothing

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]  # if the laptop_id is in the dictionary, return the row
        return None  # Default behaviour: return nothing

    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:  # is there at least on laptop with the same price?
                return True

        for row1 in self.rows:
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:  # is there a pair of laptops that add up to the price?
                    return True
        return False

    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:  # is there at least on laptop with the same price?
            return True

        for price in self.prices:
            if dollars - price in self.prices:  # is there a pair of laptops that add up to the price?
                return True
        return False
    
inventory = Inventory('laptops.csv')
print(inventory.check_promotion_dollars(1000))
print(inventory.check_promotion_dollars(442))

True
False


We see, this works. Just how fast?

##  Comparing Promotion Functions

Let's compare the performance of the last two functions that we wrote.

In [13]:
prices = [random.randint(100, 5000) for _ in range(100)]

inventory = Inventory('laptops.csv')
total_time_no_set = 0

for price in prices:
   start = time.time()
   inventory.check_promotion_dollars(price)
   end = time.time()
   total_time_no_set += end - start
    
total_time_set = 0

for price in prices:
   start = time.time()
   inventory.check_promotion_dollars_fast(price)
   end = time.time()
   total_time_dict += end - start
    
print(total_time_no_set)
print(total_time_set)
print(total_time_set < total_time_no_set)

1.15433931350708
0
True


We find again that preprocessing the data, in this case generating a set, makes the check a lot faster.

## Finding Laptops Within a Budget

We have learned previously how to use binary search to find an element in a sorted list quickly. We are going to leverage and extend that algorithm to help a customer find all laptops that fall within their budget.

More formally, we want to write a method that efficiently answers the query: Given a budget of D dollars, find all laptops whose price it at most D.

If we sort all laptops by price, we can use binary search to identify the first laptop in the sorted list with a price larger than D. We need to make sure that our binary search finds the first one on the list. Then, the result of the query will consist of all laptops whose index in the sorted list is smaller than the index of the first laptop whose price is higher than D dollars.

In [14]:
class Inventory():

    def __init__(self, csv_filename):
        with open(csv_filename) as file:
            reader = csv.reader(file)  # read the file
            rows = list(reader)  # convert the file to a list
        self.header = rows[0]  # first row is the header
        self.rows = rows[1:]  # the rest of the rows are the data
        for row in self.rows:
            row[-1] = int(row[-1])  # Convert the price to an integer (last row)
        self.id_to_row = {}  # assign an empty dictionary
        for row in self.rows:
            self.id_to_row[row[0]] = row  # assign the laptop_id as the key and the row as the value
        self.prices = set()  # assign an empty set
        for row in self.rows:
            self.prices.add(row[-1])  # add the price to the set
        self.rows_by_price = sorted(self.rows, key=lambda row: row[-1])  # sort the rows by price

    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:  # if there's a matching laptop_id, return the row
                return row
        return None  # Default behaviour: return nothing

    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]  # if the laptop_id is in the dictionary, return the row
        return None  # Default behaviour: return nothing

    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:  # is there at least one laptop with the same price?
                return True

        for row1 in self.rows:
            for row2 in self.rows:
                if row1[-1] + row2[-1] == dollars:  # is there a pair of laptops that add up to the price?
                    return True
        return False

    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:  # is there at least one laptop with the same price?
            return True

        for price in self.prices:
            if dollars - price in self.prices:  # is there a pair of laptops that add up to the price?
                return True
        return False

    def find_first_laptop_more_expensive(self, price):
        range_start = 0  # define the start of the range
        range_end = len(self.rows_by_price) - 1  # define the end of the range

        while range_start < range_end:  # within the range
            range_middle = (range_start + range_end) // 2  # find the middle of the range
            value = self.rows_by_price[range_middle][-1]  # find the price of the middle laptop

            if value > price:  # if the price is greater than the target price, move the end of the range to the middle
                range_end = range_middle
            else:
                range_start = range_middle + 1  # otherwise, move the start of the range to the middle + 1

        if self.rows_by_price[range_start][-1] <= price:
            return -1  # if there is no solution, return -1
        return range_start

inventory = Inventory('laptops.csv')
print(inventory.find_first_laptop_more_expensive(1000))  # returns the index 683
print(inventory.find_first_laptop_more_expensive(10000))  # returns -1

683
-1


We find that the next laptop more expensive than 1000 dollars is located at index 683. Also, we do not find a laptop that costs more than 10000 dollars.