# Building Fast Queries on a CSV

In this project, we would making use of an adapted csv file from the [laptop prices dataset from Kaggle](https://www.kaggle.com/ionaskel/laptop-prices), with changes to the IDs and converting the prices from float to integer. The accompanying data dictionary describing each column is as follows:
* ID: A unique identifier for the laptop.
* Company: The name of the company that produces the laptop.
* Product: The name of the laptop.
* TypeName: The type of laptop.
* Inches: The size of the screen in inches.
* ScreenResolution: The resolution of the screen.
* CPU: The laptop CPU.
* RAM: The amount of RAM in the laptop.
* Memory: The size of the hard drive.
* GPU: The graphics card name.
* OpSys: The name of the operating system.
* Weight: The laptop weight.
* Price: The price of the laptop.

Before we get started, let's do a quick exploration of the data to better understand what's stored within

In [1]:
# Reading in the file and converting to a list of lists
import csv

with open('laptops.csv') as file:
    reader = csv.reader(file)
    data = list(reader)
    header = data[0]
    rows = data[1:]    

In [2]:
# Column headers
print(header)

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


In [3]:
# First five rows
print(rows[:5])

[['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', '1339'], ['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', '898'], ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', '575'], ['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', '2537'], ['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256GB SSD', 'Intel Iris Plus Graphics 650', 'macOS', '1.37kg', '1803']]


The goal of this project is to implement a class that represesnts the inventory of an online laptop store such that the methods of the class can implement queries to answer questions about the store's inventory. To make these queries run faster, we would preprocess the data to improve time complexity.

# Class Implementation

In [4]:
class Inventory():
    def __init__(self, csv_filename):
        with open('laptops.csv') as file:
            reader = csv.reader(file)
            data = list(reader)
            self.header = data[0]
            self.rows = data[1:]
            for row in self.rows:
                row[-1] = int(row[-1])

In [5]:
test_instance = Inventory('laptops.csv')

In [6]:
print(test_instance.header)

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']


In [7]:
print(test_instance.rows[0])

['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', 1339]


In [8]:
print(len(test_instance.rows))

1303


# Finding a Laptop From ID

Given a unique ID identifier, we would want to look up the laptop it corresponds to.

In [9]:
class Inventory():
    def __init__(self, csv_filename):
        with open('laptops.csv') as file:
            reader = csv.reader(file)
            data = list(reader)
            self.header = data[0]
            self.rows = data[1:]
            for row in self.rows:
                row[-1] = int(row[-1])
                
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None

In [10]:
test_instance = Inventory('laptops.csv')

In [11]:
print(test_instance.get_laptop_from_id('3362737'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]


In [12]:
print(test_instance.get_laptop_from_id('3362736'))

None


## Faster Lookup

However, looking at the above implemented algorithm, we are required to look at every single row to find the one that we are looking for, resulting in a time complexity of O(R) where R is the number of rows. The time complexity can be improved to constant time O(1) by either implementing a set or a dictionary which has a constant-time lookup. A dictionary would be used in this case since we are able to associate values to the keys, so the key would be the ID and the values would be the rows. However, this comes at the expense of space complexity as the instantiation of the class would be longer because the dictionary has to be created.

In [13]:
class Inventory():
    def __init__(self, csv_filename):
        with open('laptops.csv') as file:
            reader = csv.reader(file)
            data = list(reader)
            self.header = data[0]
            self.rows = data[1:]
            self.id_to_row = {}
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row[1:]
                
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None

In [14]:
test_instance = Inventory('laptops.csv')

In [15]:
print(test_instance.get_laptop_from_id('3362737'))

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]


In [16]:
print(test_instance.get_laptop_from_id('3362736'))

None


## Comparing Performance

In [17]:
import time
import random

# Creating random IDs
ids = [str(random.randint(1000000, 9999999)) for _ in range(10000)]

In [18]:
test_instance = Inventory('laptops.csv')
total_time_no_dict = 0   # variable to aggregate times of calling get_laptop_from_id() method
total_time_dict = 0      # variable to aggregate times of calling get_laptop_from_id_fast() method

In [19]:
for id in ids:
    start = time.time()
    test_instance.get_laptop_from_id(id)
    end = time.time()
    runtime = end - start
    total_time_no_dict += runtime

In [20]:
for id in ids:
    start = time.time()
    test_instance.get_laptop_from_id_fast(id)
    end = time.time()
    runtime = end - start
    total_time_dict += runtime

In [21]:
print(total_time_no_dict)

0.7491128444671631


In [22]:
print(total_time_dict)

0.0038814544677734375


As confirmed above, the implementation of the dictionary which has a constant time lookup is substantially faster than the original implementation by looping through the list of rows.

# Checking for Purchases with No Changes

Sometimes, the store offers a promotion where a gift card is given. A customer uses the gift card to buy up to two laptops. To avoid having to track what was already spent, the gift card can only be used once and any leftover amount cannot be used for the next purchase. That is, the gift card can only be used in a single purchase for up to two laptops. Therefore, to avoid having customers feeling cheated since they are unable to use any remaining amount if the laptop(s) do not cover the full gift card amount, we would want to make sure that either there is a laptop or two laptops whose prices equal precisely to the gift card amount.

In [23]:
class Inventory():
    def __init__(self, csv_filename):
        with open('laptops.csv') as file:
            reader = csv.reader(file)
            data = list(reader)
            self.header = data[0]
            self.rows = data[1:]
            self.id_to_row = {}
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row[1:]
                
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        for i in range(len(self.rows)):
            for j in range(i, len(self.rows)):
                if rows[i][-1] + rows[j][-1] == dollars:
                    return True
        return False

In [24]:
test_instance = Inventory('laptops.csv')

In [25]:
print(test_instance.check_promotion_dollars(1000))

True


In [26]:
print(test_instance.check_promotion_dollars(442))

False


## Faster Lookup

Again, this implementation of algorithm can be improved in terms of time complexity by making use of a set which allows for constant time lookup.

In [27]:
class Inventory():
    def __init__(self, csv_filename):
        with open('laptops.csv') as file:
            reader = csv.reader(file)
            data = list(reader)
            self.header = data[0]
            self.rows = data[1:]
            self.id_to_row = {}
            self.prices = set()
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row[1:]
                self.prices.add(row[-1])
                
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        for i in range(len(self.rows)):
            for j in range(i, len(self.rows)):
                if rows[i][-1] + rows[j][-1] == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            price2 = dollars - price
            if price2 in self.prices:
                return True
        return False

In [28]:
test_instance = Inventory('laptops.csv')

In [29]:
print(test_instance.check_promotion_dollars_fast(1000))

True


In [30]:
print(test_instance.check_promotion_dollars_fast(442))

False


## Comparing Performance

In [31]:
dollars = [random.randint(100,5000) for _ in range(100)]

In [32]:
test_instance = Inventory('laptops.csv')
total_time_no_set = 0   # variable to aggregate times of calling check_promotion_dollars() method
total_time_set = 0      # variable to aggregate times of calling check_promotion_dollars_fast() method

In [33]:
for amt in dollars:
    start = time.time()
    test_instance.check_promotion_dollars(amt)
    end = time.time()
    runtime = end - start
    total_time_no_set += runtime

In [34]:
for amt in dollars:
    start = time.time()
    test_instance.check_promotion_dollars_fast(amt)
    end = time.time()
    runtime = end - start
    total_time_set += runtime

In [35]:
print(total_time_no_set)

20.375697374343872


In [36]:
print(total_time_set)

0.00038170814514160156


As confirmed above, the implementation of the set which has a constant time lookup is substantially faster than the original implementation by double looping through the rows.

# Finding Laptops Within Budget

Next, we need to implement an algorithm that allows a customer to find all laptops that fall within his/her budget. That is, given a budget of D dollars, find all laptops whose price is less than or equal to D dollars. Equivalently, finding the first laptop within a sorted list by prices that is more than D dollars which is how we are going to implement here, by using a binary search algorithm.

In [37]:
class Inventory():
    def __init__(self, csv_filename):
        with open('laptops.csv') as file:
            reader = csv.reader(file)
            data = list(reader)
            self.header = data[0]
            self.rows = data[1:]
            self.id_to_row = {}
            self.prices = set()
            for row in self.rows:
                row[-1] = int(row[-1])
                self.id_to_row[row[0]] = row[1:]
                self.prices.add(row[-1])
            self.rows_by_price = sorted(self.rows, key = lambda row: row[-1])
                
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if row[0] == laptop_id:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
        for i in range(len(self.rows)):
            for j in range(i, len(self.rows)):
                if rows[i][-1] + rows[j][-1] == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price in self.prices:
            price2 = dollars - price
            if price2 in self.prices:
                return True
        return False
    
    def find_first_laptop_more_expensive(self, budget):
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2
            price = self.rows_by_price[range_middle][-1]
            if price > budget:
                range_end = range_middle
            else:
                range_start = range_middle + 1
        price = self.rows_by_price[range_start][-1]
        if price <= budget:
            return -1
        return range_start

In [38]:
test_instance = Inventory('laptops.csv')

In [39]:
print(test_instance.find_first_laptop_more_expensive(1000))

683


In [40]:
print(test_instance.find_first_laptop_more_expensive(10000))

-1
