# Multi Threading & Proccesing Speed Test

In a recent interview, I was asked how to read in multiple large JSON files. My mind immediately went to distributed computing, such as PySpark. After further consideration, I realized that there are also ways to speed up this process even on a single instance. In this notebook we'll take a look at how we can use multithreading and multiprocessing to speed up reading json files. Using these two methods we can parrallise and read in multiple files at once. This speeds up the proccess of reading in multiple large files. 

In [1]:
import json
import time
import threading
import multiprocessing

# read the same json file that is about 30 MB 
file_list = ["./player_data.json", "./player_data2.json", "./player_data3.json", "./player_data4.json", "./player_data5.json"]

def read_file(file_path):
    with open(file_path, 'r') as file:
        content = json.loads(file.read())
    return content

start = time.time()
read_file(file_list[0])
end = time.time()
print(f'reading one json file took {end - start} seconds')

reading one json file took 0.10232782363891602 seconds


In [2]:

def simple_reader(file_paths):
    results = {}
    for file in file_list:
        results[file] = read_file(file)
    return results
    
start= time.time()
simple_reader(file_list)
end = time.time()

print(f'Reading in the json list took {end - start} seconds')

Reading in the json list took 0.41242122650146484 seconds


Reading in 5 files took about five times as long as reading a single file, which is to be expected since we are reading the same file five times consecutively. To speed this up, we can try using multithreading to read the files concurrently.

In [3]:
def multi_threaded_reader(file_paths):
    threads = []
    results = {}

    def read_file_thread(file_path):
        results[file_path] = read_file(file_path)
    
    for file_path in file_paths:
        thread = threading.Thread(target=read_file_thread, args=(file_path,))
        threads.append(thread)
        thread.start()
    
    for thread in threads:
        thread.join()
    
    return results

start = time.time()
multi_threaded_reader(file_list)
end = time.time()
print(f'multi threaded read took {end - start} seconds')

multi threaded read took 0.4068269729614258 seconds


Now, that's not what we wanted multithreading took just as long. If we look at the docs, there's an issue with how the Python interpreter works. Here are the multithreading docs:

"CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation)."

To avoid this lock, we can use multiprocessing, which allows us to start multiple processes. Each of these processes has its own Python interpreter, meaning we can run concurrent threads in separate processes. This requires more overhead than multithreading, and it is also more difficult to share data between processes than between threads. However, for reading large files, the advantage of concurrency is worth the tradeoff.

In [4]:
def multi_proccess_read(file_paths):
    procceses = []
    results = {}

    def read_file_procces(file_path):
        results[file_path] = read_file(file_path)
    
    for file_path in file_paths:
        proccess = multiprocessing.Process(target=read_file_procces, args=(file_path,))
        procceses.append(proccess)
        proccess.start()
    for proccess in procceses:
        proccess.join()

multiprocessing.set_start_method('fork', force=True)
start = time.time()
multi_proccess_read(file_list)
end = time.time()
print(f'multiproccessing read took {end - start} seconds')
        

multiproccessing read took 0.10682511329650879 seconds


That is a much better runtime, cutting down the time closer to reading a single file. By using this strategy of multiprocessing, we can increase the throughput of reading files. This strategy works on a single instance of a machine. If we want to further speed up processing, we can use something like PySpark to leverage multiple machines, further increasing parallelization and throughput.