# pandas vs PostgreSQL

Working on large data science projects usually involves the user accessing, manipulating, and retrieving data on a server. Next, the work flow moves client-side where the user will apply more refined data analysis and processing, typically tasks not possible or too clumsy to be done on the server. SQL (Structured Query Language) is ubiquitous in industry and data scientists will have to use it in their work to access data on the server.

The line between what data manipulation should be done server-side using SQL or on the client-side using a language like Python is not clear. Further, people who are either uncomfortable or dislike using SQL may be tempted to keep server-side manipulation to a minimum and reserve more of those actions on the client-side. With powerful and popular Python libraries for data wrangling and manipulation, the temptation to keep server-side processing to a minimum has increased.

This article will compare the execution time for several typical data manipulation tasks such as join and group by using PostgreSQL and pandas. PostgreSQL, often shortened as Postgres, is an object-relational database management system. It is free and open-source and runs on all major operating systems. Pandas is a Python data manipulation library that offers data structures akin to Excel spreadsheets and SQL tables and functions for manipulating those data structures.

The performance will be measured for both tools for the following actions:

- select columns
- filter rows
- group by and aggregation
- load a large CSV
- join two tables

How these tasks scale as a function of table size will be explored by running the analysis with datasets with ten to ten million rows. These datasets are stored as CSV files and have four columns; the entries of the first two columns are floats, the third are strings, while the last are integers representing a unique id. For joining two tables, a second dataset is used, having two columns, a unique integer id column and a column of floats.

For each of the five tasks listed above, the benchmark will run one hundred replicates for each dataset size. The Postgre part of the benchmark was run with Python using psycopg2, a Postgre adapter for Python. Running Postgre directly showed no significant performance difference when using psycopg2, however, psycopg2 allows for easier scripting and better flexibility for running the benchmark. The computer used for this study runs Ubuntu 16.04, with 16 GB of RAM, and an 8 core process at 1.8 GHz. The code used for this benchmark can be found on [GitHub](https://github.com/xofbd/pandas_vs_PostgreSQL). The repository contains all the code to run the benchmark, the results as JSON files, and figures plotting the comparison of the two methods.

It is important for data scientists to know the limitations of their tools and what approaches are optimal in terms of time. Although smaller projects will not benefit a lot of speed up, small percentage gains in more data intensive applications will translate into large absolute time savings.

## Benchmark Results

<img src='figures/select_results_plot.png', style="width: 600px;">
<img src='figures/filter_results_plot.png', style="width: 600px;">
<img src='figures/groupby_agg_results_plot.png', style="width: 600px;">
<img src='figures/load_results_plot.png', style="width: 600px;">
<img src='figures/join_results_plot.png', style="width: 600px;">

## Conclusions

Overall, pandas outperformed Postgre, often running over ten times faster. The only cases when Postgre performed better were for filtering, group by and aggregation, and joining tasks but only for datasets with less than a thousand rows. Selecting columns was very efficient in pandas; the number of rows in the dataset had little effect on the execution time. In general, loading and joining were the tasks that took the longest, requiring times greater than a second for large datasets.

One might take away from the benchmarking that pandas is far superior than Postgre but pandas does have its limitations and there is still a need for SQL. For pandas, the data is stored in memory and it will be difficult trying to load a CSV file greater than half of the memory of the system. For the ten-thousand row dataset, the file size was about 400 MB, but the dataset only had four columns. Datasets often contain hundreds of columns, resulting in file sizes on the order of 10 GB when the dataset has over a million rows.

Postgre and pandas are ultimately different tools. Postgre and other SQL based languages were created to manage databases and offer users a convenient way to access and retrieve data, especially across multiple tables. The server running Postgre would have all the datasets stored as tables across the system, and it would be impractical for a user to transfer the required tables to their system and use pandas to perform tasks such as join, and group by client side. Pandas was created for data manipulation and its strength lies in complex data analysis operations. One should not view pandas and Postgre as competing entities but rather important tools making up the Data Science computational stack.