## Scaling Example: Flat files

We'll work with a [dataset](https://www.kaggle.com/datasets/sunnykakar/spotify-charts-all-audio-data) of Spotify Charts available on Kaggle. Here's the description according to the authors.


> This is a complete dataset of all the "Top 200" and "Viral 50" charts published globally by Spotify. Spotify publishes a new chart every 2-3 days. This is its entire collection since January 1, 2019. This dataset is a continuation of the Kaggle Dataset: Spotify Charts but contains 29 rows for each row that was populated using the Spotify API.

To download the data, you only need to create a free Kaggle account. The downloaded data is a zipped archive which contains one CSV file called "merged_data.csv". Before loading in any data, we can use the disk utilities via Python's os module to see the size of the file.

In [13]:
import os

def show_file_size(path):
    ## get file size in bytes
    file_size = os.path.getsize(path)

    ## convert to human readable format
    units = ['bytes', 'KB', 'MB', 'GB', 'TB']
    for i in range(len(units)):
        if file_size < 1024:
            break
        file_size /= 1024
    print(f'File size: {file_size:.2f} {units[i]}')
    
show_file_size('merged_data.csv')

File size: 25.24 GB


You can also use the "ls -lh" command to list the files and their size. The -l tag puts the files into a stacked list and -h gives the size in "human" units (KB, MB, GB, etc., instead of bytes.)

In [12]:
ls -lh merged_data.csv

-rw-r--r--@ 1 tracijohnson  staff    25G Apr 15 20:43 merged_data.csv


#### Note on compressed data

While this tutorial is mostly about conserving memory, now might also be a good time to talk about conserving hard disk space as well. You can compress CSV files and still load in the dataset into Python without needing to decompress them. I used gzip to compress the CSV file in this example and you can see it takes up way less space.

In [14]:
show_file_size('merged_data.csv.gz')

File size: 2.81 GB


However, while we've managed to compress the data in its archived state, we still need to worry about the size of the uncompressed data. Unfortunately, the compression ratio is can vary a lot depending on the type of data and method of compression, so it is not trivial to calculate the size of the decompressed data before decompression.

If you are working on a laptop, you most likely have 2-8 GB of RAM. Therefore, we cannot load the entire dataset into memory. So, what do we do?

### Data exploration

We can read a preview the compressed CSV file directly into Pandas using a couple of additional keyword parameters.

- nrows: number of rows (beyond the header) that you want to load
- compression: the name of the algorithm used to compress the file. Pandas is usually smart enough to figure this out based on the file extension, but it's safer to provide this directly. A gzipped file will usually have the extension '.gz', so we'll put 'gzip' here.


In [42]:
import pandas as pd

df = pd.read_csv('merged_data.csv.gz', 
                 compression='gzip',
                 nrows=10_000)

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           10000 non-null  int64  
 1   title                10000 non-null  object 
 2   rank                 10000 non-null  int64  
 3   date                 10000 non-null  object 
 4   artist               10000 non-null  object 
 5   url                  10000 non-null  object 
 6   region               10000 non-null  object 
 7   chart                10000 non-null  object 
 8   trend                10000 non-null  object 
 9   streams              10000 non-null  float64
 10  track_id             10000 non-null  object 
 11  album                9998 non-null   object 
 12  popularity           10000 non-null  float64
 13  duration_ms          10000 non-null  float64
 14  explicit             10000 non-null  bool   
 15  release_date         10000 non-null  

In [46]:
df.memory_usage().sum() / 1024**2

2.145893096923828