## Scaling Example: Flat files

We'll work with a [dataset](https://www.kaggle.com/datasets/sunnykakar/spotify-charts-all-audio-data) of Spotify Charts available on Kaggle. Here's the description according to the authors.


> This is a complete dataset of all the "Top 200" and "Viral 50" charts published globally by Spotify. Spotify publishes a new chart every 2-3 days. This is its entire collection since January 1, 2019. This dataset is a continuation of the Kaggle Dataset: Spotify Charts but contains 29 rows for each row that was populated using the Spotify API.

To download the data, you only need to create a free Kaggle account. The downloaded data is a zipped archive which contains one CSV file called "merged_data.csv". Before loading in any data, we can use the disk utilities via Python's os module to see the size of the file.

In [2]:
import os

def show_file_size(path):
    ## get file size in bytes
    file_size = os.path.getsize(path)

    ## convert to human readable format
    units = ['bytes', 'KB', 'MB', 'GB', 'TB']
    for i in range(len(units)):
        if file_size < 1024:
            break
        file_size /= 1024
    print(f'File size: {file_size:.2f} {units[i]}')
    
show_file_size('merged_data.csv')

File size: 25.24 GB


You can also use the "ls -lh" command to list the files and their size. The -l tag puts the files into a stacked list and -h gives the size in "human" units (KB, MB, GB, etc., instead of bytes.)

In [3]:
ls -lh merged_data.csv

-rw-r--r--@ 1 tracijohnson  staff    25G Apr 15 20:43 merged_data.csv


#### Note on compressed data

While this tutorial is mostly about conserving memory, now might also be a good time to talk about conserving hard disk space as well. You can compress CSV files and still load in the dataset into Python without needing to decompress them. I used gzip to compress the CSV file in this example and you can see it takes up way less space.

In [4]:
show_file_size('merged_data.csv.gz')

File size: 2.81 GB


However, while we've managed to compress the data in its archived state, we still need to worry about the size of the uncompressed data. Unfortunately, the compression ratio is can vary a lot depending on the type of data and method of compression, so it is not trivial to calculate the size of the decompressed data before decompression.

If you are working on a laptop, you most likely have 2-8 GB of RAM. Therefore, we cannot load the entire dataset into memory. So, what do we do?

### Data exploration

We can read a preview the compressed CSV file directly into Pandas using a couple of additional keyword parameters.

- nrows: number of rows (beyond the header) that you want to load
- compression: the name of the algorithm used to compress the file. Pandas is usually smart enough to figure this out based on the file extension, but it's safer to provide this directly. A gzipped file will usually have the extension '.gz', so we'll put 'gzip' here.


In [9]:
import pandas as pd
pd.set_option('display.max_columns', None)

df = pd.read_csv('merged_data.csv.gz', 
                 compression='gzip',
                 nrows=10_000)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           10000 non-null  int64  
 1   title                10000 non-null  object 
 2   rank                 10000 non-null  int64  
 3   date                 10000 non-null  object 
 4   artist               10000 non-null  object 
 5   url                  10000 non-null  object 
 6   region               10000 non-null  object 
 7   chart                10000 non-null  object 
 8   trend                10000 non-null  object 
 9   streams              10000 non-null  float64
 10  track_id             10000 non-null  object 
 11  album                9998 non-null   object 
 12  popularity           10000 non-null  float64
 13  duration_ms          10000 non-null  float64
 14  explicit             10000 non-null  bool   
 15  release_date         10000 non-null  

In [7]:
df.memory_usage().sum() / 1024**2

2.145893096923828

The first 10k rows takes up 2.1 GB might even be able to read in a bit more. But this should be enough to get a feel for the data types involved.

### Column selection and data type assignment

You can save space if you only select the columns that are relevant for your work. It can also be taxing on the memory usage for the function to automatically assign data types to each column. You can specify the data types ahead of time with a dictionary and using Pandas data types.

Let's say we want to track the performance of a specific song in a specific region on the Spotify Charts. We only need to track the rank and date after filtering on region, artist, and song.

In [53]:
dtypes = {
    'title': pd.StringDtype(),
    'rank': pd.Int32Dtype(),
    'date': pd.StringDtype(),
    'artist': pd.StringDtype(),
    'region': pd.StringDtype(),
    'chart': pd.StringDtype(),
}

df = pd.read_csv('merged_data.csv.gz', 
                 compression='gzip',
                 nrows=10_000,
                 usecols=dtypes.keys(),
                 dtype=dtypes)


### Breaking the analysis up into chunks

Now we need to process the remaining dataset. We can use the chunksize parameter to only pull out a few thousand rows at a time for processing before we continue. This keyword parameter causes the function to output a generator object, which is an iterable object but only one value is kept in memory at a time.

Below is an example of really simple generator. This function is the equivalent of the range function, but only pulls out one value at a time.

In [27]:
def range_generator(n):
    for i in range(n):
        yield i

In [30]:
g = range_generator(10)

In [31]:
g

<generator object range_generator at 0x137f20e10>

In [32]:
next(g)

0

In [33]:
next(g)

1

In [34]:
for i in g:
    print(i)

2
3
4
5
6
7
8
9


Here's a function that will loop through each chunk of the data, filter on the song/artist and region, and append data to an existing frame. We also want to keep the memory footprint of the filtered data, so we'll put a cap on the total number of rows we can keep in memory and periodically dump the results to hard disk.

In [56]:
def filter_data(title=None, artist=None, region=None):
    dtypes = {
        'title': pd.StringDtype(),
        'rank': pd.Int32Dtype(),
        'date': pd.StringDtype(),
        'artist': pd.StringDtype(),
        'region': pd.StringDtype(),
        'chart': pd.StringDtype()
    }

    filtered_data = pd.DataFrame()
    i = 0
    
    for df in pd.read_csv('merged_data.csv.gz', 
                    compression='gzip',
                    chunksize=10_000,
                    usecols=dtypes.keys(),
                    dtype=dtypes):
        
        # select only the top200 chart
        df = df[df['chart'].eq('top200')]
            
        ## filter by title, artist, region
        if title:
            df = df[df['title'].str.contains(title, case=False)]
        if artist:
            df = df[df['artist'].str.contains(artist, case=False)]
        if region:
            df = df[df['region'].str.contains(region, case=False)]
        
        ## append to filtered_data if the total number of rows is less than 10_000
        ## otherwise dump data and create a new frame
        if len(filtered_data) + len(df) <= 10_000:
            filtered_data = pd.concat([filtered_data, df])
        else:
            filtered_data.to_csv(f'filtered_data_{i}.csv', index=False)
            i += 1
            filtered_data = df
            
        ## save the last chunk
        filtered_data.to_csv(f'filtered_data_{i}.csv', index=False)


In [57]:
filter_data(title='bad blood', 
            artist='taylor swift', 
            region='united states')

In [61]:
import glob
files = glob.glob('filtered_data_*.csv')

len(files)

1

Turns out we only had one file generated, so it should be safe to load all of it into one frame.

In [62]:
import glob
df = pd.concat([pd.read_csv(f,dtype=dtypes) for f in glob.glob('filtered_data*.csv')])

In [65]:
df

Unnamed: 0,title,rank,date,artist,region,chart
0,Bad Blood,127,2017-06-09,Taylor Swift,United States,top200
1,Bad Blood,122,2017-06-10,Taylor Swift,United States,top200
2,Bad Blood,175,2017-06-11,Taylor Swift,United States,top200
3,Bad Blood,157,2017-06-12,Taylor Swift,United States,top200
4,Bad Blood,181,2017-06-13,Taylor Swift,United States,top200


This process could be very tedious if you needed to do many times for different artists. It would be wise to instead convert this file into one that can be queried using SQL. We can use duckdb to do this for us.