# Understanding and testing perfomance of pandas.read_csv

Comma Separated Values (CSVs) are widely used to read, write and transfer data for machine learning applications hence pandas.read_csv is the starting point of many data analysis and machine learning projects thus it deems it necessary to have a good insight on it's working and perfomance, let's look under it's hood and explore it's working and performance. You may refer the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) and source code to have a more indepth knowledge. 

In [None]:
import pandas as pd
import numpy as np

Let's first explore it's performance
add *%time* before a statement in jupyter notebook to time it, it's a handy tool to evaluate performance. 
### Starting out with all the default parameters

In [None]:
%time df=pd.read_csv('../input/csv-large/csv_large.csv')

Here's what our data looks like

In [None]:
df.head()

python is a memory hog and anything written in pure python is expected to be upto [400 times slower](https://stackoverflow.com/questions/801657/is-python-faster-and-lighter-than-c) than native C code. Which is why most of pandas and numpy or any other scientifc calculation module functions are written in either cython or C. Let's assert this fact again.
### Using the python engine / parser to read the file

In [None]:
%time df=pd.read_csv('../input/csv-large/csv_large.csv',engine='python')

### Using the C engine / parser (default)
To show the remarkable time difference, lets run using the default engine (C) by explicitly passing it in the arguments

In [None]:
%time df=pd.read_csv('../input/csv-large/csv_large.csv',engine='c')

42.5s vs 3.57s, C parser is more than 10 times better
## Why even use python parser?
On reading the pandas.read_csv [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) i realized, python parser even though is much slower, provides a few functionalities that C parser does not. Quoting the docs 
>The C engine is faster while the python engine is currently more feature-complete. 

What are these features?
- read_csv has another parameter called "sep" which by default is ',' which very well makes sense because after all, we are reading "comma seperated files (CSV)" but if in some rare cases the seperator is not ',' rather something else, lets say period(.) then the C parser can not automatically detect the seperator while the python parser engine can with it's build in sniffer tool called csv.Sniffer. Quoting the docs
>if sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer.

- Python parser has another great built-in feature that allows us to have a regex as a seperator which can be very useful in case of badly fomatted csv files. Qouting the docs 
>separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. 



Let's test it out

In [None]:
df=pd.read_csv('../input/csv-large/csv_large.csv',engine='c',sep=None)

As expected, we get an error when trying to run C parser with sep=None, now lets try running it with python parser

In [None]:
df=pd.read_csv('../input/csv-large/csv_large.csv',engine='python',sep=None)

In [None]:
df.head()

Not only we get no errors, python parser has very well read the data into a pd.DataFrame. Hence python parser has a greater support for such functionalities even though it's much slower compared to the C parser. 

### Setting verbose = True to get insights on time distribution

In [None]:
%time df=pd.read_csv('../input/csv-large/csv_large.csv',verbose=True)

**Tokenizer** and **type conversions** take most of the time, let's try to make it better
1. **low_memory** Setting low_memory=False may reduce number of tokenizations which could have a better impact, because by default low_memory = True which allows the parser to internally break down the csv data into multiple chunks and then read it. 
2. **Explicitly specifying the dtypes of each column** shall reduce the overall type conversion time

# low_memory
First, lets fiddle around with **low_memory** parameter.
Qouting the docs
>**low_memory** : boolean, default True
Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. 

In [None]:
%time df=pd.read_csv('../input/csv-large/csv_large.csv',low_memory=False,verbose=True)

setting **low_memory = False** reduced the number of tokenizations but didn't make much progress in improving the parse time rather made it worse

### Total tokenization time in case low_memory=True
Lets sum over the chunk tokenization time created due to **low_memory=True** (default) 

In [None]:
def get_linedata(lines):
    return [
                    [line for line in lines if line.split()[0]=='Tokenization' ],
                    [line for line in lines if line.split()[0]=='Type' ],
                    [line for line in lines if line.split()[0]=='Parser' ],
    ]

def sum_times():
    total_tokenization=0
    total_type=0
    total_parser=0
    with open('../input/token-time/tokenization_time.txt') as handle:
        lines=handle.readlines()
        tokenizations,types,parsers=get_linedata(lines)
        for tz,ty,pr in zip(tokenizations,types,parsers):
            total_tokenization+=float(tz[19:24])
            total_type+=float(ty[22:26])
            total_parser+=float(pr[28:32])
    return {'total tokenization time':'{:.2f} ms'.format(total_tokenization),
                    'total type conversion time':'{:.2f} ms'.format(total_type) ,
                    'total parser time':'{:.2f} ms'.format(total_parser) 
           }     
        
sum_times()

The total parser time does not seem to be a correct measure because within the small chunks made due to **low_memory=True** the parsing time was so less that it got lost by the 2 digit precision of time printed by the **verbose=True**, the third digit decimal would've had a great impact on the total sum of parser time. 
Apart from that, total tokanization and total type conversion time are pretty insightful, turns out processing the whole file at once is **not a good way** of reading large csv files as far as the performance is concerned. Hence it's good to use the defaults and keep **low_memory = False**

## When to use low_memory=True?
Let's see an example using a much larger dataset which is from the infamous kaggle [problem](https://www.kaggle.com/c/bluebook-for-bulldozers/data)

In [None]:
%time df=pd.read_csv('../input/bluebook-for-bulldozers/train/Train.csv')

We get a DtypeWarning.

Columns (13,39,40,41) have mixed types, let's see these columns and total number of columns

In [None]:
print(df.columns[13],df.columns[39],df.columns[40],df.columns[41])
print(f"Total columns: {df.columns.size}")

In [None]:
!wc -l ../input/bluebook-for-bulldozers/train/Train.csv

There are 53 columns and 401126 rows in our dataset, giving dtypes to 53 columns while looking at 401126 rows for consistency is humanly impossible. In such cases reading files in chunks can cause DtypeWarning due to mix datatypes of columns. 
Hence, we use **low_memory=False** when:-
1. Inconsistent data type of a column because reading the csv all at once analyzes each column and accordingly assign a data type to that column
2. Too many columns to hardcode their dtypes prior to reading the csv

Now lets run the above with **low_memory=False**

In [None]:
%time df=pd.read_csv('../input/bluebook-for-bulldozers/train/Train.csv',low_memory=False)

reading_csv with **low_memory=False** is still slower but atleast we evaded the warning and our parser has a better understanding of the data hence it minimizes the chances of errors


# Explicitly specifying datatypes of columns

Let's read a synthetically generated very large csv named **vv_large.csv**

In [None]:
!wc -l ../input/vv-large/vv_large.csv

It has 17.2M lines! It's massive

Let's see what data type has our parser given to the columns of our synthetic csv by itself. We are naming of columns as names_. Passing **names=columns names**  names the columns according to passed list **column names**

In [None]:
names_=['id','name','sex']
%time df=pd.read_csv('../input/vv-large/vv_large.csv', names=names_)
print(
    type(df.id[2]),
    type(df.sex[2]),
    type(df.name[2]),
     )

This is what the csv looks like, I synthetically generated it by appending the same file in a very large python loop

In [None]:
df.head()

## Explicit type passing
There's no need to have a 64bit integer to store the id (since it's just a single digit integer), it would be rather wiser to use an 8 bit integer for 'id' and str type for both name and sex

In [None]:
names_=['id','name','sex']
dtypes={'id':np.int8,'name':'str','sex':'str'}
%time df=pd.read_csv('../input/vv-large/vv_large.csv',names=names_,dtype=dtypes)

- As expected an improvement in the performance is observed, compared to earlier when datatypes were not not passed explicitly in the parameters
Let's now try turning off **low_memory** while specifying the data types

In [None]:
names_=['id','name','sex']
dtypes={'id':np.int32,'name':'str','sex':'str'}
%time df=pd.read_csv('../input/vv-large/vv_large.csv',names=names_,dtype=dtypes,low_memory=False)

**low_memory=False** is slow as expected and always
### Set datatypes as objects
- Let's try specifying the datatypes of each column as 'object'. 
- It should be computationally harder because this is pretty much asking for more data type conversions

In [None]:
names_=['id','name','sex']
dtypes={'id':'object','name':'object','sex':'object'}
%time df=pd.read_csv('../input/vv-large/vv_large.csv',names=names_,dtype=dtypes)

In [None]:
names_=['id','name','sex']
dtypes={'id':'object','name':'object','sex':'object'}
%time df=pd.read_csv('../input/vv-large/vv_large.csv',names=names_,dtype=dtypes,verbose=True)

As expected, the runtime is much larger, mostly due to time wasted in type conversion from "object" to integer and string

# Reading really really large and massive CSV files
train.csv is a very large CSV i got i from this kaggle [problem](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data)

In [None]:
!ls -sh ../input/favorita-grocery-sales-forecasting/train.csv

In [None]:
!wc -l ../input/favorita-grocery-sales-forecasting/train.csv

125.4M lines is a lot, the file size is 4.7Gb which is huge! 

Reading such a file at once causes my browser kernel to stop responding so i will be pasting my linux terminal sceenshot here (I really don't really like using jupyter notebooks, here's  [theme](https://github.com/TimeTraveller-San/OCD_fix/tree/master/jupyter_moe) i made for it to make it bearable)

![alt text](https://i.imgur.com/54Df6r9.png "Title")

I won't even bother using low_memory=False because loading such a big file in memory at once is unreasonably foolish

One very obvious thing to do here is reading only the first few rows as follows :-
- passing **nrows = n** parameter only reads first **n** number of rows

In [None]:
df=pd.read_csv('../input/favorita-grocery-sales-forecasting/train.csv',nrows=10**3)
df.head()

But many a times we want to read the whole csv into a single dataframe, what to do in such cases?
# Introducing chunksize | Iterating through files chunk by chunk¶
A very interesting paramenter is **chunksize** which returns a **TextFileReader** object that can be iterated upon. 

>pd.read_csv('some_csv',chunksize=chunk_size)

Each element the iterator returns consist of **chunk_size** number of rows

Let's examine this object

In [None]:
??pd.io.parsers.TextFileReader

The following is the code that concerns us for our problem:-

```python
def __next__(self):
        try:
            return self.get_chunk()
        except StopIteration:
            self.close()
            raise
            ```
 This code calls ```self.get_chunk()``` function which further calls ```self.read()``` function which reads the file by the **chunk_size** specified by us in the arguments.
 
 ```__next__```  means it's an iterator, lets iterate over it and try reading the whole file 
            

In [None]:
%time TextFileReaderObject=pd.read_csv('../input/favorita-grocery-sales-forecasting/train.csv',chunksize=10**5) 
#Reading 100k rows in each chunk

next(iterator) returns the next object in the iterator 

In [None]:
print(next(TextFileReaderObject).shape)
next(TextFileReaderObject).head()

In [None]:
TextFileReaderObject=pd.read_csv('../input/favorita-grocery-sales-forecasting/train.csv',chunksize=10**5)
%time df = pd.concat(chunk for chunk in TextFileReaderObject)

In [None]:
print(df.shape)
df.head()

There we have it! 125.4 million lines and 4.5G of data read into a single dataframe within 2 minutes

Thank you o' great pandas [developers](https://github.com/pandas-dev/pandas/graphs/contributors)

## Conclusion
- Default values are made default by great programmers, almost always use them with **read_csv** function
- Name your columns of dataframe by passing **"names=column name list"** parameter
- Always try to specify dtypes of the columns unless there are too many / ambiguous columns 
- Set **low_memory=False** only in case when the data has too many columns with ambiguous datatypes and you can not hardcode their datatypes to avoid warnings (Note: It will always be very memory demanding since the whole file if being loaded into memory at once)
- Set verbose= True if you want to get insights on what exactly is taking so much time in reading the data
- Humongous CSV? break it into chunks and read it! 