# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

## Working with Pickle File(.bz2)

1. The most important advantage of using a pickle file instead of a CSV is that it retains the datatype of the column. For example, if the index is set to Datetime format in the dataframe and saved as a pickle file, it will be saved and retained whenever it is re-imported.

2. However, the limitation of pickle files is that they are Python-version-specific, i.e. you might encounter issues when saving files in one Python version and reading them in the other. Pickle files are backward compatible i.e. pickle files created in a lower version can be read in a higher version.

3. The `bz2` format is used for saving the pickle file in a compressed manner. In this notebook, we will see the process of saving a dataframe as a pickle file.

The notebook is structured as follows:
1. [Import the Data](#import)
2. [Save as Pickle File](#save)
3. [Read Pickle File](#read)
4. [Common Errors](#error)

## Import Libraries

In [1]:
# For data manipulation
import pandas as pd

# For checking python version
from platform import python_version

<a id='import'></a>
## Import the Data

Import the file `AAPL_daily_data.csv` using the `read_csv` method of `pandas`.

In [2]:
# Import price data of Apple stock
data = pd.read_csv("../data_modules/AAPL_daily_data.csv", index_col=0)
print(type(data.index))

<class 'pandas.core.indexes.base.Index'>


In [3]:
# Change index to datetime
data.index = pd.to_datetime(data.index)
print(type(data.index))

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


The datatype of index has been changed to `DatetimeIndex`.

In [4]:
# Print first 5 rows
data.head(5)

Unnamed: 0,open,high,low,close
2007-01-03 00:00:00+00:00,86.3,86.58,81.9,83.8
2007-01-04 00:00:00+00:00,84.05,85.95,83.82,85.64
2007-01-05 00:00:00+00:00,85.76,86.2,84.4,85.15
2007-01-08 00:00:00+00:00,85.94,88.92,85.28,85.44
2007-01-09 00:00:00+00:00,86.49,92.98,85.15,92.55


<a id='save'></a>
## Save as Pickle File

We will save the dataframe `data` as a pickle file by using `to_pickle` function. But before that, we would check the versions of `python` and `pandas`. The extension used for saving the file will be `.bz2` as it provides a very good size compression. 

Syntax:
```python
df.to_pickle("filename.bz2")
```

In [5]:
# Check python version
print("Python version =", python_version())

# Check pandas version
print("Pandas version =", pd.__version__)

Python version = 3.11.9
Pandas version = 2.2.2


In [6]:
# Save the dataframe as pickle file
data.to_pickle("AAPL_daily_data.bz2")

<a id='read'></a>
## Read Pickle File

We can read the pickle file using `read_pickle` method of pandas. 

Syntax:
```python
pd.read_csv("filename.bz2")
```

In [7]:
pickle_data = pd.read_pickle("AAPL_daily_data.bz2")

# Print top 5 rows
pickle_data.head(5)

Unnamed: 0,open,high,low,close
2007-01-03 00:00:00+00:00,86.3,86.58,81.9,83.8
2007-01-04 00:00:00+00:00,84.05,85.95,83.82,85.64
2007-01-05 00:00:00+00:00,85.76,86.2,84.4,85.15
2007-01-08 00:00:00+00:00,85.94,88.92,85.28,85.44
2007-01-09 00:00:00+00:00,86.49,92.98,85.15,92.55


The pickle file retains all the changes made to it. We changed the index to Datetime while saving the pickle file. We can see that this change is retained when we read the file again.

In [8]:
# Check datatype of index
print(type(pickle_data.index))

<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


<a id='error'></a>
## Common Errors

A common error that one can face while dealing with a pickle file is the inconsistency in the python/pandas version. The error raised is:

`AttributeError: Can't get attribute '_unpickle_block' on <module 'pandas._libs.internals' from '/opt/conda/lib/python3.8/site-packages/pandas/_libs/internals.cpython-38-x86_64-linux-gnu.so'>`

This is because the file was probably created with a newer Pandas version. Now to load the file, you are using an old version, and pickle can’t “deserialize” the object because of the API change. The best way to avoid this is to ensure consistency in the python and pandas versions.


Another common error that one can come across is:

`ValueError: unsupported pickle protocol: 4`

The Pickle protocol is basically the file format. This error is thrown because the higher the protocol used, the more recent the version of Python needed to read the pickle produced. Pickle protocol version 4 was added in Python 3.4. The best way to solve this is to upgrade to Python 3.4 or later.

## Conclusion

In this notebook, we saw how we can convert a CSV file into a pickle file with `.bz2` extension. To explore it further, you can try converting a CSV file into a pickle file and check the difference in file sizes of both files. You'll be amazed by the difference in file sizes between the `.csv` and `.bz2` files.