# Storage Formats

## Working with Data Files in pandas

## Prerequisites & Outcomes

**Prerequisites**
- Intro to DataFrames and Series

**Outcomes**
- Understand that data can be saved in various formats
- Know where to get help on file input and output
- Know when to use csv, xlsx, feather, and sql formats

**Data**
- Results for all NFL games between September 1920 to February 2017

In [36]:
import pandas as pd
import numpy as np


## File Formats Overview

Data can be saved in a variety of formats.

pandas understands how to write and read DataFrames to and from many of these formats.

**Full documentation**: [pandas I/O documentation](https://pandas.pydata.org/pandas-docs/stable/io.html)

## CSV Format

**What is it?** 
- CSVs store data as plain text (strings)
- Each row is a line
- Columns are separated by `,`

### CSV: Pros and Cons

**Pros**
- Widely used (you should be familiar with it)
- Plain text file (can open on any computer, "future proof")
- Can be read from and written to by most data software

**Cons**
- Not the most efficient way to store or access
- No formal standard, so there is room for user interpretation

**When to use**: A great default option for most use cases

## XLSX Format (Excel)

**What is it?** 
- xlsx is a binary file format used as Excel's default

### XLSX: Pros and Cons

**Pros**
- Standard format in many industries
- Easy to share with colleagues that use Excel

**Cons**
- Quite slow to read/write large amounts of data
- Stores both data and metadata (styling, plots, etc.)

**When to use**
- When sharing data with Excel
- When you would like special formatting applied

## Parquet Format

**What is it?** 
- Custom binary format designed for efficient reading and writing of columnar data

### Parquet: Pros and Cons

**Pros**
- *Very* fast
- Naturally understands all `dtypes` used by pandas
- Very common in "big data" systems (Hadoop, Spark)
- Supports various compression algorithms

**Cons**
- Binary storage format (not human-readable)

**When to use**
- If you have "not small" amounts (> 100 MB) of unchanging data
- When you need size-and-time-efficient storage

## Feather Format

**What is it?** 
- Custom binary format designed for efficient reading and writing of columnar data

### Feather: Pros and Cons

**Pros**
- *Very* fast -- even faster than parquet
- Naturally understands all `dtypes` used by pandas

**Cons**
- Limited language support (Python, R, Julia)
- Newer format (March 2016)
- Only supports standard pandas index

**When to use**
- Alternative to Parquet for absolute best read/write speeds
- Only when you won't need to access data outside Python/R/Julia

## SQL Format

**What is it?** 
- SQL is a language used to interact with relational databases

**Pros**
- Well established industry standard
- Much of the world's data is in SQL databases

**Cons**
- Complicated: need to learn another language (SQL)

**When to use**
- When reading from or writing to existing SQL databases

## Writing DataFrames

General pattern: If we have a DataFrame `df` and want to save it as type `FOO`:

```python
df.to_FOO(...)
```

## Creating Sample DataFrames

In [37]:
np.random.seed(42)
df1 = pd.DataFrame(
    np.random.randint(0, 100, size=(10, 4)),
    columns=["a", "b", "c", "d"]
)

df1

# wanted_mb = 10
# nrow = 100000
# ncol = int(((wanted_mb * 1024**2) / 8) / nrow)
# df2 = pd.DataFrame(
#     np.random.rand(nrow, ncol),
#     columns=["x{}".format(i) for i in range(ncol)]
# )

# print("df2.shape = ", df2.shape)
# print("df2 is approximately {} MB".format(df2.memory_usage().sum() / (1024**2)))

df1.to_csv('df1.csv')



## Writing to CSV

In [41]:
# Without arguments, returns a string
print(df1.to_csv())


df1.to_csv('data_frame_is_created')



,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37
5,1,63,59,20
6,32,75,57,21
7,88,48,90,58
8,41,91,59,79
9,14,61,61,46



In [None]:
# With filename argument, saves to file
df1.to_csv("df1.csv")

import os
print("File created:", os.path.isfile("df1.csv"))

In [None]:
%%time
df2.to_csv("df2.csv")

## Writing to Excel

In [44]:
# Single sheet
# df1.to_excel("df1.xlsx", "df1")
df1.to_excel(excel_writer = "df1.xlsx")



In [None]:
# Multiple sheets using ExcelWriter
with pd.ExcelWriter("df1.xlsx") as writer:
    df1.to_excel(writer, "df1")
    (df1 + 10).to_excel(writer, "df1 plus 10")

**Note**: Writing large DataFrames to Excel is very slow (~25 seconds for df2)

## Writing to Feather

First, install pyarrow if needed:
```python
!pip install pyarrow
```

In [43]:
# !conda install -c conda-forge pyarrow
import pyarrow.feather
# pyarrow.feather.write_feather(df1, "df1.feather")

ModuleNotFoundError: No module named 'pyarrow'

In [None]:
%%time
pyarrow.feather.write_feather(df2, "df2.feather")

## Performance Comparison

| Format | Time |
|:------:|:----:|
| CSV | 2.66 seconds |
| XLSX | 25.7 seconds |
| Feather | 43 milliseconds |

**Feather is ~60x faster than CSV and ~600x faster than Excel!**

## Reading Files into DataFrames

General pattern: Use `pd.read_FOO()` functions

Note: These are pandas functions, not DataFrame methods

## Reading CSV Files

In [46]:
df1_csv = pd.read_csv("df1.csv", index_col=0)
df1_csv.head(n = 7)

Unnamed: 0,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37
5,1,63,59,20
6,32,75,57,21


## Reading Excel Files

In [13]:
df1_xlsx = pd.read_excel("df1.xlsx",  index_col=0)
df1_xlsx.head()

Unnamed: 0,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37


## Reading Feather Files

In [None]:
# Feather automatically knows what the index is
df1_feather = pyarrow.feather.read_feather("df1.feather")
df1_feather.head()

## Reading Files from the Web

In [47]:
df1_url = "https://raw.githubusercontent.com/QuantEcon/lecture-datascience.myst/main/lectures/pandas/df1.csv"
df1_web = pd.read_csv(df1_url, index_col=0)
df1_web.head()

Unnamed: 0,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37


## Practice Exercise: NFL Games Dataset

**Tasks:**
1. Read the NFL games CSV from the URL below
2. Print the shape and column names
3. Save to an Excel file named `nfl.xlsx`
4. Open in Excel on your computer

**Bonus Analysis Ideas:**
- Compute average total points per game
- Analyze playoff games only
- Track your favorite team's performance
- Calculate ratio of upsets (lower ELO team wins)

In [None]:
url = "https://raw.githubusercontent.com/fivethirtyeight/nfl-elo-game/"
url = url + "3488b7d0b46c5f6583679bc40fb3a42d729abd39/data/nfl_games.csv"

# Your code here

## Cleanup

In [48]:
import os

def try_remove(file):
    if os.path.isfile(file):
        os.remove(file)

for df in ["df1", "df2"]:
    for extension in ["csv", "feather", "xlsx"]:
        filename = df + "." + extension
        try_remove(filename)

NameError: name 'os' is not defined

## Summary

- **CSV**: Great default, widely compatible
- **Excel**: Best for sharing with Excel users
- **Feather/Parquet**: Fast and efficient for large datasets
- **SQL**: For database integration

Choose based on your use case and performance needs!