---
# Reading and Writing Data to Different Sources

Data are stored in many different ways. We will be discussing loading data into pandas and storing them into different file types.

---

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display


In [None]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)


---
## Comma-Separated Values - .csv

CSV is a plain text file where each column is separated by a delimiter (comma).

`.read_csv()` is used to load in a CSV as a DataFrame.  
`.to_csv()` is used to write the DataFrame into a CSV file.

---

In [None]:
# # Load in csv as df and make the ResponseId the index
df = pd.read_csv("data/survey_results_public_2022.csv", index_col="ResponseId")

df.head(3)

In [None]:
# Write to csv

# Create new df
filt = df["Country"] == "Japan"
japan_df = df.loc[filt]

# Write to csv file. This will create a file named csv_file.csv
# inside a folder named new_files
japan_df.to_csv("new_files/csv_file.csv")

---
### Delimiters

Since CSVs are just plain text files, they can be delimited with different characters. This separator (delimiter) can be specified by any single character, the common ones being comma, tab, and colon.  

The **sep** parameter can be specified if the delimiter is other than a comma, both on reading (`.read_csv`) and writing (`.to_csv`). This defaults to a comma ( , ).

---

In [None]:
# Write to tab-separated value (TSV) file.
# TSV is variation of CSV. TSV uses tab as its delimiter.

# Create new df
filt = df["Country"] == "Germany"
germany_df = df.loc[filt]
display(germany_df.head(3))

# Write
germany_df.to_csv("new_files/tsv_file.tsv", sep="\t")


---
Loading with different separator

---

In [None]:
# Load in tab separated values (tsv) to a DataFrame
df = pd.read_csv("new_files/tsv_file.tsv", sep="\t", index_col="ResponseId")
display(df.head(3))

---
## Excel - .xlsx and .xls

Excel files are Microsoft's proprietery spreadsheet files. XLSX is the new Excel file format and can be read only by Excel 2007 and later. XLS is the older file format and can be read by all versions.  

`.read_excel()` is used to load in a CSV as a DataFrame.  
`.to_excel()` is used to write the DataFrame into a CSV file.

### Dependencies

Unlike CSV files, reading and writing excel files require additional package installs:  
`openpyxl` - for writing and reading xlsx, and write to xls*  
`xlrd` - for reading old xls  

pip supports multiple installs in 1 expression if you want to install both:  
`pip install openpyxl xlrd`

***Note**: xlwt support has been deprecated since 1.2.0 for writing XLS files. I can't find a replacement for writing XLS files and pandas docs doesn't mention XLS files. openpyxl and xlsxwriter can write to XLS file but is actually just an XLSX file saved with an .xls extension.

---

---
### Reading and Writing .xlsx

---

In [None]:
# Reading
excel_df = pd.read_excel("data/excel_new.xlsx", index_col=0)
display(excel_df.head())


In [None]:
# Writing
display(df.head(3))
df.to_excel("new_files/new_excel.xlsx")


---
### Reading and Writing .xls

When writing XLS files, we need to explicitly pass an **engine** argument. This can either be `openpyxl` or `xlsxwriter`.  

**Note**: xlwt support has been deprecated since 1.2.0 for writing XLS files. I can't find a replacement for writing XLS files and pandas docs doesn't mention XLS files. openpyxl and xlsxwriter can write to XLS file but is actually just an XLSX file saved with an .xls extension.

---

In [None]:
# Reading
excel_old_df = pd.read_excel("data/excel_old.xls", index_col=0)
display(excel_old_df.sample(5))


In [None]:
# Writing
display(df.head(3))
df.to_excel("new_files/old_excel22.xls", engine="xlsxwriter")


---
### Reading and Writing to Excel Sheets

`.read_excel()` and `.to_excel()` has a **sheet_name** parameter that allows it work with sheets. 

**read_excel's** sheet_name is used to load the specified sheet. **sheet_name** can take a str, int, list, or None, and defaults to 0 (first sheet).  
- Strings are used to read from the file's sheet name.  
- Integers are for sheet position (zero-indexed, and chart sheets do not count).  
- Lists of the combination of strs and ints are used to read from multiple sheets. If a list is passed, a dict of DataFrames will be returned where the passed list elements are the keys of the dict.
- None to read all sheets.  

**.to_excel's** sheet_name determines the sheet name to be created. Defaults to *Sheet1*



---

---
Loading sheets as DataFrames  

`excel_new.xlsx` has 3 sheets in the order:  
"Sheet1", "another sheet", "third_sheet3"

---

In [None]:
# Check sheet names using the sheet_name attribute of ExcelFile objects
xl_file = pd.ExcelFile("data/excel_new.xlsx")
xl_file.sheet_names


In [None]:
# Load 2nd sheet
excel_df = pd.read_excel("data/excel_new.xlsx", sheet_name="another sheet", index_col=0)
excel_df.head()

In [None]:
# Load 1st and 3rd sheets
excel_dfs = pd.read_excel("data/excel_new.xlsx", sheet_name=["Sheet1", 2], index_col=0)

# This creates a dict of DataFrames
display(excel_dfs)

In [None]:
# Load individual DataFrames
xl_df1 = excel_dfs["Sheet1"]
xl_df2 = excel_dfs[2]

display(xl_df1.head(3))
printhr()
display(xl_df2.head(3))


---
Specifying sheet name when writing Excel file.

---

In [None]:
# Let's use the previously loaded DataFrame and write it to an .xlsx file
display(excel_df.head(2))
printhr()

# Write into a sheet named "Sample Data"
excel_df.to_excel("new_files/excel_sheet.xlsx", sheet_name="Sample Data")

# Confirm
xl_file = pd.ExcelFile("new_files/excel_sheet.xlsx")
xl_file.sheet_names
