## Overview
This notebook covers the 80% of reading and writing files in pandas.</br>
While there are a large number of options in the pandas API, this will focus on:
* reading a directory
* CSV
* Parquet
* JSON
* Excel


To use you need to have python installed and jupyterlab.  </br>
The code assumes you have a basic familiarity with python syntax and use.
## Packages Needed
* sys
* os
* pandas
* numpy
* json
* pyarrow

## Install & Import


In [None]:
import sys


!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install openpyxl

'''
Using "!{sys.executable} -m pip install"   instead of "!pip install"
ensures that the install is done in the context and kernel currently running
the notebook. This is a recommended best practice and I try to use this method within
notebooks as I try to default to what I would want to see if I was collaborating with
a group.
'''
import os
import numpy as np
import pandas as pd
import json

## Files used

All files were downloaded and extracted in to a folder called "data" in the
same folder as this notbook "./data/*"

From Kaggle
* https://www.kaggle.com/datasets/jeffreybraun/chipotle-locations
    * chipotle_store.csv
    * us-states.json


From Github
* https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet
    * userdata1.parquet


## Looping files in a directory

In [None]:
# print names of files in a directory and return them as a list.
def return_files_as_list(directory):
    files = []
    for filename in os.listdir(directory):
        f = os.path.join(directory, filename)
        # checking if it is a file before printing and adding to the
        # file list
        if os.path.isfile(f):
            print(f)
            files.append(f)
    return files

In [None]:
return_files_as_list("./data")

## Read/Write CSV
docs:
 * https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
 * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [None]:
chipotle_loc_df = pd.read_csv("./data/chipotle_stores.csv")
chipotle_loc_df.head()

In [None]:
chipotle_loc_df.info()

In [None]:
chipotle_loc_df.to_csv("./data/chipotle.tsv", sep="\t", index=False)
return_files_as_list("./data")

In [None]:
tsv_df = pd.read_csv("./data/chipotle.tsv", sep="\t")
tsv_df.head()

### Chunky chunks

In [None]:
'''
A method that is often taught too late or not at all is chunks.
for large data sets or memory constrained compute like laptops, it is
very helpful to learn chunk early.
'''

for chunk_df in pd.read_csv("./data/chipotle_stores.csv", chunksize=2):
    print(chunk_df.info())
    break


## Read/Write Parquet

In [None]:
parquet_df = pd.read_parquet("./data/userdata1.parquet")
parquet_df.info()

In [None]:
parquet_df[["first_name", "email"]].to_parquet("./data/emails.parquet")
'''
This uses Pyarrow under the hood to serialize the data
 and save the file
'''
emails_df = pd.read_parquet("./data/emails.parquet")
emails_df.head()

## Read/Write JSON

In [None]:
json_df = pd.read_json("./data/us-states.json")
json_df.head()

In [None]:
json_data = json.load(open("./data/us-states.json"))
print(json.dumps(json_data, indent=2))

In [None]:
json_df2 = pd.DataFrame.from_records(json_data["features"])
json_df2.head()

In [None]:
json_df3 = pd.DataFrame.from_dict(json_data["features"]) # same as from records above
json_df3 = pd.DataFrame.from_dict(json_data["features"][0]) # gets wonky

json_df3.head()

## Read/Write Excel


In [None]:
parquet_df = pd.read_parquet("./data/userdata1.parquet")
parquet_df2 = pd.read_parquet("./data/userdata2.parquet")
parquet_df3 = pd.read_parquet("./data/userdata3.parquet")
parquet_df.head()

In [None]:
parquet_df.info()

In [None]:
parquet_df.to_excel("./data/output.xlsx",
             sheet_name='user_data_1')

In [None]:
with pd.ExcelWriter('./data/output.xlsx') as writer:
    parquet_df.to_excel(writer, sheet_name='user_data_1')
    parquet_df2.to_excel(writer, sheet_name='user_data_2')
    parquet_df2.to_excel(writer, sheet_name='user_data_3')


In [None]:
excel_df = pd.read_excel("./data/output.xlsx", sheet_name="user_data_1")
excel_df.head()

In [None]:
excel_df.info()

## View all columns

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
excel_df.head()
