In [1]:
import numpy as np 
import pandas as pd

In [2]:
print(pd.__version__)

2.0.2


# Overview

Pandas can deal with multiple types of data and documents:
![download.jpg](attachment:download.jpg)

| Format Types | Data Description                                           | Read Function                                                     | Write Function                                                     |
| -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| text     | CSV                                                          | [read_csv](https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table) | [to_csv](https://pandas.pydata.org/docs/user_guide/io.html#io-store-in-csv) |
| text     | Fixed-Width Text File                                        | [read_fwf](https://pandas.pydata.org/docs/user_guide/io.html#io-fwf-reader) |                                                              |
| text     | [JSON](https://www.json.org/)                                | [read_json](https://pandas.pydata.org/docs/user_guide/io.html#io-json-reader) | [to_json](https://pandas.pydata.org/docs/user_guide/io.html#io-json-writer) |
| text     | HTML                                                         | [read_html](https://pandas.pydata.org/docs/user_guide/io.html#io-read-html) | [to_html](https://pandas.pydata.org/docs/user_guide/io.html#io-html) |
| text     | LaTeX                                                        |                                                              | [Styler.to_latex](https://pandas.pydata.org/docs/user_guide/io.html#io-latex) |
| text     | [XML](https://www.w3.org/standards/xml/core)                 | [read_xml](https://pandas.pydata.org/docs/user_guide/io.html#io-read-xml) | [to_xml](https://pandas.pydata.org/docs/user_guide/io.html#io-xml) |
| text     | Local clipboard                                              | [read_clipboard](https://pandas.pydata.org/docs/user_guide/io.html#io-clipboard) | [to_clipboard](https://pandas.pydata.org/docs/user_guide/io.html#io-clipboard) |
| binary   | MS Excel                                                     | [read_excel](https://pandas.pydata.org/docs/user_guide/io.html#io-excel-reader) | [to_excel](https://pandas.pydata.org/docs/user_guide/io.html#io-excel-writer) |
| binary   | [OpenDocument](http://opendocumentformat.org/)               | [read_excel](https://pandas.pydata.org/docs/user_guide/io.html#io-ods) |                                                              |
| binary   | [HDF5 Format](https://support.hdfgroup.org/HDF5/whatishdf5.html) | [read_hdf](https://pandas.pydata.org/docs/user_guide/io.html#io-hdf5) | [to_hdf](https://pandas.pydata.org/docs/user_guide/io.html#io-hdf5) |
| binary   | [Feather Format](https://github.com/wesm/feather)            | [read_feather](https://pandas.pydata.org/docs/user_guide/io.html#io-feather) | [to_feather](https://pandas.pydata.org/docs/user_guide/io.html#io-feather) |
| binary   | [Parquet Format](https://parquet.apache.org/)                | [read_parquet](https://pandas.pydata.org/docs/user_guide/io.html#io-parquet) | [to_parquet](https://pandas.pydata.org/docs/user_guide/io.html#io-parquet) |
| binary   | [ORC Format](https://orc.apache.org/)                        | [read_orc](https://pandas.pydata.org/docs/user_guide/io.html#io-orc) |                                                              |
| binary   | Stata                                                        | [read_stata](https://pandas.pydata.org/docs/user_guide/io.html#io-stata-reader) | [to_stata](https://pandas.pydata.org/docs/user_guide/io.html#io-stata-writer) |
| binary   | SAS                                                          | [read_sas](https://pandas.pydata.org/docs/user_guide/io.html#io-sas-reader) |                                                              |
| binary   | SPSS                                                         | [read_spss](https://pandas.pydata.org/docs/user_guide/io.html#io-spss-reader) |                                                              |
| binary   | [Python Pickle Format](https://docs.python.org/3/library/pickle.html) | [read_pickle](https://pandas.pydata.org/docs/user_guide/io.html#io-pickle) | [to_pickle](https://pandas.pydata.org/docs/user_guide/io.html#io-pickle) |
| SQL      | SQL                                                          | [read_sql](https://pandas.pydata.org/docs/user_guide/io.html#io-sql) | [to_sql](https://pandas.pydata.org/docs/user_guide/io.html#io-sql) |
| SQL      | Google BigQuery                                              | [read_gbq](https://pandas.pydata.org/docs/user_guide/io.html#io-bigquery) | [to_gbq](https://pandas.pydata.org/docs/user_guide/io.html#io-bigquery) |

# Read the data
## Read the CSV file

For more arguments, please refer to official webpage: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Some common used arguments:

* `index_col = 0`: Specifies that the first column in the CSV file should be used as the index of the DataFrame.
* `encoding`: Specifies the encoding format to use when reading the file, primarily to handle recognition of non-English characters.
* `sep`: Specifies the delimiter used in the CSV file. By default, it is a comma, but here it is changed to a semicolon.
* `names`: Manually specifies the names of each column.
* `low_memory = False`: Allows pandas to handle cases where columns may contain multiple data types.
parse_dates: Specifies that a particular column should be treated as a `datetime` format.

## Read Excel file
###### Need to install openpyxl package

https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

In [3]:
# user_df = pd.read_excel('./users.xlsx',index_col = "name")

# Write data
## CSV file
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

## Excel file

In [4]:
data = np.random.normal(5, 2, 36).reshape(6,6)
index = pd.period_range(start = '2023-06', end = '2023-11', freq = 'M', name = 'Time')
df1 = pd.DataFrame(data = data, columns = list('ABCDEF'), index = index)
df1

Unnamed: 0_level_0,A,B,C,D,E,F
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-06,8.281828,8.04799,7.721259,4.340303,7.860816,6.295601
2023-07,2.101263,2.640764,8.201942,6.170401,6.629005,4.51462
2023-08,7.521201,5.436757,3.169956,6.860414,7.367665,6.958093
2023-09,5.764358,7.318291,5.685697,2.699052,8.493171,6.805117
2023-10,7.695481,6.183254,3.650151,6.383882,2.33762,4.89032
2023-11,5.567022,2.162513,6.140759,5.904984,6.767168,3.894126


In [5]:
# write df to excel named as output
df1.to_excel("output.xlsx")

In [6]:
# Add the sheet name when writing out the data
df1.to_excel("output.xlsx", sheet_name = 'mydata')  

In [7]:
# Copy df1 to df2 object, and a new column 'G'
# Write df1 and df2 to output.xlsx, and in seperate sheet names
df2 = df1.copy()
df2['G'] = np.random.normal(5,2,6)
with pd.ExcelWriter('output.xlsx') as writer:  
    df1.to_excel(writer, sheet_name = 'mydata1')
    df2.to_excel(writer, sheet_name = 'mydata2')

In [8]:
# Prepare another data 

data = [
       {"name":"Lebron James","age":38,"city":"Los Angeles"},
       {"name":"Kevin Durant","age":35,"city":"Phoneix"},
       {"name":"Nikola Jokic","age":28,"city": "Denver"},
       {"name":"Stephen Curry","age":35,"city":"Golden State"},
       {"name":"Luka Doncic","age":24,"city":"Dallas"}
]
df = pd.DataFrame(data,columns = ["name","age","city"])
df.set_index('name', inplace = True)
df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Lebron James,38,Los Angeles
Kevin Durant,35,Phoneix
Nikola Jokic,28,Denver
Stephen Curry,35,Golden State
Luka Doncic,24,Dallas


In [9]:
with pd.ExcelWriter('output.xlsx', mode = 'a') as writer:  
    df.to_excel(writer, sheet_name = 'NBA players2')

## JSON file

In [10]:
df = pd.DataFrame(
                [['tom', '20'], ['jerry', '18']],
                columns = ['name', 'age']
                )
df

Unnamed: 0,name,age
0,tom,20
1,jerry,18


In [11]:
data = df.to_json()
data

'{"name":{"0":"tom","1":"jerry"},"age":{"0":"20","1":"18"}}'

**`orient`** Property

* `split` : dict like {'index' -> [index], 'columns' -> [columns], 'data' -> [values]}
* `records` : list like [{column -> value}, ... , {column -> value}]
* `index` : dict like {index -> {column -> value}}
* `columns` : dict like {column -> {index -> value}}
* `values` : only list values
* `table` : dict like {'schema': {schema}, 'data': {data}}

In [12]:
json_split = df.to_json(orient ='split')
print("json_split = ", json_split, "\n")
   
json_records = df.to_json(orient ='records')
print("json_records = ", json_records, "\n")
   
json_index = df.to_json(orient ='index')
print("json_index = ", json_index, "\n")
   
json_columns = df.to_json(orient ='columns')
print("json_columns = ", json_columns, "\n")
   
json_values = df.to_json(orient ='values')
print("json_values = ", json_values, "\n")
   
json_table = df.to_json(orient ='table')
print("json_table = ", json_table, "\n")

json_split =  {"columns":["name","age"],"index":[0,1],"data":[["tom","20"],["jerry","18"]]} 

json_records =  [{"name":"tom","age":"20"},{"name":"jerry","age":"18"}] 

json_index =  {"0":{"name":"tom","age":"20"},"1":{"name":"jerry","age":"18"}} 

json_columns =  {"name":{"0":"tom","1":"jerry"},"age":{"0":"20","1":"18"}} 

json_values =  [["tom","20"],["jerry","18"]] 

json_table =  {"schema":{"fields":[{"name":"index","type":"integer"},{"name":"name","type":"string"},{"name":"age","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":0,"name":"tom","age":"20"},{"index":1,"name":"jerry","age":"18"}]} 

