In this post read we will discuss about reading/writing to different file-types using Pandas.

## File Formats:

- Data can be saved in a variety of formats.

- Pandas understands how to write and read DataFrames to and from many of these formats.

- We can refer the [official documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for a full description of how to interact with all the file formats, but will briefly discuss a few of them here.

### Writing DataFrames

(How to save a DataFrame to a file.)

As a general rule of thumb, if we have a DataFrame `df` and we would like to save it as a file of type `F`, then we would call the method named `df.to_F(...)`

We will show you how this can be done 

First, let’s create some dataframes now and then try to save them.

In [None]:
import numpy as np
import pandas as pd
np.random.seed(42)  # makes sure we get the same random numbers each time
df1 = pd.DataFrame(
    np.random.randint(0, 100, size=(10, 4)),
    columns=["a", "b", "c", "d"]
)

wanted_mb = 10  # CHANGE THIS LINE
nrow = 100000
ncol = int(((wanted_mb * 1024**2) / 8) / nrow)
df2 = pd.DataFrame(
    np.random.rand(nrow, ncol),
    columns=["x{}".format(i) for i in range(ncol)]
)
print("df1.shape = ", df1.shape)
print("df1 is approximately {} MB".format(df1.memory_usage().sum() / (1024**2)))
print("df2.shape = ", df2.shape)
print("df2 is approximately {} MB".format(df2.memory_usage().sum() / (1024**2)))

df1.shape =  (10, 4)
df1 is approximately 0.00042724609375 MB
df2.shape =  (100000, 13)
df2 is approximately 9.9183349609375 MB


Note that by default df2 will be approximately 10 MB.

If you need to change this number, adjust the value of the `wanted_mb` variable.

In [None]:
df1

Unnamed: 0,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37
5,1,63,59,20
6,32,75,57,21
7,88,48,90,58
8,41,91,59,79
9,14,61,61,46


In [None]:
df2

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
0,0.618386,0.382462,0.983231,0.466763,0.859940,0.680308,0.450499,0.013265,0.942202,0.563288,0.385417,0.015966,0.230894
1,0.241025,0.683264,0.609997,0.833195,0.173365,0.391061,0.182236,0.755361,0.425156,0.207942,0.567700,0.031313,0.842285
2,0.449754,0.395150,0.926659,0.727272,0.326541,0.570444,0.520834,0.961172,0.844534,0.747320,0.539692,0.586751,0.965255
3,0.607034,0.275999,0.296274,0.165267,0.015636,0.423401,0.394882,0.293488,0.014080,0.198842,0.711342,0.790176,0.605960
4,0.926301,0.651077,0.914960,0.850039,0.449451,0.095410,0.370818,0.668841,0.665922,0.591298,0.274722,0.561243,0.382927
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0.392542,0.768030,0.776574,0.480756,0.225518,0.503168,0.050402,0.484236,0.279970,0.789663,0.525459,0.286276,0.945325
99996,0.238556,0.977451,0.524173,0.884652,0.868995,0.014537,0.059585,0.777772,0.173748,0.459401,0.679070,0.019902,0.329513
99997,0.930295,0.216646,0.625821,0.838146,0.985438,0.574139,0.609000,0.357564,0.247900,0.668989,0.205285,0.713146,0.615311
99998,0.940102,0.169032,0.640332,0.793704,0.087788,0.626955,0.553151,0.305402,0.129065,0.961668,0.810356,0.675793,0.039730


#### df.to_csv:

A CSV (or Comma Separated Value) file is the most common type of file that a data scientist will ever work with. 

These files use a `“,”` as a delimiter to separate the values and each row in a CSV file is a data record.

Let’s start with df.to_csv.

Without any additional arguments, the `df.to_csv` function will return a string containing the csv form of the DataFrame :

In [None]:
# notice the plain text format -- one row per line, columns separated by `'`

print(df1.to_csv())

,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37
5,1,63,59,20
6,32,75,57,21
7,88,48,90,58
8,41,91,59,79
9,14,61,61,46



If we do pass an argument, the first argument will be used as the file name.

In [None]:
df1.to_csv("df1.csv")

Run the cell below to verify that the file was created.

In [None]:
import os
os.path.isfile("df1.csv")

True

Let’s see how long it takes to save `df2` to a file.

In [None]:
%%time
df2.to_csv("df2.csv")

CPU times: user 2.01 s, sys: 68.2 ms, total: 2.08 s
Wall time: 2.09 s


(Because of the `%%time ` magic function at the top, Jupyter will report the total time to run all code in the cell.)

As we will see below, this isn’t a fastest file format that we should choose.

#### df.to_excel:
Most of you will be quite familiar with Excel files and why they are so widely used to store tabular data. 
Lets see how to save a file in excel format

- When saving a DataFrame to an Excel workbook, we can choose both the name of the workbook (file) and the name of the sheet within the file where the DataFrame should be written.

- We do this by passing the workbook name as the first argument and the sheet name as the second argument as follows.

In [None]:
df1.to_excel("df1.xlsx", "df1")

- Pandas also gives us the option to write more than one DataFrame to a workbook.

- To do this, we need to first construct an instance of pd.ExcelWriter and then pass that as the first argument to `df.to_excel`.

Let’s see how this works!

In [None]:
with pd.ExcelWriter("df1.xlsx") as writer:
    df1.to_excel(writer, "df1")
    (df1 + 10).to_excel(writer, "df1 plus 10")

- 

> with ... as ... :

syntax used above is an example of a context manager.

- We don’t need to understand all the details behind what this means (google it if you are curious).

- For now, just recognize that particular syntax as the way to write multiple sheets to an Excel workbook.

df2 as excel:

Saving df2 to an excel file takes a very long time.

In [None]:
%%time
df2.to_excel("df2.xlsx")

CPU times: user 34.8 s, sys: 724 ms, total: 35.6 s
Wall time: 35.7 s


#### df.to__pickle:

- Pickle is the native format of python that is popular for object serialization. 

- The advantage of pickle is that it allows the python code to implement any type of enhancements. 

- It is much faster when compared to CSV files and reduces the file size to almost half of CSV files using its compression techniques. 

In [None]:
%%time
df2.to_pickle("df2.pickle")

CPU times: user 4.72 ms, sys: 13.7 ms, total: 18.5 ms
Wall time: 21.2 ms


#### df.to_json:

- JSON (JavaScript Object Notation) files are lightweight and human-readable to store and exchange data. 

- It is easy for machines to parse and generate these files and are based on the JavaScript programming language.

- JSON files store data within `{}` similar to how a dictionary stores it in Python.

In [None]:
%%time
df2.to_json("df2.json")

CPU times: user 286 ms, sys: 33.3 ms, total: 319 ms
Wall time: 339 ms


You can compare the CPU time to save df2  pickle file 18.5 ms as compared to csv file 2.08 sec, excel file 35.6sec

### Reading Files into DataFrames:

- As with the `df.to_F` family of methods, there are similar `pd.read_F` functions. 

(Note: they are in defined pandas, not as methods on a DataFrame.)


- For now, we just want to highlight the differences in how to read data from each of the file formats.

- Let’s start by reading the files we just created to verify that they match the data we began with.

#### CSV files :

In [None]:
df1_csv = pd.read_csv("df1.csv", index_col=0)
df1_csv.head()

Unnamed: 0,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37


#### Excel files :

Pandas has a very handy function called `read_excel()` to read Excel files:

```
df = pd.read_excel('data_excel.xlsx')
```
Parameters include:
  - sheet_name: Sheet of the excel file to import or zero-based indexing of sheer
  - index_col: Index of column to use as row labels
  - header: Row index to use as header
  - And some other parameters

In [None]:
df1_xlsx = pd.read_excel("df1.xlsx", "df1", index_col=0)
df1_xlsx.head()

Unnamed: 0,a,b,c,d
0,51,92,14,71
1,60,20,82,86
2,74,74,87,99
3,23,2,21,52
4,1,87,29,37


But an Excel file we created contain multiple sheets, right? 
So how can we access them?

For this, we can use the Pandas’ ExcelFile() function to print the names of all the sheets in the file:

In [None]:
# read Excel sheets in pandas
x2 = pd.ExcelFile('df1.xlsx')

# print sheet name
x2.sheet_names

['df1', 'df1 plus 10']

After doing that, we can easily read data from any sheet we wish by providing its name in the `sheet_name` parameter in the `read_excel()` function:

In [None]:
df1_xlsx = pd.read_excel("df1.xlsx",sheet_name ='df1 plus 10',index_col=0)
df1_xlsx.head()

Unnamed: 0,a,b,c,d
0,61,102,24,81
1,70,30,92,96
2,84,84,97,109
3,33,12,31,62
4,11,97,39,47


In [None]:
df2_xlsx = pd.read_excel("df2.xlsx", index_col=0)
df2_xlsx.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
0,0.618386,0.382462,0.983231,0.466763,0.85994,0.680308,0.450499,0.013265,0.942202,0.563288,0.385417,0.015966,0.230894
1,0.241025,0.683264,0.609997,0.833195,0.173365,0.391061,0.182236,0.755361,0.425156,0.207942,0.5677,0.031313,0.842285
2,0.449754,0.39515,0.926659,0.727272,0.326541,0.570444,0.520834,0.961172,0.844534,0.74732,0.539692,0.586751,0.965255
3,0.607034,0.275999,0.296274,0.165267,0.015636,0.423401,0.394882,0.293488,0.01408,0.198842,0.711342,0.790176,0.60596
4,0.926301,0.651077,0.91496,0.850039,0.449451,0.09541,0.370818,0.668841,0.665922,0.591298,0.274722,0.561243,0.382927


#### JSON file :

`df = pd.read_json('data_index.json', orient = "index")`

Some of its params include:
  - orient: Indication of expected JSON string format. Eg: "index"
  - convert_dates: Datelike columns will be converted
  - dtype

In [None]:
df2_json = pd.read_json("df2.json")
df2_json.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
0,0.618386,0.382462,0.983231,0.466763,0.85994,0.680308,0.450499,0.013265,0.942202,0.563288,0.385417,0.015966,0.230894
1,0.241025,0.683264,0.609997,0.833195,0.173365,0.391061,0.182236,0.755361,0.425156,0.207942,0.5677,0.031313,0.842285
2,0.449754,0.39515,0.926659,0.727272,0.326541,0.570444,0.520834,0.961172,0.844534,0.74732,0.539692,0.586751,0.965255
3,0.607034,0.275999,0.296274,0.165267,0.015636,0.423401,0.394882,0.293488,0.01408,0.198842,0.711342,0.790176,0.60596
4,0.926301,0.651077,0.91496,0.850039,0.449451,0.09541,0.370818,0.668841,0.665922,0.591298,0.274722,0.561243,0.382927


#### Pickle file :

To read: pd.read_pickle()

In [None]:
df2_pickle = pd.read_pickle("df2.pickle")
df2_pickle.head()

Unnamed: 0,x0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
0,0.618386,0.382462,0.983231,0.466763,0.85994,0.680308,0.450499,0.013265,0.942202,0.563288,0.385417,0.015966,0.230894
1,0.241025,0.683264,0.609997,0.833195,0.173365,0.391061,0.182236,0.755361,0.425156,0.207942,0.5677,0.031313,0.842285
2,0.449754,0.39515,0.926659,0.727272,0.326541,0.570444,0.520834,0.961172,0.844534,0.74732,0.539692,0.586751,0.965255
3,0.607034,0.275999,0.296274,0.165267,0.015636,0.423401,0.394882,0.293488,0.01408,0.198842,0.711342,0.790176,0.60596
4,0.926301,0.651077,0.91496,0.850039,0.449451,0.09541,0.370818,0.668841,0.665922,0.591298,0.274722,0.561243,0.382927


With the `pd.read_F` family of functions, we can also read files from places on the internet.


Below, is an example of using `pd.read_csv` to read csv file from a URL :

In [None]:
df1_url = "http://winterolympicsmedals.com/medals.csv"
df1_web = pd.read_csv(df1_url, index_col=0)
df1_web.head()

Unnamed: 0_level_0,City,Sport,Discipline,NOC,Event,Event gender,Medal
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold
1924,Chamonix,Skating,Figure skating,AUT,pairs,X,Gold
1924,Chamonix,Bobsleigh,Bobsleigh,BEL,four-man,M,Bronze
1924,Chamonix,Ice Hockey,Ice Hockey,CAN,ice hockey,M,Gold


### Cleanup:

If you want to remove the files we just created, run the following cell.



In [None]:
def try_remove(file):
    if os.path.isfile(file):
        os.remove(file)

for df in ["df1", "df2"]:
    for extension in ["csv", "json", "xlsx","pickle"]:
        filename = df + "." + extension
        try_remove(filename)