# Exporting Data

> Data science is not effective without saving results.
>
> \- Another wise person

## Applied Review

### Data in Python

* Data is frequently represented inside a **DataFrame** - a class from the pandas library

* Other structures exist, too - dicts, models, etc.

* Data is stored in memory - this makes it relatively quickly accessible

* Data is session-specific, so quitting Python (i.e shutting down JupyterLab) removes the data from memory

### Importing Data

* Tabular data can be imported into DataFrames using the `pd.read_csv()` function - there are parameters for different options and other `pd.read_xxx()` functions.

* Other data formats like JSON (key-value pairs) and Pickle (native Python) can be imported using the `with` statement and respective functions:
  * JSON files use the `load()` function from the `json` library
  * Pickle files use the `load()` function from the `pickle` library

## General Model

### General Framework

A general way to conceptualize data export from Python to Disk:

1. Data sits in memory in the Python session

2. Python code can be used to copy the data from Python's memory to an appropriate format on disk

This framework can be visualized below:

<center>
<img src="images/export-framework.png" alt="export-framework.png" width="80%" height="80%">
<center/>

## Exporting DataFrames

Remember that DataFrames are representations of tabular data -- therefore, knowing how to export DataFrames to tabular data files is important.

### Exporting Setup

We need data to export.

Let's begin by revisiting the importing of tabular data into a DataFrame:

In [1]:
import pandas as pd
planes_df = pd.read_csv('../data/planes.csv')

Next, let's do some manipulations on `planes_df`.

<div class="admonition tip alert alert-warning">
    <b><p class="first admonition-title" style="font-weight: bold">Question</p></b>
    <p>How do we select the <tt class=\"docutils literal\">year</tt> and <tt class=\"docutils literal\">manufacturer</tt> variables while returning a DataFrame?</p>
</div>

In [2]:
planes_df = planes_df[['year', 'manufacturer']]

<div class="admonition tip alert alert-warning">
    <b><p class="first admonition-title" style="font-weight: bold">Question</p></b>
    <p>How do we compute the average <tt class=\"docutils literal\">year</tt> by <tt class=\"docutils literal\">manufacturer</tt>?</p>
</div>

In [3]:
avg_year_by_man_df = (
    planes_df.groupby('manufacturer', as_index = False)
    .mean()
)

Let's view our result to find the manufacturers with the oldest planes:

In [4]:
avg_year_by_man_df.sort_values('year').head()

Unnamed: 0,manufacturer,year
16,DOUGLAS,1956.0
15,DEHAVILLAND,1959.0
7,BEECH,1969.5
13,CESSNA,1972.444444
12,CANADAIR LTD,1974.0


### Exporting DataFrames with Pandas

DataFrames can be exported using a method built-in to the DataFrame object itself: `DataFrame.to_csv()`.

In [5]:
avg_year_by_man_df.to_csv('../data/avg_year_by_man.csv')

Let's reimport to see the tabular data we just exported:

In [6]:
pd.read_csv('../data/avg_year_by_man.csv').head()

Unnamed: 0.1,Unnamed: 0,manufacturer,year
0,0,AGUSTA SPA,2001.0
1,1,AIRBUS,2007.20122
2,2,AIRBUS INDUSTRIE,1998.233333
3,3,AMERICAN AIRCRAFT INC,
4,4,AVIAT AIRCRAFT INC,2007.0


<div class="admonition warning alert alert-warning">
 <b><p class="first admonition-title" style="font-weight: bold">Question?</p></b>
 <p>Notice the extra column named <tt class=\"docutils literal\">Unnamed: 0</tt> . Where did this extra column come from?</p>
</div>

This `Unnamed: 0` column is the index from the DataFrame. Despite it not being part of the original data, it's saved with the DataFrame by default.

We can elect not to save the index with the DataFrame by passing `False` to the `index` parameter of `to_csv()`:

In [7]:
avg_year_by_man_df.to_csv('../data/avg_year_by_man.csv', index=False)

And then check our result again:

In [8]:
pd.read_csv('../data/avg_year_by_man.csv').head()

Unnamed: 0,manufacturer,year
0,AGUSTA SPA,2001.0
1,AIRBUS,2007.20122
2,AIRBUS INDUSTRIE,1998.233333
3,AMERICAN AIRCRAFT INC,
4,AVIAT AIRCRAFT INC,2007.0


The `to_csv()` method has similar parameters to `read_csv()`. A few examples:

* `sep` - the data's delimter
* `header` - whether or not to write out the column names

Full documentation can be pulled up by running the method name followed by a question mark:

In [9]:
pd.DataFrame.to_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0mto_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpath_or_buf[0m[0;34m:[0m [0;34m'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m','[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mna_rep[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m''[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfloat_format[0m[0;34m:[0m [0;34m'str | Callable | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m:[0m [0;34m'Sequence[Hashable] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m:[0m [0;34m'bool_t | list[str]'[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m:[0m [0;34m'bool_t'[0

<div class="admonition note alert alert-info">
    <b><p class="first admonition-title" style="font-weight: bold">Note</p></b>
    <p>There are several other <tt class=\"docutils literal\">df.to_xxx()</tt> methods that allow you to export DataFrames to other data formats. See more options <a href="https://pandas.pydata.org/docs/search.html?q=DataFrame.to_#">here</a>.</p>
</div>

### Your Turn

1. Exporting data is copying data from Python's ________ to the ________. 
2. Fill in the blanks to the following code to:
   - import the flights.csv file,
   - filter for flights with a destination to the 'CVG' airport,
   - write this subsetted data out to a new CSV file titled 'flights_to_cvg' (but don't save the index to the CSV). 
   <br><br>

   ```python
   import pandas as pd
   flights_df = pd.________('../data/flights.csv')
   flights_to_cvg_df = flights_df[flights_df[________] == 'CVG']
   flights_to_cvg_df.________('../data/flights_to_cvg.csv', ________ = False)
   ```

## Exporting Other Files

Recall being exposed to the importing of JSON and Pickle files -- now we will see how to save them.

### JSON Files

Take a look at the below `dict`:

In [10]:
dict_example = {
    "first": "Guido",
    "last": "van Rossum"
}

And then we can save it as a JSON file using the `with` statement and the `dump` function from the `json` library:

In [11]:
import json
with open('../data/dict_example_export.json', 'w') as f:
    f.write(json.dumps(dict_example))

We can then reimport this to verify we saved it correctly:

In [12]:
with open('../data/dict_example_export.json', 'r') as f:
    imported_json = json.load(f)

In [13]:
type(imported_json)

dict

In [14]:
imported_json

{'first': 'Guido', 'last': 'van Rossum'}

### Pickle Files

<div class="admonition warning alert alert-warning">
 <b><p class="first admonition-title" style="font-weight: bold">Question?</p></b>
 <p>What are Pickle files?</p>
</div>

Python's native data files are known as **Pickle** files:

* All Pickle files have the `.pickle` extension

* Pickle files are great for saving native Python data that can't easily be represented by other file types
  * Pre-processed data
  * Models
  * Any other Python object...

### Exporting Pickle Files

Pickle files can be exported using the `pickle` library paired with the `with` statement and the `open()` function:

In [15]:
import pickle
with open('../data/pickle_example_export.pickle', 'wb') as f:
    pickle.dump(dict_example, f)

We can then reimport this to verify we saved it correctly:

In [16]:
with open('../data/pickle_example_export.pickle', 'rb') as f:
    imported_pickle = pickle.load(f)

In [17]:
type(imported_pickle)

dict

In [18]:
imported_pickle

{'first': 'Guido', 'last': 'van Rossum'}

# Questions

Are there any questions before we move on?