In [1]:
from slide_tools import hide_code_in_slideshow

# Exporting Data

> Data science is not effective without saving results.
>
> \- Another wise person

## Applied Review

### Data in Python

* Data is frequently represented inside a **DataFrame** - a class from the pandas library

* Other structures exist, too - dicts, models, etc.

* Data is stored in memory - this makes it relatively quickly accessible

* Data is session-specific, so quitting Python (i.e shutting down JupyterLab) removes the data from memory

### Importing Data

* Tabular data can be imported into DataFrames using the `pd.read_csv()` function - there are parameters for different options

* Other data formats like JSON (key-value pairs) and Pickle (native Python) can be imported using the `with` statement and respective functions:
  * JSON files use the `load()` function from the `json` library
  * Pickle files use the `load()` function from the `pickle` library

## General Model

### General Framework

A general way to conceptualize data export from Python to Disk:

1. Data sits in memory in the Python session

2. Python code can be used to copy the data from Python's memory to an appropriate format on disk

This framework can be visualized below:

<center>
<img src="images/export-framework.png" alt="export-framework.png" width="800" height="800">
<center/>

## Exporting DataFrames

Remember that DataFrames are representations of tabular data -- therefore, knowing how to export DataFrames to tabular data files is important.

### Exporting Setup

We need data to export.

Let's begin by revisiting the importing of tabular data into a DataFrame:

In [21]:
import pandas as pd
planes_df = pd.read_csv('../data/planes.csv')

Next, let's do some manipulations on `planes_df`.

<font class="question">
    <strong>Question</strong>:<br><em>How do we select the <code>year</code> and <code>manufacturer</code> variables while returning a DataFrame?</em>
</font>

In [22]:
planes_df = planes_df[['year', 'manufacturer']]

<font class="question">
    <strong>Question</strong>:<br><em>How do we compute the average <code>year</code> by <code>manufacturer</code>?</em>
</font>

In [23]:
avg_year_by_man_df = planes_df.groupby('manufacturer', as_index = False).mean()

Let's view our result to find the manufacturers with the oldest planes:

In [24]:
avg_year_by_man_df.sort_values('year').head()

Unnamed: 0,manufacturer,year
16,DOUGLAS,1956.0
15,DEHAVILLAND,1959.0
7,BEECH,1969.5
13,CESSNA,1972.444444
12,CANADAIR LTD,1974.0


### Exporting DataFrames with Pandas

DataFrames can be exported using a method built-in to the DataFrame object itself: `DataFrame.to_csv()`.

In [6]:
avg_year_by_man_df.to_csv('../data/avg_year_by_man.csv')

Let's reimport to see the tabular data we just exported:

In [7]:
pd.read_csv('../data/avg_year_by_man.csv').head()

Unnamed: 0.1,Unnamed: 0,manufacturer,year
0,0,AGUSTA SPA,2001.0
1,1,AIRBUS,2007.20122
2,2,AIRBUS INDUSTRIE,1998.233333
3,3,AMERICAN AIRCRAFT INC,
4,4,AVIAT AIRCRAFT INC,2007.0


Notice the extra column named `Unnamed: 0`!

<font class="question">
    <strong>Question</strong>:<br><em>Where did the extra column come from?</em>
</font>

In [8]:
hide_code_in_slideshow()
pd.read_csv('../data/avg_year_by_man.csv').head()

Unnamed: 0.1,Unnamed: 0,manufacturer,year
0,0,AGUSTA SPA,2001.0
1,1,AIRBUS,2007.20122
2,2,AIRBUS INDUSTRIE,1998.233333
3,3,AMERICAN AIRCRAFT INC,
4,4,AVIAT AIRCRAFT INC,2007.0


This `Unnamed: 0` column is the index from the DataFrame. Despite it not being part of the original data, it's saved with the DataFrame by default.

We can elect not to save the index with the DataFrame by passing `False` to the `index` parameter of `to_csv()`:

In [9]:
avg_year_by_man_df.to_csv('../data/avg_year_by_man.csv', index = False)

And then check our result again:

In [10]:
pd.read_csv('../data/avg_year_by_man.csv').head()

Unnamed: 0,manufacturer,year
0,AGUSTA SPA,2001.0
1,AIRBUS,2007.20122
2,AIRBUS INDUSTRIE,1998.233333
3,AMERICAN AIRCRAFT INC,
4,AVIAT AIRCRAFT INC,2007.0


The `to_csv()` method has similar parameters to `read_csv()`. A few examples:

* `sep` - the data's delimter
* `header` - whether or not to write out the column names

Full documentation can be pulled up by running the method name followed by a question mark:

In [11]:
pd.DataFrame.to_csv?

<font class="your_turn">
    <strong>Your Turn</strong>
</font>

1. Exporting data is copying data from Python's ________ to the ________. 
2. Fill in the blanks to fix the following code:

   ```python
   import pandas as pd
   flights_df = pd.________('../data/flights.csv')
   flights_to_cvg_df = flights_df[flights_df[________] == 'CVG']
   flights_to_cvg_df.________('../data/flights_to_cvg.csv', ________ = False)
   ```

## Exporting Other Files

Recall being exposed to the importing of JSON and Pickle files -- now we will see how to save them.

### JSON Files

Take a look at the below `dict`:

In [12]:
dict_example = {
    "first": "Guido",
    "last": "van Rossum"
}

And then we can save it as a JSON file using the `with` statement and the `dump` function from the `json` library:

In [13]:
import json
with open('../data/dict_example_export.json', 'w') as f:
    f.write(json.dumps(dict_example))

We can then reimport this to verify we saved it correctly:

In [14]:
with open('../data/dict_example_export.json', 'r') as f:
    imported_json = json.load(f)

In [15]:
type(imported_json)

dict

In [16]:
imported_json

{'first': 'Guido', 'last': 'van Rossum'}

### Pickle Files

<font class="question">
    <strong>Question</strong>:<br><em>What are Pickle files?</em>
</font>

Python's native data files are known as **Pickle** files:

* All Pickle files have the `.pickle` extension

* Pickle files are great for saving native Python data that can't easily be represented by other file types
  * Pre-processed data
  * Models
  * Any other Python object...

#### Exporting Pickle Files

Pickle files can be exported using the `pickle` library paired with the `with` statement and the `open()` function:

In [17]:
import pickle
with open('../data/pickle_example_export.pickle', 'wb') as f:
    pickle.dump(dict_example, f)

We can then reimport this to verify we saved it correctly:

In [18]:
with open('../data/pickle_example_export.pickle', 'rb') as f:
    imported_pickle = pickle.load(f)

In [19]:
type(imported_pickle)

dict

In [20]:
imported_pickle

{'first': 'Guido', 'last': 'van Rossum'}

# Questions

Are there any questions before we move on?