# 04. Introduction

We are going to work with data, so the **very first** thing to do is to understand how to **load** data.

Data may come in many many formats - and for sure we are not going to cover all the possible scenarios - only the **most common** ones.

To do so, we are going to introduce one of the most important library in the Python Scientific ecosystem: `pandas`.

## In this notebook:

* General introduction to Pandas and `pandas.DataFrame`
* Examples of data loading in `CSV` and `JSON` format, using `pandas` built-in functions to handle them.


---

## 1. Introducing Pandas

Pandas is the Swiss-Multipurpose Knife for Data Analysis in Python. 

With Pandas dealing with data-analysis is easy and simple but there are some things you need to get your head around first as **Data-Frames** and **Data-Series**.

These structures will be the actual data containers automatically provided by `pandas` to **represent** the data. Once we will be done loading the data, pandas will hold for us a proper container to 
easily handle the data programmatically.

## Brief Introduction to Pandas Data Structures

Pandas builds on top of two main data structures: **Data Frame** and **Series**

### Data Frame _from the outside_

From the outside, a Data frame looks like a _Table_ (two dimensional data) in which you will have rows and columns.
Each column will have its proper name, and all rows can be easily accessed via position or via **index**, holding reference to each row.

<img src="images/df_outside.png" width="50%" />

### Data Frame _from the inside_

As from the inside, each column of a **Data Frame** will have its proper **data type** (`dtype`) that can be any of basic Python data types or `object`. 

**Note**: Each Column of a Data Frame in `pandas` is actually a **Data Series**.

<img src="images/df_inside.png" width="60%" />

---

### Data Frame vs Numpy Array

←←←←←←←←←←←←←←←←← stopped here

#### Numpy Array

<img src="./images/ndarray.png" />

#### Pandas Data Frame

<img src="./images/df_inside_numpy.png" width="70%" />

---

## Let's get Started


The following section has been adapted from https://github.com/alanderex/pandas-pydata-2017
<span style="font-size: small;float: right;">&copy; 2015-2017 Alexander C.S. Hendorf, <a href="http://koenigsweg.com">Königsweg GmbH</a>, Mannheim </span>

In [None]:
import pandas as pd

**import as pd** is a widely used convention


In the `data` directory we have some files stored from the *Blooth store*, *let's import one*!

In [None]:
!ls ./data

**Note**: When you run a command preceded with an excalamation mark (`!`) in a code cell in Jupyter Notebook, that command is interpreted as a **shell command** like in a Terminal.

In [None]:
!head ./data/blooth_sales_data.csv

### Reading data in CSV

The CSV (Comma Separated Values) is one of the most popular format of data:

- each column of data is separated by a comma (or other equivalent and specified separator);
- the first row of the file may correspond to column headers;
- the first column of each row may correspond to values of the `row index`.
```

To read a file in CSV format, `pandas` provide a built-in `read_csv` function

In [None]:
sales_data = pd.read_csv('./data/blooth_sales_data.csv')

**Let's explore our data set**

**Remember**: First general rule of data analysis: look at the data!

The **very** first thing to do after loading a dataset is to **look** at the data 

In [None]:
pd.set_option('display.max_rows', 10000)  # change presets for data preview
sales_data.head(10000)

#### Let's see what we have got now

In [None]:
type(sales_data)

In [None]:
len(sales_data)

#### Inspect your DataFrame with pandas methods

In [None]:
sales_data.head(5)

In [None]:
sales_data.tail(5)

In [None]:
sales_data.info()

**note: floats and ints were detected automatically but date(time) are still strings objects**

* *columns*
* count rows
* data types (numpy)
* memenory used

**`Strings`** are stored in **`pandas`** as **`object`**!

In [None]:
pd.read_csv?

**`pandas.read_csv`** has more than 50 parameters to customize imports.

For example dates can be parsed automatically.

> **`parse_dates`** a list of columns to parse for dates.

This is only one of multiple options to customize imports.

In [None]:
sales_data = pd.read_csv('./data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate']
                        )
sales_data.info()

In [None]:
sales_data.head(5)

The auto date parser is US date friendly by default -> month first! MM/DD/YYYY add *dayfirst=True* for international and European format.

In [None]:
sales_data = pd.read_csv('./data/blooth_sales_data.csv',
                         parse_dates=['birthday', 'orderdate'],
                         dayfirst=True)
sales_data.head(5)

**!** The date parse is US datew friendly! *MM/DD/YYYY*

To use the more common international format for sure,<br>
add 
>**`dayfirst=True`** 

The CSV import may be highly customized, <br>e.g.:

* `date_parser` - which columns to parse.
* `compression` - `pandas` hint compression of file, default: `infer`- auto discovery
* `delimiter` - delimiter
* `thousands`, `decimal` - thousands or decimal character
* `encoding` - encoding of the file
* `dtype`- target data type of column(s)
* `header`- header number(s)
* `skipfooter`- do not import the footer (e.g. summary line)



#### Reading a CSV from URL

The `read_csv` function is flexible enough to support reading a CSV file from URL:

In [None]:
medals_dataset_url = "http://winterolympicsmedals.com/medals.csv"

In [None]:
medals_data = pd.read_csv(medals_dataset_url)

In [None]:
medals_data.head()

### Excercises

Repeat what you just have learned above:
* `.head()`

In [None]:
# your code here


* `.tail()`

In [None]:
# your code here


* `.info()`

In [None]:
# your code here


---

Read the file `blooth_sales_data_2.csv` from the directory *data* and save it to a variable called *data2*.

In [None]:
# your code here


Use the parameters on import to make the import in a useful format

In [None]:
# your code here


In [None]:
# check your import using .info()


## Reading Data with `pandas`

**Take a look at**: 

- `pd.read_excel`
- `pd.read_json`