# Python for Machine Learning - Loading Data with Pandas

> Introduction to common workflows using Pandas.

- toc: true
- badges: true
- comments: false
- categories: ['Python for ML','Pandas','Machine Learning']
- image: images/pandas-logo.jpg

# Importing Pandas

[Pandas](https://pandas.pydata.org/) is a library that contains pre-written code to help wrangle with data. We can think of it as Python's equivalent of Excel. 

We import Pandas into our development environment as we import any other library — using the `import` command.

```python
import pandas
```
It's standard to import Pandas with the shorthand `pd` in order to avoid typing `pandas` all the time.

```python
import pandas as pd
```

This gives us access to a vast array of pre-built objects, functions, and methods which are details in the [API reference](https://pandas.pydata.org/docs/reference/index.html#api).


# Common Pandas Features

## Two Underlying Data Types

### Series
The basic unit of Pandas is the `pandas.Series` object which, in keeping with the Excel analogy, can be thought of as a column in an Excel table. It's a one-dimensional data structure that's derived from a [Numpy](https://numpy.org/) array. However, unlike a Numpy array, the indices of a `Series` object aren't limited to the integer values $0,1,...n$ — they can also be descriptive labels. 

Let's create a `Series` object representing the populations of the G-7 countries.

In [14]:
import pandas as pd
g7_pop = pd.Series([35,63,80,60,127,64,318])

Indexing into a `Series` object is similar to indexing a Python list. For instance, let's print the first element in the above series.

In [15]:
g7_pop[0]

35

To check the data type of a `Series` object, we can use the `dtype` property.

In [16]:
g7_pop.dtype

dtype('int64')

Let's swap out the integer-based indices with descriptive labels. Each `Series` object has an `index` property that can be overwritten. 

In [17]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'UK',
    'US'
]

Now, we can print the first element of the series using its descriptive label.

In [18]:
g7_pop['Canada']

35

We may notice a similarity between a standard Python dictionary and a labeled `Series` object. Namely, indexing a series with a label and keying into a Python dictionary share a similar syntax. In fact, it's possible to create a labeled `Series` object directly from a Python dictionary. 

In [21]:
g7_pop = pd.Series({
    'Canada' : 35,
    'France' : 63,
    'Germany' : 80,
    'Italy' : 60,
    'Japan' : 127,
    'UK' : 64,
    'US' : 318
})

> Note: In the event of overwritting the number-based indices, it's still possible to access the elements of a `Series` sequentially using the `iloc` property (short for "integer location") like so: `g7_pop.iloc[0]`.

Since the `Series` object is based on a Numpy array, it supports multi-indexing, as well as broadcasted (i.e. element-wise) and vectorized operations. 

For instance, to print the populations of Canada and Germany at the same time, we can do:

In [23]:
g7_pop['Canada','Germany']

KeyError: 'key of type tuple not found and not a MultiIndex'

## Loading Data

DataFrame is 

Reading a CSV into a DataFrame is done through the `pandas.read_csv()` command. 