# Python for Machine Learning - Pandas

> Introduction to common workflows using Pandas.

- toc: true
- badges: true
- comments: false
- categories: ['Python for ML','Pandas','Machine Learning']
- image: images/pandas-logo.jpg

# Importing Pandas

[Pandas](https://pandas.pydata.org/) is a library that contains pre-written code to help wrangle with data. We can think of it as Python's equivalent of Excel. 

We import Pandas into our development environment as we import any other library — using the `import` command.

```python
import pandas
```
It's standard to import Pandas with the shorthand `pd` in order to avoid typing `pandas` all the time.

```python
import pandas as pd
```

This gives us access to a vast array of pre-built objects, functions, and methods which are detailed in the [API reference](https://pandas.pydata.org/docs/reference/index.html#api).


# Two Underlying Data Types

## Series
The basic unit of Pandas is the `pandas.Series` object which, in keeping with the Excel analogy, can be thought of as a column in an Excel table. It's a one-dimensional data structure that's derived from a [NumPy](https://numpy.org/) array. However, unlike a NumPy array, the indices of a `Series` object aren't limited to the integer values $0,1,...n$ — they can also be descriptive labels. 

Let's create a `Series` object representing the populations of the G-7 countries in units of millions.

In [42]:
import pandas as pd
g7_pop = pd.Series([35,63,80,60,127,64,318])

As we can see, creating a series is a matter of passing a Python list (or a Numpy array) into the `Series` constructor.

### Indexing
Indexing a `Series` object is similar to indexing a Python list. For instance, let's print the first element in the above series.

In [43]:
g7_pop[0]

35

Let's now swap out the integer-based indices with descriptive labels. Each `Series` object has an `index` property that can be overwritten. 

In [44]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'UK',
    'US'
]

Now, we can print the first element of the series using its descriptive label.

In [45]:
g7_pop['Canada']

35

We may notice a similarity between a standard Python dictionary and a labeled `Series` object. Namely, indexing a series with a label and keying into a Python dictionary have similar syntax. In fact, it's possible to create a labeled `Series` object directly from a Python dictionary. 

In [46]:
g7_pop = pd.Series({
    'Canada' : 35,
    'France' : 63,
    'Germany' : 80,
    'Italy' : 60,
    'Japan' : 127,
    'UK' : 64,
    'US' : 318
})

> Note: In the event of overwritting the integer-based indices, it's still possible to access the elements of a `Series` sequentially using the `iloc` property (short for "integer location") like so: `g7_pop.iloc[0]`.


Since the `Series` object is based on a Numpy array, it also supports *multi-indexing* through passing a list of indices or a Boolean mask.

For instance, to print the populations of Canada and Germany at the same time, we can pass in the list `['Canada','Germany']` or the Boolean mask `[True, False, True, False, False, False, False]`. 

In [47]:
g7_pop[['Canada','Germany']]

Canada     35
Germany    80
dtype: int64

In [48]:
g7_pop[[True, False, True, False, False, False, False]]

Canada     35
Germany    80
dtype: int64

### Broadcasted and Vectorized Operations

Since it's based on a NumPy array, a `Series` object also supports vectorization and broadcasted operations. 

As a quick reminder, *vectorization* is the process by which NumPy optimizes looping in Python. It stores the array internally in a contiguous block of memory and restricts its contents to only one data type. Letting Python know this data type in advance, NumPy can then skip the per-iteration type checking that Python normally does in order to speed up our code. In fact, NumPy delegates most of the operations on such optimized arrays to pre-written C code under the hood.

*Broadcasting*, on the other hand, is the optimized process by which NumPy performs arithmetic and Boolean operations on arrays of unequal dimensions.

For instance, suppose the projected population growth of each G-7 country `10 mln` by the year 2030. Instead of looping through the `Series` object and adding `10` to each row, or using a list comprehension, we can simply use broadcasted addition.  

In [52]:
g7_2030_pop = g7_pop + 10
g7_2030_pop

Canada      45
France      73
Germany     90
Italy       70
Japan      137
UK          74
US         328
dtype: int64

### Filtering

Thanks to broadcasted Boolean operations and multi-indexing with a Boolean mask, it's possible to write concise and readable filtering expressions on `Series`.

For instance, let's return the list of countries with a population over `80 mln`.

In [54]:
g7_pop[g7_pop >= 70]

Germany     80
Japan      127
US         318
dtype: int64

The expression `g7_pop >= 70` is a broadcasted Boolean operation on the `Series` object `g7_pop` which returns a Boolean array `[False, False, True, False, True, False, True]`. Then `g7_pop` is multi-indexed using this Boolean mask.

As another example of readable filtering expressions, we can return the list of countries whose populations exceed the mean population.  

In [55]:
g7_pop[g7_pop >= g7_pop.mean()]

Japan    127
US       318
dtype: int64

## DataFrame

DataFrame is 

Reading a CSV into a DataFrame is done through the `pandas.read_csv()` command. 