![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Introducing Pandas Series
<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png" style="width:200px; float: right; margin: 0 40px 40px 40px;"/>

The Pandas (Panel Data) Python library is a very powerful tool for data manipulation and analysis.  We will talk about it throughout several lessons of this bootcamp, and even assume familiarity with Pandas in later lessons.

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

The basic data type/collection in Pandas is the `Series`.  More often—once you get to know Pandas—you will find yourself working with another kind of collection, called a `DataFrame`.  A Series is an array of data having the same data type that is also labeled in ways more meaningful than only by index position.

A special relationship exists between these two though: a DataFrame is a way of collecting together one or more Series.  Most operations on Series and DataFrames are very similar.  We start with the simpler data collection.

At the start, we can import Pandas, by convention it is usually given the two-letter name `pd` within Python programs.  We also import the library NumPy, using the conventional short name `np`.  This bootcamp will not discuss NumPy specifically, but Pandas' Series are built on top of NumPy `ndarrays`, and occasionally we want to use capabilities of their underlying arrrays.

In [None]:
import pandas as pd
import numpy as np

## Pandas Series

We'll start analyzing "[The Group of Seven](https://en.wikipedia.org/wiki/Group_of_Seven)". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a `pandas.Series` object.

In [None]:
# In millions
populations = [63.951, 35.467, 80.940, 60.665, 127.061, 64.511, 318.523]
g7_pop = pd.Series(populations)
g7_pop

Someone might not know we're representing population in millions of inhabitants. Series can have a `name`, to better document the purpose of the Series:

In [None]:
g7_pop.name = 'G7 Population in millions'

In [None]:
g7_pop

In [None]:
g7_pop.dtype

In [None]:
g7_pop.values

Series wrap underlying NumPy arrays:

In [None]:
type(g7_pop.values)

And they _look_ like simple Python lists or Numpy arrays. But they're actually more similar to Python `dict`s.

A Series has an `index`, that's similar to the automatic index assigned to Python's lists:

In [None]:
g7_pop

In [None]:
g7_pop[0]

In [None]:
g7_pop[1]

In [None]:
g7_pop.index

In contrast to lists, we can explicitly define the index:

In [None]:
g7_pop.index = [
    'France',
    'Canada',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [None]:
g7_pop

Compare it with the [following table](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing): 

|                |   G7 Population in millions |
|:---------------|----------------------------:|
| Canada         |                      35.467 |
| France         |                      63.951 |
| Germany        |                      80.94  |
| Italy          |                      60.665 |
| Japan          |                     127.061 |
| United Kingdom |                      64.511 |
| United States  |                     318.523 |

Series look like "ordered dictionaries". Moreover, we can create Series out of dictionaries:

In [None]:
pd.Series({
    'France': 63.951,
    'Canada': 35.467,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

There are several other "constructors" for Series, i.e. different ways of passing in the data needed to create one.

In [None]:
pd.Series(
    [63.951, 35.467, 80.94, 60.665, 127.061, 64.511, 318.523],
    index=['France', 'Canada', 'Germany', 'Italy', 'Japan', 
           'United Kingdom', 'United States'],
    name='G7 Population in millions')

You can create Series out of other series, specifying indexes:

In [None]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

## Indexing

Indexing works similarly as on lists and dictionaries, you use the **index** of the element you're looking for:

In [None]:
g7_pop

In [None]:
g7_pop['Canada']

In [None]:
g7_pop['Japan']

Numeric positions can also be used, with the `iloc` accessor:

In [None]:
g7_pop.iloc[0]

In [None]:
g7_pop.iloc[-1]

Selecting multiple elements at once:

In [None]:
g7_pop[['Italy', 'France']]

_(The result is another Series)_

In [None]:
g7_pop.iloc[[0, 1]]

Slicing also works, but **unlike in lists**, in Pandas the upper limit is included:

In [None]:
g7_pop['Canada': 'Italy']

It is usually preferred to use the `.loc` accessor rather than simple square brackets.  It is a few extra characters, but it avoids confusion when the indices of Series are themselves numbers (like in our first pass at `g7_pop`).  A motto in Python is "explicit is better than implicit"; we can specifiy whether we mean the name of row or its numeric position.

In [None]:
g7_pop.loc['Canada': 'Italy']

## Conditional selection (Boolean arrays)

A special "Boolean array" can be used to select from Pandas Series:

In [None]:
g7_pop

In [None]:
g7_pop > 70

In [None]:
g7_pop[g7_pop > 70]

Often we create Boolean arrays directly inside square brackets to index elements.  But we can also save the name of a useful filter.

In [None]:
is_big = g7_pop > 70
g7_pop[is_big]

Series have various useful methods; many of them are numeric or statistical.

In [None]:
g7_pop.mean()

We can combine these two capabilities we have just seen:

In [None]:
g7_pop[g7_pop > g7_pop.mean()]

In [None]:
g7_pop.std()

We can use symbols as logical connectors for several different filters.

| Symbol | Meaning
|--------|---------
| ~      | not
| &#124; | or
| &      | and

A common pitfall is failing to parenthesize subexpressions separated by these symbols.  Often it will give you **some** answer, but the wrong one, when you do that.

Let us select the G7 countries with population within 1/2 standard deviation of the mean.

In [None]:
g7_pop[(g7_pop > (g7_pop.mean() - g7_pop.std()/2)) &
       (g7_pop < (g7_pop.mean() + g7_pop.std()/2))
      ].rename('Mid-sized G7 countries')

That was a powerful query, but perhaps hard to read.  We might break it into steps.

In [None]:
bottom = g7_pop.mean() - g7_pop.std()/2
top = g7_pop.mean() + g7_pop.std()/2
print(bottom, top)

g7_pop[(g7_pop > bottom) & (g7_pop < top)]

## Operations and methods

Series support "vectorized" operations and aggregation functions:

In [None]:
g7_pop

In [None]:
g7_pop * 1_000_000

In [None]:
mean = (g7_pop * 1e6).mean()
print(f"{mean:,.0f}")

Perhaps we would like to think just of "order of magnitude" of these populations.

In [None]:
np.log10(g7_pop * 1e6)

We might aggregate over a subset of the data.

In [None]:
g7_pop['France': 'Italy'].mean()

Looking at Series in particular orders can be useful. For example, perhaps alphabetically by the index.

In [None]:
g7_pop.sort_index()

Or in order by the population.

In [None]:
g7_pop.sort_values()

In [None]:
g7_pop.sort_values(ascending=False)

## Modifying series

Let us make some changes to populations hypothetically.  Perhaps these are projections or hypotheticals.

In [None]:
g7_imagined = g7_pop.copy()
g7_imagined.rename("Imagined populations", inplace=True)
g7_imagined

In [None]:
g7_imagined.loc['Canada'] = 40.5
g7_imagined

Perhaps we stipulate by index position.

In [None]:
g7_imagined.iloc[-1] = 500
g7_imagined

Recall how we might filter.

In [None]:
g7_imagined[g7_imagined < 70]

We can also modify based on the filter.

In [None]:
g7_imagined[g7_imagined < 70] = 99.99
g7_imagined

# Exercises

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Series creation

### Create an empty pandas Series

In [None]:
# your code goes here

### Given the X python list convert it to an Y pandas Series

In [None]:
# your code goes here
X = ['A','B','C']
print(X, type(X))

### Given the X pandas Series, name it 'My letters'

In [None]:
# your code goes here
X = pd.Series(['A','B','C'])
X

### Given the X pandas Series, show its values

In [None]:
# your code goes here
X = pd.Series(['A','B','C'])

## Series indexing

### Assign index names to the given X pandas Series


In [None]:
# your code goes here
X = pd.Series(['A','B','C'])
X

### Given the X pandas Series, show its first element

In [None]:
X = pd.Series(['A','B','C'], index=['first', 'second', 'third'])
# your code goes here

### Given the X pandas Series, show its last element

In [None]:
X = pd.Series(['A','B','C'], index=['first', 'second', 'third'])
# your code goes here

### Given the X pandas Series, show all middle elements

In [None]:
# your code goes here
X = pd.Series(['A','B','C','D','E'],
              index=['first','second','third','forth','fifth'])

### Given the X pandas Series, show the elements in reverse position

In [None]:
# your code goes here
X = pd.Series(['A','B','C','D','E'],
              index=['first','second','third','forth','fifth'])

### Given the X pandas Series, show the first and last elements

In [None]:
# your code goes here
X = pd.Series(['A','B','C','D','E'],
              index=['first','second','third','fourth','fifth'])

## Series manipulation

### Convert the given integer pandas Series to float


In [None]:
# your code goes here
X = pd.Series([1, 2, 3, 4, 5],
              index=['first', 'second', 'third', 'fourth', 'fifth'])
X

### Order (sort) the given pandas Series

In [None]:
X = pd.Series([4, 2, 5, 1, 3],
              index=['fourth', 'second', 'fifth', 'first', 'third'])
# your code goes here

### Given the X pandas Series, set the fifth element equal to 10

In [None]:
X = pd.Series([1, 2, 3, 4, 5],
              index=['A', 'B', 'C', 'D', 'E'])
# your code goes here

### Given the X pandas Series, change all the middle elements to 0

In [None]:
# your code goes here
X = pd.Series([1, 2, 3, 4, 5],
              index=['A', 'B', 'C', 'D', 'E'])

### Given the X pandas Series, add 5 to every element

In [None]:
X = pd.Series([1,2,3,4,5],
              index=['A','B','C','D','E'])
# your code goes here


## Boolean arrays (also called masks)

### Given the X pandas Series, make a mask showing negative elements


In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

### Given the X pandas Series, get the negative elements

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

### Given the X pandas Series, get numbers larger than 5


In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

### Given the X pandas Series, select numbers higher than the elements mean

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

### Given the X pandas Series, get numbers equal to 2 or 10

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

## Logic functions

### Given the X pandas Series, return True if none of its elements is zero

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

### Given the X pandas Series, return True if any of its elements is zero

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

## Summary statistics

### Given the X pandas Series, show the sum of its elements


In [None]:
X = pd.Series([3, 5, 6, 7, 2, 3, 4, 9, 4])
# your code goes here

### Given the X pandas Series, show the mean value of its elements

In [None]:
X = pd.Series([1, 2, 0, 4, 5, 6, 0, 0, 9, 10])
# your code goes here

### Given the X pandas Series, show the max value of its elements

In [None]:
X = pd.Series([1, 2, 0, 4, 5, 6, 0, 0, 9, 10])
# your code goes here