<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
     INE Bootcamp
</h1>
<h2 style="color: white; position: absolute; top:36%; left:10%;">
    Data Analysis, Visualization and Predictive Modeling
</h2> 

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:58%; left:10%;">
    <b>David Mertz, Ph.D.</b>
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:63%; left:10%;">
    <b>Data Scientist</b>
</h3>
</div>

<div style="width: 100%; height: 200px; background-color: #222; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Introducing Pandas Series
</h1>

<br><br> 
</div>

The Pandas (Panel Data) Python library is a very powerful tool for data manipulation and analysis.  We will talk about it throughout several lessons of this bootcamp, and even assume familiarity with Pandas in later lessons.

<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png" style="width:300px; float: right; margin: 0 40px 40px 40px;"/>

> The basic data type/collection in Pandas is the `Series`.  More often—once you get to know Pandas—you will find yourself working with another kind of collection, called a `DataFrame`.  A Series is an array of data having the same data type that is also labeled in ways more meaningful than only by index position.

A special relationship exists between these two though: a DataFrame is a way of collecting together one or more Series.  Most operations on Series and DataFrames are very similar.  We start with the simpler data collection.

At the start, we can import Pandas, by convention it is usually given the two-letter name `pd` within Python programs.  We also import the library NumPy, using the conventional short name `np`.  This bootcamp will not discuss NumPy specifically, but Pandas' Series are built on top of NumPy `ndarrays`, and occasionally we want to use capabilities of their underlying arrrays.

In [1]:
import pandas as pd
import numpy as np

<h2 style="font-weight: bold;">
    Pandas Series
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

We'll start analyzing "[The Group of Seven](https://en.wikipedia.org/wiki/Group_of_Seven)". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a `pandas.Series` object.

In [2]:
# In millions
populations = [63.951, 35.467, 80.940, 60.665, 127.061, 64.511, 318.523]
g7_pop = pd.Series(populations)
g7_pop

0     63.951
1     35.467
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Someone might not know we're representing population in millions of inhabitants. Series can have a `name`, to better document the purpose of the Series:

In [3]:
g7_pop.name = 'G7 Population in millions'

In [4]:
g7_pop

0     63.951
1     35.467
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

Series always have a common data type for each of their elements.

In [5]:
g7_pop.dtype

dtype('float64')

In [6]:
g7_pop.values

array([ 63.951,  35.467,  80.94 ,  60.665, 127.061,  64.511, 318.523])

Series wrap underlying NumPy arrays:

In [7]:
type(g7_pop.values)

numpy.ndarray

And they _look_ like simple Python lists or Numpy arrays. But they're actually more similar to Python `dict`s.

A Series has an `index`, that's similar to the automatic index assigned to Python's lists:

In [8]:
g7_pop

0     63.951
1     35.467
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [9]:
g7_pop[0]

63.951

In [10]:
g7_pop[1]

35.467

In [11]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In contrast to lists, we can explicitly define the index:

In [12]:
g7_pop.index = [
    'France',
    'Canada',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [13]:
g7_pop

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Compare it with the [following table](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing): 

|                |   G7 Population in millions |
|:---------------|----------------------------:|
| Canada         |                      35.467 |
| France         |                      63.951 |
| Germany        |                      80.94  |
| Italy          |                      60.665 |
| Japan          |                     127.061 |
| United Kingdom |                      64.511 |
| United States  |                     318.523 |

Series look like "ordered dictionaries". Moreover, we can create Series out of dictionaries:

In [14]:
pd.Series({
    'France': 63.951,
    'Canada': 35.467,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

There are several other "constructors" for Series, i.e. different ways of passing in the data needed to create one.

In [15]:
pd.Series(
    [63.951, 35.467, 80.94, 60.665, 127.061, 64.511, 318.523],
    index=['France', 'Canada', 'Germany', 'Italy', 'Japan', 
           'United Kingdom', 'United States'],
    name='G7 Population in millions')

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

You can create Series out of other series, specifying indexes:

In [16]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

<h2 style="font-weight: bold;">
    Indexing
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Indexing works similarly as on lists and dictionaries, you use the **index** of the element you're looking for:

In [17]:
g7_pop

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [18]:
g7_pop['Canada']

35.467

In [19]:
g7_pop['Japan']

127.061

Numeric positions can also be used, with the `iloc` accessor:

In [20]:
g7_pop.iloc[0]

63.951

In [21]:
g7_pop.iloc[-1]

318.523

In [22]:
g7_pop.iloc[2:5]

Germany     80.940
Italy       60.665
Japan      127.061
Name: G7 Population in millions, dtype: float64

Selecting multiple elements at once:

In [23]:
g7_pop[['Italy', 'France']]

Italy     60.665
France    63.951
Name: G7 Population in millions, dtype: float64

_(The result is another Series)_

In [24]:
g7_pop.iloc[[0, 1]]

France    63.951
Canada    35.467
Name: G7 Population in millions, dtype: float64

Slicing also works, but **unlike in lists**, in Pandas the upper limit is included:

In [25]:
g7_pop['Canada': 'Italy']

Canada     35.467
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

It is usually preferred to use the `.loc` accessor rather than simple square brackets.  It is a few extra characters, but it avoids confusion when the indices of Series are themselves numbers (like in our first pass at `g7_pop`).  A motto in Python is "explicit is better than implicit"; we can specifiy whether we mean the name of row or its numeric position.

In [26]:
g7_pop.loc['Canada': 'Italy']

Canada     35.467
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

<h2 style="font-weight: bold;">
    Conditional selection (Boolean arrays)
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

A special "Boolean array" can be used to select from Pandas Series:

In [27]:
g7_pop

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [28]:
g7_pop > 70

France            False
Canada            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [29]:
g7_pop[g7_pop > 70]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

Often we create Boolean arrays directly inside square brackets to index elements.  But we can also save the name of a useful filter.

In [30]:
is_big = g7_pop > 70
g7_pop[is_big]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

Series have various useful methods; many of them are numeric or statistical.

In [31]:
g7_pop.mean()

107.30257142857144

We can combine these two capabilities we have just seen:

In [32]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [35]:
g7_pop[g7_pop > g7_pop.median()]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [36]:
g7_pop.std()

97.24996987121581

We can use symbols as logical connectors for several different filters.

| Symbol | Meaning
|--------|---------
| ~      | not
| &#124; | or
| &      | and

A common pitfall is failing to parenthesize subexpressions separated by these symbols.  Often it will give you **some** answer, but the wrong one, when you do that.

Let us select the G7 countries with population within 1/2 standard deviation of the mean.

In [38]:
g7_pop[(g7_pop > (g7_pop.mean() - g7_pop.std()/2)) &
       (g7_pop < (g7_pop.mean() + g7_pop.std()/2))
      ].rename('Mid-sized G7 countries')

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
Name: Mid-sized G7 countries, dtype: float64

That was a powerful query, but perhaps hard to read.  We might break it into steps.

In [39]:
bottom = g7_pop.mean() - g7_pop.std()/2
top = g7_pop.mean() + g7_pop.std()/2
print(bottom, top)

g7_pop[(g7_pop > bottom) & (g7_pop < top)]

58.677586492963535 155.92755636417934


France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
Name: G7 Population in millions, dtype: float64

<h2 style="font-weight: bold;">
    Operations and methods
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Series support "vectorized" operations and aggregation functions:

In [40]:
g7_pop

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [43]:
g7_pop_persons = (g7_pop * 1_000_000).rename("G7 population")
g7_pop_persons

France             63951000.0
Canada             35467000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 population, dtype: float64

In [44]:
mean = (g7_pop * 1e6).mean()
print(f"{mean:,.0f}")

107,302,571


Perhaps we would like to think just of "order of magnitude" of these populations.

In [45]:
np.log10(g7_pop * 1e6)

France            7.805847
Canada            7.549824
Germany           7.908163
Italy             7.782938
Japan             8.104012
United Kingdom    7.809634
United States     8.503141
Name: G7 Population in millions, dtype: float64

We might aggregate over a subset of the data.

In [46]:
g7_pop['France': 'Italy'].mean()

60.25575

Looking at Series in particular orders can be useful. For example, perhaps alphabetically by the index.

In [47]:
g7_pop.sort_index()

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Or in order by the population.

In [48]:
g7_pop.sort_values()

Canada             35.467
Italy              60.665
France             63.951
United Kingdom     64.511
Germany            80.940
Japan             127.061
United States     318.523
Name: G7 Population in millions, dtype: float64

In [49]:
g7_pop.sort_values(ascending=False)

United States     318.523
Japan             127.061
Germany            80.940
United Kingdom     64.511
France             63.951
Italy              60.665
Canada             35.467
Name: G7 Population in millions, dtype: float64

<h2 style="font-weight: bold;">
    Modifying series
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


Let us make some changes to populations hypothetically.  Perhaps these are projections or hypotheticals.

In [50]:
g7_imagined = g7_pop.copy()
g7_imagined.rename("Imagined populations", inplace=True)
g7_imagined

France             63.951
Canada             35.467
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Imagined populations, dtype: float64

In [51]:
g7_imagined.loc['Canada'] = 40.5
g7_imagined

France             63.951
Canada             40.500
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Imagined populations, dtype: float64

Perhaps we stipulate by index position.

In [52]:
g7_imagined.iloc[-1] = 500
g7_imagined

France             63.951
Canada             40.500
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: Imagined populations, dtype: float64

Recall how we might filter.

In [53]:
g7_imagined[g7_imagined < 70]

France            63.951
Canada            40.500
Italy             60.665
United Kingdom    64.511
Name: Imagined populations, dtype: float64

We can also modify based on the filter.

In [54]:
g7_imagined[g7_imagined < 70] = 99.99
g7_imagined

France             99.990
Canada             99.990
Germany            80.940
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     500.000
Name: Imagined populations, dtype: float64

<div style="width: 100%; height: 200px; background-color: #ef7d22; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Exercises
</h1>

<br><br> 
</div>

<h2 style="font-weight: bold;">
    Series creation
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)


---
### Create an empty pandas Series

In [None]:
# your code goes here

---
### Given the X python list convert it to an Y pandas Series

In [None]:
# your code goes here
X = ['A','B','C']
print(X, type(X))

---
### Given the X pandas Series, name it 'My letters'

In [None]:
# your code goes here
X = pd.Series(['A','B','C'])
X

---
### Given the X pandas Series, show its values

In [None]:
# your code goes here
X = pd.Series(['A','B','C'])

<h2 style="font-weight: bold;">
    Series indexing
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

---
### Assign index names to the given X pandas Series


In [None]:
# your code goes here
X = pd.Series(['A','B','C'])
X

---
### Given the X pandas Series, show its first element

In [None]:
X = pd.Series(['A','B','C'], index=['first', 'second', 'third'])
# your code goes here

---
### Given the X pandas Series, show its last element

In [None]:
X = pd.Series(['A','B','C'], index=['first', 'second', 'third'])
# your code goes here

---
### Given the X pandas Series, show all middle elements

In [None]:
# your code goes here
X = pd.Series(['A','B','C','D','E'],
              index=['first','second','third','forth','fifth'])

---
### Given the X pandas Series, show the elements in reverse position

In [None]:
# your code goes here
X = pd.Series(['A','B','C','D','E'],
              index=['first','second','third','forth','fifth'])

---
### Given the X pandas Series, show the first and last elements

In [None]:
# your code goes here
X = pd.Series(['A','B','C','D','E'],
              index=['first','second','third','fourth','fifth'])

<h2 style="font-weight: bold;">
    Series manipulation
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

---
### Convert the given integer pandas Series to float


In [None]:
# your code goes here
X = pd.Series([1, 2, 3, 4, 5],
              index=['first', 'second', 'third', 'fourth', 'fifth'])
X

---
### Order (sort) the given pandas Series

In [None]:
X = pd.Series([4, 2, 5, 1, 3],
              index=['fourth', 'second', 'fifth', 'first', 'third'])
# your code goes here

---
### Given the X pandas Series, set the fifth element equal to 10

In [None]:
X = pd.Series([1, 2, 3, 4, 5],
              index=['A', 'B', 'C', 'D', 'E'])
# your code goes here

---
### Given the X pandas Series, change all the middle elements to 0

In [None]:
# your code goes here
X = pd.Series([1, 2, 3, 4, 5],
              index=['A', 'B', 'C', 'D', 'E'])

---
### Given the X pandas Series, add 5 to every element

In [None]:
X = pd.Series([1,2,3,4,5],
              index=['A','B','C','D','E'])
# your code goes here


<h2 style="font-weight: bold;">
    Boolean arrays (also called masks)
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

---
### Given the X pandas Series, make a mask showing negative elements


In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

---
### Given the X pandas Series, get the negative elements

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

---
### Given the X pandas Series, get numbers larger than 5


In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

---
### Given the X pandas Series, select numbers higher than the elements mean

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

---
### Given the X pandas Series, get numbers equal to 2 or 10

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

<h2 style="font-weight: bold;">
    Logic functions
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

---
### Given the X pandas Series, return True if none of its elements is zero

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

---
### Given the X pandas Series, return True if any of its elements is zero

In [None]:
X = pd.Series([-1, 2, 0, -4, 5, 6, 0, 0, -9, 10])
# your code goes here

<h2 style="font-weight: bold;">
    Summary statistics
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

---
### Given the X pandas Series, show the sum of its elements


In [None]:
X = pd.Series([3, 5, 6, 7, 2, 3, 4, 9, 4])
# your code goes here

---
### Given the X pandas Series, show the mean value of its elements

In [None]:
X = pd.Series([1, 2, 0, 4, 5, 6, 0, 0, 9, 10])
# your code goes here

---
### Given the X pandas Series, show the max value of its elements

In [None]:
X = pd.Series([1, 2, 0, 4, 5, 6, 0, 0, 9, 10])
# your code goes here

<div style="width: 100%; height: 400px; background-color: #222; text-align: center; padding-top: 120px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Review and questions
</h1>

<br><br> 
</div>

---
<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz, Ph.D.
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    david.mertz@gmail.com
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz/
</p>

</div>

<br><br><br>