<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Series Basics</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/laozi.jpg">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"Act without doing; work without effort. Think of the small as large and the few as many. Confront the difficult while it is still easy; accomplish the great task by a series of small acts."</p>
                <br>
                <p>-Laozi (by way of Wes McKinney)</p>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Laozi.jpg'>ManosHacker</a>; released into the public domain.
</div>

<hr>

In [1]:
# Import stuff so we can use libraries.
import numpy as np
import pandas as pd

---

# What is a Series?

Conceptually, the [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) object in Pandas is simple. It is just a data container comprised of data attached to an index.

In [2]:
# You can directly create series using the pd.Series constructor. Which takes lists ...
s1 = pd.Series([50, 100, 150, 100, 50])

# Or dictionaries
s1 = pd.Series({0: 50, 1: 100, 2: 150, 3: 100, 4: 50})

# Or any iterable, really ...
s1 = pd.Series(data=tuple([50,100,150,100,50]), index=[0,1,2,3,4])

# We can also load it from a data file. This is ugly because read_csv() is for dataframes.
# We need the "squeeze" option to get a Series instead of a frame.
s1 = pd.read_csv('data/simple.csv', squeeze=True)

# Remember the last item in a cell is displayed.
s1

0     50
1    100
2    150
3    100
4     50
Name: data, dtype: int64

What do we have here? Looking at the above, we can see that we have our data listed in a column on the right, and a list of numbers on the left. And what are these labels on the left? These numbers are the "index".

At it's core, a Series is a container that has an index and data ... which allows us to do a ton of useful stuff with it.

The purpose of the data is self-evident. It is the information we are analyzing. The purpose of the index is to allow for us to select/lookup parts of the data ("indexing") or to help associate data from one series to another. 

Generally, an index should be unique key (usually but not always a number or a string)--in this case Pandas auto-assigned us an index.

In [3]:
# We can access our index directly using dot notaiton.
my_index = s1.index

print('We are accessing our index!\n')
print(my_index)

We are accessing our index!

RangeIndex(start=0, stop=5, step=1)


In [4]:
# And take a look at it's values
value_of_my_index = my_index.values

print('\nPandas has helpfully auto-assigned the following numbers for our index!\n')
print(value_of_my_index)
print('\n... because we failed to do so.')


Pandas has helpfully auto-assigned the following numbers for our index!

[0 1 2 3 4]

... because we failed to do so.


In [5]:
# We can also directly inspect the data values
data_values = s1.values

print('\nOur data values are below!\n')
print(data_values)


Our data values are below!

[ 50 100 150 100  50]


In [6]:
# Series are built on top of numpy arrays for speed.
underlying_type = type(data_values)
number_type = type(data_values[0])

print(f'The list of values is stored as {underlying_type}.\n')
print(f'The data itself is stored as {number_type}, but we can just treat it as if it is a regular number.\n')

The list of values is stored as <class 'numpy.ndarray'>.

The data itself is stored as <class 'numpy.int64'>, but we can just treat it as if it is a regular number.



In [7]:
# Creating some slighly more complicated series.
nl_central = pd.Series(
    data=[0, 1, 7, 13, 28],
    index=['Brewers', 'Cubs', 'Cardinals', 'Pirates', 'Reds'],
    dtype=np.float64,
    name='Games Back'
)

# 'data' is the data we are storing
# 'index' is out lookup keys
# 'dtype' is generally inferred, but you can specify if need be
# 'name' is used for internal references within dataframes.
nl_central

Brewers       0.0
Cubs          1.0
Cardinals     7.0
Pirates      13.0
Reds         28.0
Name: Games Back, dtype: float64

Note: Numpy trades flexibility for speed. Python usually resolves types of unknown size at runtime. If we use a "static"-like type, we can organize our data into arrays that are compact and provide quick lookups/operations.

---

# What does this Series allow us to do?

Whole bunch. Let's take a look see.

### Convenience attrributes and methods.

First off, the Pandas [Series API](https://pandas.pydata.org/pandas-docs/stable/api.html#series) has a huge number of different types of calculations and operations to make your life easier. A brief sampling is below.

In [8]:
# Lets take a look at how many available attibutes and methods we have
count = len(dir(s1))
print(f'Our series object has {count} attributes and methods.')

Our series object has 450 attributes and methods.


In [9]:
# How many of each value are there?
s1.value_counts()

100    2
50     2
150    1
Name: data, dtype: int64

In [10]:
# What's the mean value?
s1.mean()

90.0

In [11]:
# What unique values do we have?
s1.unique()

array([ 50, 100, 150], dtype=int64)

In [12]:
# What are our values sorted descending?
s1.sort_values(ascending=False)

2    150
3    100
1    100
4     50
0     50
Name: data, dtype: int64

### Vectorized mathematics

Secondly, and perhaps more importantly, putting the data in a container allows us to operate on the container as a whole, which makes computation much more convenient.

In [13]:
# Again, here's s1 ...
s1

0     50
1    100
2    150
3    100
4     50
Name: data, dtype: int64

In [14]:
# Say we want to create a new series that is just every item in the series times 10?
s2 = s1 * 10
s2

0     500
1    1000
2    1500
3    1000
4     500
Name: data, dtype: int64

In [15]:
# And then we want to make a new series that is just Series 2 minus Series 1?
s3 = s2 - s1
s3

0     450
1     900
2    1350
3     900
4     450
Name: data, dtype: int64

That's pretty nifty, but how does pandas know which row to subtract from which?

As you can see, pandas uses the index to figure out what gets subtracted from what. It in other words, it is subtracting the item at the zero index of Series one from the zero index of Series Two, to give us the item at the zero index of Series 3.

In [16]:
# Demnostrating the above:
print('The first item in s2 is:            ', s2[0])
print('The first item in s1 is:            ', s1[0])
print('Our operation is s2 - s1.')
print('Therefore, the first item in s3 is: ', s3[0])

The first item in s2 is:             500
The first item in s1 is:             50
Our operation is s2 - s1.
Therefore, the first item in s3 is:  450


In [17]:
# Be aware that if you're combining two series and an index is only
# in one series, you will get a missing value
#
# What is this mysterious missing value named NaN? More on this later ...
# loc[] is your friend
first_3_elements_of_s1 = s1.iloc[:3]
s2 - first_3_elements_of_s1

0     450.0
1     900.0
2    1350.0
3       NaN
4       NaN
Name: data, dtype: float64

In [18]:
# We can also go logical operations
s1 > 50

0    False
1     True
2     True
3     True
4    False
Name: data, dtype: bool

In [19]:
# And what are all the numbers greater than 50 (we'll get to this)?
s1.loc[s1 > 50]

1    100
2    150
3    100
Name: data, dtype: int64

##### Side Note

There are certain limitations to pay attention to operating on a series as a whole; this is due to ["broadcasting" rules](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). You can safely ignore most of these provided:

* You are using a single value, a.k.a. "scalar" 


    # scalar examples
    s1 + 1.5
    s2 * 3
    

* You are using two series, a.k.a. "array" or "vector"


    # series examples
    s1 * s2
    s1 + sl.iloc[:3]


### Cross-Data Associations

Lastly, and most importantly, Series can come together Voltron-style to create "dataframes". We're not going to touch dataframes just yet, so just be aware that Dataframes are a collection of Series based around a shared index ... and they are totally awesome.

Here we are supplying our own index of names (strings). We create a group of series with the same index, and then use them to create a dataframe.

In [20]:
# List Data
names     = ['Keith Kogane', 'Lance McClain',  'Darrell Stoker', 'Princess Allura', 'Tsuyoshi Garett']
# Series data
nicknames = pd.Series(['Captain', np.NaN, 'Pidge', np.NaN , 'Hunk'  ], index=names)
color     = pd.Series(['Black'  , 'Red' , 'Green', 'Pink' , 'Yellow'], index=names)
updated   = pd.Series(pd.datetime.now(), index=names)
favorite  = pd.Series([pd       , np    , np     , np.NaN , pd      ], index=names)

# Don't worry about dataframes yet! We'll get there ...
# Dataframes allow us to track multiple types of data on a row-by-row basis.
voltron = pd.DataFrame({
    'Handle'         : nicknames,
    'Color'          : color,
    'Last Update'    : updated,
    'Favorite Module': favorite,
})

voltron

Unnamed: 0,Handle,Color,Last Update,Favorite Module
Keith Kogane,Captain,Black,2018-11-03 20:18:40.663605,<module 'pandas' from 'C:\\Program Files (x86)...
Lance McClain,,Red,2018-11-03 20:18:40.663605,<module 'numpy' from 'C:\\Program Files (x86)\...
Darrell Stoker,Pidge,Green,2018-11-03 20:18:40.663605,<module 'numpy' from 'C:\\Program Files (x86)\...
Princess Allura,,Pink,2018-11-03 20:18:40.663605,
Tsuyoshi Garett,Hunk,Yellow,2018-11-03 20:18:40.663605,<module 'pandas' from 'C:\\Program Files (x86)...


---

# What can a Series hold?

It ***can*** hold pretty much anything. What it ***should*** and **does** hold is another matter.

As mentioned before, Pandas uses Numpy to speed up computations. Numpy attaches a datatype to all arrays which allows those arrays to be faster and more compact than Python lists. In Pandas, we will generally use the following [datatypes](https://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes):

* **object** (abbreviated **'O'**): which can be any Python object. Most notably this includes strings. This type supports np.NaN, which is used for missing data.
* **bool**: which can be either True or False.
* **int64**: an integer type. This type does not support np.NaN, so if your expected int64 column ends up as float64, that's probably why.
* **float64**: a floating point type. This type supports the np.NaN.
* **datetime64[ns]**: the basic time and/or date type with a resolution of nanoseconds. This comes in regular and time-zone enabled. This type supports pd.NaT.
* **timedelta64[ns]**: a type used for a period of time between two datetimes, with a resolution in nanoseconds.
* **category**: this is a pandas-specific type that is used to substitute a small integer for a frequently occuring value. This supports np.NaN.

---

Weird detour: np.NaN is the missing object type. Oddly enough:
* it's actually a float
* it is not equal to anything, including itself (np.NaN != np.NaN).
* If you want to see if a value is NaN, you must use pd.isnull(your_series), pd.notnull(your_series), your_series.isnull(), or your_series.notnull().
* only the object and float types support np.NaN.
* if you have null values in an integer type, it will convert to float.

For more detail, see [the official Pandas tutorial on missing data](http://pandas.pydata.org/pandas-docs/stable/missing_data.html).

---

In [21]:
# You can get the datatype of a particular Series by accessing its dtype method.
nicknames.dtype

dtype('O')

In [22]:
# And depending on the type, you may have special namespaces such as .str for strings ...
color.str.upper()

Keith Kogane        BLACK
Lance McClain         RED
Darrell Stoker      GREEN
Princess Allura      PINK
Tsuyoshi Garett    YELLOW
dtype: object

In [23]:
# Or .dt for datetimes.
updated.dt.day_name()

Keith Kogane       Saturday
Lance McClain      Saturday
Darrell Stoker     Saturday
Princess Allura    Saturday
Tsuyoshi Garett    Saturday
dtype: object

---

# Recap

1. Series are a flexible datatype for holding data
2. They have two major parts, and index and data
3. Unlike regular lists, they have a type
4. Series can be operated on as a unit
5. Series associate with other series via the index

---

# So now what?

Now that we have a better understanding of what a Series is, we're going to take a look at indexing and all the cool stuff it can do for us.

# Additional Learing Resources

* ### [Official Pandas Datatypes](https://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes)
* ### [Practical Business Python: Pandas Datatypes](pbpython.com/pandas_dtypes.html)
* ### [Series Reference](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)
* ### [Tutorials Point: Series](https://www.tutorialspoint.com/python_pandas/python_pandas_series.htm)

---

# Next Up: [Basic Indexing](4_basic_indexing.ipynb)

<br>


$\large{{{a} ={\begin{pmatrix}a_{1}\\a_{2}\\\vdots \\a_{n}\end{pmatrix}}}}$

---