<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Background</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/baby_panda.jpg" width="400">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"A people without the knowledge of their past history, origin and culture is like a tree without roots."</p>
                <br>
                <p>-Marcus Garvey</p>
                <br>
                <br>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Yun_Zi_-_Baby_Giant_Panda_-_IMG_1499_(4305495389).jpg'>fortherock</a>; released under the <a href="https://creativecommons.org/licenses/by-sa/2.0/deed.en">CC BY-SA 2.0</a>
</div>

<hr>

# Where did Pandas come from?

In the early 2000s Python didn't really have a general-purpose library for data manipulation. If you wanted a free, intuitive interface for data analysis, you had to use the <a href="https://en.wikipedia.org/wiki/R_(programming_language)">R programming language</a>. That changed in 2008 with the introduction of the [Pandas](https://en.wikipedia.org/wiki/Pandas_%28software%29) data analysis library.

While working at a quantitative finance firm, an individual named Wes McKinney built an econometrics platform for analyzing panel data (**PAN**el **DA**ta -> Pandas). This platform used [Numpy](https://en.wikipedia.org/wiki/NumPy) and <a href="https://en.wikipedia.org/wiki/Python_(programming_language)">Python</a> as its base, but adopted many of R's concepts for effectively transforming datasets. It has since become the *de facto* standard in the Python community for high-level data analysis.

If you downloaded Anaconda, you've already got pandas. If not, you can get it by downloading the [Anaconda Python distribution](https://www.anaconda.com/download/), or by using [Pip/PyPI](https://pandas.pydata.org/getpandas.html).

---

# What is Pandas?

Pandas is a library used to expand upon the capabilities of the Python language. It introduces data structures and calculations specifically geared towards making data analysis more effective. It additionally has features for items like plotting, cleaning, and loading data. These new elements make it easy to tackle data problems in an iterative manner, which is especially suited to the Jupyter Notebook.

---

In [None]:
# Side note: the convention is to import Pandas as pd and numpy as np.
# You should do this too, or I will pass judgment upon you.
import pandas as pd
import numpy as np

---

# Why do we use Pandas?

### Versatility

First off, the Pandas library has [roughly a bajillion](https://pandas.pydata.org/pandas-docs/stable/api.html) different ways to mangle your data, as you'll soon see. Nothing else really compares in the Python space (or arguably anywhere else). Most operations worth doing can be done in pandas.

Secondly, we use Pandas is to operate on a dataset as a whole instead of having to loop through each item one-by-one while processing those items. This is known as [Array Programming a.k.a. Vectorization](https://en.wikipedia.org/wiki/Array_programming)). This "vectorization" benefits us in two main ways.

### Vectorization: Code Clarity

First (and arguably more importantly) dealing with a dataset as a whole makes the code you write clearer. This benefits you because it makes your code more maintainable, explainable, and defensible. A quick, naive example (don't worry about understanding this just yet).

A quick demo: say we have a list of customers and we want the average of the total:

1. car with sales tax, and

2. untaxed add on sales.

In other words, calculate:

    price + (price * tax) + add_on_costs

... and get the average total price paid.

In [None]:
# Data first.

# Define sales tax (we'll use the number instead of the variable for clarity)
sales_tax        = .09

# Starting with two lists
car_price_list   = [10_000, 20_000, 15_000, 40_000, 70_000]
other_costs_list = [250   , 250   , 250   , 2000  , 5000  ]

# Starting with two series based on previous lists.
car_price_series  = pd.Series(car_price_list)
other_cost_series = pd.Series(other_costs_list)

In [None]:
# Doing this in a non-vectorized manner with builtins ...

# Use a list comprehension to create our taxes.
tax_totals = [price * .09 for price in car_price_list]

# Create a zip object on a separate line for clarity
zip_object = zip(car_price_list, tax_totals, other_costs_list)

# List comprehension with zip object to add everything together
totals = [
    price + tax + other_cost
    for price, tax, other_cost
    in zip_object
]

# Now calculate the average.
aggregate_total   = sum(totals)
number_of_entries = len(totals)
average_1 = aggregate_total / number_of_entries

print('Non-vectorized mean cost is ${:,.2f}!'.format(average_1))

In [None]:
# Now vectorized! We can operate on the data **as a single unit** ...

# Here we multiply each value in car series by .09 (sales tax) to get tax for each item
tax_series = car_price_series * .09

# Here we add the tax + price + other costs
total_series = tax_series + car_price_series + other_cost_series

# And here we take the series and reduce it to a single value
average_2 = total_series.mean()

print('Non-vectorized mean cost is ${:,.2f}!'.format(average_2))

### Vectorization: Speed

Computers have [specialized CPU instructions](https://en.wikipedia.org/wiki/SIMD) that allow the computer to significantly speed up computation a lot of things at once (instead of one by one). Pandas uses this (along with numpy) to it's advantage to combat Python's main weakness: computation speed.

In [None]:
# Create a million random numbers
random_numbers = np.random.normal(1, .5, 1000000)
random_list    = list(random_numbers)
random_series  = pd.Series(random_numbers)

In [None]:
%%timeit
# Let's add 1 to each of those numbers individually
random_plus_1 = [number + 1 for number in random_numbers]

In [None]:
%%timeit
# Let's add 1 to each of those numbers collectively
random_plus_1 = random_series + 1

---

# How do we use Pandas?

We use Pandas to take data from a variety of sources (csvs, databases, etc.) and put them into a variety of useful objects (usually in RAM). We'll get to these in more detail, but for background some of these are:

* [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html): a one-dimensional list of data with an attached index. It is often used as a row or a column of a Dataframe.


* [Dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html): a two-dimensional table of data with an attached row index and column index. It is the primary structure for analyzing data.


* [Groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html): a structure for placing data in certain groups for collective analysis.


* [Index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html): a flat (Index) or hierarchical (MultiIndex) data structure attached to one of the above items that is used for lookups and associations.


We can then use these objects to transform our data in any one of a billion ways and output it. You can use these objects in a simple manner, or you can integrate them into your own functions and computations to do pretty much anything.

Note: ironically, the Panel data structure from which Pandas got its name is now deprecated.

---

# What is the goal for this session?

We are going to learn about the Series, which is a basic building block of the Pandas library. Learning about the Series will allow us to tackle some core concepts before we use those concepts to discuss the Dataframe.

---

# Questions?

---

# Additional Learing Resources

* ### [Official Pandas Intro to Data Structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
* ### [Official Pandas Essential Basic Functionality](https://pandas.pydata.org/pandas-docs/stable/basics.html)
* ### [Official 10 Minutes to Pandas Guide](https://pandas.pydata.org/pandas-docs/stable/10min.html)
* ### [Python Data Science Handbook: Data Structures](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* ### [Pandas From the Ground Up](https://www.youtube.com/watch?v=5JnMutdy6Fw) <- you should watch this.

---

# Next Up: [Series Basics](3_series_basics.ipynb)

<br>

<img style="margin-left: 0;" src="static/laozi.jpg">

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Laozi.jpg'>ManosHacker</a>; released into the public domain.
</div>

---