<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Background</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/baby_panda.jpg" width="400">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"A people without the knowledge of their past history, origin and culture is like a tree without roots."</p>
                <br>
                <p>-Marcus Garvey</p>
                <br>
                <br>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Yun_Zi_-_Baby_Giant_Panda_-_IMG_1499_(4305495389).jpg'>fortherock</a>; released under the <a href="https://creativecommons.org/licenses/by-sa/2.0/deed.en">CC BY-SA 2.0</a>
</div>

<hr>

# Where did pandas come from?

In the early 2000s Python didn't really have a general-purpose library for data manipulation. If you wanted a free, intuitive interface for data analysis, you had to use the [R language](https://en.wikipedia.org/wiki/R_(programming_language)). That changed in 2008 with the introduction of the [Pandas](https://en.wikipedia.org/wiki/Pandas_%28software%29) data analysis library.

While working at a quantitative finance firm, an individual named Wes McKinney built an econometrics platform for analyzing panel data (**PAN**el **DA**ta -> Pandas). This platform used [Numpy]() and [Python]() as its base, but adopted many of R's concepts for effectively transforming datasets. It has since become the *de facto* standard in the Python community for high-level data analysis.

If you downloaded Anaconda, you've already got pandas. If not, you can get it by downloading the [Anaconda Python distribution](https://www.anaconda.com/download/), or by using [Pip/PyPI](https://pandas.pydata.org/getpandas.html).

---

# What is pandas?

Pandas is a library used to expand upon the capabilities of the Python language. It introduces data structures, syntax, and calculations specifically geared towards making data analysis more effective. It additionally has features for items like plotting, cleaning, and loading data. These new elements make it easy to tackle data problems in an iterative manner, which is especially suited to the Jupyter Notebook.

---

In [None]:
# Side note: the convention is to import pandas as pd and numpy as np.
# You should do this too, or I will judge you.
import pandas as pd
import numpy as np

---

# Why do we use pandas?

### Versatility

First off, the pandas library has [roughly a bajillion](https://pandas.pydata.org/pandas-docs/stable/api.html) different ways to mangle your data, as you'll soon see. Nothing else really compares in the Python space (or arguably anywhere else).

Secondly, we use pandas is to operate on a dataset as a whole instead of having to loop through each item one-by-one while processing. This is known as [Array Programming a.k.a. Vectorization](https://en.wikipedia.org/wiki/Array_programming)). This vectorization benefits us in two main ways.

### Vectorization: Code Clarity

First (and arguably more importantly) dealing with a dataset as a whole makes the code you write clearer. This benefits you because it makes your code more maintainable, explainable, and defensible. A quick, naive example (don't worry about the details yet):

In [None]:
# A quick demo: say we want to get the average BMI of a group of people.
# https://en.wikipedia.org/wiki/Body_mass_index
kilos_as_list    = [100, 50, 60, 40, 70]
meters_as_list   = [2, 1, 1.5, 1, 2]
kilos_as_series  = pd.Series(kilos_as_list)
meters_as_series = pd.Series(meters_as_list)

In [None]:
# Lets do non vectorized
# This could be more Pythonic, obviously.
bmis = []
for x in range(0, len(kilos_as_list)):
    height    = meters_as_list[x]
    weight    = kilos_as_list[x]
    weight_sq = weight ** 2
    bmi       = height * weight_sq
    bmis.append(bmi)

bmi_sum = sum(bmis)
avg_bmi = bmi_sum / len(bmis)

print(f'Non-vectorized mean BMI is {avg_bmi}!')

In [None]:
# Now vectorized! We can operate on the data **as a single unit**
# E.g. we can square each element of a column at once (or multiply entire datasets)
weight_sq = kilos_as_series ** 2
bmis      = weight_sq * meters_as_series
avg_bmi   = bmis.mean()

print(f'Vectorized mean BMI is {avg_bmi}!')

### Vectorization: Speed

Computers have [specialized CPU instructions](https://en.wikipedia.org/wiki/SIMD) that allow the computer to significantly speed up computation a lot of things at once (instead of one by one). Pandas uses this (along with numpy) to it's advantage to combat Python's main weakness: computation speed.

In [None]:
# Create a million random numbers
random_numbers = np.random.normal(1, .5, 1000000)
random_list    = list(random_numbers)
random_series  = pd.Series(random_numbers)

In [None]:
%%timeit
# Let's add 1 to each of those numbers individually
random_plus_1 = [number + 1 for number in random_numbers]

In [None]:
%%timeit
# Let's add 1 to each of those numbers collectively
random_plus_1 = random_series + 1

---

# How do we use Pandas?

We use pandas to take data from a variety of sources (csvs, databases, etc.) and put them into a variety of useful objects (usually in RAM). We'll get to these in more detail, but for background some of these are:

* [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html): a one-dimensional list of data with an attached index. It is often used as a row or a column of a dataframe.


* [Dataframes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html): a two-dimensional table of data with an attached row index and column index. It is the primary structure for analyzing data.


* [Groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html): a structure for placing data in certain groups for collective analysis.


* [Index](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html): a flat (Index) or hierarchical (MultiIndex) data structure attached to one of the above items that is used for lookups and associations.


We can then use these objects to transform our data in any one of a billion ways and output it. You can use these objects in a simple manner, or you can integrate them into your own functions and computations to do pretty much anything.

Note: ironically, the Panel data structure from which pandas got its name is now deprecated.

---

# What is the goal for this session?

We are going to learn about the Series, which is a basic building block of the pandas library. Learning about the Series will allow us to tackle some core concepts before we use those concepts to discuss the Dataframe.

---

# Questions?

---

# Additional Learing Resources

* ### [Official Pandas Intro to Data Structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
* ### [Official Pandas Essential Basic Functionality](https://pandas.pydata.org/pandas-docs/stable/basics.html)
* ### [Official 10 Minutes to Pandas Guide](https://pandas.pydata.org/pandas-docs/stable/10min.html)
* ### [Python Data Science Handbook: Data Structures](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* ### [Pandas From the Ground Up](https://www.youtube.com/watch?v=5JnMutdy6Fw) <- you should watch this.

---

# Next Up: [Series Basics](3_series_basics.ipynb)

<br>

<img style="margin-left: 0;" src="static/laozi.jpg">

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Laozi.jpg'>ManosHacker</a>; released into the public domain.
</div>

---