# Pandas Tutorial

### Introduction

_Pandas_ is an open source library for data analysis and manipulation in Python. We’re introducing it in this tutorial because it's one of the most powerful and popular tools for performing the type of exploratory data analysis and data processing you’ll need to perform for the project. 

Ultimately, you have full control over your tech stack for the project, so feel free to skip this tutorial if you’re already familiar with _Pandas_ or if your team has an alternate tech stack in mind for data exploration and pre-processing. 


### Data Structures
_Pandas_ is built around two foundational data structures:

*   **`DataFrame`**: The analog of a SQL table in Python. Just like a SQL table, it’s a two dimensional data structure, containing rows and named columns with an index. 
*   **`Series`**: Analog of a single column. Like `DataFrames`, `Series` also have indices. 

Usually, Python users import *pandas* under the abreviation `pd`. And since many *numpy* functions operate on `Series`, users often import *numpy* alongside *pandas*.

In [3]:
import pandas as pd
import numpy as np

### Construct a `DataFrame`
Let's construct a DataFrame of toy data by passing a map to the `pandas.DataFrame` constructor. The keys of the map should give column names and each value gives the column values for the corresponding key. 

In [4]:
people = pd.DataFrame({'age': [19, 24, 4], 'height': [183, 170, 70], 'weight': [77, 70, 25]})
people

Unnamed: 0,age,height,weight
0,19,183,77
1,24,170,70
2,4,70,25


The bolded text across the top of the `DataFrame` gives the column names. The bolded text running down the left side of the `DataFrame` gives the index values.

We can also construct a dataframe using a map of `Series`:

In [5]:
age = pd.Series([19, 24, 4])
height = pd.Series([183, 170, 70])
weight = pd.Series([77, 70, 25])

people = pd.DataFrame({'age': age, 'height': height, 'weight': weight})
people

Unnamed: 0,age,height,weight
0,19,183,77
1,24,170,70
2,4,70,25


### Accessing Data
Pandas defines a [powerful and extensive API](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) for indexing and subsetting data that I won't cover in detail. For now, I'll just introduce two basic subsetting methods: 
1. Dictionary operations
2. `.loc` 


#### Dictionary Operations
Standard Python `dict` square-bracket indexing allows us to access individual columns as `Series`. For example, let's grab the age column: 

In [6]:
people['age']

0    19
1    24
2     4
Name: age, dtype: int64

We can also pass a list of column names into the bracket to grab a subset of columns simulataneously. In this case, the operation returns a `DataFrame`. For example, let's grab the `height` and `weight` columns together:

In [7]:
people[['height', 'weight']]

Unnamed: 0,height,weight
0,183,77
1,170,70
2,70,25


We can also use dictionary operations to define new columns or update existing ones:

In [8]:
people['height'] = people['height'] * 2
people['log_weight'] = np.log(people['weight'])  # numpy functions can modify Series
people

Unnamed: 0,age,height,weight,log_weight
0,19,366,77,4.343805
1,24,340,70,4.248495
2,4,140,25,3.218876


#### `.loc`
The `.loc` function allows us to subset rows and columns of a `DataFrame` using the row and column labels, where the `DataFrame`'s index gives row labels. 

The standard syntax is `my_data.loc[row_labels, column_labels]`. 

For example, let's grab the `age` and `weight` for the first and last rows:

In [9]:
people.loc[[0, 2], ['age', 'weight']]

Unnamed: 0,age,weight
0,19,77
2,4,25


*Notice that the index values of the rows we extracted remain the same. Like the index values of SQL table rows, the index values for `DataFrame` rows don't aren't dependent on position. They remain constant unless we explicitly redefine the index.*


Passing a colon to `.loc` grabs all of the rows or columns:

In [10]:
people.loc[:, ['age', 'weight']]

Unnamed: 0,age,weight
0,19,77
1,24,70
2,4,25


In [11]:
people.loc[[0, 2], :]

Unnamed: 0,age,height,weight,log_weight
0,19,366,77,4.343805
2,4,140,25,3.218876


If we don't pass a comma to the bracket, `.loc` interprets the values as row indices:

In [12]:
people.loc[[0, 2]]

Unnamed: 0,age,height,weight,log_weight
0,19,366,77,4.343805
2,4,140,25,3.218876


`.loc` can also use several other data structures to index a `DataFrame`. For example, a boolean list or Series:

In [13]:
people.loc[[True, False, True], :]

Unnamed: 0,age,height,weight,log_weight
0,19,366,77,4.343805
2,4,140,25,3.218876


But the list must have the same number of elements as the axis it indexes:

In [14]:
people.loc[[True, True], :]

IndexError: Boolean index has wrong length: 2 instead of 3

### Exercises

Now, try [these exercises](https://drive.google.com/open?id=1muxuaIDp4OW4rOO3ZWPxOZyXxNKkdDvr) to practice using _Pandas_