<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pandas-from-NumPy" data-toc-modified-id="Pandas-from-NumPy-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pandas from NumPy</a></span><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objectives</a></span></li></ul></li><li><span><a href="#What-are-Series-and-DataFrames?" data-toc-modified-id="What-are-Series-and-DataFrames?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What are Series and DataFrames?</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Some-Examples" data-toc-modified-id="Some-Examples-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>Some Examples</a></span></li></ul></li><li><span><a href="#🧠-Knowledge-Check:-Why-use-Pandas-Series-and-DataFrames?" data-toc-modified-id="🧠-Knowledge-Check:-Why-use-Pandas-Series-and-DataFrames?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>🧠 Knowledge Check: Why use Pandas Series and DataFrames?</a></span><ul class="toc-item"><li><span><a href="#Possible-Answer" data-toc-modified-id="Possible-Answer-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Possible Answer</a></span></li></ul></li></ul></li><li><span><a href="#Pandas-Methods-for-Importing-&amp;-Exporting-Data" data-toc-modified-id="Pandas-Methods-for-Importing-&amp;-Exporting-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pandas Methods for Importing &amp; Exporting Data</a></span><ul class="toc-item"><li><span><a href="#def-Functions-vs-lambda-Functions" data-toc-modified-id="def-Functions-vs-lambda-Functions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span><code>def</code> Functions vs <code>lambda</code> Functions</a></span></li></ul></li><li><span><a href="#Pandas-Methods-for-Formatting-and-Cleaning-Data" data-toc-modified-id="Pandas-Methods-for-Formatting-and-Cleaning-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Pandas Methods for Formatting and Cleaning Data</a></span></li><li><span><a href="#Accessing-Data-in-Pandas" data-toc-modified-id="Accessing-Data-in-Pandas-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Accessing Data in Pandas</a></span><ul class="toc-item"><li><span><a href="#Methods:" data-toc-modified-id="Methods:-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Methods:</a></span><ul class="toc-item"><li><span><a href="#Examples" data-toc-modified-id="Examples-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Examples</a></span></li></ul></li><li><span><a href="#Attributes:" data-toc-modified-id="Attributes:-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Attributes:</a></span></li><li><span><a href="#Additional-Resources" data-toc-modified-id="Additional-Resources-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Additional Resources</a></span></li><li><span><a href="#Boolean-Masking-for-Data-Selection" data-toc-modified-id="Boolean-Masking-for-Data-Selection-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Boolean Masking for Data Selection</a></span></li></ul></li></ul></div>

# Pandas from NumPy

![](images/pandas.svg)

> [Pandas](https://pandas.pydata.org/pandas-docs/stable/) is the typical tool a data scientist grabs first. It is based around a lot of the [NumPy package](https://docs.scipy.org/doc/numpy/reference/) so a familiarity with NumPy will help understand how to use Pandas. However, Pandas has a lot of specific extras that can be very useful to a data scientist!
>
> Pandas isn't the only tool that you will use as a data scientist but can usually get the job done when working with data that are not _"big data"_. It can also be useful to do an initial exploration of larger datasets. A great (and free resource) is the _Python Data Science Handbook_ by Jake VanderPlas: https://jakevdp.github.io/PythonDataScienceHandbook

## Objectives

You will be able to:
* Understand and explain what Pandas Series and DataFrames are and how they differ from dictionaries and lists
* Create Series & DataFrames from dictionaries and lists
* Perform data manipulation on series and DataFrames using methods and attibutes
* Perform data manipulation on series and DataFrames using lambda fucntions, def functions, .map()/.apply()/.applymap()

In [None]:
# Some imports to do some work
import numpy as np # Our good old friend ♥️
import pandas as pd # We'll get to know her well 🐼

# What are Series and DataFrames?

One really cool thing about Pandas is that the main structures **DataFrames** and **Series** are really just NumPy arrays structured similar to a spreadsheet. This is how we organize our data in Pandas.

* “Series” → one-dimensional array; a column
* “Data Frames” → two-dimensional array; a spreadsheet

### Some Examples

In [None]:
# Series
data_col = pd.Series([0.2, 0.4, 0.6, 0.8, 1.0])
data_col


In [None]:
# Series with an index!
data_col = pd.Series([0.2, 0.4, 0.6, 0.8, 1.0], 
                     index=['a','b','c','d','e'])
data_col

In [None]:
data_col_other = pd.Series([0.1, 0.2, 0.3, 0.4, 5],
                           index=['a','b','c','d','e'])
data_col_other

In [None]:
# A DataFrame from our Series!
df = pd.DataFrame(data_col)
df

In [None]:
# Multiple Series! We can define our column names
df = pd.DataFrame({'1st_Col':data_col,'2nd_Col':data_col_other})
df

> A great way to learn is to try and break things! What do you predict will happen here?

In [None]:
# 
data_col_other = pd.Series([0.1, 0.2, 0.3, 0.4, 5],
                           index=['a','b','c','z','y'])
data_col = pd.Series([0.2, 0.4, 0.6, 0.8, 1.0], 
                     index=['a','b','c','d','e'])

df = pd.DataFrame({'1st_Col':data_col,'2nd_Col':data_col_other})
df

## 🧠 Knowledge Check: Why use Pandas Series and DataFrames?

> Why use Pandas Series and DataFrames instead of built-in Python data types of lists and dictionaries?

### Possible Answer

> Series and DataFrames have a range of built in methods which make standard practices and procedures streamlined. 

# Pandas Methods for Importing & Exporting Data

- pd.read_csv()
- pd.read_excel()
- pd.read_json()
- pd.DataFrame.from_dict()

- df.to_csv()
- df.to_excel()
- df.to_json()
- df.to_dict()

## `def` Functions vs `lambda` Functions

> .map(), .apply(), .applymap()

- both useful for applying an operation to all values in a series or DataFrame
    * mapping a user-defined function allows for easy re-use
    * lambda function is 1-time-use
    
https://sites.google.com/site/prgrammnote/python/difference-between-map-applymap-and-apply-methods-in-pandas

https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_dataframes/

# Pandas Methods for Formatting and Cleaning Data

* df['col'].astype()
* pd.to_datetime()
* df.rename()
* df.drop()
* df.set_index()

**When manipulating and formatting data it is good practice to preview changes before overwriting data**

# Accessing Data in Pandas

In [None]:
from sklearn.datasets import load_wine
import pandas as pd 

data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)

## Methods:

* `.info()` → Gives a feel for what the data are about
* `.describe()` → Descriptive statistics of numerical columns 
* `.head()` → First 5 rows of DataFrame
* `.tail()` → Last 5 rows of DataFrame

### Examples

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df.head(10)

In [None]:
df.tail()

## Attributes:

* .index
* .columns
* .dtypes
* .shape
* .loc - select by label
* .iloc - select by index

## Additional Resources

Confused on methods vs attributes? Check out these links:

https://www.quora.com/What-is-the-difference-between-methods-and-attributes-in-Python

https://stackoverflow.com/questions/28798781/differences-between-data-attributes-and-method-attributes

## Boolean Masking for Data Selection

> Extremely useful when trying to get a subset of data based on conditions

In [None]:
df.loc[df["alcohol"] < 12] # boolean indexing

In [None]:
# Alternative
mask = df['alcohol'] < 12 #boolean masking
new_df = df[mask]
new_df

In [None]:
# Multiple conditions: → &
mask = (df['alcohol'] < 12) & (df['alcohol'] > 11.7) # Must be in paranthesis!
new_df = df[mask]
new_df

In [None]:
# Multiple conditions: OR → |
mask = (df['alcohol'] < 12) | (df['alcohol'] > 14) # Must be in paranthesis!
new_df = df[mask]
new_df