# What is Pandas?

Pandas is a [package](https://data36.com/python-import-built-in-modules-data-science/) in Python, used for data formatting, analysis and manipulation. It gives you a way to deal with 2-D data structures (like SQL or Excel tables) in Python, which isn't "native" to the language. That's why we have to `import` the package. If you have Conda, you've already downloaded Pandas, we don't have to download it. By `import`-ing the package, we just allow the current .py or .ipynb file we're working on to use the functionality of the package.

### Why pandas?
- 2-D table-like data structures are intuitively known to us. Everyone is used to seeing Excel tables and this feels comfortable to work with. It is also what business people and other stakeholders will be used to seeing.
- It's one of the quickest ways to "automate the boring stuff" and lets you add value to any team quickly. 
- It plays well with other packages & libraries like `scikit-learn` and `matplotlib`.
- Excellent tools for transforming & cleaning data (i.e. resampling, filling missing values).
- Sits on top oy `numpy` - which can be fast if you vectorize calculations.

[For reference, you can use this great pandas cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf).

### When Pandas?

In other iterations of this course, participants asked where Pandas fits in the DS/ML pipeline. Let's focus on that!

<img src="../pics/ds_path.png">

[source](https://towardsdatascience.com/how-it-feels-to-learn-data-science-in-2019-6ee688498029)

<img src="../pics/ds_pipeline.png">

[source](https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955)

In [1]:
import pandas as pd

In [None]:
pd.__version__

## Python data types

A quick review of the basic data structures in Python:

In [None]:
#float
float(16)

In [None]:
#int
int(16)

In [None]:
#str
str(16)

In [None]:
#bool
bool(16)

In [None]:
#bool
bool(0)

In [None]:
#bool
bool(None)

## Pandas Data Types

Pandas has 2 main data structures:

- pd.Series:
    - 1-D data structure. Different from a numpy array in that each value has a unique ID.
    

- pd.DataFrame:
    - 2(+)-D data structure (better known as a table w/ named columns & numbered rows).
    
Here, we will focus on DataFrames:
- if you can do something in 2-D, you can also probably do it in 1-D.
- 2(+)-D data structures are much more useful to know how to work with in data science (you can't train a model with only 1 single array of data)
- the API between the Series can differ slightly (i.e. `pd.Series.name` vs `pd.DataFrame.columns`)

### Importing a DataFrame
Throughout this tutorial, we will use practice data sets. Here, we use a dataset of wine reviews.  

We use the read_csv function to load data from a csv file into a DataFrame:

In [39]:
wine = pd.read_csv('data/wine_reviews/winemag-data_first150k.csv', index_col=0)

### Quickly viewing data

The DataFrame is a Python object, with useful methods (i.e. functions) and attributes (often referring to simpler Python data types like floats or tuples).  

You can tell if it is a method by `()` at the end.  A method will run code - an attribute will refer to data that already exists (most of the time - see the `@property` decorator for more).

df - calling the DataFrame without any method or attribute reference will print the `__repr__` method of the class.  For a DataFrame, the Pandas developers chose to print the entire DataFrame (or, up to a set number of row - you can configure this number in the notebook settings.)

In [None]:
wine

df.head() - shows first n rows (5 by default) _in order imported_! This is important as it is not sorted. So, the first row you see may not be the "actual" first row, or the first row you expect to see.

In [None]:
wine.head()

df.tail() - shows last n rows in order of imported (below we limit it to three)

In [None]:
wine.tail(3)

df.shape - shows number of (rows, columns)

In [None]:
wine.shape

df.shape returns a tuple. We can use tuple indexes to quickly access just one of these.
**How many rows does our data have?**

In [None]:
wine.shape[0]

We can also quickly look at the data types of each of our columns:
- each column has a single data type (due to the data being represented as numpy arrays
- pandas has figured out some of our columns dtypes for us

In [None]:
wine.dtypes

Also useful is df.describe() - a method which generates summary statistics of the columns:

In [None]:
wine.describe()

## What is a DataFrame made of?

Column names:

In [None]:
wine.columns

An index (similar to a primary key in SQL), NOTE: an index in pandas does NOT have to be unique!! This is a major difference between the index in pandas and a primary key in SQL.

In [None]:
wine.index

And data (stored in a `numpy` array):

In [None]:
wine.values

In [None]:
# getting just top row
wine.values[0]

In [None]:
# showing that data is stored as a numpy array 
type(wine.values[0])

## Initial Data Exploration with Pandas
### Selecting Data
Anyone who knows SQL knows how important it is to be able to select particular columns or rows.
In Pandas, there are a lot of different ways to select.

### Selecting by column name
2 NOT RECOMMENDED ways:
- selecting with column name in quotes, directly in brackets 
- selecting with dot notation and column name (only works when column name has no spaces) 

Why are these not recommended? 
- not explicit (can be used to select rows OR columns, may cause issues: more about that [here](https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc).)
- returns a copy of a slice: can cause unwanted changes to the underlying data. More about this [here](https://realpython.com/pandas-settingwithcopywarning/)

In [None]:
# 1. With quotes
wine['country']

In [None]:
# 2. With period (NOTE - ONLY works when column names have no spaces!)
wine.country

## Slicing - Precise Selection by Row and Column (Recommended)
- `df.loc` gets rows (or columns) with particular labels from the index.
- `df.iloc` gets rows (or columns) at particular positions in the index (so it only takes integers).


**When using `loc` and `iloc`, the row is ALWAYS 1st and column is ALWAYS second!! **

**[row:row, column:column]**

Notice how each defaults if you leave one side blank.

In [None]:
# selecting up to index = 5
wine.loc[:5]

In [None]:
# selecting first 5 positionally (note that selection only goes up to index number 4)
wine.iloc[:5]

In [None]:
# selecting first 5 rows AND columns positionally 
wine.iloc[:5, :5]

**Question**: What happens it we try to do this with `.loc`?

Selecting all rows in the column `price` with loc

In [None]:
wine.loc[:, 'price']

We can also chain together functions (commands) in Pandas:

In [None]:
wine.loc[:, 'price'].describe()

To confirm our understanding of how `.loc` and `.iloc` are different, we will change the index.

In [31]:
import random

In [32]:
# making a new dataframe where the index is a random scramble of the original index
random_index = random.sample(list(wine.index), k=len(list(wine.index)))
wine.loc[:, 'scrambled_index'] = random_index
scrambled_wine = wine.set_index('scrambled_index')

In [None]:
# returns all rows up to the 5th position.
scrambled_wine.iloc[:5]

In [None]:
# returns all rows up to where #5 is in the index. Since we scrambled it, this is no longer in the 5th position.
scrambled_wine.loc[:5]

Including columns:

In [None]:
scrambled_wine.iloc[:5, :5]

## Exercise:
Select the first 5 rows and columns "points" and "region_1" of the *scrambled_wine* table.
Hint: you'll need to string together 2 different types of selectors.

## Pandas + Matplotlib

Matplotlib is a library famous for having multiple ways to do one thing - partly due to its heritage that goes back to MATLAB.

Below is one way to use matplotlib with pandas, that allows you to plot multiple axes on a single figure.  

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

fig, axes = plt.subplots(2)

wine.iloc[:1000, :].loc[:, 'price'].plot(ax=axes[0])

wine.iloc[:1000, :].loc[:, 'points'].plot(ax=axes[1], kind='hist')