# What is Pandas?

Pandas is a [package](https://data36.com/python-import-built-in-modules-data-science/) in Python, used for data formatting, analysis and manipulation. It gives you a way to deal with 2-D data structures (like SQL or Excel tables) in Python, which isn't "native" to the language. That's why we have to `import` the package. If you have Conda, you've already downloaded Pandas, we don't have to download it. By `import`-ing the package, we just allow the current .py or .ipynb file we're working on to use the functionality of the package.

### Why pandas?
- 2-D table-like data structures are intuitively known to us. Everyone is used to seeing Excel tables and this feels comfortable to work with. It is also what business people and other stakeholders will be used to seeing.
- It's one of the quickest ways to "automate the boring stuff" and lets you add value to any team quickly. 
- It plays well with other packages & libraries like SK-learn.


For reference, you can use this great pandas [cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)

### When Pandas?

In other iterrations of this course, participants asked where Pandas fits in the DS/ML pipeline. Let's focus on that!

<img src="pics/ds_path.png">

[source](https://towardsdatascience.com/how-it-feels-to-learn-data-science-in-2019-6ee688498029)

<img src="pics/ds_pipeline.png">

[source](https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955)

In [1]:
import pandas as pd

In [2]:
pd.__version__

'0.24.2'

## Pandas Data Types

Pandas has 2 main data types: 
- Series:
    - 1-D data structure. Different from a numpy array in that each value has a unique ID.
    

- DataFrame:
    - 2(+)-D data structure (better known as a table w/ named columns & numbered rows).
    
Here, we will focus on DataFrames, because if you can do something in 2-D, you can also probably do it in 1-D! Also, 2(+)-D data structures are much more useful to know how to work with in data science (you can't train a model with only 1 single array of data). 

### Importing a DataFrame
Throughout this tutorial, we will use practice data sets. 
Here, we use a dataset of wine reviews.

In [3]:
wine = pd.read_csv('data/wine_reviews/winemag-data_first150k.csv', index_col=0)

### Quickly viewing data
df - calling the DataFrame by name will show the entire thing, or, up to a set number of rows. you can configure this number in the notebook settings.

df.head() - shows first 5 rows _in order imported_! This is important as it is not sorted. So, the first row you see may not be the "actual" first row, or the first row you expect to see.

df.tail() - shows last 5 rows in order of imported.

All 3 are useful to quickly get a feel for your data.

In [4]:
wine

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
5,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia
6,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
7,Spain,Lush cedary black-fruit aromas are luxe and of...,Carodorum Único Crianza,95,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
8,US,This re-named vineyard was formerly bottled as...,Silice,95,65.0,Oregon,Chehalem Mountains,Willamette Valley,Pinot Noir,Bergström
9,US,The producer sources from two blocks of the vi...,Gap's Crown Vineyard,95,60.0,California,Sonoma Coast,Sonoma,Pinot Noir,Blue Farm


df.shape - shows number of (rows, columns)

In [5]:
wine.shape

(150930, 10)

df.shape returns a tuple. We can use tuple indexes to quickly access just one of these.
**How many rows does our data have?**

In [6]:
wine.shape[0]

150930

We can also quickly look at the data types of each of our columns.

In [7]:
wine.dtypes

country         object
description     object
designation     object
points           int64
price          float64
province        object
region_1        object
region_2        object
variety         object
winery          object
dtype: object

## Initial Data Exploration with Pandas
### Selecting Data
Anyone who knows SQL knows how important it is to be able to select particular columns or rows.
In Pandas, there are a lot of different ways to select.

### Selecting by column name
2 ways:

In [8]:
# 1. With quotes
wine['price']

0         235.0
1         110.0
2          90.0
3          65.0
4          66.0
5          73.0
6          65.0
7         110.0
8          65.0
9          60.0
10         80.0
11         48.0
12         48.0
13         90.0
14        185.0
15         90.0
16        325.0
17         80.0
18        290.0
19         75.0
20         24.0
21         79.0
22        220.0
23         60.0
24         45.0
25         57.0
26         62.0
27        105.0
28         60.0
29         60.0
          ...  
150900     13.0
150901     12.0
150902     10.0
150903      7.0
150904     10.0
150905     13.0
150906     65.0
150907     52.0
150908     65.0
150909     52.0
150910     38.0
150911     37.0
150912     65.0
150913     30.0
150914     25.0
150915     30.0
150916     65.0
150917     30.0
150918     38.0
150919     37.0
150920     19.0
150921     38.0
150922      NaN
150923     30.0
150924     70.0
150925     20.0
150926     27.0
150927     20.0
150928     52.0
150929     15.0
Name: price, Length: 150

In [9]:
# 2. With period (NOTE - ONLY works when column names have no spaces!)
wine.price

0         235.0
1         110.0
2          90.0
3          65.0
4          66.0
5          73.0
6          65.0
7         110.0
8          65.0
9          60.0
10         80.0
11         48.0
12         48.0
13         90.0
14        185.0
15         90.0
16        325.0
17         80.0
18        290.0
19         75.0
20         24.0
21         79.0
22        220.0
23         60.0
24         45.0
25         57.0
26         62.0
27        105.0
28         60.0
29         60.0
          ...  
150900     13.0
150901     12.0
150902     10.0
150903      7.0
150904     10.0
150905     13.0
150906     65.0
150907     52.0
150908     65.0
150909     52.0
150910     38.0
150911     37.0
150912     65.0
150913     30.0
150914     25.0
150915     30.0
150916     65.0
150917     30.0
150918     38.0
150919     37.0
150920     19.0
150921     38.0
150922      NaN
150923     30.0
150924     70.0
150925     20.0
150926     27.0
150927     20.0
150928     52.0
150929     15.0
Name: price, Length: 150

We can also chain together functions (commands) in Pandas:

In [10]:
wine.price.head()

0    235.0
1    110.0
2     90.0
3     65.0
4     66.0
Name: price, dtype: float64

## Slicing - Precise Selection by Row and Column
- `df.loc` gets rows (or columns) with particular labels from the index.
- `df.iloc` gets rows (or columns) at particular positions in the index (so it only takes integers).


**When using `loc` and `iloc`, the row is ALWAYS 1st and column is ALWAYS second!! **

**[row:row, column:column]**

Notice how each defaults if you leave one side blank.

In [11]:
wine.iloc[:5]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [12]:
wine.iloc[:5]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


To see how these two are different, we will change the index.

In [13]:
import random

In [14]:
random_index = random.sample(list(wine.index), k=len(list(wine.index)))
wine['scrambled_index'] = random_index
scrambled_wine = wine.set_index('scrambled_index')

In [15]:
# returns all rows up to the 5th position.
scrambled_wine.iloc[:5]

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,variety,winery
scrambled_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
56424,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
71353,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
64812,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
79646,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
91211,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [16]:
# returns all rows up to where #5 is in the index. Since we scrambled it, this is no longer in the 5th position.
scrambled_wine.loc[:5]

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,variety,winery
scrambled_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
56424,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
71353,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
64812,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
79646,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
91211,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
42040,Spain,"Deep, dense and pure from the opening bell, th...",Numanthia,95,73.0,Northern Spain,Toro,,Tinta de Toro,Numanthia
125364,Spain,Slightly gritty black-fruit aromas include a s...,San Román,95,65.0,Northern Spain,Toro,,Tinta de Toro,Maurodos
80268,Spain,Lush cedary black-fruit aromas are luxe and of...,Carodorum Único Crianza,95,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
146605,US,This re-named vineyard was formerly bottled as...,Silice,95,65.0,Oregon,Chehalem Mountains,Willamette Valley,Pinot Noir,Bergström
72238,US,The producer sources from two blocks of the vi...,Gap's Crown Vineyard,95,60.0,California,Sonoma Coast,Sonoma,Pinot Noir,Blue Farm


Including columns:

In [17]:
scrambled_wine.iloc[:5, :5]

Unnamed: 0_level_0,country,description,designation,points,price
scrambled_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
56424,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0
71353,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0
64812,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0
79646,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0
91211,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0


**Question:**
What happens when you try to use `.loc` with column selectors, instead of `.iloc`?

With `df.loc`, we can also use key words.

**Question:** Why doesn't the same work with `.iloc`?

In [18]:
wine.loc[:5,['variety', 'price']]

Unnamed: 0,variety,price
0,Cabernet Sauvignon,235.0
1,Tinta de Toro,110.0
2,Sauvignon Blanc,90.0
3,Pinot Noir,65.0
4,Provence red blend,66.0
5,Tinta de Toro,73.0


## Exercise:
Select the first 5 rows and columns "points" and "region_1" of the *scrambled_wine* table.
Hint: you'll need to string together 2 different types of selectors.

In [21]:
scrambled_wine.iloc[:5][['points', 'region_1']]

Unnamed: 0_level_0,points,region_1
scrambled_index,Unnamed: 1_level_1,Unnamed: 2_level_1
56424,96,Napa Valley
71353,96,Toro
64812,96,Knights Valley
79646,96,Willamette Valley
91211,95,Bandol
