# Pandas Overview
When finished with this notebook, we'll be ready for anything pandas.


In [ ]:
import pandas as pd


## 1. Creating, Reading and Writing
Pandas has two core objects, **DataFrame** and **Series**.

### DateFrame
A DataFrame is a table. It contains an array of individual *entries*, each of which has a certain *value*. Each entry corresponds to a row (or *record*) and a *column*.

For example, consider the following simple DataFrame:

In [ ]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

Here's another example showing strings.

In [ ]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

We are using the ```pd.DataFrame()``` constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (*Bob* and *Sue* in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.


The dictionary-list constructor assigns values to the *column labels*, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the *row labels*. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an ```index``` parameter in our constructor:

In [ ]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
              'Sue': ['Pretty good.', 'Bland.']},
              index=['Product A', 'Product B'])

### Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. 

In [ ]:
pd.Series([1, 2, 3, 4, 5])

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an ```index``` parameter. However, a Series does not have a column name, it only has one overall ```name```:

In [ ]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

The Series and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together". We'll see more of this below.

### Reading data files
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:
```
Product A, Product B, Product C
30,21,9
35,34,1
41,11,11
```
Let's now set aside our toy datasets and read in a real dataset into a DataFrame. We'll use ```pd.read_csv()``` to do this.

In [ ]:
ign_scores = pd.read_csv("./datasets/data-vis/ign_scores.csv")

We can use the ```shape``` attribute to check how large a DataFrame is, and the ```head()``` function to peek the first five rows.

In [ ]:
print(ign_scores.shape)
ign_scores.head()

The ```pd.read_csv()``` function is well-endowed, with over 30 optional parameters you can specify, like being able to specify a specific index column using ```index_col.```

## 2. Indexing, Selecting, & Assigning
Let's go over an important part of any data work, accessing our data.

### Naive accessors
Native Python objects provide good way of indexing data, which Pandas carries over to it's objects. Consider the data:

In [ ]:
import pandas as pd
# Import data
import my_modules.data_imports as data
wine_data = data.import_wine_data()

# Peek the data
wine_data

In Python, we can access the property of an object by accessing it as an attribute. A *book* object, for example, might have a *title* property, which we can access by calling ```book.title```. Columns in a pandas DataFrame work in much the same way.

Hence to access the country property of reviews we can use:

In [ ]:
wine_data.country

If we have a Python dictionary, we can access its values using the indexing ```([])``` operator. We can do the same with columns in a DataFrame:

In [ ]:
wine_data['country']

### Indexing in pandas
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem, making them easy to pick up and use. However, pandas has its own accessor operators, ```loc``` and ```iloc```. For more advanced operations, these are the ones we should use.

#### Index-based selection
Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

In [ ]:
wine_data.iloc[0]

Both ```loc``` and ```iloc``` are row-first, column-second. This is the opposite of what we do in native Python (and all other languages), which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. 
##### iloc
To get a column with iloc, we can do the following:

In [ ]:
wine_data.iloc[:, 0]

On its own, the ```:``` operator, which also comes from native Python, means "everything". When combined with other selectors, however, it can be used to indicate a range of values. For example, to select the *country* column from just the first, second, and third row, we would do:

In [ ]:
wine_data.iloc[:3, 0]

Finally, it's worth knowing that negative numbers can be used in selection. This will start counting forwards from the end of the values. So for example here are the last five elements of the dataset.

In [ ]:
wine_data.iloc[-5:]

#### Label-based selection
The second paradigm for attribute selection is the one followed by the loc operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.

##### loc
For example, to get the first entry in reviews, we would now do the following:

In [ ]:
# get the conutry of row 0
wine_data.loc[0, 'country']

```iloc``` is conceptually simpler than ```loc``` because it ignores the dataset's indices. When we use ```iloc``` we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. ```loc```, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using loc instead. For example, here's one operation that's much easier using loc:

In [ ]:
# get all data in the following columns
wine_data.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

#### Difference between ```loc``` and ```iloc```
There's one main difference between ```loc``` and ```iloc``` and that's the way they handle their indexing schemas.

```iloc``` uses the Python ```stdlib``` indexing scheme, where the first element of the range is included and the last one excluded. So ```0:10``` will select entries ```0,...,9```. ```loc```, meanwhile, indexes inclusively. So ```0:10``` will select entries ```0,...,10```.

### Manipulating the index
Manipulating the index
Label-based selection derives its power from the labels in the index. Critically, the index we use is not immutable. We can manipulate the index in any way we see fit (or if it wasn't set during ```read_csv```) .

The set_index() method can be used to do the job. Here is what happens when we set_index to the title field:


In [ ]:
wine_data.set_index('title')

### Conditional selection
We can use conditional statements inside of ```loc``` for more interesting ways of selecting data.

In [ ]:
wine_data.loc[wine_data.country == 'Italy']

The above statement pulled ~20,000 rows, while originally there were ~130,000. About 15% of wine comes from Italy!

Now let's find the highly reviewed wines in Italy. Wine is reviewd on an 80-100 point scale, so let's find wines that got atleast a 90.

In [ ]:
wine_data.loc[(wine_data.country == 'Italy') & (wine_data.points >= 90)]

Pandas also comes with a few useful built-in conditional selectors, two of which are:

#### ```isin```
```isin``` lets you select data whose value "is in" a list of values.


In [ ]:
wine_data.loc[wine_data.country.isin(['Italy', 'France'])]

#### ```isnull``` & ```notnull``` 
```isnull``` (and it's friend ```notnull```) let us highlight values which are (or are not) NaN.


In [ ]:
wine_data.loc[wine_data.price.notnull()]

## 3. Summary Functions and Maps
In this section, we'll work on getting out data in the right "shape". Let's go ahead and get some summaries of our data.

### Summary Functions
First, let's consider the ```describe()``` function, which gives us a high-level summary of the attributes of a given column. Note that the ```describe()``` method is type aware and will changes it output based on the input.

In [ ]:
wine_data.points.describe()

In [ ]:
wine_data.taster_name.describe()

Here are some more useful commands:
- ```mean()```: returns the the mean of the specified column
- ```unique()```: returns all unique values of the specified column
- ```value_counts()```: returns all unique values *and* how often they occur in the dataset for the specified column

In [ ]:
wine_data.taster_name.value_counts()

### Maps
Maps are very useful for mapping/transforming out data. Python comes a few different mapping methods, but let's look at the two most useful.

#### ```map()```
This is your good ol' basic map function that will transform values with the given lambda method for the specified column.


In [35]:
point_mean = wine_data.points.mean()
wine_data.points.map(lambda p: p - point_mean)

0        -1.447138
1        -1.447138
2        -1.447138
3        -1.447138
4        -1.447138
            ...   
129966    1.552862
129967    1.552862
129968    1.552862
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

#### ```apply()```
This is the DataFrame equivalent of map, that takes the supplied method and applies it on each row.

In [36]:
def remean_points(row):
    row.points = row.points - point_mean
    return row

wine_data.apply(remean_points, axis='columns')

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.447138,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.447138,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,-1.447138,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,-1.447138,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,-1.447138,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,1.552862,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,1.552862,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,1.552862,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,1.552862,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


If ```apply()``` is called with ```axis='index'```, then instead of passing a function to transform each row, we would need to give a function to transform each *column*.

**Note:** Both ```map()``` and ```apply()``` return new, transformed Series/DataFrames, leaving the original intact.

## 4. Grouping and Sorting


## 5. Data Types and Missing Values


## 6. Renaming and Combining
