# Intro

Below are a list of functions/features you will most likely use on this assignment. This is not the only functions possible to use and these functions can actually be used in a lot more cool and complicated ways, but we are going to focus on the basics in this wordbank. 

For each of the examples, we will use a `pets.csv` file with the contents that we have read into a `DataFrame` called `data`:

In [1]:
import pandas as pd

In [2]:
with open('pets.csv') as f:
    print(f.read())

name,age,species
Fido,4,dog
Meowrty,6,cat
Chester,1,dog
Phil,1,axolotl


In [3]:
data = pd.read_csv('pets.csv')
data

Unnamed: 0,name,age,species
0,Fido,4,dog
1,Meowrty,6,cat
2,Chester,1,dog
3,Phil,1,axolotl


# Get a column of a `DataFrame`
Usually you want to look at the values for one value of a `DataFrame`. Getting a column will return a `Series`.

In [4]:
data['name']

0       Fido
1    Meowrty
2    Chester
3       Phil
Name: name, dtype: object

You can ask for more than one column, and it will return a `DataFrame`

In [5]:
data[['age', 'name']]

Unnamed: 0,age,name
0,4,Fido
1,6,Meowrty
2,1,Chester
3,1,Phil


# Get a row of a `DataFrame` (`loc`)

To get a particular row of a `DataFrame`,  you should use the `ix` property.

In [6]:
data.loc[1]

name       Meowrty
age              6
species        cat
Name: 1, dtype: object

Can use slice indices to get more than one row. 

Warning: Slice semantics for accessing rows are different than standard Python semantics. With the slice, the result will include the stop point. A bit annoying, but you get used to it

In [7]:
data.loc[1:3]

Unnamed: 0,name,age,species
1,Meowrty,6,cat
2,Chester,1,dog
3,Phil,1,axolotl


# Filtering
You usually will want to select certain rows of a `DataFrame` by some condition. You can use logical operators on `Series` or `DataFrames` and it will do an element-wise computation of the operator. For example:

In [8]:
data['age'] >= 2

0     True
1     True
2    False
3    False
Name: age, dtype: bool

This boolean `Series` (boolean mask) can be used to index into the `DataFrame` and it will return the subset of the rows which contain true values in the mask.

In [9]:
data[data['age'] >= 2]

Unnamed: 0,name,age,species
0,Fido,4,dog
1,Meowrty,6,cat


You can use logical operators as well between these masks to combine the filters. With multiple filters conditions, you almost always will need extra parentheses to get it working.

In [10]:
data[(data['age'] < 5) & (data['species'] == 'dog')]

Unnamed: 0,name,age,species
0,Fido,4,dog
2,Chester,1,dog


Notice that we used `&` instead of `and` for the logical operator. For `pandas` filters you use the following symbols

Logical operator | Symbol for `pandas`
---------------------------|----------
and                       | &
or                          | &#124;
not                        | ~


# Loop over Series

It should be rare that you need this since most times you want to use the functions described in this document, but sometimes you will need to loop over a `Series` so that you can print it out or transform it to some other data type. You can use a for loop to access the values in the series

In [11]:
for value in data['name']:
    print(value)

Fido
Meowrty
Chester
Phil


You can also access the "index" of the `Series` using the `index` property

In [12]:
for i in data['name'].index:
    print('Index val:', i, 'Value:', data['name'][i])

Index val: 0 Value: Fido
Index val: 1 Value: Meowrty
Index val: 2 Value: Chester
Index val: 3 Value: Phil


# `groupby`

Separates a `DataFrame` into groups by their values of the given column. This returns a `GroupBy` object that can be used to compute aggregates. We do not show any examples here, but many of the functions below show examples with `groupby` as well.

# `min`
Returns the minimum value in the `Series` or along each column of a `DataFrame`. Also works on a `GroupBy` object as an aggregate.

In [13]:
data['age'].min()

1

In [14]:
data.min()

name       Chester
age              1
species    axolotl
dtype: object

In [15]:
data.groupby('species')['age'].min()

species
axolotl    1
cat        6
dog        1
Name: age, dtype: int64

# `max`
Returns the maximum value in the `Series` or along each column of a `DataFrame`. Also works on a `GroupBy` object as an aggregate.

In [16]:
data['age'].max()

6

In [17]:
data.max()

name       Phil
age           6
species     dog
dtype: object

In [18]:
data.groupby('species')['age'].max()

species
axolotl    1
cat        6
dog        4
Name: age, dtype: int64

# `idxmin`

Returns the index of the minimum value in the `Series`. If there is a tie, returns the first index.



In [19]:
data['age'].idxmin()

2

# `idxmax`

Returns the index of the maximum value in the `Series`. If there is a tie, returns the first index.



In [20]:
data['age'].idxmax()

1

# `count`

Returns the number elements in the `Series` or by column of a `DataFrame`. Also works on a `GroupBy` object as an aggregate.



In [21]:
data['name'].count()

4

In [22]:
data.count()

name       4
age        4
species    4
dtype: int64

In [23]:
data.groupby('species')['age'].count()

species
axolotl    1
cat        1
dog        2
Name: age, dtype: int64

# `mean`
Returns the mean value in the `Series` or along each **numberic** column of a `DataFrame`. Also works on a `GroupBy` object as an aggregate.

In [24]:
data['age'].mean()

3.0

In [25]:
data.mean()

  data.mean()


age    3.0
dtype: float64

In [26]:
data.groupby('species')['age'].mean()

species
axolotl    1.0
cat        6.0
dog        2.5
Name: age, dtype: float64

# `unique`

Returns the unique values that appear in a `Series`. This is technically returned as something called a `numpy` array, but for most intents and purposes, you can just treat it like a list.


In [27]:
data['species'].unique()

array(['dog', 'cat', 'axolotl'], dtype=object)