# Tutorial - Pandas Operations

This material is adapted from the extensive library of examples on data science by Chris Albon [https://chrisalbon.com/].

I have gathered here operations that are very common when working with pandas and that you will find useful in doing your assignments in this class.

**Table of Content**

1. Apply operations
2. Creating new columns
3. Deleting duplicates
4. Filtering values
5. Replacing values
6. Dropping values 
7. A simple `groupby` example

**Your task:** Walk through the tutorial reading the examples, playing with them, trying to understand their output.

In [1]:
import numpy as np
import pandas as pd

## 1. Apply operations

In pandas, we often need to apply an operation to all members of a column or the whole dataframe. The methods `apply` and `applymap` will be helpful in that regard.

Let's create a dataframe from a dictionary first, to use in the examples:

In [2]:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3],
        'coverage': [25, 94, 57, 62, 70]}

df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df

Unnamed: 0,name,year,reports,coverage
Cochice,Jason,2012,4,25
Pima,Molly,2012,24,94
Santa Cruz,Tina,2013,31,57
Maricopa,Jake,2014,2,62
Yuma,Amy,2014,3,70


### Capitalize all names in column `name`

Initially, we create a lambda function and give it a name:

In [3]:
capitalizer = lambda x: x.upper()

Let's test that it works:

In [4]:
capitalizer("wellesley")

'WELLESLEY'

We now use the method `apply` on the column `name`, by passing the new function as an argument:

In [5]:
df['name'].apply(capitalizer)

Cochice       JASON
Pima          MOLLY
Santa Cruz     TINA
Maricopa       JAKE
Yuma            AMY
Name: name, dtype: object

However, such operation hasn't changed the dataframe:

In [6]:
df

Unnamed: 0,name,year,reports,coverage
Cochice,Jason,2012,4,25
Pima,Molly,2012,24,94
Santa Cruz,Tina,2013,31,57
Maricopa,Jake,2014,2,62
Yuma,Amy,2014,3,70


To do that, we can assign the new column to the old one:

In [7]:
df['name'] = df['name'].apply(capitalizer)
df

Unnamed: 0,name,year,reports,coverage
Cochice,JASON,2012,4,25
Pima,MOLLY,2012,24,94
Santa Cruz,TINA,2013,31,57
Maricopa,JAKE,2014,2,62
Yuma,AMY,2014,3,70


**YOUR TURN:** Increase by 5 the `year` values using the approach shown above.

In [8]:
# write your solution here


### Change all elements
Writing more sophisticated functions allows to apply an operation to almost all values in the dataframe. The function below multiplies by 100 every numerical value, but doesn't touch the string elements.

In [9]:
# create a function called times100
def times100(x):
    # that, if x is a string,
    if type(x) is str:
        # just returns it untouched
        return x
    # but, if not, return it multiplied by 100
    elif x:
        return 100 * x
    # and leave everything else
    else:
        return

In [10]:
df.applymap(times100)

Unnamed: 0,name,year,reports,coverage
Cochice,JASON,201200,400,2500
Pima,MOLLY,201200,2400,9400
Santa Cruz,TINA,201300,3100,5700
Maricopa,JAKE,201400,200,6200
Yuma,AMY,201400,300,7000


## 2. Creating new columns

It's possible to add new columns to a dataframe at any time. Here are two ways to do that:

In [11]:
# 1. use the subscription operator []

df = pd.DataFrame()
df['name'] = ['ben', 'lyn', 'ada']
df

Unnamed: 0,name
0,ben
1,lyn
2,ada


In [12]:
# 2. use the method assign

df = df.assign(lastname=['wood','turbak','lerner'])
df

Unnamed: 0,name,lastname
0,ben,wood
1,lyn,turbak
2,ada,lerner


## 3. Deleting duplicates

Sometimes there are identical rows, and sometimes there are rows that are partially identical. The example below shows how to remove them.

In [13]:
raw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'], 
        'age': [42, 42, 1111111, 36, 24, 73], 
        'preTestScore': [4, 4, 4, 31, 2, 3],
        'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,4,25
1,Jason,Miller,42,4,25
2,Jason,Miller,1111111,4,25
3,Tina,Ali,36,31,57
4,Jake,Milner,24,2,62
5,Amy,Cooze,73,3,70


Notice that we have two identical rows, 0 and 1, and one that is partially identical to them (row 2).

First, one can check for duplicate rows:

In [14]:
df.duplicated()

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

The result shows that row 1 is a duplicate. It's easy to drop identical duplicates:

In [15]:
df.drop_duplicates()

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,4,25
2,Jason,Miller,1111111,4,25
3,Tina,Ali,36,31,57
4,Jake,Milner,24,2,62
5,Amy,Cooze,73,3,70


Notice that row 1 disappeared. Meeanwhile, the same method, if used with named arguments, can remove partial duplicates:

In [16]:
df.drop_duplicates(['first_name'], keep='last')

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
2,Jason,Miller,1111111,4,25
3,Tina,Ali,36,31,57
4,Jake,Milner,24,2,62
5,Amy,Cooze,73,3,70


That is, we removed rows with the same first_name, keeping in the table the last element (by default, this method keeps the first occurrence of a duplicate).

## 4. Filtering values

One way to "filter" data is to select only a few columns or a few rows, as we saw when we learned to slice a dataframe. However, often we want to filter based on the values of variables, in that case, we are going to create boolean expressions that will keep some values and ignore some others, as the examples in this section show.

In [17]:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3],
        'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df

Unnamed: 0,name,year,reports,coverage
Cochice,Jason,2012,4,25
Pima,Molly,2012,24,94
Santa Cruz,Tina,2013,31,57
Maricopa,Jake,2014,2,62
Yuma,Amy,2014,3,70


**Find rows where `coverage` is greater than 50**

In [18]:
df[df['coverage'] > 50] # notice the boolean expression within the brackets

Unnamed: 0,name,year,reports,coverage
Pima,Molly,2012,24,94
Santa Cruz,Tina,2013,31,57
Maricopa,Jake,2014,2,62
Yuma,Amy,2014,3,70


**A complex filtering expression**

In [19]:
df[(df['coverage']  > 50) & (df['reports'] < 4)] # two combined boolean expressions

Unnamed: 0,name,year,reports,coverage
Maricopa,Jake,2014,2,62
Yuma,Amy,2014,3,70


## 5. Replacing values

Sometimes, we want to replace many values at once. That is easy to do with the `replace` method.

In [20]:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [-999, -999, -999, 2, 1],
        'postTestScore': [2, 2, -999, 2, -999]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,-999,2
1,Molly,Jacobson,52,-999,2
2,Tina,Ali,36,-999,-999
3,Jake,Milner,24,2,2
4,Amy,Cooze,73,1,-999


In [21]:
df.replace(-999, np.nan) # np.nan means "Not a Number", a common way of denoting missing values in a dataset

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,,2.0
1,Molly,Jacobson,52,,2.0
2,Tina,Ali,36,,
3,Jake,Milner,24,2.0,2.0
4,Amy,Cooze,73,1.0,


## 6. Dropping values

We might want to drop entire columns or rows, we can do that with the method `drop`. Study it's documentation to learn about its different parameters.

In [22]:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df

Unnamed: 0,name,year,reports
Cochice,Jason,2012,4
Pima,Molly,2012,24
Santa Cruz,Tina,2013,31
Maricopa,Jake,2014,2
Yuma,Amy,2014,3


**Drop rows (observations)**

In [23]:
df.drop(['Cochice', 'Pima'])

Unnamed: 0,name,year,reports
Santa Cruz,Tina,2013,31
Maricopa,Jake,2014,2
Yuma,Amy,2014,3


**Drop columns(variables)**

In [24]:
df.drop('reports', axis=1) # axis=1 denotes that we are referring to a column, not a row

Unnamed: 0,name,year
Cochice,Jason,2012
Pima,Molly,2012
Santa Cruz,Tina,2013
Maricopa,Jake,2014
Yuma,Amy,2014


**Note:** Both examples above didn't change the original frame. To do that, use `inplace=True`.

## 7. A simple `groupby` example

This is just a simple taste of this powerful method, next time we'll do a full tutorial on `groupby`.

In [25]:
# Example dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore
0,Nighthawks,1st,Miller,4,25
1,Nighthawks,1st,Jacobson,24,94
2,Nighthawks,2nd,Ali,31,57
3,Nighthawks,2nd,Milner,2,62
4,Dragoons,1st,Cooze,3,70
5,Dragoons,1st,Jacon,4,25
6,Dragoons,2nd,Ryaner,24,94
7,Dragoons,2nd,Sone,31,57
8,Scouts,1st,Sloan,2,62
9,Scouts,1st,Piger,3,70


In [26]:
# Create a grouping object. In other words, create an object that
# represents that particular grouping. In this case we group
# pre-test scores by the regiment.
regiment_preScore = df['preTestScore'].groupby(df['regiment'])
regiment_preScore

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x10867f2b0>

In [27]:
# Display the mean value of the each regiment's pre-test score
regiment_preScore.mean()

regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64

In [28]:
regiment_preScore.max()

regiment
Dragoons      31
Nighthawks    31
Scouts         3
Name: preTestScore, dtype: int64

The idea is that once we have grouped the data by a certain variable, we can apply many variuos statistical operations to the groups.