# Data Manipulation


We first import numpy and pandas libraries:
* Pandas
* NumPy

In [1]:
import pandas as pd
import numpy as np

Next, we import the iris data set as a pandas data frame, and show the first few rows (the "head"):

In [2]:
iris = pd.read_csv('03_iris.csv')
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# Selecting Columns and Rows

## Select columns

Columns can be selected by indexing the data frame with a *list* of column names:

In [3]:
iris[ ['Sepal.Length', 'Sepal.Width'] ].head()

Unnamed: 0,Sepal.Length,Sepal.Width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


Columns can be droped using the `drop()` method, applied to axis 1 (= columns):

In [4]:
iris.drop('Species', axis = 1).head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


More complex filtering can be done by extracting the columns string names from the data frame, and then applying some filtering method like `startswith()` or `findall()`. The resulting indexing object (a pandas `Series`) can then be used with the `loc()` method to extract the corresponding columns.

In [5]:
iris.loc[:,iris.columns.str.startswith('Sepal')].head()

Unnamed: 0,Sepal.Length,Sepal.Width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


In [6]:
iris.loc[:,iris.columns.str.contains('^S.*\\.')].head()

Unnamed: 0,Sepal.Length,Sepal.Width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


## Filter rows

Rows can simply be selected by specifying one or more predicates within the indexing operator:

In [7]:
iris[iris.Species == 'versicolor'].head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor


In [8]:
iris[(iris['Sepal.Length'] > 5) & (iris['Sepal.Width'] > 4)]

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
15,5.7,4.4,1.5,0.4,setosa
32,5.2,4.1,1.5,0.1,setosa
33,5.5,4.2,1.4,0.2,setosa


# Transforming Variables

## Change content

Modify existing variable:

In [9]:
iris['Sepal.Length'] = iris['Sepal.Length'].round()
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.0,3.5,1.4,0.2,setosa
1,5.0,3.0,1.4,0.2,setosa
2,5.0,3.2,1.3,0.2,setosa
3,5.0,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Adding a new variable, based on an existing one, can be done by using the `where()` funtion from the `numpy` library. It works like vectorized ternary operator:

In [10]:
iris['Sepal'] = np.where(iris['Sepal.Length'] > 5, 'Long', 'Short')
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Sepal
0,5.0,3.5,1.4,0.2,setosa,Short
1,5.0,3.0,1.4,0.2,setosa,Short
2,5.0,3.2,1.3,0.2,setosa,Short
3,5.0,3.1,1.5,0.2,setosa,Short
4,5.0,3.6,1.4,0.2,setosa,Short


## Renaming variables

Renaming variables can be done using the `rename()` method. The simplest way is by specifying a dictionary of old name/new name pairs as the `columns` argument:

In [11]:
iris.rename(columns = {'Sepal.Length': 'Sepal_Length'}).head()

Unnamed: 0,Sepal_Length,Sepal.Width,Petal.Length,Petal.Width,Species,Sepal
0,5.0,3.5,1.4,0.2,setosa,Short
1,5.0,3.0,1.4,0.2,setosa,Short
2,5.0,3.2,1.3,0.2,setosa,Short
3,5.0,3.1,1.5,0.2,setosa,Short
4,5.0,3.6,1.4,0.2,setosa,Short


# Sorting and Summarizing

## Sorting

The `sort_values()` method allows sorting according to several columns---descendoing or ascending:

In [12]:
iris.sort_values(by = ['Species', 'Sepal.Length'], ascending = [False, True])

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species,Sepal
106,5.0,2.5,4.5,1.7,virginica,Short
100,6.0,3.3,6.0,2.5,virginica,Long
101,6.0,2.7,5.1,1.9,virginica,Long
103,6.0,2.9,5.6,1.8,virginica,Long
104,6.0,3.0,5.8,2.2,virginica,Long
...,...,...,...,...,...,...
14,6.0,4.0,1.2,0.2,setosa,Long
15,6.0,4.4,1.5,0.4,setosa,Long
18,6.0,3.8,1.7,0.3,setosa,Long
33,6.0,4.2,1.4,0.2,setosa,Long


## Summarizing

### Standard statistics

The simplest way to compute summary statistics is to use the `describe()` method. It computes, for each variable, count, mean, standard deviation, and all quartiles (including min/max).

In [13]:
iris.describe()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
count,150.0,150.0,150.0,150.0
mean,5.86,3.057333,3.758,1.199333
std,0.867226,0.435866,1.765298,0.762238
min,4.0,2.0,1.0,0.1
25%,5.0,2.8,1.6,0.3
50%,6.0,3.0,4.35,1.3
75%,6.0,3.3,5.1,1.8
max,8.0,4.4,6.9,2.5


### Individual statistics

These statistics can also computed individually by first selecting a variable, and then calling the corresponding method:

In [14]:
iris['Sepal.Length'].mean()

5.86

In [15]:
iris.Species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: Species, dtype: int64

Aggregation functions can also be applied to the complete data frame:

In [16]:
iris.mean()

  iris.mean()


Sepal.Length    5.860000
Sepal.Width     3.057333
Petal.Length    3.758000
Petal.Width     1.199333
dtype: float64

## Grouping

An important feature is grouping. After grouping, all summary methods are applied to each group separately. In the following, we first group by the `Species` column, and then apply some summary methods:

In [17]:
iris.groupby('Species').mean()

  iris.groupby('Species').mean()


Unnamed: 0_level_0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.0,3.428,1.462,0.246
versicolor,6.04,2.77,4.26,1.326
virginica,6.54,2.974,5.552,2.026


In [18]:
iris.groupby('Species')['Sepal.Length'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
setosa,50.0,5.0,0.451754,4.0,5.0,5.0,5.0,6.0
versicolor,50.0,6.04,0.532993,5.0,6.0,6.0,6.0,7.0
virginica,50.0,6.54,0.734291,5.0,6.0,6.0,7.0,8.0
