# Working with DataFrames

## Manipulating DataFrame contents

New column can easily be added to a DataFrame by assignment as shown below.

In [24]:
import pandas as pd
import numpy as np
x = {"a":[1.2,12.5], "b":[1.7,17.1], "c":[1.3,13.3], "d":[1.6,16.6]}
# y = dict(a = 12.5, b = 12.7, c = 11.9, d = 13.1)
df = pd.DataFrame(x, index = ['X', 'Y']).T

df

Unnamed: 0,X,Y
a,1.2,12.5
b,1.7,17.1
c,1.3,13.3
d,1.6,16.6


In [25]:
type(x)

dict

### Adding a column

New column can easily be added to a DataFrame by assignment as shown below.

In [26]:
df['Z'] = pd.Series({'d':32.1, 'c':35.5})
df

Unnamed: 0,X,Y,Z
a,1.2,12.5,
b,1.7,17.1,
c,1.3,13.3,35.5
d,1.6,16.6,32.1


In [27]:
df['W'] = [5, 6, 7, 8]
df

Unnamed: 0,X,Y,Z,W
a,1.2,12.5,,5
b,1.7,17.1,,6
c,1.3,13.3,35.5,7
d,1.6,16.6,32.1,8


### Adding rows

New rows can be added using append method as shown below.

In [28]:
newRow = pd.Series([5, 15, 20, 25], index = ['X', 'Y', 'Z', 'W'], name = 'e')
df2 = df.append(newRow)
df2

Unnamed: 0,X,Y,Z,W
a,1.2,12.5,,5
b,1.7,17.1,,6
c,1.3,13.3,35.5,7
d,1.6,16.6,32.1,8
e,5.0,15.0,20.0,25


In [29]:
df

Unnamed: 0,X,Y,Z,W
a,1.2,12.5,,5
b,1.7,17.1,,6
c,1.3,13.3,35.5,7
d,1.6,16.6,32.1,8


It may be noted that the append method returns a new DataFrame. The original DataFrame remains unchanged.

### Modifying existing data 

Existing data can be modified by assignment statement.

In [30]:
df2['Y'] = range(20, 25)
df2.loc['b','X'] = 1.5
df2

Unnamed: 0,X,Y,Z,W
a,1.2,20,,5
b,1.5,21,,6
c,1.3,22,35.5,7
d,1.6,23,32.1,8
e,5.0,24,20.0,25


In [31]:
df2.loc['e'] = [1, 2, 3, 4]
df2

Unnamed: 0,X,Y,Z,W
a,1.2,20,,5
b,1.5,21,,6
c,1.3,22,35.5,7
d,1.6,23,32.1,8
e,1.0,2,3.0,4


In [32]:
df2.W.dtype

dtype('int64')

### Deleting a column

Existing column can be deleted from DataFrame using a drop method as shown below.

In [33]:
df2.drop(columns = ['Z'])

Unnamed: 0,X,Y,W
a,1.2,20,5
b,1.5,21,6
c,1.3,22,7
d,1.6,23,8
e,1.0,2,4


In [34]:
df2

Unnamed: 0,X,Y,Z,W
a,1.2,20,,5
b,1.5,21,,6
c,1.3,22,35.5,7
d,1.6,23,32.1,8
e,1.0,2,3.0,4


### Deleting rows

Similarly, the rows can also be dropped as shown below.

In [35]:
df2.drop(index = ['a','e'])

Unnamed: 0,X,Y,Z,W
b,1.5,21,,6
c,1.3,22,35.5,7
d,1.6,23,32.1,8


In [36]:
df2

Unnamed: 0,X,Y,Z,W
a,1.2,20,,5
b,1.5,21,,6
c,1.3,22,35.5,7
d,1.6,23,32.1,8
e,1.0,2,3.0,4


It may be noted that the drop method returns the DataFrame after dropping row/ column. The oridinal DataFrame remains unchanged.  

## Computing with DataFrame


First, we create a DataFrame on which computing will be performed.

In [37]:
df = pd.DataFrame( {'X': np.random.randint(1, 4, 10), 
                    'Y': np.random.normal(15,2,10), 
                    'Z': np.random.uniform(0,1,10)
                   },
                   index = ['a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9'], 
                )

## viewing a data frame

To see the first five rows, we can use the method `head`

In [38]:
df.head()

Unnamed: 0,X,Y,Z
a0,3,15.249281,0.076862
a1,2,12.80963,0.989788
a2,3,11.915218,0.654349
a3,1,13.300533,0.207719
a4,1,15.379844,0.216006


In [39]:
df.head(3)

Unnamed: 0,X,Y,Z
a0,3,15.249281,0.076862
a1,2,12.80963,0.989788
a2,3,11.915218,0.654349


Similarly `tail` method can be used to see the last five rows.

## Summary Statistics

The `describe` method returns a DataFRame with summary statistics contained in rows.

In [41]:
df

Unnamed: 0,X,Y,Z
a0,,15.249281,0.076862
a1,2.0,12.80963,0.989788
a2,3.0,11.915218,0.654349
a3,1.0,13.300533,0.207719
a4,1.0,15.379844,0.216006
a5,3.0,14.991145,0.823008
a6,1.0,18.123013,0.590293
a7,1.0,16.618437,0.881111
a8,3.0,11.135996,0.527159
a9,2.0,14.17005,0.714692


In [40]:
df.iloc[0, 0] = None
df.describe()

Unnamed: 0,X,Y,Z
count,9.0,10.0,10.0
mean,1.888889,14.369315,0.568099
std,0.927961,2.145345,0.310382
min,1.0,11.135996,0.076862
25%,1.0,12.932355,0.293794
50%,2.0,14.580597,0.622321
75%,3.0,15.347203,0.795929
max,3.0,18.123013,0.989788


HW: explore the parameters that can be supplied to customize the output of the method.

## Descriptive Statistics 

Descriptive statistics such as mean, variance can be computes as

In [42]:
df.mean()

X     1.888889
Y    14.369315
Z     0.568099
dtype: float64

Note that `mean` method has returned a Series object. 

**HW**: 
1. Explore the parameters that can be passed to `mean` method.
2. There are several other methods available for computing different descriptive statistics. Explore these methods.

A DataFrame containing descriptive statistics can be constructed as shown below. 

In [44]:
Means = df.mean(); Means.name = 'Mean'
Stds = df.std(); Stds.name = 'Std Dev'
Skewness = df.skew(); Skewness.name = 'Coef Skewness'
pd.DataFrame([Means, Stds, Skewness])

Unnamed: 0,X,Y,Z
Mean,1.888889,14.369315,0.568099
Std Dev,0.927961,2.145345,0.310382
Coef Skewness,0.2632,0.177466,-0.385729


Note that the name attributes have been used as index values.

### Centering data

DataFrame can be used as numpy ndarray. So all operations that can be performed on ndarray can also be performed on DataFrame. For example, if df is considered as representing data matrix, the data can be centred as 

In [45]:
df - df.mean()

Unnamed: 0,X,Y,Z
a0,,0.879966,-0.491237
a1,0.111111,-1.559685,0.42169
a2,1.111111,-2.454096,0.08625
a3,-0.888889,-1.068782,-0.36038
a4,-0.888889,1.01053,-0.352093
a5,1.111111,0.62183,0.254909
a6,-0.888889,3.753699,0.022195
a7,-0.888889,2.249122,0.313012
a8,1.111111,-3.233319,-0.04094
a9,0.111111,-0.199265,0.146593


### Variance-covariance matrix

The variance covariance matrix can be computed as 

In [47]:
dfCentered = df-df.mean()
n = df.index.size
print(n)
dfCentered.T.dot(dfCentered)/(n-1)

10


Unnamed: 0,X,Y,Z
X,,,
Y,,4.602505,-0.02481
Z,,-0.02481,0.096337


Note the index values for rows and columns in the result.

This matrix can, however, be readily computed using the method `cov`

In [48]:
df.cov()

Unnamed: 0,X,Y,Z
X,0.861111,-1.388491,0.091508
Y,-1.388491,4.602505,-0.02481
Z,0.091508,-0.02481,0.096337


## Data Aggregration 

By aggregation operation, we mean an operation that transforms an array into a scalar. 
The methods for computing descriptive statistics such as mean, median, count, sum, min, max, etc. are essentially aggregation methods.

### Using `agg` method

The DataFrame containing descriptive statistics that we computed earlier can be quickly computed as

In [50]:
df.agg(['mean', 'std', 'skew'])

Unnamed: 0,X,Y,Z
mean,1.888889,14.369315,0.568099
std,0.927961,2.145345,0.310382
skew,0.2632,0.177466,-0.385729


Note that `agg` is an alias of `aggregate`. Use of the alias is more common.

To get required row labels, we can write

In [51]:
Dstats = df.agg(['mean', 'std', 'skew'])
Dstats.index = ['Mean', 'Std Dev', 'Skewness']
Dstats

Unnamed: 0,X,Y,Z
Mean,1.888889,14.369315,0.568099
Std Dev,0.927961,2.145345,0.310382
Skewness,0.2632,0.177466,-0.385729


### User defined aggregate function

We can also use user defined aggregation function as shown below.

In [52]:
def range(x):
    return x.max() - x.min()
df.agg(['median', range])

Unnamed: 0,X,Y,Z
median,2.0,14.580597,0.622321
range,2.0,6.987017,0.912926
