<img style="float: right;" src="https://c402277.ssl.cf1.rackcdn.com/photos/13100/images/featured_story/BIC_128.png?1485963152">
# Pandas

- Python package that is essentially a souped-up Excel
- Built off numpy, so you will see a lot of similarity
- Adds **labels** to data for easy readability
- Adds an analog of R data frames

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Contents
- [Series](#series)
- [DataFrames](#dataframes)
- [Pandas Index](#index)
- [Hierarchical index](#hierarchical)
- [Missing data](#missing)
- [`groupby`](#group)
- [Merging, joining, concatenating](#combine)

___
<a id='series'></a>
## Series
- "One-dimensional ndarray with axis labels", Pandas
- Time series are fit well for this
- Individual data points are labelled and reffered to by the Series's **Index**

### Creating a Series

- `Series` function can create Series from lists, arrays, dictionaries, and many other Python objects

In [None]:
# Calcium imaging data
ca_data = np.loadtxt('data_for_lectures/ca-traces.txt', delimiter=',')

In [None]:
ca_data[0, :]

In [None]:
plt.plot(ca_data[0, :])

In [None]:
neuron0 = pd.Series(data=ca_data[0, :])
neuron0

You can add labels for indices.

In [None]:
ts = np.arange(0, ca_data.shape[1]) * 0.2
neuron0 = pd.Series(data=ca_data[0, :], index=ts)
neuron0.head()

In [None]:
plt.plot(neuron0);

In [None]:
neuron0.plot();

Dictionaries already have labels supplied!

In [None]:
genotype = {
    'mouse2': 'Cre+',
    'mouse5': 'Cre-',
    'mouse6': 'Cre-',
    'mouse9': 'Cre+',
}

In [None]:
pd.Series(genotype)

Note that any object can be provided, not just numbers. Here we provided strings, but we could also provide any other type of object.

### Using an Index

- Defined by an ndarray of the same size as the series
- Used as a guide for operations

In [None]:
np1 = pd.Series([24, 20, 55, 32, 100], index=['mouse5', 'mouse6', 'mouse9', 'mouse2', 'mouse10'])
np2 = pd.Series([20, 20, 33, 51], index=['mouse5', 'mouse6', 'mouse2', 'mouse91'])

In [None]:
np1

In [None]:
np2.size

In [None]:
np1['mouse9']

In [None]:
np2['mouse9']

In [None]:
np1.add(np2, fill_value=0)

---
<a id='dataframes'></a>
## DataFrames

- 2-dimensional Pandas objects (vs 1-dimensional Series)
- Essentially a column stack of Series
- Similar to data frames of R

In [None]:
# Calcium imaging data
ca_data = np.loadtxt('data_for_lectures/ca-traces.txt', delimiter=',')

In [None]:
pd.DataFrame(ca_data).head()

Add labels to really take advantage of Pandas

In [None]:
neuron_name = ['neuron{}'.format(n) for n in range(ca_data.shape[0])]
df_ca = pd.DataFrame(ca_data.T, index=ts, columns=neuron_name)
df_ca.tail()

You can give your columns and indices names for convenience.

In [None]:
df_ca.columns.name

In [None]:
df_ca.columns.name = 'neuron'
df_ca.index.name = 'time'
df_ca

### Selection and Indexing

- Choose columns with brackets `[]` we are used to (unless you use boolean mask, see [conditional selection](#conditional))
- `loc` method allows you to choose index then, optionally, column
- `iloc` method allows you to choose index, column by numeric location (like a numpy array)

In [None]:
type(df_ca['neuron0'])

You can pass a list as well (just like numpy arrays)

In [None]:
df_ca[0.0, 'neuron0']

### Creating a new column

In [None]:
df_ca['neuron46'] = df_ca['neuron0'] * 2
df_ca['neuron46']

In [None]:
df_ca

### Removing columns

In [None]:
df_ca = df_ca.drop('neuron46', axis=1)
df_ca['neuron46']

Most methods do not act on the DataFrame!  
Instead, it creates a **copy**.

In [None]:
df_ca = df_ca.drop('neuron46', axis=1)

# Or use parameter `inplace`
# df_ca.drop('neuron46', axis=1, inplace=True)

Can also drop rows this way:

In [None]:
df_ca.drop(0.0, axis=0)

In [None]:
df_ca[['neuron0', 'neuron1']]

### Selecting rows

In [None]:
df_ca.loc[0:10]

### Selecting by row AND column

In [None]:
df_ca.loc[0:100, 'neuron0':'neuron3']

**Notice slice INCLUDES stop index**

In [None]:
df_ca.iloc[0:10]

In [None]:
df_ca.iloc[0:500, 0:3]

**Notice slicing with iloc EXCLUDES stop index**  
Just like numpy array slicing

<a id='conditional'></a>
### Conditional selection
- Very similar to how we  did it in numpy
- Boolean mask in bracket notation will select **indices**
- `isin` method is useful if checking if values are equal to one of multiple values

#### Boolean masks

- Provide boolean mask on index (rows) using bracket notation `[]`

In [None]:
# in situ data
# insitu_ctrl = np.loadtxt('data_for_lectures/insitu-ctrl.txt')
# insitu_fat = np.loadtxt('data_for_lectures/insitu-fat.txt')

insitu_ctrl = pd.read_csv(
    'data_for_lectures/insitu-ctrl.txt',
    delimiter=' ',
    header=None,
).transpose()

insitu_fat = pd.read_csv(
    'data_for_lectures/insitu-fat.txt',
    delimiter=' ',
    header=None,
).transpose()

for df in [insitu_ctrl, insitu_fat]:
    df.columns = ['nts', 'vgat', 'vglut2']
    df.columns.name = 'gene'
    df.index.name = 'cell'

# insitu_ctrl.columns = ['nts', 'vgat', 'vglut2']
# insitu_ctrl.columns.name = 'gene'
# insitu_ctrl.index.name = 'cell'

# insitu_fat.columns = ['nts', 'vgat', 'vglut2']
# insitu_fat.columns.name = 'gene'
# insitu_fat.index.name = 'cell'
    
insitu_ctrl

In [None]:
insitu_ctrl.shape

In [None]:
insitu_fat.shape

In [None]:
list('abcde')

In [None]:
df = pd.DataFrame(np.arange(20).reshape(4, 5), columns=list('abcde'))
df

In [None]:
df[df['a'] > 7]

In [None]:
insitu_ctrl[insitu_ctrl['nts'] > 245]

In [None]:
insitu_ctrl[insitu_ctrl['nts'] > 245]['vgat']

For two conditions you can use `|` and `&` with parentheses.

In [None]:
insitu_ctrl[(insitu_ctrl['nts'] > 60) & (insitu_ctrl['vgat'] > 60)]

#### `DataFrame.isin`

In [4]:
gene_expr = pd.read_csv('data_for_lectures/gene-expr.csv')
gene_expr

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
1,0.861,1.151010,0.722,0,A030009H04Rik
2,0.857,1.025466,0.714,0,Nap1l5
3,0.845,1.056095,0.690,0,Zwint
4,0.839,0.980689,0.678,0,Zcchc18
5,0.838,0.910083,0.676,0,Ttc3
6,0.835,0.976621,0.670,0,Snhg11
7,0.835,0.959488,0.670,0,Meg3
8,0.832,0.983590,0.664,0,Celf4
9,0.831,1.233286,0.662,0,Gad1


In [9]:
np.where(gaba_ix == True)

(array([   0,    9,   14, 2477, 2572, 2598, 2761, 2810, 2822, 3133, 3175,
        3176, 3764, 4111, 4781, 4944, 5055]),)

In [5]:
gaba_ix = gene_expr['gene'].isin(['Gad2', 'Gad1', 'Slc32a1'])
gaba_ix

0        True
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9        True
10      False
11      False
12      False
13      False
14       True
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
6836    False
6837    False
6838    False
6839    False
6840    False
6841    False
6842    False
6843    False
6844    False
6845    False
6846    False
6847    False
6848    False
6849    False
6850    False
6851    False
6852    False
6853    False
6854    False
6855    False
6856    False
6857    False
6858    False
6859    False
6860    False
6861    False
6862    False
6863    False
6864    False
6865    False
Name: gene, Length: 6866, dtype: bool

In [6]:
gene_expr[gaba_ix]

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
9,0.831,1.233286,0.662,0,Gad1
14,0.825,1.18322,0.65,0,Slc32a1
2477,0.738,0.744624,0.476,16,Gad1
2572,0.723,0.752828,0.446,16,Slc32a1
2598,0.719,0.855336,0.438,16,Gad2
2761,0.838,1.094347,0.676,19,Gad2
2810,0.784,0.903642,0.568,19,Slc32a1
2822,0.772,0.87746,0.544,19,Gad1
3133,0.817,0.993828,0.634,22,Gad2


---
<a id='index'></a>
## Pandas Index (and columns)
- Numpy array that act as labels for axes of DataFrames and Series
- Index object defines columns and indices (rows)

### Changing the index with method `reset_index` and `set_index`

In [None]:
df_ca.head()

In [None]:
# Remove current index
df_ca = df_ca.reset_index()
df_ca.head()

In [None]:
new_ix = df_ca['time'] / 60  # `reset_index` moved the old index into a new column named 'index'
df_ca['time (m)'] = new_ix

In [None]:
df_ca.head()

In [None]:
# Set column 'time (m) as new index
df_ca = df_ca.set_index('time (m)')
df_ca.head()

---
<a id='hierarchical'></a>
## Multi-Index and Index Hierarchy

- Add multiple levels to column and/or row labels
- Can simulate multidimensional data

In [10]:
columns = pd.MultiIndex.from_tuples(
    [
        ('Cre+', 'mouse2'),
        ('Cre-', 'mouse5'),
        ('Cre-', 'mouse6'),
        ('Cre+', 'mouse9'),
    ],
    names=['genotype', 'mouse']
)
index = pd.Index(['first half', 'second half'], name='time epoch')

df_np = pd.DataFrame([[32, 24, 20, 55], [33, 20, 20, 51]], columns=columns, index=index)
df_np

genotype,Cre+,Cre-,Cre-,Cre+
mouse,mouse2,mouse5,mouse6,mouse9
time epoch,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
first half,32,24,20,55
second half,33,20,20,51


Sort index to organize data with method `sort_index`.

In [11]:
df_np = df_np.sort_index(axis=1, level=0)
df_np

genotype,Cre+,Cre+,Cre-,Cre-
mouse,mouse2,mouse9,mouse5,mouse6
time epoch,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
first half,32,55,24,20
second half,33,51,20,20


### Indexing hierarchichal index

- Use **tuple** to index each level
- Remember a list would select multiple values on one axis (or level in the case of multilevels)
- Use `slice(None)` to select entire level
- `DataFrame.xs` method can select levels as well
- `DataFrame.iloc` is not affected by multilevel
- Use `DataFrame.sort_index` method to enable full indexing with hierarchical indices

In [12]:
df_np['Cre+']

mouse,mouse2,mouse9
time epoch,Unnamed: 1_level_1,Unnamed: 2_level_1
first half,32,55
second half,33,51


In [13]:
df_np[('Cre+', 'mouse2')]

time epoch
first half     32
second half    33
Name: (Cre+, mouse2), dtype: int64

Let's load in some gene expression data

In [14]:
gene_expr = pd.read_csv('data_for_lectures/gene-expr.csv')
gene_expr.head()

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
1,0.861,1.15101,0.722,0,A030009H04Rik
2,0.857,1.025466,0.714,0,Nap1l5
3,0.845,1.056095,0.69,0,Zwint
4,0.839,0.980689,0.678,0,Zcchc18


Create hierarchical index

In [None]:
gene_expr = gene_expr.set_index(['cluster', 'gene'])
gene_expr = gene_expr.sort_index(axis=0)  # Necessary to fully index

In [17]:
gene_expr.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,myAUC,avg_diff,power
cluster,gene,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
41,mt-Nd1,0.93,0.766301,0.86
41,mt-Nd2,0.878,0.74101,0.756
41,mt-Nd5,0.814,0.822959,0.628
41,mt-Te,0.746,0.567904,0.492
41,mt-Tp,0.726,0.381493,0.452


Index DataFrame

In [19]:
gene_expr.loc[40]

Unnamed: 0_level_0,myAUC,avg_diff,power
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1500015O10Rik,0.954,3.867684,0.908
Arl6ip1,0.81,1.348885,0.62
Calml4,0.727,2.138549,0.454
Clu,0.807,0.864561,0.614
Cox8a,0.728,1.27393,0.456
Crebrf,0.713,1.505117,0.426
Enpp2,0.808,2.715721,0.616
Etfb,0.768,1.614254,0.536
Fam213a,0.774,1.165361,0.548
Folr1,0.727,2.399573,0.454


Indexing both axes: notice parentheses

In [21]:
gene_expr.loc[40, 'myAUC']
# df.loc[(40, slice(None)), 'myAUC']  # Alternative

gene
1500015O10Rik    0.954
Arl6ip1          0.810
Calml4           0.727
Clu              0.807
Cox8a            0.728
Crebrf           0.713
Enpp2            0.808
Etfb             0.768
Fam213a          0.774
Folr1            0.727
Ifi27            0.731
Kcnj13           0.818
Ldhb             0.719
Ndufa1           0.796
Slco1c1          0.703
Sostdc1          0.814
Ttr              0.861
Vamp8            0.713
Name: myAUC, dtype: float64

Indexing both axes: notice parentheses

In [23]:
gene_expr.loc[(slice(None), 'Slc32a1'), slice(None)]

Unnamed: 0_level_0,Unnamed: 1_level_0,myAUC,avg_diff,power
cluster,gene,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Slc32a1,0.825,1.18322,0.65
16,Slc32a1,0.723,0.752828,0.446
19,Slc32a1,0.784,0.903642,0.568
22,Slc32a1,0.773,0.759802,0.546
41,Slc32a1,0.869,0.580712,0.738


#### `slice` object

- Inputs look just like normal slice: `slice(start, end, step)`

In [24]:
arr = np.arange(50).reshape(-1, 5)
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44],
       [45, 46, 47, 48, 49]])

In [25]:
arr[0:2, :]

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [26]:
arr[slice(0, 2), slice(None)]

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

In [29]:
sl = slice(0, 2)
arr[:, sl]

array([[ 0,  1],
       [ 5,  6],
       [10, 11],
       [15, 16],
       [20, 21],
       [25, 26],
       [30, 31],
       [35, 36],
       [40, 41],
       [45, 46]])

In [35]:
gene_expr.loc[(slice(None), 'Slc32a1'), slice(None)]

Unnamed: 0_level_0,Unnamed: 1_level_0,myAUC,avg_diff,power
cluster,gene,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Slc32a1,0.825,1.18322,0.65
16,Slc32a1,0.723,0.752828,0.446
19,Slc32a1,0.784,0.903642,0.568
22,Slc32a1,0.773,0.759802,0.546
41,Slc32a1,0.869,0.580712,0.738


#### `DataFrame.xs`

In [34]:
gene_expr.xs('Slc32a1', axis=0, level='gene')

Unnamed: 0_level_0,myAUC,avg_diff,power
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.825,1.18322,0.65
16,0.723,0.752828,0.446
19,0.784,0.903642,0.568
22,0.773,0.759802,0.546
41,0.869,0.580712,0.738


### Manipulating levels

- `DataFrame.stack` and `DataFrame.unstack` to move between row and column
- `DataFrame.pivot` and `DataFrame.melt` are similar but work not on Index objects but rather columns within the data.
- `DataFrame.droplevel`
- `DataFrame.reorder_levels`

In [36]:
columns = pd.MultiIndex.from_tuples(
    [
        ('drug X', 'Cre+', 'mouse2'),
        ('drug X', 'Cre-', 'mouse5'),
        ('drug X', 'Cre-', 'mouse6'),
        ('drug X', 'Cre+', 'mouse9'),
        ('drug Y', 'Cre+', 'mouse2'),
        ('drug Y', 'Cre-', 'mouse5'),
        ('drug Y', 'Cre-', 'mouse6'),
        ('drug Y', 'Cre+', 'mouse9'),
    ],
    names=['drug', 'genotype', 'mouse']
)
index = pd.Index(['first half', 'second half'], name='time epoch')

df_np_drug = pd.DataFrame([[32, 24, 20, 55, 77, 65, 66, 101], [33, 20, 20, 51, 76, 69, 68, 123]], columns=columns, index=index)
df_np_drug

drug,drug X,drug X,drug X,drug X,drug Y,drug Y,drug Y,drug Y
genotype,Cre+,Cre-,Cre-,Cre+,Cre+,Cre-,Cre-,Cre+
mouse,mouse2,mouse5,mouse6,mouse9,mouse2,mouse5,mouse6,mouse9
time epoch,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
first half,32,24,20,55,77,65,66,101
second half,33,20,20,51,76,69,68,123


In [37]:
stacked = df_np_drug.stack(['mouse', 'genotype'])
# stacked = df_np_drug.stack(-1)
stacked

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,drug X,drug Y
time epoch,mouse,genotype,Unnamed: 3_level_1,Unnamed: 4_level_1
first half,mouse2,Cre+,32.0,77.0
first half,mouse5,Cre-,24.0,65.0
first half,mouse6,Cre-,20.0,66.0
first half,mouse9,Cre+,55.0,101.0
second half,mouse2,Cre+,33.0,76.0
second half,mouse5,Cre-,20.0,69.0
second half,mouse6,Cre-,20.0,68.0
second half,mouse9,Cre+,51.0,123.0


In [113]:
stacked.sort_index(axis=0, level=['time epoch', 'genotype', 'mouse'])

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,drug X,drug Y
time epoch,mouse,genotype,Unnamed: 3_level_1,Unnamed: 4_level_1
first half,mouse2,Cre+,32.0,77.0
first half,mouse9,Cre+,55.0,101.0
first half,mouse5,Cre-,24.0,65.0
first half,mouse6,Cre-,20.0,66.0
second half,mouse2,Cre+,33.0,76.0
second half,mouse9,Cre+,51.0,123.0
second half,mouse5,Cre-,20.0,69.0
second half,mouse6,Cre-,20.0,68.0


In [40]:
stacked.sort_index(axis=0, level='genotype')

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,drug X,drug Y
time epoch,mouse,genotype,Unnamed: 3_level_1,Unnamed: 4_level_1
first half,mouse2,Cre+,32.0,77.0
first half,mouse9,Cre+,55.0,101.0
second half,mouse2,Cre+,33.0,76.0
second half,mouse9,Cre+,51.0,123.0
first half,mouse5,Cre-,24.0,65.0
first half,mouse6,Cre-,20.0,66.0
second half,mouse5,Cre-,20.0,69.0
second half,mouse6,Cre-,20.0,68.0


In [38]:
reordered = stacked.reorder_levels(['genotype', 'mouse', 'time epoch'], axis=0)
reordered

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,drug X,drug Y
genotype,mouse,time epoch,Unnamed: 3_level_1,Unnamed: 4_level_1
Cre+,mouse2,first half,32.0,77.0
Cre-,mouse5,first half,24.0,65.0
Cre-,mouse6,first half,20.0,66.0
Cre+,mouse9,first half,55.0,101.0
Cre+,mouse2,second half,33.0,76.0
Cre-,mouse5,second half,20.0,69.0
Cre-,mouse6,second half,20.0,68.0
Cre+,mouse9,second half,51.0,123.0


In [39]:
sortedd = reordered.sort_index(axis=0)
sortedd

Unnamed: 0_level_0,Unnamed: 1_level_0,drug,drug X,drug Y
genotype,mouse,time epoch,Unnamed: 3_level_1,Unnamed: 4_level_1
Cre+,mouse2,first half,32.0,77.0
Cre+,mouse2,second half,33.0,76.0
Cre+,mouse9,first half,55.0,101.0
Cre+,mouse9,second half,51.0,123.0
Cre-,mouse5,first half,24.0,65.0
Cre-,mouse5,second half,20.0,69.0
Cre-,mouse6,first half,20.0,66.0
Cre-,mouse6,second half,20.0,68.0


### Stacking picture
![stack](http://nikgrozev.com/images/blog/Reshaping%20in%20Pandas%20-%20Pivot%20Pivot-Table%20Stack%20and%20Unstack%20explained%20with%20Pictures/stack-unstack1.png)
nikgrozev.com

---
<a id='mising'></a>
## Missing data and `nan`

- We've already seen how missing data poitns are filled with `nan`
- `nan` typically ignored by default (unlike numpy, remember `nanmean` and `nanmax`?)
- Can be easily removed with `dropna` method

In [41]:
df_wts = pd.DataFrame({
    'mouse2': [29, 29, np.nan, 30, 29],
    'mouse5': [31, 30, np.nan, 30, 30],
    'mouse6': [33, 32, np.nan, np.nan, 33]
})

In [42]:
df_wts

Unnamed: 0,mouse2,mouse5,mouse6
0,29.0,31.0,33.0
1,29.0,30.0,32.0
2,,,
3,30.0,30.0,
4,29.0,30.0,33.0


In [43]:
df_wts.mean()

mouse2    29.250000
mouse5    30.250000
mouse6    32.666667
dtype: float64

In [44]:
df_wts.dropna(axis=0, how='any')

Unnamed: 0,mouse2,mouse5,mouse6
0,29.0,31.0,33.0
1,29.0,30.0,32.0
4,29.0,30.0,33.0


In [45]:
df_wts.dropna(axis=0, how='all')

Unnamed: 0,mouse2,mouse5,mouse6
0,29.0,31.0,33.0
1,29.0,30.0,32.0
3,30.0,30.0,
4,29.0,30.0,33.0


In [46]:
df_wts.fillna(value=df_wts.mean())

Unnamed: 0,mouse2,mouse5,mouse6
0,29.0,31.0,33.0
1,29.0,30.0,32.0
2,29.25,30.25,32.666667
3,30.0,30.0,32.666667
4,29.0,30.0,33.0


In [47]:
df_wts.interpolate(method='linear', axis=0)

Unnamed: 0,mouse2,mouse5,mouse6
0,29.0,31.0,33.0
1,29.0,30.0,32.0
2,29.5,30.0,32.333333
3,30.0,30.0,32.666667
4,29.0,30.0,33.0


---
<a id='group'></a>
## `DataFrame.groupby` and `DataFrame.apply`

- Used to apply opeartions on subset of DataFrame

![groupby](https://i.stack.imgur.com/sgCn1.jpg)
stackoverflow.com

### Partitioning (grouping) data

In [48]:
df = pd.read_csv('data_for_lectures/gene-expr.csv')
df.head()

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
1,0.861,1.15101,0.722,0,A030009H04Rik
2,0.857,1.025466,0.714,0,Nap1l5
3,0.845,1.056095,0.69,0,Zwint
4,0.839,0.980689,0.678,0,Zcchc18


In [50]:
df[df['cluster'] == 0]

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
1,0.861,1.151010,0.722,0,A030009H04Rik
2,0.857,1.025466,0.714,0,Nap1l5
3,0.845,1.056095,0.690,0,Zwint
4,0.839,0.980689,0.678,0,Zcchc18
5,0.838,0.910083,0.676,0,Ttc3
6,0.835,0.976621,0.670,0,Snhg11
7,0.835,0.959488,0.670,0,Meg3
8,0.832,0.983590,0.664,0,Celf4
9,0.831,1.233286,0.662,0,Gad1


In [52]:
df.groupby('cluster').max()

Unnamed: 0_level_0,myAUC,avg_diff,power,gene
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.897,1.622876,0.794,Zwint
1,0.864,1.291036,0.728,Zeb2
2,0.901,1.617671,0.802,Zwint
3,0.955,4.077598,0.91,Vim
4,0.869,1.362237,0.738,mt-Rnr2
5,0.961,2.531546,0.922,Zdhhc20
6,0.856,1.159741,0.712,Ywhaq
7,0.711,0.819242,0.422,Nap1l5
8,0.887,1.52628,0.774,Tsc22d4
9,0.953,2.442071,0.906,Ttyh1


In [56]:
df_np

genotype,Cre+,Cre+,Cre-,Cre-
mouse,mouse2,mouse9,mouse5,mouse6
time epoch,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
first half,32,55,24,20
second half,33,51,20,20


In [55]:
df_np.describe()

genotype,Cre+,Cre+,Cre-,Cre-
mouse,mouse2,mouse9,mouse5,mouse6
count,2.0,2.0,2.0,2.0
mean,32.5,53.0,22.0,20.0
std,0.707107,2.828427,2.828427,0.0
min,32.0,51.0,20.0,20.0
25%,32.25,52.0,21.0,20.0
50%,32.5,53.0,22.0,20.0
75%,32.75,54.0,23.0,20.0
max,33.0,55.0,24.0,20.0


In [59]:
df.groupby('cluster')

<pandas.core.groupby.DataFrameGroupBy object at 0x7f4800645d90>

In [58]:
for x, y in df.groupby('cluster'): print x, y

0      myAUC  avg_diff  power  cluster           gene
0    0.897  1.622876  0.794        0           Gad2
1    0.861  1.151010  0.722        0  A030009H04Rik
2    0.857  1.025466  0.714        0         Nap1l5
3    0.845  1.056095  0.690        0          Zwint
4    0.839  0.980689  0.678        0        Zcchc18
5    0.838  0.910083  0.676        0           Ttc3
6    0.835  0.976621  0.670        0         Snhg11
7    0.835  0.959488  0.670        0           Meg3
8    0.832  0.983590  0.664        0          Celf4
9    0.831  1.233286  0.662        0           Gad1
10   0.830  0.986635  0.660        0         Snap25
11   0.830  0.937059  0.660        0         Atp1b1
12   0.828  1.141674  0.656        0           Syt1
13   0.828  0.962677  0.656        0         Impact
14   0.825  1.183220  0.650        0        Slc32a1
15   0.825  0.973680  0.650        0            Nsf
16   0.825  0.964731  0.650        0          Vsnl1
17   0.823  0.969593  0.646        0       Atp6v1g2
18   0.821

13       myAUC  avg_diff  power  cluster      gene
1165  0.994  3.266682  0.988       13      Cst3
1166  0.993  4.606169  0.986       13      Ctss
1167  0.984  4.173578  0.968       13      Hexb
1168  0.977  4.139191  0.954       13      C1qb
1169  0.951  1.757091  0.902       13    Tmsb4x
1170  0.949  3.804534  0.898       13      C1qc
1171  0.944  3.549107  0.888       13     Csf1r
1172  0.944  2.677279  0.888       13       B2m
1173  0.942  3.698652  0.884       13    Cx3cr1
1174  0.936  2.679696  0.872       13      Ctsd
1175  0.931  2.065258  0.862       13     Sparc
1176  0.928  3.490830  0.856       13   Siglech
1177  0.916  3.363064  0.832       13    Laptm5
1178  0.907  2.601856  0.814       13      Egr1
1179  0.904  3.331499  0.808       13      C1qa
1180  0.903  3.343281  0.806       13    P2ry12
1181  0.903  3.297767  0.806       13    Tyrobp
1182  0.902  2.039653  0.804       13       Jun
1183  0.891  3.178250  0.782       13    Fcer1g
1184  0.889  3.274495  0.778       13

35       myAUC  avg_diff  power  cluster     gene
4187  0.912  1.528597  0.824       35      Trf
4188  0.886  1.140721  0.772       35     Fth1
4189  0.883  1.351313  0.766       35    Stmn4
4190  0.867  1.326315  0.734       35     Car2
4191  0.865  1.336836  0.730       35      Cnp
4192  0.851  1.331228  0.702       35    Lamp1
4193  0.841  1.246124  0.682       35     Glul
4194  0.840  1.035778  0.680       35      Mbp
4195  0.838  1.118583  0.676       35    Cryab
4196  0.820  1.585150  0.640       35   Opalin
4197  0.818  1.117940  0.636       35      Mag
4198  0.817  1.212829  0.634       35     Qdpr
4199  0.816  0.694145  0.632       35  mt-Cytb
4200  0.812  1.271293  0.624       35     Gatm
4201  0.793  1.202172  0.586       35    Olig1
4202  0.789  0.902882  0.578       35    Enpp2
4203  0.783  0.932776  0.566       35    Aplp1
4204  0.781  0.923248  0.562       35      Mal
4205  0.777  0.966045  0.554       35    Sept4
4206  0.773  1.097230  0.546       35  Gm21984
4207  0.77

In [53]:
df.groupby('cluster').describe().head()

Unnamed: 0_level_0,avg_diff,avg_diff,avg_diff,avg_diff,avg_diff,avg_diff,avg_diff,avg_diff,myAUC,myAUC,myAUC,myAUC,myAUC,power,power,power,power,power,power,power,power
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,323.0,0.624531,0.175893,0.364205,0.491062,0.596651,0.72939,1.622876,323.0,0.747486,...,0.7725,0.897,323.0,0.494972,0.076609,0.402,0.428,0.478,0.545,0.794
1,109.0,0.868817,0.169609,0.467214,0.761172,0.851672,0.978899,1.291036,109.0,0.771679,...,0.814,0.864,109.0,0.543358,0.097336,0.402,0.464,0.534,0.628,0.728
2,342.0,0.599796,0.174228,0.230671,0.46627,0.579214,0.694755,1.617671,342.0,0.743477,...,0.76575,0.901,342.0,0.486953,0.069728,0.402,0.43,0.47,0.5315,0.802
3,54.0,2.425819,0.700234,0.615907,2.012508,2.392945,2.899495,4.077598,54.0,0.787852,...,0.83575,0.955,54.0,0.575704,0.14209,0.404,0.459,0.532,0.6715,0.91
4,1.0,1.362237,,1.362237,1.362237,1.362237,1.362237,1.362237,1.0,0.869,...,0.869,0.869,1.0,0.738,,0.738,0.738,0.738,0.738,0.738


### Apply function to partitions of (or whole) dataframe

In [60]:
df.groupby('cluster').apply(np.amax).head()

Unnamed: 0_level_0,myAUC,avg_diff,power,cluster,gene
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.897,1.622876,0.794,0,Zwint
1,0.864,1.291036,0.728,1,Zeb2
2,0.901,1.617671,0.802,2,Zwint
3,0.955,4.077598,0.91,3,Vim
4,0.869,1.362237,0.738,4,mt-Rnr2


In [62]:
df.iloc[:, :-1].groupby('cluster').apply(lambda x: x.max() ** 2).head()

Unnamed: 0_level_0,myAUC,avg_diff,power,cluster
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.804609,2.633727,0.630436,0.0
1,0.746496,1.666774,0.529984,1.0
2,0.811801,2.616858,0.643204,4.0
3,0.912025,16.626801,0.8281,9.0
4,0.755161,1.85569,0.544644,16.0


### More groupby

In [63]:
grouping = df.groupby('cluster')
grouping

<pandas.core.groupby.DataFrameGroupBy object at 0x7f480057eb10>

In [64]:
for label, grp in grouping:
    print('cluster {} myAUC max: {}'.format(label, grp['myAUC'].max()))

cluster 0 myAUC max: 0.897
cluster 1 myAUC max: 0.864
cluster 2 myAUC max: 0.901
cluster 3 myAUC max: 0.955
cluster 4 myAUC max: 0.869
cluster 5 myAUC max: 0.961
cluster 6 myAUC max: 0.856
cluster 7 myAUC max: 0.711
cluster 8 myAUC max: 0.887
cluster 9 myAUC max: 0.953
cluster 10 myAUC max: 0.863
cluster 11 myAUC max: 0.933
cluster 12 myAUC max: 0.832
cluster 13 myAUC max: 0.994
cluster 14 myAUC max: 0.958
cluster 15 myAUC max: 0.872
cluster 16 myAUC max: 0.875
cluster 17 myAUC max: 0.882
cluster 18 myAUC max: 0.87
cluster 19 myAUC max: 0.872
cluster 20 myAUC max: 0.795
cluster 21 myAUC max: 0.939
cluster 22 myAUC max: 0.862
cluster 23 myAUC max: 0.709
cluster 24 myAUC max: 0.849
cluster 25 myAUC max: 0.855
cluster 26 myAUC max: 0.904
cluster 27 myAUC max: 0.74
cluster 28 myAUC max: 0.723
cluster 29 myAUC max: 0.978
cluster 30 myAUC max: 0.994
cluster 31 myAUC max: 1.0
cluster 32 myAUC max: 0.894
cluster 33 myAUC max: 0.841
cluster 34 myAUC max: 1.0
cluster 35 myAUC max: 0.912
cluster 

---
<a id='combine'></a>
## Merging, joining, and concatenating

- `concat` can do almost everything for you
- Other functions are `merge`, `join`, `append`

In [65]:
# in situ data
insitu_ctrl = pd.read_csv('data_for_lectures/insitu-ctrl.txt', delimiter=' ', header=None,).transpose()
insitu_fat = pd.read_csv('data_for_lectures/insitu-fat.txt', delimiter=' ', header=None,).transpose()

for df in [insitu_ctrl, insitu_fat]:
    df.columns = ['nts', 'vgat', 'vglut2']
    df.columns.name = 'gene'
    df.index.name = 'cell'
    
insitu_ctrl.head()

gene,nts,vgat,vglut2
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.5831,32.2526,0.0
1,1.09,0.0,37.5083
2,3.5255,4.8921,4.9607
3,0.0,0.0,0.0
4,0.2963,1.3258,23.0985


### Concatenation

- `concat` combines a list of DataFrames
- Works much like numpy--dimensions should match along the axis you are concatenating

In [66]:
pd.concat([insitu_ctrl, insitu_fat], axis=1).head()

gene,nts,vgat,vglut2,nts,vgat,vglut2
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1.5831,32.2526,0.0,9.492,9.785,7.114
1,1.09,0.0,37.5083,1.863,0.0,1.6823
2,3.5255,4.8921,4.9607,126.0144,9.3607,1.1116
3,0.0,0.0,0.0,4.9399,43.577,5.2057
4,0.2963,1.3258,23.0985,30.291,0.0,17.2772


In [67]:
pd.concat([insitu_ctrl, insitu_fat], axis=0).head()

gene,nts,vgat,vglut2
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.5831,32.2526,0.0
1,1.09,0.0,37.5083
2,3.5255,4.8921,4.9607
3,0.0,0.0,0.0
4,0.2963,1.3258,23.0985


In [68]:
pd.concat([insitu_ctrl, insitu_fat], axis=0).sort_index().head()

gene,nts,vgat,vglut2
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1.5831,32.2526,0.0
0,9.492,9.785,7.114
1,1.09,0.0,37.5083
1,1.863,0.0,1.6823
2,3.5255,4.8921,4.9607


**Notice Index labels are maintained and we have repeats**

**`ignore_index` resets index to default**

In [69]:
pd.concat([insitu_ctrl, insitu_fat], axis=0, ignore_index=True).sort_index(axis=0).head()

gene,nts,vgat,vglut2
0,1.5831,32.2526,0.0
1,1.09,0.0,37.5083
2,3.5255,4.8921,4.9607
3,0.0,0.0,0.0
4,0.2963,1.3258,23.0985


In [70]:
pd.concat([insitu_ctrl, insitu_fat], axis=0, ignore_index=True).sort_index(axis=0).tail()

gene,nts,vgat,vglut2
25607,0.0,0.0,0.0
25608,2.0429,0.0,15.0927
25609,1.9199,0.0,33.4847
25610,3.1569,0.0,3.1647
25611,0.6409,0.247,0.0


In [71]:
pd.concat([insitu_ctrl, insitu_fat], axis=1, ignore_index=True).head()

Unnamed: 0_level_0,0,1,2,3,4,5
cell,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1.5831,32.2526,0.0,9.492,9.785,7.114
1,1.09,0.0,37.5083,1.863,0.0,1.6823
2,3.5255,4.8921,4.9607,126.0144,9.3607,1.1116
3,0.0,0.0,0.0,4.9399,43.577,5.2057
4,0.2963,1.3258,23.0985,30.291,0.0,17.2772


**Add hierarchical level to keep organization using parameter `keys`.**

In [72]:
pd.concat([insitu_ctrl, insitu_fat], axis=0, keys=['ctrl', 'fat']).head()

Unnamed: 0_level_0,gene,nts,vgat,vglut2
Unnamed: 0_level_1,cell,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ctrl,0,1.5831,32.2526,0.0
ctrl,1,1.09,0.0,37.5083
ctrl,2,3.5255,4.8921,4.9607
ctrl,3,0.0,0.0,0.0
ctrl,4,0.2963,1.3258,23.0985


In [73]:
pd.concat([insitu_ctrl, insitu_fat], axis=0, keys=['ctrl', 'fat']).tail()

Unnamed: 0_level_0,gene,nts,vgat,vglut2
Unnamed: 0_level_1,cell,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
fat,12690,0.0,0.0,0.0
fat,12691,2.0429,0.0,15.0927
fat,12692,1.9199,0.0,33.4847
fat,12693,3.1569,0.0,3.1647
fat,12694,0.6409,0.247,0.0


In [75]:
pd.concat([insitu_ctrl, insitu_fat], axis=1, keys=['ctrl', 'fat']).describe()

Unnamed: 0_level_0,ctrl,ctrl,ctrl,fat,fat,fat
gene,nts,vgat,vglut2,nts,vgat,vglut2
count,12917.0,12917.0,12917.0,12695.0,12695.0,12695.0
mean,13.348858,7.62455,8.05363,10.624147,8.5518,13.277236
std,34.302328,13.140338,16.984538,29.587007,16.013405,24.309336
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.2221,0.0,0.0,0.0,0.0,0.0
50%,2.502,1.8086,1.1097,1.8035,0.5357,1.7196
75%,6.832,9.5263,7.4907,5.8412,10.18535,15.8657
max,250.9453,190.6956,218.9755,244.6519,183.5165,215.0717


## Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [105]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Info on Unique Values

In [77]:
df['col2'].unique()

array([444, 555, 666])

In [78]:
df['col2'].nunique()

3

In [81]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

### Applying Functions

In [82]:
def times2(x):
    return x*2

In [83]:
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [84]:
df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

In [85]:
df['col1'].sum()

10L

** Permanently Removing a Column**

In [86]:
del df['col1']

In [88]:
df.columns

Index([u'col2', u'col3'], dtype='object')

In [91]:
df.drop('col1', axis=1).columns

Index([u'col2', u'col3'], dtype='object')

** Get column and index names: **

In [94]:
df.columns = ['a', 'b', 'c']

In [95]:
df

Unnamed: 0,a,b,c
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [93]:
df.index

RangeIndex(start=0, stop=4, step=1)

** Sorting and Ordering a DataFrame:**

In [96]:
df

Unnamed: 0,a,b,c
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [98]:
df.sort_values(by='b') #inplace=False by default

Unnamed: 0,a,b,c
0,1,444,abc
3,4,444,xyz
1,2,555,def
2,3,666,ghi


** Find Null Values or Check for Null Values**

In [99]:
df.isnull()

Unnamed: 0,a,b,c
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False


In [None]:
# Drop rows with NaN Values
df.dropna()

** Filling in NaN values with something else: **

In [None]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

In [None]:
df

## Data Input and Output

- Has a couple of `pd.read_XX` methods
- Can handle text files and Excel files

### CSV Input

In [103]:
df = pd.read_csv('data_for_lectures/gene-expr.csv')
df

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
1,0.861,1.151010,0.722,0,A030009H04Rik
2,0.857,1.025466,0.714,0,Nap1l5
3,0.845,1.056095,0.690,0,Zwint
4,0.839,0.980689,0.678,0,Zcchc18
5,0.838,0.910083,0.676,0,Ttc3
6,0.835,0.976621,0.670,0,Snhg11
7,0.835,0.959488,0.670,0,Meg3
8,0.832,0.983590,0.664,0,Celf4
9,0.831,1.233286,0.662,0,Gad1


### CSV Output

In [104]:
df

Unnamed: 0,myAUC,avg_diff,power,cluster,gene
0,0.897,1.622876,0.794,0,Gad2
1,0.861,1.151010,0.722,0,A030009H04Rik
2,0.857,1.025466,0.714,0,Nap1l5
3,0.845,1.056095,0.690,0,Zwint
4,0.839,0.980689,0.678,0,Zcchc18
5,0.838,0.910083,0.676,0,Ttc3
6,0.835,0.976621,0.670,0,Snhg11
7,0.835,0.959488,0.670,0,Meg3
8,0.832,0.983590,0.664,0,Celf4
9,0.831,1.233286,0.662,0,Gad1


In [106]:
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [109]:
df.to_csv('example.csv')

In [110]:
pd.read_csv('example.csv')

Unnamed: 0.1,Unnamed: 0,col1,col2,col3
0,0,1,444,abc
1,1,2,555,def
2,2,3,666,ghi
3,3,4,444,xyz


Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash. 

### Excel Input

In [None]:
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

### Excel Output

In [None]:
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')