<img src="support_files/cropped-SummerWorkshop_Header.png">  

<h1 align="center">Python Bootcamp</h1> 
<h3 align="center">August 20-21, 2016, Seattle, WA</h3> 

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<center><h1>Introduction to Pandas</h1></center>

<p>
**`pandas`** is a library with high-level data structures and manipulation tools:
<p><ul> 
<li>Data loading/saving
<li>Data exploration
<li>Filtering, selecting
<li>Plotting/visualization
<li>Computing summary statistics
<li>Groupby operations
</ul>

<p>
**DataFrame Object**
<ul>
<li>Represents a tabular, spreadsheet-like data structure
<li>Ordered collection of columns
<li>Each column can be a different value type (numeric, string, boolean, etc.)
</ul>
<p>This introduction will only just scratch the surface of Pandas functionality. For more information, check out the full documentation here: 
<p>&nbsp;&nbsp;&nbsp;&nbsp;http://pandas.pydata.org/pandas-docs/stable/index.html
<p>Or check out the '10 minutes to Pandas' tutorial here: 
<p>&nbsp;&nbsp;&nbsp;&nbsp;http://pandas.pydata.org/pandas-docs/stable/10min.html
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Imports</h2>
<p>
</div>

In [1]:
# Convention for import naming
import pandas as pd

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from __future__ import print_function

%matplotlib notebook

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Set a few optional pandas display settings:
</div>

In [9]:
# Format DataFrame display properties
pd.set_option('display.max_rows', 30) #maximum number of rows to display
pd.set_option('display.max_columns', 500) #maximum number of rows to display
pd.set_option('display.notebook_repr_html',True) #ensure that html display mode is enabled for best display

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Loading data</h2>
<p>Pandas has a lot of convenient built-in methods for reading data of various formats.
<p>Make and print a list of all of the Pandas methods with the word 'read' in them:
</div>

In [4]:
read_methods = [x for x in dir(pd) if 'read' in x]
for method in read_methods:
    print(method)

read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_msgpack
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table



<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Load tabular data from CSV file</h2>

<p>A simple csv file is saved in the working directory on your hard drive. We'll take a minute to open the file and view it.
<p>Pandas can quickly load and display it. Note that it automatically parses the column names
</div>

In [10]:
sample_dataframe = pd.read_csv('support_files/SampleWorkbook.csv')
sample_dataframe

Unnamed: 0,Column 1,Column 2
0,one,1
1,two,2
2,three,3


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
We can access a particular row and column of the dataframe as follows:
</div>

In [11]:
print(sample_dataframe['Column 2'][0])

1


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Load 'Mouse Connectivity Atlas Metadata' into DataFrame using url address</h2>
<p>We know in advance that this data is saved in comma seperated value (CSV) format, so we can use the ```read_csv``` method.
</div>

In [12]:
# note the line continuations to keep the long URL from continuing outside of our cell
url_csv_file = 'http://connectivity.brain-map.org/projection/csv?'\
               'criteria=service::mouse_connectivity_injection_structure'\
               '[injection_structures$eq8,304325711][primary_structure_only$eqtrue]'
df = pd.read_csv(url_csv_file)

# The above code will download a file; if you are having trouble with the download, 
# you can try using the pre-cached file on your hard drive with the following path.
# (Call a TA to help if this doesn't work either.)
# csv_file = 'support_files/connectivity_metadata.csv'

# df = pd.read_csv(csv_file)

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Use ```head()``` and ```tail()``` methods to take quick look at data structure</h2>
<p>The ```head()``` method displays the first N rows, with N=5 by default

<p>The ```tail()``` method displays the last N rows, with N=5 by default
</div>

In [13]:
df.head()

Unnamed: 0,id,transgenic-line,product-id,structure-id,structure-abbrev,structure-name,name,injection-volume,injection-structures,gender,strain,sum,structure-color,num-voxels,injection-coordinates,selected,experiment_page_url
0,180436360,,5,677,VISC,Visceral area,378-1827,1.13529,"[{""id""=>104, ""abbreviation""=>""AId"", ""name""=>""A...",M,C57BL/6J,35.602853,11ad83,,"[5520, 4470, 10080]",False,http://connectivity.brain-map.org/projection/e...
1,180435652,,5,895,ECT,Ectorhinal area,378-1825,1.075153,"[{""id""=>541, ""abbreviation""=>""TEa"", ""name""=>""T...",M,C57BL/6J,32.90317,0d9f91,,"[7860, 3740, 10390]",False,http://connectivity.brain-map.org/projection/e...
2,180719293,,5,993,MOs,Secondary motor area,378-1822,1.047049,"[{""id""=>104, ""abbreviation""=>""AId"", ""name""=>""A...",M,C57BL/6J,29.847877,1f9d5a,,"[2460, 2960, 8170]",False,http://connectivity.brain-map.org/projection/e...
3,167902586,Rbp4-Cre_KL100,5,746,ORBvl,"Orbital area, ventrolateral part",Rbp4-Cre-129,0.526425,"[{""id""=>723, ""abbreviation""=>""ORBl"", ""name""=>""...",F,,27.922282,248a5e,,"[2710, 3990, 6400]",False,http://connectivity.brain-map.org/projection/e...
4,180709942,,5,993,MOs,Secondary motor area,378-1812,0.808988,"[{""id""=>985, ""abbreviation""=>""MOp"", ""name""=>""P...",M,C57BL/6J,27.776882,1f9d5a,,"[2670, 2800, 8010]",False,http://connectivity.brain-map.org/projection/e...


In [14]:
df.tail(2)

Unnamed: 0,id,transgenic-line,product-id,structure-id,structure-abbrev,structure-name,name,injection-volume,injection-structures,gender,strain,sum,structure-color,num-voxels,injection-coordinates,selected,experiment_page_url
2916,501837158,Ai75(RCL-nT),42,385,VISp,Primary visual area,Ai75(T601)-208859,0.001036,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",F,,0.010353,08858c,,"[8910, 840, 3340]",False,http://connectivity.brain-map.org/projection/e...
2917,516848906,Drd3-Cre_KI196,5,48,ACAv,"Anterior cingulate area, ventral part",Drd3-Cre_KI196_LURC-128044,0.002299,"[{""id""=>39, ""abbreviation""=>""ACAd"", ""name""=>""A...",M,,0.00953,40a666,,"[4630, 2600, 6110]",False,http://connectivity.brain-map.org/projection/e...


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Print a particular value
</div>

In [15]:
print(df['experiment_page_url'][371])

http://connectivity.brain-map.org/projection/experiment/146078721


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Many familiar functions/methods work with DataFrames</h2>
<p>
</div>

In [16]:
# numpy function
np.shape(df)

(2918, 17)

In [17]:
# python built-in function
len(df)

2918

In [18]:
# methods
print(df.keys())
print("")
print(df.columns)

Index(['id', 'transgenic-line', 'product-id', 'structure-id',
       'structure-abbrev', 'structure-name', 'name', 'injection-volume',
       'injection-structures', 'gender', 'strain', 'sum', 'structure-color',
       'num-voxels', 'injection-coordinates', 'selected',
       'experiment_page_url'],
      dtype='object')

Index(['id', 'transgenic-line', 'product-id', 'structure-id',
       'structure-abbrev', 'structure-name', 'name', 'injection-volume',
       'injection-structures', 'gender', 'strain', 'sum', 'structure-color',
       'num-voxels', 'injection-coordinates', 'selected',
       'experiment_page_url'],
      dtype='object')


<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
<p>**Exercise 6.1:**
<p>Identify another familiar function that works with the DataFrame
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>The dataframe object has a lot of useful built-in functions</h2>
<p>Start with ```unique```
</div>

In [19]:
print('Genders:',df['gender'].unique())

Genders: ['M' 'F']


In [20]:
print("transgenic lines in the dataset:")
for line in df['transgenic-line'].unique():
    print("  ",line)

transgenic lines in the dataset:
   nan
   Rbp4-Cre_KL100
   Slc6a4-Cre_ET33
   Vipr2-Cre_KE2
   Rasgrf2-T2A-dCre
   Emx1-IRES-Cre
   Gpr26-Cre_KO250
   Slc6a4-CreERT2_EZ13
   Erbb4-T2A-CreERT2
   Trib2-F2A-CreERT2
   Fezf1-T2A-dCre
   Efr3a-Cre_NO108
   Slc17a6-IRES-Cre
   Cux2-IRES-Cre
   Esr1-2A-Cre
   Ppp1r17-Cre_NL146
   Slc6a5-Cre_KF109
   Ntng2-IRES2-Cre
   Cart-Tg1-Cre
   Gnb4-IRES2-Cre
   Glt25d2-Cre_NF107
   Sim1-Cre_KJ18
   Grik4-Cre
   Cck-IRES-Cre
   Grm2-Cre_MR90
   Th-Cre_FI172
   Plxnd1-Cre_OG1
   Slc18a2-Cre_OZ14
   Slc6a3-Cre
   Syt17-Cre_NO14
   Prkcd-GluCla-CFP-IRES-Cre
   Slc32a1-IRES-Cre
   Adcyap1-2A-Cre
   Cnnm2-Cre_KD18
   Ntrk1-IRES-Cre
   Satb2-Cre_MO23
   A930038C07Rik-Tg1-Cre
   Calb1-T2A-dgCre
   Tlx3-Cre_PL56
   Chat-IRES-Cre-neo
   Drd1a-Cre_EY262
   Grp-Cre_KH288
   Lypd6-Cre_KL156
   Kcng4-Cre
   Tac1-IRES2-Cre
   Htr2a-Cre_KM207
   Cux2-CreERT2
   Drd3-Cre_KI198
   Etv1-CreERT2
   Syt6-Cre_KI148
   Pvalb-IRES-Cre
   Nkx2-1-CreERT2
   Dlg3-Cre_KG118
  

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.2:**
<ol>
<li> How many different transgenic lines were used in this dataset?
<li> How many different brain structures were injected in this dataset?
</ol>
</div>

In [23]:
print(len(df['transgenic-line'].unique()))
print(len(df['structure-name'].unique()))

124
217


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Selecting columns</h2>

<p>Retrieve column based on column name.
<p>There are two notations that allow you to access data from a column:
<ul>
<li>bracket notation
<li>dot notation
</ul>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<p>Bracket notation:

</div>

In [24]:
inj_vol = df['injection-volume']
inj_vol.head()

0    1.135290
1    1.075153
2    1.047049
3    0.526425
4    0.808988
Name: injection-volume, dtype: float64

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<p>Dot notation:
<p>note that this is sensitive to special characters in the variable name such as spaces, dashes, etc.

</div>

In [25]:
strain = df.strain
print(strain.head())

0    C57BL/6J
1    C57BL/6J
2    C57BL/6J
3         NaN
4    C57BL/6J
Name: strain, dtype: object


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
The returned column is a Series object
</div>

In [26]:
print(type(strain))

<class 'pandas.core.series.Series'>


<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">
<p>**Exercise 6.3:**
<ol>
<li>What data type are entries in the column "injection-volume"?
<li>What data type are entries in the column "injection-coordinates"?
</ol>
</div>

In [28]:
print(type(df['injection-coordinates'][1]))

<class 'str'>


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Get values as numpy ndarray</h2>
<p>
</div>


In [29]:
values_inj_vol = df['injection-volume'].values
values_inj_vol

array([1.13528986e+00, 1.07515251e+00, 1.04704854e+00, ...,
       3.46031783e-03, 1.03577877e-03, 2.29897664e-03])

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<p>Print the type of ```values_inj_volume```:
</div>

In [30]:
print(type(values_inj_vol))

<class 'numpy.ndarray'>


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">

<h2>Plot the injection volume values using Matplotlib</h2>
<p>We can use Matplotlib to plot the array that we just extracted from the dataframe:
</div>

In [31]:
# Plot array to inspect array
fig,ax = plt.subplots(1,1)
ax.plot(values_inj_vol,'.')

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x7f087525fdc0>]

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Plot the injection volume values using the Pandas built-in plotting method</h2>
<p>Pandas also has a built-in plotting function that will allow us to make the plot directly from the dataframe
<p>It does some nice formatting for you, but you still have access to matplotlib methods
</div>

In [32]:
ax = df.plot(x=df.index,y='injection-volume',marker='.',linestyle='none')

ax.set_title('Injection volumes for all rows')

KeyError: 'None of [RangeIndex(start=0, stop=2918, step=1)] are in the [columns]'

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.4:**
<p>Retrieve a different column and make plot of data
</div>

In [33]:
fig,ax = plt.subplots()
ax.scatter(df['structure-id'],df['injection-volume'],c='k')

<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x7f08730dccd0>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Select multiple columns</h2>
<p>We can make a new dataframe that contains only a subset of the column data from the first dataframe
</div>

In [34]:
# Use copy to get new DataFrame object instead of a 'view' on existing object
df2 = df[['transgenic-line','injection-volume']].copy()

In [35]:
df2.head(10)

Unnamed: 0,transgenic-line,injection-volume
0,,1.13529
1,,1.075153
2,,1.047049
3,Rbp4-Cre_KL100,0.526425
4,,0.808988
5,Rbp4-Cre_KL100,0.483214
6,,1.298801
7,,0.56829
8,,1.414942
9,,1.457846


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Adding, deleting columns</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Let's add a column denoting whether injection is in wild type or transgenic mouse.
<p>Note that wild type mice have a ```NaN``` in the 'transgenic-line' column
</div>

In [36]:
df2['transgenic-line'].head()

0               NaN
1               NaN
2               NaN
3    Rbp4-Cre_KL100
4               NaN
Name: transgenic-line, dtype: object

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Step 1:
<p>We can use the ```isnull``` method to find all of the entries with ```NaN``` or ```None```
</div>

In [37]:
is_wt = df2['transgenic-line'].isnull() #isnull() returns True if value is NaN or None. 
print(is_wt.head())

0     True
1     True
2     True
3    False
4     True
Name: transgenic-line, dtype: bool


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Step 2:
<p>We can create a new column and assign the 'is_wt' series that we just created to that column
</div>

In [38]:
df2['is_wildtype'] = is_wt

In [39]:
df2.head(5)

Unnamed: 0,transgenic-line,injection-volume,is_wildtype
0,,1.13529,True
1,,1.075153,True
2,,1.047049,True
3,Rbp4-Cre_KL100,0.526425,False
4,,0.808988,True


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Delete column (note: inplace argument)</h2>
<p>
</div>

In [40]:
df2.drop('transgenic-line',axis=1,inplace=True)
# note: this would the same as df2 = df2.drop('transgenic-line',axis=1)

In [41]:
df2.head(6)

Unnamed: 0,injection-volume,is_wildtype
0,1.13529,True
1,1.075153,True
2,1.047049,True
3,0.526425,False
4,0.808988,True
5,0.483214,False


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Selecting rows and filtering</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**Slice rows**
<p>We can use Numpy-like slicing to access particular rows
</div>

In [42]:
df[150:190:10] # [start:end:step]

Unnamed: 0,id,transgenic-line,product-id,structure-id,structure-abbrev,structure-name,name,injection-volume,injection-structures,gender,strain,sum,structure-color,num-voxels,injection-coordinates,selected,experiment_page_url
150,158314278,,5,1011,AUDd,Dorsal auditory area,378-1628,0.303672,"[{""id""=>378, ""abbreviation""=>""SSs"", ""name""=>""S...",M,C57BL/6J,8.756128,019399,,"[7720, 2650, 9460]",False,http://connectivity.brain-map.org/projection/e...
160,307558646,,5,385,VISp,Primary visual area,C57BL/6-153763,1.058476,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,C57BL/6J,8.421091,08858c,,"[8740, 1590, 8140]",False,http://connectivity.brain-map.org/projection/e...
170,126710740,,5,491,MM,Medial mammillary nucleus,378-1314,1.142014,"[{""id""=>1, ""abbreviation""=>""TMv"", ""name""=>""Tub...",M,C57BL/6J,8.221756,ff4c3e,,"[7720, 6500, 6090]",False,http://connectivity.brain-map.org/projection/e...
180,114046440,,5,194,LHA,Lateral hypothalamic area,378-1157,0.772607,"[{""id""=>63, ""abbreviation""=>""PVHd"", ""name""=>""P...",M,C57BL/6J,8.034985,f2483b,,"[6530, 5900, 6570]",False,http://connectivity.brain-map.org/projection/e...


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**Selection purely by position (integer index)**
<p>We can also access columns by their numerical indices
</div>

In [43]:
df.iloc[150:190:10,0:10:2]  # [row start:end:step, column start:end:step]

Unnamed: 0,id,product-id,structure-abbrev,name,injection-structures
150,158314278,5,AUDd,378-1628,"[{""id""=>378, ""abbreviation""=>""SSs"", ""name""=>""S..."
160,307558646,5,VISp,C57BL/6-153763,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""..."
170,126710740,5,MM,378-1314,"[{""id""=>1, ""abbreviation""=>""TMv"", ""name""=>""Tub..."
180,114046440,5,LHA,378-1157,"[{""id""=>63, ""abbreviation""=>""PVHd"", ""name""=>""P..."


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**Select rows based on boolean array (very commonly used)**
<p>This is very powerful as it lets you slice the dataframe using logical conditions
<p>Let's keep working with our new ```df2``` for now
</div>

In [44]:
df2.head()

Unnamed: 0,injection-volume,is_wildtype
0,1.13529,True
1,1.075153,True
2,1.047049,True
3,0.526425,False
4,0.808988,True


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>We can create a boolean array based on our 'is_wildtype' column
</div>

In [45]:
boolean_array = df2.is_wildtype.values
print(boolean_array)

[ True  True  True ... False False False]


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>And if we apply that boolean array to the entire dataframe, we'll be left with only rows where the boolean array was ```True```
</div>

In [46]:
df2[boolean_array].head(15)

Unnamed: 0,injection-volume,is_wildtype
0,1.13529,True
1,1.075153,True
2,1.047049,True
4,0.808988,True
6,1.298801,True
7,0.56829,True
8,1.414942,True
9,1.457846,True
10,0.330774,True
13,0.175197,True


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**Expression in brackets that yields boolean array**
<p>This can be done in one line by putting an expression into the brackets that will yield a boolean array
</div>

In [47]:
df2[df2.is_wildtype==False].head(5)

Unnamed: 0,injection-volume,is_wildtype
3,0.526425,False
5,0.483214,False
11,0.424763,False
12,0.137302,False
17,0.415792,False


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>We can combine multiple logical statements using the ```&``` or ```|``` characters
<p>For instance, let's find all of the male Sst-Cre and Vip-IRES-Cre mice in our full dataframe:
</div>

In [48]:
df[((df['transgenic-line']=='Sst-Cre') | (df['transgenic-line']=='Vip-IRES-Cre')) & (df['gender']=='M')]

Unnamed: 0,id,transgenic-line,product-id,structure-id,structure-abbrev,structure-name,name,injection-volume,injection-structures,gender,strain,sum,structure-color,num-voxels,injection-coordinates,selected,experiment_page_url
2080,262188772,Vip-IRES-Cre,5,795,PAG,Periaqueductal gray,Vip-IRES-Cre-210,0.025072,"[{""id""=>294, ""abbreviation""=>""SCm"", ""name""=>""S...",M,,0.295257,ff90ff,,"[9620, 3160, 5690]",False,http://connectivity.brain-map.org/projection/e...
2132,160294327,Sst-Cre,5,250,LSc,"Lateral septal nucleus, caudal (caudodorsal) part",Sst-Cre-D-806,0.062106,"[{""id""=>250, ""abbreviation""=>""LSc"", ""name""=>""L...",M,C57BL/6J,0.272962,90cbed,,"[4920, 3400, 5960]",False,http://connectivity.brain-map.org/projection/e...
2423,182935487,Vip-IRES-Cre,5,993,MOs,Secondary motor area,Vip-IRES-Cre-206,0.014947,"[{""id""=>39, ""abbreviation""=>""ACAd"", ""name""=>""A...",M,,0.161382,1f9d5a,,"[5090, 1540, 6290]",False,http://connectivity.brain-map.org/projection/e...
2575,299897573,Vip-IRES-Cre,5,961,PIR,Piriform area,Vip-IRES-Cre-129391,0.021916,"[{""id""=>952, ""abbreviation""=>""EPd"", ""name""=>""E...",M,C57BL/6J,0.119429,6acbba,,"[6280, 6350, 9470]",False,http://connectivity.brain-map.org/projection/e...
2841,182293273,Vip-IRES-Cre,5,385,VISp,Primary visual area,Vip-IRES-Cre-192,0.023698,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,,0.044201,08858c,,"[9980, 1200, 8360]",False,http://connectivity.brain-map.org/projection/e...


<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.5:**
<ol>
<li>Generate a new dataframe with only injections into primary visual cortex (hint: the abbreviation for primary visual cortex is VISp)
<li>How many injections were made into V1?
<ol>
</div>

In [49]:
df[df['structure-abbrev']=='VISp']

Unnamed: 0,id,transgenic-line,product-id,structure-id,structure-abbrev,structure-name,name,injection-volume,injection-structures,gender,strain,sum,structure-color,num-voxels,injection-coordinates,selected,experiment_page_url
58,180296424,,5,385,VISp,Primary visual area,378-1815,0.814006,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,C57BL/6J,13.012431,08858c,,"[9290, 2220, 9410]",False,http://connectivity.brain-map.org/projection/e...
105,114008926,,5,385,VISp,Primary visual area,378-1155,0.158576,"[{""id""=>382, ""abbreviation""=>""CA1"", ""name""=>""F...",M,C57BL/6J,10.217399,08858c,,"[8480, 1510, 8120]",False,http://connectivity.brain-map.org/projection/e...
127,309004492,,5,385,VISp,Primary visual area,C57BL/6-155459,1.083030,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,C57BL/6J,9.389121,08858c,,"[9450, 2180, 8700]",False,http://connectivity.brain-map.org/projection/e...
153,309372716,,5,385,VISp,Primary visual area,C57BL/6-155461,0.908790,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,C57BL/6J,8.679576,08858c,,"[8740, 1590, 8140]",False,http://connectivity.brain-map.org/projection/e...
160,307558646,,5,385,VISp,Primary visual area,C57BL/6-153763,1.058476,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,C57BL/6J,8.421091,08858c,,"[8740, 1590, 8140]",False,http://connectivity.brain-map.org/projection/e...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2881,528964707,Ai75(RCL-nT),42,385,VISp,Primary visual area,Ai75(T601)-251855,0.001641,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",F,,0.031631,08858c,,"[8950, 1520, 3330]",False,http://connectivity.brain-map.org/projection/e...
2885,479203646,A930038C07Rik-Tg1-Cre,36,385,VISp,Primary visual area,A930038C07Rik-Tg1-Cre-190112,0.004368,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",F,,0.029717,08858c,,"[9080, 1740, 2410]",False,http://connectivity.brain-map.org/projection/e...
2890,526784559,Ai75(RCL-nT),42,385,VISp,Primary visual area,Ai75(T601)-248508,0.000781,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,,0.025474,08858c,,"[9010, 1440, 3750]",False,http://connectivity.brain-map.org/projection/e...
2894,297627858,Chrna2-Cre_OE25,5,385,VISp,Primary visual area,Chrna2-Cre_OE25-124949,0.005304,"[{""id""=>385, ""abbreviation""=>""VISp"", ""name""=>""...",M,FVB.CD1(ICR),0.023745,08858c,,"[9130, 890, 7220]",False,http://connectivity.brain-map.org/projection/e...


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>More useful methods</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**isin()**
<p> Use ```isin()``` to find all injections into either 'AUDp' (primary auditory cortex) or 'SSp-bfd' (the barrel field of the primary somatosensory cortex)
</div>

In [50]:
area_list = ['AUDp','SSp-bfd']
df_areas = df[df['structure-abbrev'].isin(area_list)] #This is an alternative to using OR

print('There were',len(df_areas),'injections into these structures')

There were 94 injections into these structures


In [51]:
df_areas.head(6)

Unnamed: 0,id,transgenic-line,product-id,structure-id,structure-abbrev,structure-name,name,injection-volume,injection-structures,gender,strain,sum,structure-color,num-voxels,injection-coordinates,selected,experiment_page_url
30,112951804,,5,329,SSp-bfd,"Primary somatosensory area, barrel field",378-1091,0.412418,"[{""id""=>329, ""abbreviation""=>""SSp-bfd"", ""name""...",M,C57BL/6J,16.379417,188064,,"[6720, 2440, 8720]",False,http://connectivity.brain-map.org/projection/e...
63,583748537,Emx1-IRES-Cre,35,329,SSp-bfd,"Primary somatosensory area, barrel field",Emx1-IRES-Cre-305953,0.857399,"[{""id""=>329, ""abbreviation""=>""SSp-bfd"", ""name""...",M,,12.614273,188064,,"[7440, 2680, 1940]",False,http://connectivity.brain-map.org/projection/e...
104,126907302,,5,329,SSp-bfd,"Primary somatosensory area, barrel field",378-1352,0.627607,"[{""id""=>329, ""abbreviation""=>""SSp-bfd"", ""name""...",M,C57BL/6J,10.226739,188064,,"[7460, 1530, 8220]",False,http://connectivity.brain-map.org/projection/e...
126,100142655,,5,329,SSp-bfd,"Primary somatosensory area, barrel field",378-850,0.17592,"[{""id""=>329, ""abbreviation""=>""SSp-bfd"", ""name""...",M,C57BL/6J,9.430913,188064,,"[6920, 1660, 8020]",False,http://connectivity.brain-map.org/projection/e...
156,562671482,Emx1-IRES-Cre,35,1002,AUDp,Primary auditory area,Emx1-IRES-Cre-283754,0.333925,"[{""id""=>1002, ""abbreviation""=>""AUDp"", ""name""=>...",F,,8.584059,19399,,"[8030, 2370, 1780]",False,http://connectivity.brain-map.org/projection/e...
165,127866392,,5,329,SSp-bfd,"Primary somatosensory area, barrel field",378-1488,0.175445,"[{""id""=>329, ""abbreviation""=>""SSp-bfd"", ""name""...",M,C57BL/6J,8.316269,188064,,"[7310, 1890, 9080]",False,http://connectivity.brain-map.org/projection/e...


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**value_counts()**
<p>This method returns an object containing counts of unique values, in descending order.
</div>

In [52]:
# Top 20 Cre lines used in connectivity atlas
df['transgenic-line'].value_counts()[:20]

Ai75(RCL-nT)             182
Rbp4-Cre_KL100           125
Cux2-IRES-Cre            118
A930038C07Rik-Tg1-Cre    102
Ntsr1-Cre_GN220           88
Tlx3-Cre_PL56             86
Emx1-IRES-Cre             68
Gad2-IRES-Cre             64
Slc17a6-IRES-Cre          61
Syt6-Cre_KI148            59
Sim1-Cre_KJ18             57
Scnn1a-Tg3-Cre            53
Chrna2-Cre_OE25           42
Ppp1r17-Cre_NL146         41
Efr3a-Cre_NO108           41
Chat-IRES-Cre-neo         33
Htr2a-Cre_KM207           33
Drd3-Cre_KI196            31
Gpr26-Cre_KO250           31
Nos1-CreERT2              31
Name: transgenic-line, dtype: int64

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Summary statistics</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Get summary statistics of a particular column
</div>

In [53]:
df['injection-volume'].describe()

count    2918.000000
mean        0.149618
std         0.194226
min         0.000233
25%         0.025037
50%         0.081198
75%         0.198513
max         1.614223
Name: injection-volume, dtype: float64

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Bar plot</h2>
<p>Use the built-in bar plot method
</div>

In [56]:
fig,ax=plt.subplots(figsize=(12,6))
df['transgenic-line'].value_counts()[:50].plot(kind='bar')
ax.set_title("Top 50 injected Cre lines");
ax.set_ylabel("# Experiments");
fig.tight_layout() #this keeps the x-labels from getting cut off

<IPython.core.display.Javascript object>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.6:**
<p>Make a bar plot of the top 20 most injected brain areas in the connectivity atlas.
</div>

In [80]:
fig,ax = plt.subplots(figsize=(6,12))
data = df['structure-abbrev'].value_counts()[:50]
ax.barh(np.arange(0,len(data)),data.values,tick_label=data.index)
ax.invert_yaxis()
ax.set_frame_on(False)

<IPython.core.display.Javascript object>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Groupby operations</h2>
<p>We're going to group by two characteristics: the injection structure and the gender, the find the minimum injection volume in each group
</div>

In [81]:
grouped = df.groupby(['structure-abbrev','gender']).min()

columns_to_display = ['injection-volume','num-voxels']

grouped[columns_to_display].head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,injection-volume,num-voxels
structure-abbrev,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
AAA,M,0.48044,
ACAd,F,0.003941,
ACAd,M,0.001885,
ACAv,F,0.003993,
ACAv,M,0.002299,
ACB,F,0.004732,
ACB,M,0.005276,
AD,F,0.019505,
AD,M,0.003086,
ADP,F,0.027138,


<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.7:**
<p>Use groupby to compute mean injection volume in male vs female mice.
</div>

In [82]:
grouped = df.groupby('gender').mean()
grouped['injection-volume']

gender
F    0.118051
M    0.170446
Name: injection-volume, dtype: float64

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Making a DataFrame from scratch</h2>
<p>
</div>

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**From an array**
</div>

In [83]:
data = np.random.rand(100,3)
columns = ['cell_1','cell_2','cell_3']
df_arr = pd.DataFrame(data,columns=columns)
df_arr.head()

Unnamed: 0,cell_1,cell_2,cell_3
0,0.927052,0.221977,0.118807
1,0.373756,0.876106,0.262702
2,0.850917,0.785967,0.846697
3,0.048631,0.7465,0.911086
4,0.438631,0.740203,0.793531


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
**From a dictionary**
</div>

In [84]:
data1 = [1.,3.2,39.]
data2 = ['Steve','Joe','Bob']

dict_data = {
    'col1_name': data1,
    'col2_name': data2}

df_from_dict = pd.DataFrame(dict_data)
df_from_dict

Unnamed: 0,col1_name,col2_name
0,1.0,Steve
1,3.2,Joe
2,39.0,Bob


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<h2>Saving (to_pickle(), to_excel())</h2>
<p>
</div>

In [85]:
save_methods = [x for x in dir(df) if 'to_' in x]
print("save_methods:")
for method in save_methods:
    print(method)

save_methods:
_to_dict_of_blocks
to_clipboard
to_csv
to_dense
to_dict
to_excel
to_feather
to_gbq
to_hdf
to_html
to_json
to_latex
to_msgpack
to_numpy
to_parquet
to_period
to_pickle
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_xarray


<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Save to Excel
</div>

In [None]:
df_arr.to_excel('random_df.xls')

<div style="border-left: 3px solid #000; padding: 1px; padding-left: 10px; background: #F0FAFF; ">
<p>Save to a pickle file
</div>

In [None]:
df_arr.to_pickle('random_df.pkl')

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.8:** 
<p>Is there a relationship between injection volume and use of Cre vs wild type mouse?
</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.9:** 
<p>Use documentation or online help to figure how to sort a dataframe by values in particular column.
</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.10:** 
<p>Use injection coordinates to plot spatial distribution of injections.

<p>Make a 2D plot in which the following is true:
<ol>
<li>Each injection is a dot
<li>The injection locations are collapsed on two of the three dimensions (choose which two, maybe try it multiple ways)
<li>The dot size represents the injection volume
<li>The dot color represents the cre-line
</ol>
<p>**The final plot should look like one projection of the rotatable plot at:** http://connectivity.brain-map.org/
</div>

<div style="background: #DFF0D8; border-radius: 3px; padding: 10px;">

**Exercise 6.11:
<p>** Find a dataset online and explore with a DataFrame ... 
</div>
