# Python Bootcamp Day 2
## Theme for day 2: Manipulation and math of lists.

Today's goal is to take your raw data, select subsets of it, and do some basic manipulation to extract useful information from the data.  The things that we will cover today include:
 - indexing lists
 - smoothing data
 - plotting data

First, we'll import some python libraries for manipulating and plotting data.

In [108]:
import matplotlib.pyplot as plt
#This next line of code changes any matplotlib figures into interactive plots so we can manipulate them in the output, rather than just programatically.  It's specific to jupyter
%matplotlib widget 
import numpy as np
import scipy as sp
import pandas as pd
from pathlib import Path

## Indexing
Sometimes, we only want a subset of a whole dataset, or even to select out one individual point.  This is accomplished by selecting points in the array using their index.  This can be thought of as an ID for the spot.  For a list (or vector, or 1D array), the index just indicates the possition of the item in the list.  Python begins indexing at 0, so the first item in the list has an index of 0.  If we have a list of 4 things, [orange, cherry, apple, bannana], the index for each looks like:

|Index|Item|
|---|---|
|0|orange|
|1|cherry|
|2|apple|
|3|banana|
|...|...|
|9|pear|

Lets create a list with the fruits.

In [71]:
fruit = ['orange','cherry','apple','banana','mango','blackberry','strawberry','peach','nectarine','pear']

Lets say I want to access the second item in the list.  Remember that in python we start counting at 0, so the index for that item is 1.

In [72]:
print(fruit[1])

cherry


If I want to get a set of items from a list, lets say the 2nd (index 1) through the 5th (index 4) items in our fruits list, I can use the syntax 1:5, which says index 1 through index 5 but not including the item at index 5 (kind of confusing, but that's what it is...)

In [73]:
print(fruit[1:5])

['cherry', 'apple', 'banana', 'mango']


If we want to start at the front of the list, we start with index 0

In [74]:
print(fruit[0:2])

['orange', 'cherry']


If we don't put a number before the ":"  python assumes we want to start indexing at the front of the list.

In [75]:
fruit[:2]

['orange', 'cherry']

Similarly, we can go from an index to the end of the list by ommiting the index after the ":"

In [7]:
print(fruit[3:])

['banana', 'mango', 'blackberry', 'strawberry', 'peach', 'nectarine', 'pear']


Positive numbers in the index imply that we want to start indexing at the front of the list.  Python also indexes lists with negative numbers, to indicate the position relative to the end of the list.  The fruit list above can also be represented like this:

|Index|Item|
|---|---|
|-10|orange|
|-9|cherry|
|-8|apple|
|...|...|
|-1|pear|

In [76]:
print(fruit[-10])

orange


In [77]:
print(fruit[-1])
print(fruit[5:-1])

pear
['blackberry', 'strawberry', 'peach', 'nectarine']


If we have a 2 dimensional array like:

| | | | |
|---|---|---|---|
|17|16|1|3|
|2|0|22|36|
|3|32|99|13|
|89|17|6|4|

The ID for the rows and columns of this array are below.  The first number is the row index and the second number is the column index.

| | | | |
|---|---|---|---|
|0,0|0,1|0,2|0,3| 
|1,0|1,1|1,2|1,3|
|2,0|2,1|2,2|2,3|
|3,0|3,1|3,2|3,3|

Let's initialize a list of lists to produce a 2 dimensional array.  We'll then convert it to a numpy array to do more traditional (more like Matlab) indexing.

In [10]:
randomNumbers = [[17,16,1,3],[2,0,22,36],[3,32,99,13],[89,17,6,4]]
randomNumbersArr = np.array(randomNumbers)
print(randomNumbersArr)

[[17 16  1  3]
 [ 2  0 22 36]
 [ 3 32 99 13]
 [89 17  6  4]]


Indexing the list, you have to select the list in the list you want first, then select the item inside that list.  This can be done in 2 steps, or 1

In [78]:
a = randomNumbers[0]
b = a[1]
print(b)
print(randomNumbers[0][1]) #This is equivellent to the above 3 lines

16
16


When we convert the list into a numpy array, we can index the array by inputting the index the same way seen in the index table above.  Often, you will want to convert your 2D lists into arrays to speed processing time, and for working with the data in general (many built in functions in the numpy library work better with arrays).

In [79]:
print(randomNumbersArr[0,1])
print(randomNumbersArr[3,2])

16
6


We can also index to get sub-arrays.  To get the upper 2x2 matrix, we can index the same way we were doing the lists.

In [13]:
print(randomNumbersArr[0:2,0:2])

[[17 16]
 [ 2  0]]


When we have multidimensional arrays, we also have the option to pull everything from a particular row or column.  This is intuitive based on what we know from indexing lists.  A ":" with nothing before or after indicates we want everything.  So if we want every row in a particular column, we can do so by leaving either the row or the column index as a ":".  to get all of the rows of column 2, we'd use the index [:,1]

In [14]:
print(randomNumbersArr[:,1])

[16  0 32 17]


If we want every item from a particular column, we can grab it with the same style of syntax

In [15]:
print(randomNumbersArr[1,:])

[ 2  0 22 36]


Or maybe we need everything from a few columns.

In [16]:
print(randomNumbersArr[:,1:3])

[[16  1]
 [ 0 22]
 [32 99]
 [17  6]]


This same principle applies out to any dimensionality of array.  We can use this to access items in an array in python.  Just add an extra "," for the next dimension.

## Some pandas indexing

Next, we're going to import data to work through some examples.  We'll do it in 2 ways, once with the built in "open" command, and once using the panda's library, which we'll also give a brief introduction to.

First: We'll import the data using the built in methods

In [109]:
filename = Path('TutorialData/HC3N_FIR_35mubar_transmitance_Z_0005Thresh.spectrumcal')
fid = open(filename,'r')
rawX = []
rawY = []
for i in fid:
    tmp = i.split(',')
    rawX.append(float(tmp[0]))
    rawY.append(float(tmp[1]))
fid.close()

This is a pretty big dataset, so I'm going to turn these lists into numpy arrays, which is optimized for doing math with large lists (arrays) and will speed up any computations we decide to do.

In [110]:
rawX = np.array(rawX)
rawY = np.array(rawY)
rawX

array([ 39.999802,  40.000272,  40.000742, ..., 719.998947, 719.999417,
       719.999887])

## Simple visualization
If you've watched the module on plotting data, you'll have already seen how we can make simple plots using matplotlib to make quick line graphs of our data

In [111]:
plt.figure()
plt.plot(rawX,rawY)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

[<matplotlib.lines.Line2D at 0x151748f0bb0>]

In [81]:
plt.xlim([202.67,202.71])
plt.ylim([-.2,1.1])
plt.show()

Now we'll Import the data using the pandas library.  Pandas has some great built in data importing, and has developed a very nice datastructure for working with the imported data.  If your data has a header file, the columns will automatically be labeled, or you can specify what to call the columns with the "names" keyword.  Run the cell directly below this one for more information on the keywords availiable in the read_csv file.

In [None]:
pd.read_csv?

In [83]:
rawData = pd.read_csv(filename,sep=',',names=['frequency','transmittance'])
rawData.head() #This lets us examine the top 5 rows of the data to see if it's importing correctly.

Unnamed: 0,frequency,transmittance
0,462.0,0.19284
1,466.0,0.19261
2,470.0,0.19153
3,474.0,0.19078
4,478.0,0.19059


Because the columns have names, we can index a particular column directly by using the name.

In [84]:
rawData['frequency']

0       462.0
1       466.0
2       470.0
3       474.0
4       478.0
        ...  
829    3778.0
830    3782.0
831    3786.0
832    3790.0
833    3794.0
Name: frequency, Length: 834, dtype: float64

In [85]:
rawData['transmittance']

0      0.19284
1      0.19261
2      0.19153
3      0.19078
4      0.19059
        ...   
829    0.36475
830    0.36527
831    0.36542
832    0.36542
833    0.36542
Name: transmittance, Length: 834, dtype: float64

Because of how the datafile is setup, we can't index the data the same way we would a numpy array.

In [23]:
rawData[100,1]

KeyError: (100, 1)

Instead, we need to use either the loc, or the iloc built in function of pandas to slice our data using the index values. 

In [23]:
rawData.iloc[100:200,0]

100    40.046882
101    40.047362
102    40.047832
103    40.048302
104    40.048772
         ...    
195    40.091612
196    40.092082
197    40.092552
198    40.093022
199    40.093502
Name: frequency, Length: 100, dtype: float64

In [24]:
rawData.iloc[15,:]

frequency        40.006872
transmittance     0.231060
Name: 15, dtype: float64

One of the fun things we can do with pandas fairly simply is use the names of the columns to find all the data in the panda datastructure between specific values in a particular column.  This is done below.

In [38]:
singlePeak = rawData[(rawData['frequency'] > 202.67) & (rawData['frequency'] < 202.71)]
singlePeak

Unnamed: 0,frequency,transmittance
345506,202.670131,1.024467
345507,202.670611,0.992668
345508,202.671081,0.970010
345509,202.671551,1.010211
345510,202.672021,0.954384
...,...,...
345586,202.707801,0.996443
345587,202.708271,0.967306
345588,202.708741,0.948392
345589,202.709211,0.966158


# Working with data
Now that we know how to select data that we're interested in playing with, we will explore a couple things we can do to manipulate it.  Lets use the rawX and rawY data that we were using earlier.

In [88]:
#Grab just one peak to work with for now, to make it a little easier for the moment.
X = rawX[(rawX < 202.71) & (rawX > 202.67)]
Y = rawY[(rawX < 202.71) & (rawX > 202.67)]
plt.figure()
plt.plot(X,Y)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

[<matplotlib.lines.Line2D at 0x1516aa903a0>]

In [89]:
deltaX = X[1:]-X[:-1]
deltaY = Y[1:]-Y[:-1]
midX = (X[1:]+X[:-1])/2
firstDer = deltaY/deltaX
plt.plot(midX,firstDer)

[<matplotlib.lines.Line2D at 0x1516aadd3a0>]

In [90]:
plt.close('all')

The next example data for this part of the tutorial is from a FTIR spectrum of benzene.

In [91]:
filename = 'TutorialData/Benzene.dpt'
benzene = pd.read_csv(filename,sep=',',names=['Frequency','Intensity'])

In [92]:
benzene.head()

Unnamed: 0,Frequency,Intensity
0,462.0,0.2093
1,466.0,0.21372
2,470.0,0.20994
3,474.0,0.2255
4,478.0,0.20481


We can plot it with matplotlib as a line plot to see the data.

In [96]:
plt.figure()
plt.plot(benzene['Frequency'],benzene['Intensity'],'r')
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Ideally, this data will have a nice baseline at 0, and any other species will be removed from it. This particular spectrum had a background measurement taken beforehand, which we should be able to use to modify our data.

In [97]:
filename = 'TutorialData/background.dpt'
background = pd.read_csv(filename,sep=',',names=['Frequency','Intensity'])
plt.plot(background['Frequency'],background['Intensity'],'b')

[<matplotlib.lines.Line2D at 0x1516ab1dfd0>]

Since we learned how to index specific columns out of our panda dataframes, we'll use that method for selecting the intensity columns out of both sets of data, and subtract the background out from the actual data.

In [95]:
repairedX = benzene['Frequency']
repairedY = benzene['Intensity']-background['Intensity']
plt.figure()
plt.plot(repairedX,repairedY,label='Corrected Data')
plt.plot(benzene['Frequency'],benzene['Intensity'],'r',label='Raw Data')
plt.legend()
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [98]:
from scipy import signal

In [99]:
x = []
y = []
for i in singlePeak.to_numpy():
    x.append(i[0])
    y.append(i[1])
y
filteredPeak = signal.medfilt(np.array(y),5)

In [106]:
plt.figure()
plt.plot(singlePeak['frequency'],singlePeak['transmittance'],'r')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

[<matplotlib.lines.Line2D at 0x1517224ec10>]

In [107]:
plt.plot(x,filteredPeak,'k')

[<matplotlib.lines.Line2D at 0x151721b5610>]