# Data Analysis
Now that we have some practice with using the fundamentals of Python, let's apply these fundamentals on a dataset. One example dataset is provided to you in this directory. This data originates from measuring hydrogen bond distances between a Watson-Crick base pair modeled in an all-atom molecular dynamics simulation. If we have enough time, we can visualize this trajectory using VMD (visual molecular dynamics). 

## Using VMD
In a terminal (putty) window, enter in this command:

This will open up the VMD software GUI. There are a lot of menus and options to use within this software. I'll highlight the most important options available if we have time. 

## Using Python To Read A Data File
We now have our hydrogen-bonding data file, output from VMD, and we want to read in and analyze the raw data using a Python script. Time to introduce a new variable type! 

In [None]:
data_file = open('Trajectory_Data/hbond.dat')
print type(data_file)
data_file.close()

Our data_file variable is type 'file' and is equivalent to the opened hbond.dat file. Unlike strings and list variables, 'file' variables do not have a length _but_ they are iteratable. To access the data within the variable, we'll need to iterate over the lines in the file. As an example, we could iterate over all the lines in the file and print them. 

In [None]:
data_file = open('Trajectory_Data/hbond.dat')
print type(data_file)
for line in data_file:
    print line, type(line)
data_file.close()

In [None]:
data_file = open('Trajectory_Data/hbond.dat')
print type(data_file)
for line in data_file:
    print line.split(), type(line.split())
data_file.close()

In [52]:
data_file = open('Trajectory_Data/hbond.dat')
dataSum = 0
lineCount = 0
for line in data_file:
    dataSum = dataSum + float(line.split()[1])
    lineCount = lineCount + 1
data_file.close()
print dataSum, lineCount

1977.990574 1000


In [53]:
dataAvg = dataSum/lineCount
print 'Average hydrogen bond distance =', dataAvg

Average hydrogen bond distance = 1.977990574


In [55]:
dataStdev = 0.
data_file = open('Trajectory_Data/hbond.dat')
for line in data_file:
    temp = float(line.split()[1]) - dataAvg
    dataStdev += temp**2
data_file.close()
print 'Standard Deviation =', (dataStdev/lineCount)**0.5

Standard Deviation = 0.13321047324


## Note: we had to read the file twice to calculate the average and standard deviation. That's completely unnecessary! Let's do it in one loop.

In [None]:
lineCount = 0
dataList = []
data_file = open('Trajectory_Data/hbond.dat')
for line in data_file:
    dataList.append(float(line.split()[1]))
    lineCount += 1
data_file.close()
print type(dataList)
dataAvg = sum(dataList)/lineCount

dataStdev = 0.
for i in dataList:
    temp = i - dataAvg
    dataStdev += temp**2

print 'Average hydrogen bond distance =', dataAvg
print 'Standard Deviation = ', (dataStdev/lineCount)**0.5

## To be honest, we can read this file even more quickly. And, we'll introduce an important tool that we can use within Python scripts as well.

In [None]:
import numpy as np # numpy = numerical python
data = np.loadtxt('Trajectory_Data/hbond.dat')

print type(data)
print data[0:10]
print data[0:10,1]

Using the numpy 'module', we can easily load in data files (assuming a regular file format). Using numpy __loadtxt__ function, we create a numpy.ndarray variable (called data). This array behaves very similarly to a list; it has a length and is indexable. But a numpy.ndarray variable has more functionality than a regular list does.

In [58]:
lineCount = len(data)
dataStdev = np.std(data[:,1])
dataAvg = np.mean(data[:,1])
print 'Number of lines in file =', lineCount
print 'Average hydrogen bonding distance =', dataAvg
print 'Standard Deviation =', dataStdev 

Number of lines in file = 1000
Average hydrogen bonding distance = 1.977990574
Standard Deviation = 0.13321047324
