# MolISS Workeshop 2
# Python for Data Analysis
you can find more info here
https://education.molssi.org/python-data-analysis/01-numpy-arrays/index.html

## Section 1: Working with Numpy Arrays

__Questions__

What are the differences between numpy arrays and lists?

How can I use NumPy to do calculations?

__Objectives__

Be able to name the differences between Python lists and numpy arrays.

Understand the idea of broadcasting.

### NumPy Arrays vs. Python Lists

In [12]:
import numpy as np
import os

file_location = os.path.join('data','water.xyz')
xyz_file = np.genfromtxt(file_location, skip_header=2, dtype='unicode')
#print(xyz_file)
symbols = xyz_file[:,0]
#print(symbols)
coordinates = xyz_file[:,1:].astype(float) # extract the coords and as floats

print(symbols)
print(coordinates)

['O' 'H1' 'H2']
[[ 0.       -0.007156  0.965491]
 [-0.        0.001486 -0.003471]
 [ 0.        0.931026  1.207929]]


In [19]:
oxygen_coord= coordinates[0] # all the columns of the first raw
#print(oxygen_coord)

[ 0.       -0.007156  0.965491]


Let’s imagine that we wanted to translate the position of the oxygen atom. We want to translate it 0.1 units in the x direction and -0.1 units in the y direction.
    
If we were writing for loops like we did before, we might do this by defining a translation vector and using a for loop.

In [52]:
translation_vector = [0.1, -0.1, 0]

oxygen_coord_new=[]

for dim in range(3): # from 0 to 3
    new_position = oxygen_coord[dim] + translation_vector[dim]
    oxygen_coord_new.append(new_position)
print(oxygen_coord_new)
    
# using the fact it is a numpy array 
oxygen_coord_new = oxygen_coord + translation_vector
print(oxygen_coord_new)

#print(type(oxygen_coord))
    
    
    

[0.1, -0.107156, 0.965491]
[ 0.1      -0.107156  0.965491]
<class 'numpy.ndarray'>


In [59]:
# can make it into a list
oxygen_list = list(oxygen_coord)
type(oxygen_list)

# but if you try to add the two lists:
oxygen_list + translation_vector



[0.0, -0.007156, 0.965491, 0.1, -0.1, 0]

In [60]:
# if you want to do the same thing with numpy
# To concatenate numpy arrays...

np.concatenate((oxygen_coord, translation_vector))

array([ 0.      , -0.007156,  0.965491,  0.1     , -0.1     ,  0.      ])

You can add two arrays together, multiply arrays by scalars, or do element-wise multiplcation of arrays.

For example, you can multiply two numpy arrays to get their element-wise product. This means that given two vectors a = np.array([a0, a1, a2]) and b = np.array([b0, b1, b2]), a * b = [a0 * b0, a1 * a1, a2 * b2].

In [61]:
# for example:

a1 = np.array([2, 1, 0])
a2 = np.array([1, 3, 5])

print(a1 * a2)
print(a1 + a2)

[2 3 0]
[3 4 5]


In [66]:
# but if they are lists, you would get an error for the mutiplication
# and you add the elemnts to one list for addition

a1 = [2, 1, 0]
a2 = [1, 3, 5]

print(a1 + a2)
#print(a1 * a2)

[2, 1, 0, 1, 3, 5]


### Broadcasting

Another special thing about numpy is something called broadcasting. Broadcasting occurs when you attempt mathematical operations on arrays that have different shapes. If possible, the smaller array is “broadcast” across the larger array.

Let’s think about what would happen if we wanted to move every atom in our water molecule by our translation vector.

In [83]:
print(coordinates)
print(translation_vector)

[[ 0.       -0.007156  0.965491]
 [-0.        0.001486 -0.003471]
 [ 0.        0.931026  1.207929]]
[0.1, -0.1, 0]


In [82]:
# using for loop

new_coordinates=[]

for atom in coordinates:
    new_x = atom[0] + translation_vector[0]
    new_y = atom[1] + translation_vector[1]
    new_z = atom[2] + translation_vector[2]
    new_coordinates.append([new_x, new_y, new_z])

print(new_coordinates)

[[0.1, -0.107156, 0.965491], [0.1, -0.098514, -0.003471], [0.1, 0.831026, 1.207929]]


In [84]:
# using numpy broadcasting

new_coordinates = coordinates + translation_vector

print(new_coordinates)

[[ 0.1      -0.107156  0.965491]
 [ 0.1      -0.098514 -0.003471]
 [ 0.1       0.831026  1.207929]]


For this to work, we have to have two arrays that have a matching dimension. You can see the shape of an array using the function np.shape.

In [88]:
print(np.shape(coordinates))
print(np.shape(translation_vector))

(3, 3)
(3,)


In [91]:
row_translate = [[0.1], [0.2], [0.3]]
print(np.shape(row_translate))

(3, 1)


In [93]:
# row_translate can be added to coordinates becuase both have three columns
print(coordinates + row_translate)

[[0.1      0.092844 1.065491]
 [0.2      0.201486 0.196529]
 [0.3      1.231026 1.507929]]


### Logical comparisons

We can also do logical comparisons on whole arrays. For example, to find out if values in the array are greater than 0, we can write

In [94]:
print(coordinates > 0)

[[False False  True]
 [False  True False]
 [False  True  True]]


To get every value in the array that is greater than 0, we can use this as a list of indices we want, or a slice.

In [97]:
greater_than_0_values = coordinates[coordinates > 0]
print(greater_than_0_values)

[0.965491 0.001486 0.931026 1.207929]


### Array Axes

Imagine we wanted to calculate the geometric center of our molecule. To do this, we would need to get the average x coordinate, the average y coordinate, and the average z coordinate.

In a previous lesson, we calculated the mean of each column of an array using the range function and a for loop. This was good because it reminded us of the range function and how for loops worked. However, the numpy.mean function will let us do that without a for loop.

Our code for that would look something like this:

In [117]:
center = list()

for dim in range(len(symbols)): # from 0 to 3 or 3 times
    dim_mean = np.mean (coordinates[:,dim]) #use every raw and start by column 0 then 1 then 2
    center.append(dim_mean)

print(center)
    
    

[0.0, 0.308452, 0.7233163333333333]


In [121]:
# you can do this by simply adding the axis
# in other words, the following command says:
# calculate the mean for each column

np.mean(coordinates,axis=0)

array([0.        , 0.308452  , 0.72331633])

### Optional - Returning to the geometry analysis project
In your geometry analysis project, you had to analyze an xyz file, find the bonds, and print bond lengths.

We can rewrite that project using the features of numpy arrays.

Recall that a solution given for a function for calculating distances between two points was the following

In [155]:
#my solution 

def calculate_distance(atom1_coord, atom2_coord):
    
    """Calculate the distance between two three-dimensional points."""

    x_distance = atom1_coord[0] - atom2_coord[0]
    y_distance = atom1_coord[1] - atom2_coord[1]
    z_distance = atom1_coord[2] - atom2_coord[2]
    bond_length = np.sqrt(x_distance**2+y_distance**2+z_distance**2)
    return bond_length
calculate_distance(oxygen_coord,coordinates[1])

0.9690005374652793

In [166]:
# can be first updated in a better way

def calculate_distance(rA, rB):
    '''Calculate the distance between points A and B'''
    x_dist = (rA[0] - rB[0]) ** 2
    y_dist = (rA[1] - rB[1]) ** 2
    z_dist = (rA[2] - rB[2]) ** 2
    
    distance = np.sqrt(x_dist + y_dist + z_dist)
    
    return distance
calculate_distance_list(oxygen_coord,coordinates[1])


0.9690005374652793

In [157]:
# my first try
def calculate_distance_1(rA, rB):
    a=(coordinates[0,:]-coordinates[1,:])**2
    b=np.sum(a)
    c=np.sqrt(b)
    return c
calculate_distance_1(oxygen_coord,coordinates[1])


0.9690005374652793

In [158]:
# updating my code
# you can make it into one line:

def calculate_distance_1(rA, rB):
    a=np.sqrt(np.sum((coordinates[0,:]-coordinates[1,:])**2))
    return a

calculate_distance_1(oxygen_coord,coordinates[1])

0.9690005374652793

In [161]:
# their solution which is cleaner

def calculate_distance(rA, rB):
    AB = (rA - rB)**2
    distance = np.sqrt(np.sum(AB))
    return distance

#calculate_distance(oxygen_coord,coordinates[1])

# making it more compact

def calculate_distance(rA, rB):
    distance = np.sqrt(np.sum((rA - rB)**2))
    return distance

calculate_distance(oxygen_coord,coordinates[1])

0.9690005374652793

You might also have used the numpy function np.linalg.norm which calculates the magnitude of a given vector.

In [162]:
# even better

def calculate_distance(rA, rB):
   dist_vec = (rA-rB)
   distance = np.linalg.norm(dist_vec)
   return distance

In [169]:
# making the coords
r1=coordinates[0]
r2=coordinates[1]
r3=coordinates[2]

Redefine your original distance function as calculate_distance_list. Using both, we see that both functions give the same answer.

In [172]:
print(calculate_distance_list(r1, r2))
print(calculate_distance(r1, r2))

0.9690005374652793
0.9690005374652793


### Key points
NumPy arrays which are the same size use element-wise operations when added or subtracted

NumPy uses something called broadcasting for arrays which are not the same size to allow arrays to be added or multiplied.

NumPy has extensive documentation online - you should check this out if you need to do a computation.