<hr><hr>

# Data Science Summer School - Split '16 </center>

## Day 1 - Python for data analysis fundamentals 
### *Numpy, Pandas, Matplotlib*

(c) 2016 Damir Pintar

*version: 0.1* 


`kernel: Python 2.7`

<hr><hr>

# Part 1 - Numpy

*Numpy*, or *Numerical Python* is a package aimed primarily at high-performance array-based scientific computing. If your work requiers efficient processing of arrays (especially high-dimensional arrays) you might find Numpy to be extremely useful. For more generic data analysis involving tabular data there are more convenient higher-level packages available (such as *Pandas*, to be explained later), but certain *Numpy* principles carry over to that package, too.

What follows is only a small glimpse into the functionality offered by *Numpy*. We will focus primarily on 1-dimensional arrays and operations we can perform on them. 2- or more dimensional arrays, advanced mathematical capabilities offered by Numpy (such as advanced linear algebra operations, Fourier transform capabilities etc.), and one of arguably *Numpy*'s greatest strengths - integration with lower-level languages such as C++ or Fortran - are beyond the intended scope of this Notebook. Readers are encouraged to expand the concepts provided below with additional reading, starting with the official Numpy documentation <a href = "#1">[1]</a>.

### *Numpy* arrays

Imagine that we have two Python lists depicting ordered set of points gained by the same set of people on two separate exams. Let's say that we want to get the total score, which would involve sumation of paired elements from both lists. There are a few conceivable ways you might be trying this out:

In [2]:
exam1 = [10, 8, 12, 14, 5, 16, 11, 10, 9]
exam2 = [8, 8, 13, 19, 15, 6, 18, 7, 19]

z1 = exam1 + exam2                              
z2 = [(i + j) for i in exam1 for j in exam2]    
z3 = [(i + j) for (i,j) in zip(exam1, exam2)]  

### print out the contents of above lists and comment the results
###
###
###


As you notice, some of the ones you tried above simply do not work as intended, or require more complex coding than we would prefer (and may be computationally intensive for larger lists!).

*Numpy* offers so-called **vectorized** operations which automatically perform requested operations on paired array elements. To convert a list to a *Numpy* array, you simply call the `array` function of the *Numpy* package and put the list as the first argument.

Try to convert `exam1` and `exam2` lists to *Numpy* arrays. Then try to sum the contents of those two arrays.

In [3]:
# first we need to import the numpy package
# we will use the following convention:
import numpy as np               

# convert the exam1 and exam2 lists to numpy arrays using np.array() function
# call the new arrays np_exam1 and np_exam2

###
###

# print out the result of adding np_exam1 and np_exam2 using the arithmetic operator '+'

###


Substraction, multiplication, division and other arithmetic operators will work the same - the result will be an array with the same dimension, containing elements which are the results of applying the provided arithmetic operator on the paired elements of both arrays.

We can also perform arihtmetic operations with arrays and scalars:

In [None]:
# print out the contents of np_exam1

###

# print out the result of dividing np_exam1 by 2

###



Notice what happens when you perform arithmetic operations with arrays and scalars - the scalar gets "recycled" so each element of the array performs the same arithmetic operation with one "copy" of the scalar in question

What about vectorizing boolean operations? We can do that too, but we need to remember that *Numpy* uses & and | operators as vectorized versions of `and` and `or`.

In [21]:
x = np.array([True, False, True, False])
y = np.array([True, True, False, False])

# print out arrays corresponding to vectorized versions of "x and y" and "x or y"

###
###


Finally, let's see how vectorization works on relational parameteres. As expected, using relational operator on an array-array or array-scalar operation will result in a Boolean array.

In [25]:
x = np.array([1,2,3,4,5])
y = np.array([-2,0,2,4,6])

# print the result of x > y
###

# print the result of x < 2
###


Unlike Python lists, *Numpy* arrays can hold only one type of data (if we try to create an array of multiple types, Python will coerce them all to the "strongest" type). 

If we want to find out the array's data type by checking the array's attribute `dtype`.

In [None]:
my_array1 = np.array([1, True, 2.3, "A"])

# print out the contents of above array and its data type

###
###


Similar to Python's lists, there a few convenience functions for array creation which we may find very useful. For example, `arange` is *Numpy*'s version of `range`, `zeros` and `ones` create arrays of zeros and ones respectively, and `random.normal` gives us an array of elements randomly drawn from a normal distributuion.

In [10]:
# print an array containing integers from 1 to 10 using the function arange 
# remember you can ask for help on using the function by typing help(np.arange)

###


# print an array of 10 zeros and an array of 5 ones 

###
###


# print an array of 10 elements drawn from a normal distribution with the mean of 50 and standard deviation of 5
np.random.seed(1234)            # setting the seed for reproducibility

###


Instead of `normal` you might have also used some other distribution, such as `gamma`, `uniform`, `chisquare` etc. 

### Array slicing

Slicing arrays works in a similar fashion as slicing lists. Try it out:

In [19]:
my_array = np.arange(1, 11)

# print the first element of my_array
###

# print all but the last element
###


# print the following elements: second, third, fifth and then second again
###

 


You may have noticed that to succesfully achieve the last bit you had to supply a list of indexes. This is what we called an *index list* - a regular list of integers which denote the indexes we are interested in retreiving. Notice that if we want, we can retrieve the same element multiple times. Also, note that we can use the "index list" not just to slice, but also to reorder the array.

There is one thing you might find slightly unexpected: array slices are basically "views" toward the original array. This means we can change a portion of the array simply by assinging values to its slice. Try it out. 

In [20]:
# change the first three elements of my_array to zero
###

# put the last three elements of my_array into a separate variable called my_array_slice
###

# change all elements of my_array_slice to zero
###

# print out my_array
#my_array



[1 2 3 4 5 6 7 0 0 0]


Remember how we used a relational operator with a numerical array and got a Boolen array as a result? Time to put those Boolean arrays to use.

In [None]:
# just run this code
x = np.arange(1, 6)

# try to print the result of: x[x > 2] 
###


How do you explain this result? It's simple - we can retrieve a subset of an array by providing an array (or a list) of boolean elements which has the same length as the original array. The result will be an array containing only the elements for which the corresponding boolean value was `True`. Consequently, what happened here was this:
<hr>

`x > 2                             x[x > 2]`

`   1  <  2   ->   False              1   [False]  ->   x
2  <  2   ->   False              2   [False]  ->   x
3  <  2   ->   True               3   [True]   ->   3
4  <  2   ->   True               4   [True]   ->   4
5  <  2   ->   True               5   [True]   ->   5`


<hr>

For those versed in SQL, you might have noticed parallels with:
<hr>

`SELECT * FROM x WHERE x > 2`
<hr>

Using Boolean arrays as index lists is very powerful and extremely common. Let's try to solve the following exercise:

In [30]:
# first notice and explain the result of the following operation (np.sum will add up all array elements)
###


In [34]:
# Now try to put the above concept to some use. Let's initialize an array of random elements

np.random.seed(1234)            
x = np.random.normal(0, 1, 10)

# print out the array x and examine its contents
###

# print out the total number of elements smaller then 0
###

# change all those pesky negative elements to 0 and print out the array
###
###



What you saw above is the common trick analysts use when they want to count elements (or observations) for which a certain criteria resolves to `True`. We might have also used the function `mean` (in our case `np.mean` of course), which would have given us a ratio of elements which satisfy the criteria.

We'll be wrapping up our brief introduction of this powerful package by showcasing certain fun stuff you can do with it. We have already mentioned functions `sum` and `mean` of the *Numpy* package which perform summation or calculate mean average of array's element. *Numpy* offers a very rich set of these so called "universal functions" (or `ufuncs`), which are essentially well-known functions implemented in a way they work on array elements, often in vectorized fashion. You can check out a full list at the bottom of <a href = "#5"> [5] </a>. Let's try a few of them out:

In [39]:
# We create an array of 100 elements drawn from a uniform distribution
np.random.seed(1234)
a = np.random.uniform(0, 100, 100)

# let's round these numbers to 2 decimals with the round function and save it back to variable a
###

# print a
###


In [42]:
# print out the mean, standard deviation and variance by using Numpy functions: mean, std, var
###
###
###


In [43]:
# print out the 75th percentil using the percentile function
###


In [49]:
# let's keep only 10 elements from this array. Use the random.choice function. Call the new array a_sample
###

# print the contents of a_sample
###


In [48]:
# check out the mean of a_sample
###


In [53]:

# finally, sort this sample in-place by using the sort method of the array itself
###

# print a_sample
###


Remember, we have only scraped the surface of what *Numpy* can do. If you want to learn more about this package, check out the official documentation on <a href = "http://docs.scipy.org/doc/numpy/"> [1] </a>. 

<hr> <hr> <hr>
## <font color = "blue">Exercises

**`1.`** In this exercise we will run through a few of the basic concepts we learned in this lecture. Try to solve the following tasks without going back to the lecture segments, and only check them if you get stuck.

In [None]:
# Create an array of integers from 20 to 50. Print a sum of its elements



In [None]:
# Create an array of 20 zeros. Change each element which has an odd index to 1



In [None]:
# Create en array of 100 random elements drawn from a normal distribution with a mean of 100 and standard deviation of 5
# Count how many elements have a value lower than 90



In [None]:
# Draw two random samples of 10 elements from the above array. Find their mean and standard deviation.
# Then add those two samples together, divide the result with 2, and finally find the mean and standard deviation
#    of the resulting array




**`2.`** In this lecture we only dealt with 1-dimensional arrays. Let's try an exercise which deals with *Numpy* matrices, (or two-dimensional arrays). You will see that almost all we learned is directly applicable for multi-dimensional array structures, too.

In [None]:
# let's create a 3 x 3 containing integers from 1 to 9, ordered row-wise
# first create a "list of lists", a list of 3 elements, each element being itself a list of integers,
#         first list being numbers from 1 to 3 etc.
# store this list in a variable called: a



In [None]:
# now create a Numpy array from this list: np_a
# print out the array, its shape attribute and its ndim attribute



In [None]:
# there's an easier way to create a Numpy 2d array
# use Numpy function arange to create an array of integers from 1 to 9,
# then use its method called "reshape" to turn a 1d array into a 2d array
#      - method reshape takes integers as arguments - i.e.  reshape(3,4) - 2d array, 3 rows, 4 columns
# store this new 2d array in a variable called np_b



In [None]:
# Check whether np_a and np_b are equal. You can use Numpy's function called array_equal which checks whether
#     two provided arrays have identical elements as well as identical shape




In [None]:
# Which element is in the 1st row, 2nd column of np_a?  
# Remember that 2d arrays use the same indexing methods as 1d arrays, only you use two indexes separated by a comma (",")



In [None]:
# Now try slicing - print out elements from the 1st and 2nd row, 2nd and 3rd column of np_a




In [1]:
# Finally, print the entire 3rd column of np_a. Remember that the shortcut for "all rows" is ":"



For further exercises, we will suggest a great resource called **100 Numpy exercises**, available here: <a href = "#6">[6]</a>. It is a joint effot of Numpy community, and is a great way to both repeat the concepts learned here as well as learn sa=ome new Numpz tricks!

<hr> <hr> <hr>

## Additional resources

<a name="1"></a><a href = "http://docs.scipy.org/doc/">[1]</a> *Numpy and Scipy documentation*, last accessed 2016/09/19

<a name="2"></a><a href = "https://docs.scipy.org/doc/numpy-dev/user/quickstart.html">[2]</a> *Quickstart tutorial*, official *SciPy* documentation, last accessed 2016/09/04

<a name="3"></a><a href = "http://docs.scipy.org/doc/numpy-1.10.1/user/basics.html">[3]</a> *Numpy basics*, official SciPy documentation, last accessed 2016/09/04

<a name="4"></a><a href = "http://cs231n.github.io/python-numpy-tutorial/">[4]</a> *Python Numpy tutorial* by Justin Johnston, last accessed 2016/09/05

<a name="5"></a><a href = "http://docs.scipy.org/doc/numpy/reference/ufuncs.html">[5]</a> *Numpy universal functions*, official SciPy documentation, last accessed 2016/09/19

<a name="6"></a><a href = "http://www.labri.fr/perso/nrougier/teaching/numpy.100/">[6]</a> *100 Numpy exercises*, last accessed 2016/09/22