# Python Training

Welcome to Python training at 84.51$^\circ$! The tutorial is in 6 parts: <br>
   >  Python Basics<br>
    > NumPy<br>
    > Matplotlib<br>
   > Pandas<br>
   > 84.51$^\circ$ Specific<br>
   > Small case study<br>

By the end, you'll hopefully have a good enough understanding of Python to get working on your own projects. Some material has referenced the excellent resource "Python Data Science Handbook", which can be found here:

https://jakevdp.github.io/PythonDataScienceHandbook/
 
The chapters from this book cover each of the topics covered in this training (and more) to a deeper extent. 

# NumPy

NumPy is such an important package for data analytics that it merits its own small section. To begin with, the most common way to import NumPy is as follows:

In [2]:
# np serves as an alias 
import numpy as np

Writing numpy over and over again can be tedious, so np is a well-understood abreviation that is used commonly throughout the programming world. 

## NumPy Array vs. Standard List

The basic and most important feature of NumPy which makes it stand out is the array. **Arrays** are like lists, but with a few small twists. Firstly, all datatypes in an array must be the same (for example, all must be ints or all must be floats). A list is allowed to contain different datatypes as its elements (you may build a list with both integer datatypes and character datatypes). 

Moreover, arrays are much like vector in mathematics, where you can do mathematical operations with arrays (such as add, subtract, multiply two arrays). Arrays allow for fast computation and the ability to do numeric operations with ease. We will look at an example of this stark difference in time between arrays and lists. 

Lists do allow flexibility in terms of the items that can be added as well as some of the methods (i.e. length or slicing). Luckily, many operations (such as `len` and list slicing) that work with lists normally work with arrays. 

In [3]:
arr = np.array([8,4,5,1])
#add arrays of equivalent length
print(arr+arr) 

#multiply arrays of equivalent length 
print(arr*arr)

# find the length of an array
print(len(arr))

# find the shape of an array 
print("The shape of my array is:{}".format(arr.shape))

# slice from 1st index to end (2nd element to the end of array)
print(arr[1:])

# print only the 3rd index (4th element in array)
print(arr[3])

[16  8 10  2]
[64 16 25  1]
4
The shape of my array is:(4,)
[4 5 1]
1


In [6]:
# Uncomment the following line of code to try adding arrays of different sizes
arr2 = np.array([1,2,3])
print(arr+arr2)

ValueError: operands could not be broadcast together with shapes (4,) (3,) 

## Mathematical Operations with NumPy

Below are samples of a few different mathematical operations that can be run on a NumPy array:

*Across full array*

* `np.mean`, `np.var`, `np.std` take the mean, variance, and standard deviation respectively of an array
* `np.max`, `np.min` provide the mmaximum and mininum values of an array
* `np.sum` outputs the sum of all array elements (Note: axis can be specified for multi-dimensional arrays) 

*Operation across each element in an array*
* `np.sqrt` returns the positive square-root of each element in an array
* `np.log` provides the natural logarithm of each element in an array 
* `np.add` is going to add an argument to each element

Let's build an array called **hh_spend** which provides data for the total spend of 10 households in a 3-week period. We can then try some of the aforemention operations on the **hh_spend** array. 

More can be found at: https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html

In [13]:
np.argpartition?

In [7]:
# Build hh_spend array 
hh_spend = np.array([800.56,457.3,505.98,100.01, 567.87, 303.23, 178.65, 213.98, 606.78, 907.65])
print("We have data for {} households".format(len(hh_spend)))

We have data for 10 households


In [8]:
# what is the mean spend? 
print(np.mean(hh_spend))

# What is the standard deviation? 
print(np.std(hh_spend))

464.201
254.037438912


In [9]:
# what is the maximum value? min? 
print("The max value is {}".format(np.max(hh_spend)))
print("The min value is {}".format(np.min(hh_spend)))

The max value is 907.65
The min value is 100.01


In [10]:
# what is the square-root of each HHs spend? 
print(np.sqrt(hh_spend))

[ 28.29416901  21.38457388  22.4939992   10.00049999  23.83002308
  17.41350051  13.36600165  14.62805524  24.63290482  30.12723021]


In [14]:
# Let's add $10 to each HH Spend 
print(np.add(hh_spend,10))
np.sum(hh_spend)

[ 810.56  467.3   515.98  110.01  577.87  313.23  188.65  223.98  616.78
  917.65]


4642.0099999999993

## NumPy Has Superior Computational Speed

Let's take a simple example to show the slowness of lists relative to arrays. 

Perhaps we have a list that contains the amount of spend for a household across the last 10 years worth of transactions. 

1) To build this we will use **np.random.seed** so that we can replicate our random integers created 

2) Next, we create an array of 100,000 elements from 1 to 500 with **np.random.randint** and name it **spend_array**

3) Then create a **spend_list** as a list of the same elements

4) Finally, the **%timeit** magic command can be called when running a mathematical operation on **spend_array** and **spend_list** to check how long each process takes. 

What you should see (although computation times will vary slightly with each run) is that the array is *significantly* faster than the list in terms of summation. This is the case if we were to multiply across values as well.

**Helpful reminder: 1000 µs = 1 ms**


In [17]:
np.random.seed(0) #use a seed for replication 

spend_array = np.random.randint(1,50,100000)
print(spend_array[1:10])

# make a copy that is a list
spend_list = list(spend_array)
print(spend_list[1:10])

# check types
print(type(spend_array))
print(type(spend_list))

[48  1  4  4 40 10 20 22 37]
[48, 1, 4, 4, 40, 10, 20, 22, 37]
<type 'numpy.ndarray'>
<type 'list'>


In [18]:
# for the array 
%timeit np.sum(spend_array)

10000 loops, best of 3: 160 µs per loop


In [19]:
# for the list 
%timeit sum(spend_list)

100 loops, best of 3: 13 ms per loop


In [20]:
# verify the values would be equivalent
sum(spend_list) == np.sum(spend_array)

True

# Multi-Dimensional Arrays

Arrays can also be multi dimensional and be treated like matrices. We can make two-dimensional arrays as well as three-dimensional arrays. For starts, let's focus on two-dimensional arrays which will look very similar to a dataframe in R, a table in SQL, or an excel spreadsheet.

In [21]:
arr1 = np.array([1,2,3,4])
arr2 = np.array([5,6,7,8])

In [22]:
twoD = np.array([arr1, arr2])
print(twoD)

[[1 2 3 4]
 [5 6 7 8]]


## Indexing for Multi-Dimensional Arrays

Indexing a multi-dimensional array is very similar to how we index for a singular array, or even within R. The next few examples with show variations of this for a two-dimensional array:

In [23]:
# Print the 1 (0,0) in the 2-d array
# This is the 0th row and 0th column
# Two ways to do it: Think of this as a matrix (0,0) or as two arrays [0][0]
print(twoD[0,0])
print(twoD[0][0]) #same output

# Print the 2 in the 2-d array
# This is the 0th row, and 1st column
print(twoD[0,1])
print(twoD[0][1])

# Print the 8 in the 2-d array
# This is in the 1st (0th) row and 4th (index 3) column
print(twoD[1,3])
print(twoD[1][3])

# Print the full first row 
print(twoD[0,])
print(twoD[0]) #we just take first array 

# Print the last column
print(twoD[:,3])

# print the element from last column of each row 
for i in twoD:
    print(i[3])


1
1
2
2
8
8
[1 2 3 4]
[1 2 3 4]
[4 8]
4
8


## Iterating over Arrays

Moreover, arrays are sequences, so they can be iterated over just like lists and dictionaries. 

We can do simple things, such as print out each element in an array. Or possibly perform some calculation on each array. 

In [24]:
# print out each element
for num in arr:
    print(num)

8
4
5
1


In [27]:
# add 10 to each element
for num in arr:
    print(num + 10)

18
14
15
11


## Additional Functions for NumPy
Some important numpy functions and methods that work on arrays include: 
* `.reshape`, which reshapes an array to a desired new shape. Importantly, this reshape must be possible (you can't turn a 1 element array into an 4x5 matrix)
* `.shape` returns the shape of your array
* `np.sin`, `np.log`, and many other mathematical operations exist that work element wise on arrays
* `np.all` and `np.any` do `and` and `or` logic operators on all of the elements in the array, and returns one boolean value
* `np.zeros` returns an array of 0s with the shape specified by the user
* `np.mean`, `np.var`, `np.std` take the mean, variance, and standard deviation respectively of an array
* `np.argmax`, `np.argmin` return the position of the largest/smallest number in an array
* `np.cov`, `np.corrcoef` return covariance and correlation coefficients for arrays
* `np.count_nonzero` returns the number of nonzero (equivalently, the number of non-false) items in an array
* `np.arange` returns an array with the specified start, stop, and increments (essentially, it returns a sequence)
* `np.random.seed`, `np.random.randn`, `np.random.randint`, and the whole `np.random` module work with dealing with random numbers 

Below are some examples of how to use the above functions. 

In [28]:
# look at the elements of twoD
print(twoD)

# Look at the shape 
print(twoD.shape)

# Reshape to be 1-dimensional
twoDReshaped = twoD.reshape((8,))
print(twoDReshaped)
print(twoDReshaped.shape)

[[1 2 3 4]
 [5 6 7 8]]
(2, 4)
[1 2 3 4 5 6 7 8]
(8,)


In [29]:
# calculate the log of each element from twoD
np.log(twoD)

array([[ 0.        ,  0.69314718,  1.09861229,  1.38629436],
       [ 1.60943791,  1.79175947,  1.94591015,  2.07944154]])

In [30]:
# build out an array of booleans
bools = np.array([True, True, False])
print(np.all(bools))
print(np.any(bools))

False
True


In [31]:
# build out an array of zeroes
zeros = np.zeros(10)
print(zeros)
print(np.mean(zeros), np.std(zeros))

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
(0.0, 0.0)


In [32]:
# using argmax
print(np.argmax(np.array([1,0])))

0


In [33]:
# printing a count of all non-zero elements
# in this case we have two trues (1) and a false (0)
print(np.count_nonzero(bools))

2


In [34]:
# similar to range: take values from (0,10], by 2s
# will not be inclusive for last element (10)
print(np.arange(0,10,2))

[0 2 4 6 8]


In [35]:
# sample of 2 from standard normal distribution 
print(np.random.randn(2))

[-1.07395387  1.53100051]


Numpy is a very powerful package with a lot more functions that might be useful to you than you might think. If you're ever in need of a mathematical or statistical function to use, there probably exists something in numpy and a google search will help with this. 

### Numpy Practice 

Below are some practice problems on some of the concepts talked about above. Googling is encouraged! 


1. Create a 3x3 array of random values and a 3x1 array of random values. Multiply them and print the result.
2. Create a 5x5 random matrix, reshape it to 25x1, and normalize it. Print the mean and standard deviation, to verify.
3. Create a 1000x1 random matrix. Remove values above the 99th percentile.
4. Create a 5x5 matrix with row values ranging from 0 to 4.
5. Create a 2-dimensional NumPy array of random intengers that has a shape of 3 rows and 4 columns. Force the array to have a low value of 500, and a high value of 5000 (consider researching 'np.random.randint' function for support). Print the mean & median of each column. Then do the same for each row. 

## NumPy Practice & Beyond:

For those of you cruising through this practice, the following link has 100 additional NumPy problems. Some of these problems served as the motivation for the above problems while some become much more advanced. 

The link to the repository: https://github.com/rougier/numpy-100