# Python Data Science Ecosystem (30 min)

The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages. 

Some of the most important ones are:

#### [`numpy`](http://numpy.org/): Numerical Python

This library provides the `ndarray` for computation and manipulation of homogeneous array-based data in Python.

#### [`pandas`](http://pandas.pydata.org/): Panel Data

This library provides the `DataFrame` for analysis and manipulation of heterogeneous and labeled data in Python. 

#### [`scipy`](http://scipy.org/): Scientific Python

This library contains a wide range of functionality for accomplishing common scientific tasks, such as optimization, integration, interpolation, and much more.

#### [`matplotlib`](http://matplotlib.org): MATLAB-style Plotting via Gnuplot

This library provides capabilities for a flexible range of publication-quality data visualizations in Python.

Reference: [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)

# Recap on Numpy

Importing different libraries usually is the first cell of a jupyter notebook.

By convention, most people in the PyData world will import Numpy under a shortened name using the `import ... as ...` pattern:

In [2]:
import numpy as np

## 1. Numpy Arrays

### Creating Arrays from Python Lists with `np.array`

Unlike Python lists, NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [3]:
list1 = [1, 2, 3, 4, 5.15]
list1

[1, 2, 3, 4, 5.15]

In [4]:
x1 = np.array([1, 4, 2, 5, 3.14])
print(x1)

[1.   4.   2.   5.   3.14]


Note the decimal point of each element.

Create multi-dimensional arrays with nested lists:

In [5]:
x2 = np.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]])
print(x2)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


The inner lists are treated as rows of the resulting two-dimensional array.

### Numpy Array Attributes

`dtype`: the data type of the array

In [6]:
print(x1)
print(x1.dtype)

[1.   4.   2.   5.   3.14]
float64


``ndim``: the number of dimensions

``shape``: the size of each dimension

``size``: the total size of the array

In [7]:
print("ndim: ", x2.ndim)
print("shape: ", x2.shape)
print("size: ", x2.size)

ndim:  2
shape:  (4, 3)
size:  12


Please note `shape` returns a tuple of ints:

In [8]:
x2.shape[0]

4

### Creating Arrays from Scratch

Create arrays filled with specific values, e.g., `np.zeros` and `np.ones`:

In [9]:
# Create a length-10 integer array filled with zeros
print(np.zeros(10, dtype=int)) # take shape (int or tuple of ints) as an argument
print()
# Create a 2x3 floating-point array filled with ones
print(np.ones((2, 3), dtype=float))
print()
# Create a 2x3 array filled with 3.14
print(np.full((2, 3), 3.14))

[0 0 0 0 0 0 0 0 0 0]

[[1. 1. 1.]
 [1. 1. 1.]]

[[3.14 3.14 3.14]
 [3.14 3.14 3.14]]


Create an array filled with a linear sequence with `np.arange` (similar to the built-in `range` function):

In [10]:
# Range between [0, 20), stepping by 2
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

Create an array of evenly spaced values between a range with `np.linspace`:

In [11]:
# Five values evenly spaced between [0, 1]
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

### Reshaping of Arrays

Use `reshape` to put the numbers 1 through 12 in a 3×4 grid:

In [13]:
x3 = np.arange(1, 13).reshape(3, 4)
x3

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

### Exercise 1:

Create the same array with `np.linspace`:

In [14]:
np.linspace(1, 12, 12, dtype=int).reshape(3, 4)

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

Please note if dtype is not given, the data type is inferred from start and stop. The inferred dtype will never be an integer for `np.linspace`; float is chosen even if the arguments would produce an array of integers - different from `np.arange`!

### Exercise 2:

Reshape a to 6 rows, 2 columns:

In [15]:
x3.reshape(6, 2)

array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12]])

## 2. Array Indexing and Slicing

### Array Indexing: Accessing Single Elements

Access the $i,j$ value by specifying the desired index in square brackets, just as with Python lists:

In [16]:
x3

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [17]:
x3[1, 2]

np.int64(7)

Values can also be modified using the above index notation. However, as NumPy arrays have a fixed type, if you attempt to insert a floating-point value to an integer array, the value will be silently truncated.

In [18]:
x3[1, 2] = 3.14159
x3

array([[ 1,  2,  3,  4],
       [ 5,  6,  3,  8],
       [ 9, 10, 11, 12]])

### Array Slicing: Accessing Subarrays

We can access subarrays with the slice notation (':' character). The NumPy slicing syntax follows that of the standard Python list: `x[start:stop:step]`.

We don't have to include the upper and lower bound on the slice. If we don't include the lower bound, Python uses 0 by default; if we don't include the upper, the slice runs to the end of the axis, and if we don't include either (i.e., if we use ':' on its own), the slice includes everything. 

In [19]:
x3[0:3:2, 1:4]

array([[ 2,  3,  4],
       [10, 11, 12]])

The following example selects rows 0 through 1 and columns 1 through 3 of the array:

In [20]:
x3[:2, 1:4] # first two rows, 2nd to 4th columns

array([[2, 3, 4],
       [6, 3, 8]])

### Note:

One important–and extremely useful–thing to know about array slices is that they return *views* rather than *copies* of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.

In [21]:
x3

array([[ 1,  2,  3,  4],
       [ 5,  6,  3,  8],
       [ 9, 10, 11, 12]])

In [22]:
x3_sub = x3[:2, :2]
print(x3_sub)

[[1 2]
 [5 6]]


In [23]:
x3_sub[0, 0] = 99
print(x3_sub)

[[99  2]
 [ 5  6]]


In [24]:
x3

array([[99,  2,  3,  4],
       [ 5,  6,  3,  8],
       [ 9, 10, 11, 12]])

This default behavior is quite useful when we work with large datasets. We can access and process pieces of these datasets without the need to copy the underlying data buffer. But it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the `copy` method.

## 3. Structured data

In [None]:
element = ['C', 'O', 'S']
atomic_number = [6, 8, 16]
atomic_mass = [12.001, 15.999, 32.06]

Initialize a Numpy array with datatypes: 
* U10 for 10-character string
* i4 for 32-bit signed integer
* f8 for 64-bit floating-point number

In [None]:
data = np.zeros(3, dtype={'names':('element', 'atomic_number', 'atomic_mass'),
                          'formats':('U10', 'i4', 'f8')})
data.dtype

Access columns by name using square-bracket indexing:

In [None]:
data['element'] = element
data['atomic_number'] = atomic_number
data['atomic_mass'] = atomic_mass
print(data)

Another way to specify structured data types:

In [None]:
datatype = np.dtype([('element', 'U10'), ('atomic_number', 'i4'), ('atomic_mass', 'f8')])
d = np.zeros(3, dtype=datatype)
data['element'] = element
data['atomic_number'] = atomic_number
data['atomic_mass'] = atomic_mass
print(data)