# Data analysis

In this section, we will cover brief introductions to numpy.

## Numpy

Numpy is a highly-optimized (50x faster than native python) library for working with arrays. It stands for Numerical Python. It was created in 2005 by Travis Oliphant and is open-source.

Source code is available at https://github.com/numpy/numpy.

This content is borrowed from the NumPy tutorial in W3 schools - https://www.w3schools.com/python/numpy/default.asp

For a more thorough tutorial, please work through their course.

The primary array object is `ndarray` and provides many supporting functions. Arrays are commonly used in data science, where speed is important. 

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(type(arr))
print(arr)

Arrays can be dimensioned.

In [None]:
arr_0d = np.array(1)
arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
arr_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7,8]]])

print(f"{arr_0d.ndim}d:", arr_0d, "---", sep="\n")
print(f"{arr_1d.ndim}d:", arr_1d, "---", sep="\n")
print(f"{arr_2d.ndim}d:", arr_2d, "---", sep="\n")
print(f"{arr_3d.ndim}d:", arr_3d, "---", sep="\n")

Arrays can be indexed just like native lists.

In [None]:
arr = np.array([2, 4, 6, 8])
print(arr[2])

In [None]:
arr = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
])

print(arr[1, 2])

In [None]:
arr = np.array([
    [
        [1, 2],
        [3, 4],
    ],
    [
        [5, 6],
        [7, 8],
    ]
])

print(arr[1, 0, 1])

Numpy has a few data:
* `i` - integer
* `b` - boolean
* `u` - unsigned integer
* `f` - float
* `c` - complex float
* `m` - timedelta
* `M` - datetime
* `O` - object
* `S` - string
* `U` - unicode string
* `V` - fixed chunk of memory for other type ( void )

You can get the data type with `dtype`

In [None]:
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array(['apple', 'banana', 'cherry'])

print(arr1, arr1.dtype)
print(arr2, arr2.dtype)

You can control the data type.

In [None]:
arr1 = np.array([1, 2, 3, 4], dtype="i8")
arr2 = np.array([1, 2, 3, 4], dtype="i1")

print(arr1.dtype)
print(arr2.dtype)

Array have a defined shape.

In [None]:
arr = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
])

print(arr.shape)

Arrays can be reshaped.

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

print(arr.shape)
print(arr)

In [None]:
new_arr = arr.reshape(4, 3)

print(new_arr.shape)
print(new_arr)

In [None]:
new_arr = arr.reshape(2, 6)

print(new_arr.shape)
print(new_arr)

Array can be iterated over

In [None]:
arr = np.array([
    [
        [1, 2, 3],
        [4, 5, 6],
    ],
    [
        [7, 8, 9],
        [10, 11, 12],
    ],
])

In [None]:
for x in arr:
    for y in x:
        for z in y:
            print(z, end=" ")

This can be shortened with `nditer`.

In [None]:
for d in np.nditer(arr):
    print(d, end=" ")

Arrays can be filtered.

In [None]:
new_arr = arr[arr > 50]

print(new_arr)

The seaborn library provides data visualization.

https://seaborn.pydata.org/

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

arr = np.array([1, 2, 3, 4, 5])
sns.distplot(arr)
plt.show()

## Exercise 1

Plot the distribution graph of Proline from the `data/wines.csv` file.

In [None]:
# Your code

In [None]:
! cat answers/data_1.py