# NumPy and Pandas

In [None]:
import pip
!pip install numpy pandas

## NumPy
![NumPy logo](https://bids.berkeley.edu/sites/default/files/styles/250x140/public/projects/numpy_logo_project_page_banner.png?itok=LlKlkeqk)

[NumPy](https://www.numpy.org/) is a Python library for numerical computation. NumPy's main object is the homogeneous **multidimensional array**, ``ndarray`` (n-dimensional array). It is a table of elements (usually numbers), all of the **same type**, indexed by a tuple of non-negative integers. In NumPy, dimensions are called ``axes``.

__NEWS:__

And, why is really interessting? https://www.phoronix.com/scan.php?page=news_item&px=Intel-Numpy-AVX-512-Landed

In [None]:
import numpy as np

[![Python Import](https://imgs.xkcd.com/comics/python.png)](https://xkcd.com/353/)

## A new best friend

[![Numpy Cheat Sheet](https://blog.finxter.com/wp-content/uploads/2019/10/grafik-1-1024x725.png)](https://blog.finxter.com/wp-content/uploads/2019/10/grafik-1-1024x725.png)

For example, the coordinates of a point in 3D space ``[1, 2, 1]`` has one axis. That axis has 3 elements in it, so we say it has a length of 3. The array ``b`` has 2 axes. The first axis has a length of 2, the second axis has a length of 3.

In [None]:
a = np.array([1,2,1])
b = np.array([[1,2,3],[4,5,6]])

In [None]:
a

In [None]:
a.shape

In [None]:
b

In [None]:
b.shape

NumPy’s array class is called ``ndarray``. It is also known by the alias ``array``. Some important attributes of an ``ndarray`` object are:

* ``ndarray.ndim``: the number of axes (dimensions) of the array.
* ``ndarray.shape``: the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with ``n`` rows and ``m`` columns, shape will be ``(n,m)``. The length of the shape tuple is therefore the number of axes, ``ndim``.

* ``ndarray.size``: the total number of elements of the array. This is equal to the product of the elements of shape.
* ``ndarray.dtype``: an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally, NumPy provides types of its own: ``numpy.int32``, ``numpy.int16``, and ``numpy.float64`` are some examples.
* ``ndarray.data``: the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities. 

In [None]:
# arange (stands for "array range"): all numbers from 0 to 14; reshape: arrange in an array with the given dimensions
a = np.arange(15).reshape(3, 5) 
a

In [None]:
a.shape

In [None]:
a.ndim

In [None]:
a.dtype

In [None]:
a.size

In [None]:
type(a)

### Array Creation
There are several ways to create arrays. For example, you can create an array from a regular Python list or tuple. The type of the resulting array is deduced from the type of the elements in the sequences.

In [None]:
a = np.array([2,3,4])
a

In [None]:
a.dtype

In [None]:
b = np.array([1.2, 3.5, 5.1])
b.dtype

``array`` transforms sequences of sequences into two-dimensional arrays, sequences of sequences of sequences into three-dimensional arrays, and so on. The type of the array can also be explicitly specified at creation time.

In [None]:
np.array([(1.5,2,3), (4,5,6)])

In [None]:
np.array([ [1,2+3j], [3,4] ], dtype=np.csingle)  # float complex

A frequent error consists in calling ``array`` with multiple numeric arguments, rather than providing a single list of numbers as an argument.

_Wrong:_

In [None]:
a = np.array(1,2,3,4)

_Right:_

In [None]:
a = np.array([1,2,3,4])

Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.

* The function ``zeros`` creates an array full of zeros.
* The function ``ones`` creates an array full of ones.
* The function ``empty`` creates an array whose initial content is random and depends on the state of the memory. 

By default, the dtype of the created arrays is ``float64``. The first argument to the functions is a shape-tuple, defining the size of the array.

In [None]:
np.zeros((3,4))

In [None]:
np.ones((2,3,4), dtype=np.int16 )  # dtype can also be specified

In [None]:
np.empty((2,3))  # uninitialized, output may vary

Use ``arange`` and ``linspace`` to create sequences of numbers:

In [None]:
np.arange( 10, 30, 5 )  # from 10 to 30 (exclusive) in steps of 5

In [None]:
np.linspace( 0, 2, 9 )  # 9 numbers between 0 and 2 (inclusive)

### Basic Operations
Arithmetic operators on arrays apply **elementwise**. A new array is created and filled with the result. The product operator ``*`` also operates elementwise in NumPy arrays. The matrix product can be performed using the ``@`` operator or the ``dot`` function or method:
![Dot Product](http://wiki.fast.ai/images/e/eb/Matrix_product_steps.png)

In [None]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )

a-b

In [None]:
b**2

In [None]:
10*np.sin(a)

In [None]:
a<35

In [None]:
A = np.array( [[1,1], 
               [0,1]] )
B = np.array( [[2,0], 
               [3,4]] )
A * B       # elementwise product

In [None]:
A @ B  # matrix product

In [None]:
A.dot(B)  # matrix product again

In [None]:
np.mat(A) * np.mat(B) # matrix product again

Unary operators apply to the array as though it were a list of numbers, regardless of its shape. However, by specifying the ``axis`` parameter, it is applied along the specified axis of an array:

In [None]:
a = np.random.random((2,3))
a

In [None]:
a.sum()

In [None]:
a.min()

In [None]:
a.max()

In [None]:
a.sum(axis=0)  # sum of each column

In [None]:
a.min(axis=1)  # min of each row

NumPy provides familiar mathematical functions such as ``sin``, ``cos``, and ``exp``. In NumPy, these are called "universal functions". Within NumPy, these functions operate elementwise on an array, producing an array as output.

In [None]:
b = np.arange(3)
b

In [None]:
np.exp(b)

In [None]:
np.sqrt(b)

### Indexing, Slicing and Iterating
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.

In [None]:
a = np.arange(10)**3
a

In [None]:
a[2]

In [None]:
a[2:5]   # Attention: stop-element is excluded

In [None]:
a[:6:2] = -1000
a

In [None]:
a[::-1]  # reverse a

Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas:

In [None]:
def f(x,y):
    return 10*x+y
    
b = np.fromfunction(f,(5,4),dtype=int)
b

In [None]:
b[2, 3]

In [None]:
b[0:5, 1]

In [None]:
b[:,-1]

When fewer indices than the number of axes are provided, the missing indices are considered complete slices. The expression within brackets in ``b[i]`` is treated as an ``i`` followed by as many instances of ``:`` as needed to represent the remaining axes. NumPy also allows you to write this using dots as ``b[i,...]``.

In [None]:
b[-1]  # the last row. Equivalent to b[-1,:]

In [None]:
b[-1,...]

Iterating over multidimensional arrays is done with respect to the first axis:

In [None]:
for row in b:
    print(row)

However, if one wants to perform an operation on each element in the array, one can use the ``flat`` attribute which is an iterator over all the elements of the array:

In [None]:
for element in b.flat:
    print(element)

### Copies and Views
When operating and manipulating NumPy arrays, the data is sometimes copied into a new array and sometimes not. This is often a source of confusion. There are three cases:

#### 1) No Copy At All
Simple assignments make no copy of array objects or of their data.

In [None]:
a = np.arange(12)
b = a   # no new object is created
b is a  # a and b are two names for the same ndarray object

In [None]:
b.shape = 3,4  # also changes the shape of a
a.shape

Python passes mutable objects as references, so function calls make no copy.

In [None]:
id(a)

In [None]:
def f(x):
    print(id(x))
    
f(a)
f(b)

#### 2) View or Shallow Copy
Different array objects can share the same data. The ``view`` method creates a new array object that looks at the **same** data.

In [None]:
c = a.view()
c is a

In [None]:
c.base is a  # c is a view of the data owned by a

In [None]:
c.flags.owndata

In [None]:
c.shape = 2,6   # a's shape doesn't change
c

In [None]:
a.shape

In [None]:
c[0,4] = 1234  # a's data changes
a

Slicing an array returns a view of it:

In [None]:
s = a[:, 1:3]
s[:] = 10  # s[:] is a view of s. Note the difference between s=10 and s[:]=10
a

#### 3) Deep Copy
The ``copy`` method makes a complete copy of the array and its data.

In [None]:
d = a.copy()  # a new array object with new data is created
d is a

In [None]:
d.base is a

In [None]:
d[0,0] = 9999
d

In [None]:
a

Sometimes ``copy`` should be called after slicing if the original array is not required anymore. For example, suppose ``a`` is a huge intermediate result and the final result ``b`` only contains a small fraction of ``a``, a deep copy should be made when constructing ``b`` with slicing:

In [None]:
a = np.arange(int(1e8))
b = a[:100].copy()
del a  # the memory of ``a`` can be released.

If ``b = a[:100]`` is used instead, ``a`` is referenced by ``b`` and will persist in memory even if ``del a`` is executed.

![Pandas Logo](https://dev.pandas.io/static/img/pandas.svg)
Pandas is a Python library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

In [None]:
import pandas as pd

## Another new best friend

[![Pandas Cheat Sheet](https://s3.studylib.net/store/data/025268801_1-1bb4205c74b96358224e9e1be6dbfbda.png)](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjGr5DrtLjrAhW7QEEAHYxvAPIQFjAAegQICRAB&url=https%3A%2F%2Fpandas.pydata.org%2FPandas_Cheat_Sheet.pdf&usg=AOvVaw2Z0H-ttrFe-41ta-Cnkf55)

### Pandas Features
- Data manipulation: perform operations on datasets
- Handling missing values: datasets are imperfect
- File format support: various file formats are supported for I/O
- Data cleaning: data can be messy. Pandas helps in making data usable for analysis
- Visualization: visualization helps to understand analysis results
- Python support: use other Python libraries with Pandas objects

There are two data structures: ``Series`` and ``DataFrames``.  

A ``Series`` is a one-dimensional labeled array, whereas a ``DataFrame`` is a 2-dimensional labeled data structure. You can think of a ``DataFrame`` as a spreadsheet or SQL table, or a dict of ``Series`` objects. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

![Series vs DataFrame](https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png)

In [None]:
data = {
    'apples': [3, 2, 6, 1, 7, 8], 
    'oranges': [2, 3, 7, 2, 6, 4],
    'bananas': [1, 1, 3, 5, 4, 1],
    'strawberries': [41, 27, 33, 15, 4, 8]
}

purchases = pd.DataFrame(data)

### Viewing Data
You can view the top and bottom rows of a DataFrame, display index and column names or show a quick statistic summary of the data:

In [None]:
purchases.head()

In [None]:
purchases.tail(2)  # default value is 5

In [None]:
purchases.columns

In [None]:
purchases.index

In [None]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Maria', 'Bert'])
purchases.index

In [None]:
purchases.head()

In [None]:
purchases.describe()

### Sorting

_Sorting by an axis_

In [None]:
purchases.sort_index(axis=1, ascending=False).head() # sort columns in reverse alphabetical order

_Sorting by a value_

In [None]:
pp = purchases.sort_values(by='oranges') # sort by ascending values in column 'oranges'
pp.head()

### Inplace
In pandas, when you pass ``inplace=True`` to a function call, the data is modified in place and the function will return ``None``. Otherwise, the function returns a copy of the object (memory!).

In [None]:
purchases.head()

In [None]:
purchases.sort_values(by='oranges', inplace=True)

In [None]:
purchases.head()

### Indexing and Selecting


#### Selecting by label: loc
``loc`` is purely label based indexing. Every label asked for must be in the index, otherwise a ``KeyError`` will be raised. Integers are valid labels, but they refer to the label and not the position.

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=list('abcdef'), columns=list('ABCD'))
df.head(6)

In [None]:
df.loc[['a', 'b', 'd'], :]

In [None]:
df.loc['d':, 'A':'C']

In [None]:
df.loc['a']

In [None]:
df.loc['a'] > 0

In [None]:
df.loc[:, df.loc['a'] > 0]

Accessing a single value

In [None]:
df.loc['a', 'A']

In [None]:
df.at['a', 'A']

#### Valid label inputs:
- a single label: ``5`` or ``'a'``
- a list or array of labels: ``['a', 'b', 'c']``
- a slice object with labels: ``'a':'f'``
- a boolean array
- (a callable)

#### Selecting by position: iloc
``iloc`` is purely integer based indexing (0-based). Trying to use a non-integer will raise an ``IndexError``.

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=list(range(0, 12, 2)), columns=list(range(0, 8, 2)))
df

In [None]:
df.iloc[:3]

In [None]:
df.iloc[1:5, 2:4]

In [None]:
df.iloc[[1,3,5], [1,3]]

In [None]:
df.iloc[:, 1:3]

In [None]:
df.iloc[1, 1]

In [None]:
df.iat[1, 1]

#### Valid integer inputs:
- an integer: ``5``
- a list or array of integers: ``[4, 3, 0]``
- a slice object with integers: ``3:7``
- a boolean array
- (a callable)

### Visualization

In [None]:
import matplotlib.pyplot as plt

purchases.plot()
plt.show()

#### Other kinds of plots
built-in in Pandas: ``bar``, ``hist``, ``box``, ``density``, ``area``, ``scatter``, ``pie``, ...

In [None]:
purchases.plot(kind='bar')
plt.show()

In [None]:
purchases.plot(kind='scatter', x='oranges', y='apples')
plt.show()

#### Plotting Parameters
You can pass additional parameters to the plot function that are passed on to matplotlib's ``plot()``, e.g. ``alpha``.

In [None]:
purchases.plot(kind='bar', alpha=0.5)
plt.show()

In [None]:
purchases.plot(kind='barh', stacked=True)
plt.show()

More on Python plotting functionalities next time!

### Pandas I/O
Pandas provides a number of reader and writer functions for file handling and supports a large number of different file types such as csv, json, html, excel, pickle, SQL etc. We will look into reading and writing csv files.

__Attention:__ the csv file in used in the following example is placed on the web. Instead of providing an URL, one can also simply specify a path on the disc.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/ADSLab-Salzburg/DataAnalysiswithPython/main/data/Automobile_data.csv', delimiter=",")
df.head()

In [None]:
column_names = list(df.columns)
column_names.remove('index')
df = df.loc[:, column_names]
df.head()

In [None]:
df.to_csv('Automobile_data.tsv', sep='\t')
import os; os.remove('Automobile_data.tsv')  # I don't want to keep that file ;-)

#### Total cars per company

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### Most expansive car per company
How much is the most expensivest car per company and overall?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

#### String replacement
Replace ``alfa-romero`` by ``alfa-romeo``

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<a name="exercise"></a>
## Wrap-up Exercise # 1
Reuse the class ``Rocket`` from last time. The ``__init__`` function (without parameters) should again set the class variables ``x``, ``y`` and ``z`` to zero. Change the method ``move_up`` so that it increments the ``z`` value by a random integer between ``1`` and ``10`` whenever it is called (choose an appropriate method from [numpy.random](https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html)).

Then, create a fleet of 3 rockets and store them in the list ``space_rockets``. Call ``move_up`` for every rocket and then iterate over the list to print their altitudes.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<a name="exercise"></a>
## Wrap-up Exercise # 2
Read the ``Automobile_data.csv`` file again into a pandas dataframe. Extract the columns ``company`` and 
``price`` into a new DataFrame ``df_prices``. Call the ``boxplot`` function on the new DataFrame and plot a boxplot grouped by ``company``. Note that ``boxplot`` also has a keyword argument ``by``.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<a name="exercise"></a>
## Wrap-up Exercise # 3
Again, use the ``Automobile_data.csv`` dataframe. Print its shape by calling ``shape`` on the dataframe. Note how many rows the dataframe has.

Now, use the function ``isna()`` to find NaN values in the column ``price`` and count them using ``sum``. Remove each row that contains a NaN value in the ``price`` column. Again, call ``shape`` on the resulting frame to make sure the number of rows removed corresponds to the number of NaNs counted before.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Inspriation/Further Reading
- [NumPy Quickstart](https://www.numpy.org/devdocs/user/quickstart.html)
- [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
- [Pandas Tutorials](https://data-flair.training/blogs/pandas-tutorials-home)