# Hands-on Introduction to Python And Machine Learning

Instructor: Tak-Kei Lam

(Readers are assumed to have a little bit programming background.)

It is often useful to visualise the data to get a better understanding about the properties. There are several very useful Python libraries for us to read/write and analyse the data, and to show the data:
- [Numpy](http://www.numpy.org/)
- [Pandas](https://pandas.pydata.org/)
- [Matplotlib](https://matplotlib.org/)

# Numpy
> NumPy is the fundamental package for scientific computing with Python. It contains among other things:
>
>    - a powerful N-dimensional array object
>    - sophisticated (broadcasting) functions
>    - tools for integrating C/C++ and Fortran code
>    - useful linear algebra, Fourier transform, and random number capabilities
>
> Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can >be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
>
>NumPy is licensed under the BSD license, enabling reuse with few restrictions.

If we want to multiply every element in an array by 2, how should we write the code?

In [None]:
data = [1, 2, 3, 4, 5]
# it is natural for us to express our idea in this way, but does it work?
data = data *2

print('data is {}'.format(data))

In [None]:
import numpy as np

# what if we use numpy array?
data = [1, 2, 3, 4, 5]
np_data = np.array(data)
np_data = np_data *2

print('data is {}'.format(np_data))

Numpy can do many more things!

In [None]:
# multi dimensional

data = [
    [[1,2, 3], [4, 5, 6]],
    [[7,8, 9], [10, 11, 12]]
       ]

data = np.array(data)
data =  data * 2
print(data)

In [None]:
# multi dimensional

data = [
    [[1,2, 3], [4, 5, 6]],
    [[7,8, 9], [10, 11, 12]]
       ]

data = np.array(data)
print('data[0][1][1]: {}'.format(data[0][1][1]))

print('data[:1] : ')
print(data[:1])

print('data.transpose():')
print(data.transpose())

print('data\'.shape')
print(data.shape)

In [None]:
print(np.zeros( (3,4) ))

print(np.ones( (2,3,4)) )

In [None]:
# reshaping
np.arange(12).reshape(4,3)  

In [None]:
# number generation

# generate 50 random integers
v = np.random.randint(low=10, high=13, size=50)
print(v)

# generate 50 numbers from a normal (Gaussian) distribution
mu, sigma = 2, 0.5 # mean, standard deviation
v = np.random.normal(mu,sigma,50)
print(v)

# Pandas
>*pandas* is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

> Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

The main data structures in pandas are *series" and *dataframe*. 

Series is officially defined as follows:
> Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the *index*. 

Dataframe:
> DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. 

pandas can simply achieve what a spreadsheet can do; and many more.

In [None]:
import pandas as pd

# wrap a Numpy  array, and give a label to each of the data point
v = pd.Series(np.random.randint(low=10, high=13, size=5), index=['a', 'b', 'c', 'd', 'e'])
print(type(v)) # print the type of v
print(v) # print the content of v

# instead of accessing the array by index, now we can access the Series by label. Yay!
print(v['b'])

In [None]:
# we can also wrap a dictionary into a pandas's Series
v = pd.Series({'a':1, 'b':2, 'c':3})
print(v)


In [None]:
# pandas's Series has similar behaviours as Numpy's array
# we can also wrap a dictionary into a pandas's Series
v = pd.Series({'a':1, 'b':2, 'c':3})
print(v)

v =  v * 2
print(v)


In [None]:
# let's go 2D!

d = [[1, 2], [3, 4], [5, 6]]
v = pd.DataFrame(d)
v

#does it look good?

In [None]:
v =pd.DataFrame(d, index=['a', 'b', 'c'])
v

In [None]:
# selecting rows
v =pd.DataFrame(d, index=['a', 'b', 'c'])
v.loc[['a', 'c']]

In [None]:
# selecting columns
v =pd.DataFrame(d, index=['a', 'b', 'c'])
v[0]

In [None]:
# selecting rows and columns
v =pd.DataFrame(d, index=['a', 'b', 'c'])
v.loc[['a', 'c']][1]

In [None]:
# we can also give names to the columns
v =pd.DataFrame(d, index=['a', 'b', 'c'], columns=['x', 'y'])

print(v.loc[['a', 'c']]['y'])

print(v['y'].loc[['a', 'c']])

In [None]:
v =pd.DataFrame(d, index=['a', 'b', 'c'], columns=['x', 'y'])
print(v)

# get a summary of the data in one line
v.describe()

In [None]:
# pandas supports SQL-joins
table1 = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
table1 = pd.DataFrame(table1, columns=['a','b','c'])
print(table1)

table2 = [[1, 0, 1], [1, 1, 0], [1, 0, 0]]
table2 = pd.DataFrame(table2, columns=['a','b','c'])
print(table2)

table3 = pd.merge(table1, table2, how='inner', on=['a', 'c'])
table3

Besides data manipulation, pandas supports reading from and writing to CSV text file:

In [None]:
pokemons = pd.read_csv('pokemon.csv')
pokemons.to_csv('pokemon2.csv')
pokemons

# Matplotlib
> Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

It is a library that tries to port a ploting library in Matlab into Python. Its functions are easy to use: pretty graphs can be plotted using only several lines of code. 

The plotting functions generally have the following formats:
<code>
plot(x array, y, array, options)
</code>

In [None]:
import matplotlib.pyplot as plt

x = np.array([1,2, 3])
y = np.array([3, 2, 1])

plt.figure(figsize=(5, 4), dpi=100)

plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')

How about drawing two graphs side by side? We can use <code>subplot</code>.

In [None]:
import matplotlib.pyplot as plt

x = np.array([1,2, 3])
y = np.array([1, 1, 1])

plt.figure(figsize=(8, 4), dpi=100)
plt.subplot (1, 2, 1) # number of rows, number of columns,  current index
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('y against x')

plt.subplot(1, 2, 2)
plt.plot(y, x)
plt.xlabel('y')
plt.ylabel('x')
plt.title('x against y')

plt.tight_layout() # adjust the layout automatically to achieve the best fit

Besides line, there are many other graph types available in Matplotlib.