# <span style="color:#0b486b">SIT307 - Data Mining and Machine Learning</span>

---
Lecturer:   Richard Dazeley     | richard.dazeley@deakin.edu.au<br />
Assistant:  Adam Bignold | abignold@gmail.com

School of Information Technology, <br />
Deakin University, VIC 3216, Australia.


---


## <span style="color:#0b486b">Practical Session 2: Data and Visualisations with numpy</span>

**Prerequisite**
You should already have done, or be confident with the content of: 
1. Week 1 material

**The purpose of this session is:**

1. learn simple data visualisation skills with numpy

**Instructions** 

1. After you download this notebook, save it as another copy and rename it to `"[yourstudentID]_Week_2_Data_and_Visualisations_with_numpy.ipynb"`
2. fill in the code cells indicated with your own solution. You can discuss approaches with other students but must only submit your own original solution. 

## <span style="color:#0b486b">Matplotlib  </span>
### <span style="color:#0b486b">What is it?  </span>
matplotlib is a python plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code.

For simple plotting the pyplot interface provides a MATLAB-like interface, particularly when combined with IPython. You have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

### <span style="color:#0b486b">Getting Started  </span>
To get started with 'matplotlib' you can either execute:

In [None]:
from pylab import *

or

In [None]:
import matplotlib.pyplot

In fact it is a convention to import it under the name of 'plt':

In [None]:
import matplotlib.pyplot as plt

note: The second method is preferred.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Regardless of the method you use, it is better to configure matplotlib to embed figures in the notebook instead of opening them in a new window for each figure. To do this use the magic function:

In [None]:
%matplotlib inline

### <span style="color:#0b486b">Plot  </span>
By using `'subplots()'` you have access to both figure and axes objects. 

In [None]:
x = np.linspace(0, 10)
y = np.sin(x)

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y)

### <span style="color:#0b486b">Title and labels  </span>

In [None]:
ax.set_title('title here!')
ax.set_xlabel('x')
ax.set_ylabel('sin(x)')
fig

You can also use $\LaTeX$ in title or labels, or change the font size or font family.

In [None]:
x = np.linspace(-10, 10)
fig, ax = plt.subplots(figsize=(6, 4), dpi=100)
ax.plot(x, x**3-x**2, 'bo', lw=3)

ax.set_title('$x^3-x^2$', fontsize=15)
ax.set_xlabel('$x$', fontsize=15)
ax.set_ylabel('$y$', fontsize=15)

### <span style="color:#0b486b">Subplots  </span>
You can pass the number of subplots to 'subplots()'. In this case, 'axes' will be an array that each of its elements associates with one of the subgraphs. You can set properties of each 'ax' object separately like the cell below.

Obviously you caould use a loop to iterate over 'axes'.


In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)

x = np.linspace(0, 10)

axes[0].plot(x, np.sin(x))
axes[0].set_xlabel('x')
axes[0].set_ylabel('sin(x)')

axes[1].plot(x, np.cos(x))
axes[1].set_xlabel('xx')
axes[1].set_ylabel('cos(x)')

`'cos(x)'` label is overlapping with the `'sin'` graph. You can adjust the size of the graph or space between the subplots to fix it.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
fig.subplots_adjust(wspace=0.4)
x = np.linspace(0, 10)

axes[0].plot(x, np.sin(x))
axes[0].set_xlabel('x')
axes[0].set_ylabel('sin(x)')

axes[1].plot(x, np.cos(x))
axes[1].set_xlabel('xx')
axes[1].set_ylabel('cos(x)')

### <span style="color:#0b486b">Legend</span>

In [None]:
x = np.linspace(0, 10)
fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(x, np.sin(x), label='$sin(x)$')
ax.plot(x, np.cos(x), label='$cos(x)$')
# ax.legend(fontsize=16, loc=3)

### <span style="color:#0b486b">Customizing ticks</span>
In many cases you want to customize the ticks and their labels on x or y axis. First draw a simple graph and look at the ticks on x-axis. 

In [None]:
x = np.linspace(0, 10, num=100)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, np.sin(x), x, np.cos(x), lw=2)

You can change the ticks easily with passing a list (or array) to `'set_xticks()'` or `'set_yticks()'`:

In [None]:
xticks = [0, 1, 2, 5, 8, 8.5, 10]
ax.set_xticks(xticks)
fig

Or even you can change the labels:

In [None]:
xticklabels = ['$\gamma$', '$\delta$', 'apple', 'b', '', 'c'] 
ax.set_xticklabels(xticklabels, fontsize=18)
fig

### <span style="color:#0b486b">Saving figures</span>

In [None]:
x = np.linspace(0, 10)
fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(x, np.sin(x), label='$sin(x)$')
ax.plot(x, np.cos(x), label='$cos(x)$')
ax.legend(fontsize=16, loc=3)
fig.savefig('myfig.pdf', format='PDF', dpi=300)

### <span style="color:#0b486b">Other plot styles</span>

There are many other plot types in addition to simple `'plot'` supported by `'matplotlib'`. You will find a complete list of them on [matplotlib gallery](http://matplotlib.org/gallery.html).

### <span style="color:#0b486b">Scatter plot</span>

In [None]:
fig, ax = plt.subplots()

x = np.linspace(-0.75, 1., 100)
ax.scatter(x, np.random.randn(x.shape[0]), 
                   s = 250*np.abs(np.random.randn(x.shape[0])), 
                   alpha=0.4,
                  edgecolor='none')
ax.set_title('scatter')

### <span style="color:#0b486b">Bar plot</span>

In [None]:
fig, ax = plt.subplots()

x = np.arange(1, 6)
ax.bar(x, x**2, align="center")
ax.set_title('bar')

## <span style="color:#0b486b">File I/O</span>
### <span style="color:#0b486b">TXT</span>
TXT file format is the most simplestic way to store data.

Load a TXT file with `'np.loadtxt()'`:


In [None]:
import numpy as np

x = np.loadtxt("data/txt_data1.txt")
x

Save a TXT file with `'np.savetxt()'`:

In [None]:
y = np.random.randint(10, size=5)
np.savetxt("data/txt_data2.txt", y)
y

### <span style="color:#0b486b">TXT</span>
Comma Separated Values format and its variations, are one the most used file format to store data.

You can use `'np.genfromtxt()'` to read a CSV file: **NOTE:** The best way to read CSV and XLS files is suing **pandas** package that will be introduced later.


In [None]:
x = np.genfromtxt("data/csv_data1.csv", delimiter=",")
x

Use `'np.savetxt()'` to save a 2d-array in a CSV file.

In [None]:
x = np.random.randint(10, size=(6,4))
np.savetxt("data/csv_data2.csv", x, delimiter=',')
x

### <span style="color:#0b486b">3.3 JSON</span>
JSON is the most used file format when dealing with web services. 

To read a JSON file, use `'json'` package and `'load()'` function, or `'loads()'` if the data is serialized. It reads the data and parses it into a dictionary.

In [None]:
import json
with open("data/json_data1.json", 'rb') as fp:
    fcontent = fp.read()
data = json.loads(fcontent)
data.keys()

In [None]:
data

In [None]:
data['phoneNumbers']

You can also write a python dictionary into a JSON file. To do this use `'dump()'` or `'dumps()'` functions.

In [None]:
data = [{'Name': 'Zara', 'Age': 7, 'Class': 'First'}, 
        {'Name': 'Lily', 'Age': 9, 'Class': 'Third'}];
data

In [None]:
with open("data/json_data_now.json", 'w') as fp:
    json.dump(data, fp)

## <span style="color:#0b486b">4. Plotting a histogram</span>
### <span style="color:#0b486b">4.1 Dataset</span>
You are provided with a dataset of percentage of body fat and 10 simple body measurements recoreded for 252 men (courtesy of Journal of Statistics Education - JSE). You can read about this and other [JSE datasets here](http://www.amstat.org/publications/jse/jse_data_archive.htm).

First load the data set into an array:

In [None]:
import numpy as np
data = np.genfromtxt("data/fat.dat.txt")
data.shape

Based on the [dataset description](http://www.amstat.org/publications/jse/datasets/fat.txt), 5th column represents the weight in lbs. Index the weight column and call it `'weights'`:

In [None]:
weights = data[:, 5]
weights

Use array operators to convert the weigts into kg. 1 lb equals to 0.453592 kg.

In [None]:
a = 3
a = a + 4
a += 4

In [None]:
weights *= 0.453592
weights = weights.round(2)
weights

### <span style="color:#0b486b">Histogram</span>
A histogtram is a bar plot that shows you the statistical distribution of the data over a variable. The bars represent the frequency of occurenve by classess of data. We use the package `'matplotlib'` and the function `'hist()'` for plotting the histogram. To learn more about `'matplotlib'` make sure you have read tutorial.

The first line of the cell below if for showing the figure in the notebook and not opening it in a separate window.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots()
ax.hist(weights)

The `'hist()'` functions automatically group the data over 10 bins. Usually you need to tweek the number of bins to obtain a more expressive histogram.

In [None]:
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(weights, bins=20)

### <span style="color:#0b486b">4.3 Boxplot</span>

A `Boxplot` is a convenient way to graphically display numerical data. 

In [None]:
import matplotlib
fig, ax = matplotlib.pyplot.subplots(figsize=(7, 5))
matplotlib.rcParams.update({'font.size': 14})
ax.boxplot(weights, 0, labels=['group1'])
ax.set_ylabel('weight (kg)', fontsize=16)
ax.set_title('Weights BoxPlot', fontsize=16)