# <span style="color:#0b486b">SIT307 - Data Mining and Machine Learning</span>

---
Lecturer:   Richard Dazeley     | richard.dazeley@deakin.edu.au<br />
Assistant:  Adam Bignold | abignold@gmail.com

School of Information Technology, <br />
Deakin University, VIC 3216, Australia.


---


## <span style="color:#0b486b">Practical Session 1: Python for Data Science</span>

**Prerequisite**
You should already have done, or be confident with the content of: 
1. [Python Environments](1. Introduction to Python Environments.pdf)
2. [Jupyter Notebooks](2. Jupyter Notbook Introduction.pdf)
3. [Markdown](5. Markdown.ipynb)
4. [Basic Python](3. Introduction to Python Programming.ipynb)

If not click on the above links for an introduction

**The purpose of this session is:**

1. Getting your feet wet with Python programming for Data Science. Demonstration of simple commands, numpy, IO, pandas etc
2. Understanding how to read and write data to files

**Instructions** 

1. After you download this notebook, save another copy and rename it so you keep your solution.
2. As we walk through the notebook, you will be required to write your own code in the cells provided for you where necessary. Enter your own code in these cells.
3. Although it is NOT marked, you are strongly recommended to keep these saved notebooks for your future reference.

**Please note that**:
* This unit is *not* about teaching you how to program in Python. You are responsible for developing your own programming skills. There are many excellent resources to learn python online as well as the recommended textbooks presented in the lecture.
* Also note this file does not need to be submitted.

## <span style="color:#0b486b">Numpy: Quick Introduction </span>
Numpy library forms the core of many scientific packages. It offers support for arrays, matrices and related numerical calculations.

In this tutorial we will review its basics, so to learn more about NumPy, visit [NumPy User Guide](http://docs.scipy.org/doc/numpy/user/index.html)

### <span style="color:#0b486b">1 Importing Numpy </span>
To use any package, you must first import it. 

In [1]:
import numpy

The default convention is to import numpy as:

In [2]:
import numpy as np

### <span style="color:#0b486b">2 Numpy arrays</span>
The core of NumPy is its arrays. You can create an array from a Python list or tuple using `'array'` function. They work similarly to lists apart from the fact that:

* you can easily perform element-wise operation on them, and
* unlike lists, they should be pre-allocated.

The first point is further explained in [Array operations section](03-prac3.ipynb#Array-operations). The second point means that you there is no equivalent to list append for arrays. The size of the arrays is known at the time it is defined.

In [None]:
#create a list
x = [1, 7, 3, 4, 0, -5]

#convert it to an array
y = np.array(x)
print (type(y))
print (y.shape)   # It returns a tuple of array dimensions

Using numpy, we can specify a defined range of values to create a new array

In [None]:
range(5)

In [None]:
print(np.array(range(4,8)))

In [None]:
print(np.arange(2, 3, 0.2))  # prints elements from 2 to 3 spaced by 0.2 units
print(np.linspace(2, 3, 5))  # returns numbers spaced evenly on a linear scale, both endspoints are included
print(np.logspace(2, 3, 5))  # returns numbers spaced evenly on a log scale

Creating a an array filled with zeros of a patricular type and shape.


In [None]:
print(np.zeros(shape=(3,), dtype=np.int))

Creating a an array filled with ones of a patricular type and shape.


In [None]:
print(np.ones(shape=(3,), dtype=np.int))

We can reshape an array to a new specified shape keeping its contents.



In [None]:
a = np.array([[1,2], [3,4]])
print(a)

a = np.reshape(a, 4)
print(a)

We can append values to the end of an array.

In [None]:
a = np.array([1,2,3])
print(a)
a = np.append(a,[4])
print(a)

We can stack arrays horizontally (columnwise) using numpy.


In [None]:
a = np.array([1, 2, 3])
print("shape of a is " , a.shape, "and the contents are",a)
b = np.array([2, 3, 4])
print("shape of b is " , b.shape, " and the contents are",b)
stacked = np.hstack((a,b)) #horizontally
print("shape of new array is " , stacked.shape, " and the contents are",stacked) 

We can also stack arrays vertically (rowwise) using numpy.

In [None]:
a = np.array([1, 2, 3])
print("shape of a is " , a.shape, "and the contents are",a)
b = np.array([2, 3, 4])
print("shape of b is " , b.shape, " and the contents are",b)
stacked = np.vstack((a,b)) #vertically
print("shape of new array is " , stacked.shape, " and the contents are",stacked)

We can create a matrix from an array-like object, or from a string of data. A matrix is a  2D array. 

In [None]:
a = np.matrix('1 2; 3 4') #string of data
print("shape is" ,a.shape, "and contents are", a)

a = np.matrix([[1, 2], [3, 4]]) #array-like structure
print("shape is" ,a.shape, "and contents are", a)

Create an identity matrix using numpy. it is a square array with ones on the main diagonal.

In [None]:
a = np.identity(3)
print("shape is" ,a.shape, "and contents are")
print(a)

We can find dot product of two arrays using numpy.

In [None]:
c =  np.dot(3, 4)
print(c)
print(c.shape)

a = [[1, 0], [0, 1]] #2D array 
b = [[4, 1], [2, 2]] #2D array
c = np.dot(a, b)  # matrix multiplication

print(c)
print(c.shape)

We can compute the arithmetic sum, mean and standard deviation using numpy.

In [None]:
a = np.array([[1, 2], [3, 4]])
print(np.sum(a)) # no axis specified
print(np.sum(a, axis=0)) #along a specified axis (across columns)
print(np.sum(a, axis=1)) #along a specified axis (across rows)

print(np.mean(a))         # no axis specified
print(np.mean(a, axis=0)) #along a specified axis (across columns)
print(np.mean(a, axis=1)) #along a specified axis (across rows)

print(np.std(a))         # no axis specified
print(np.std(a, axis=0)) #along a specified axis (across columns)
print(np.std(a, axis=1)) #along a specified axis (across rows)

We can also find the cumulative sum of the array along a given axis.

In [None]:
print(np.cumsum(a)) # no axis specified
print(np.cumsum(a, axis=0)) #along a specified axis (across columns)
print(np.cumsum(a, axis=1)) #along a specified axis (across rows)

Using numpy, we can select the unique element and its index from an array.

In [None]:
print("the unique elements are ",  np.unique([1, 1, 2, 2, 3, 3]))

a = np.array([[1, 1], [2, 3]])
print("the unique elements are ", np.unique(a))
 
a = np.array(['a', 'b', 'b', 'c', 'a'])
u, indices = np.unique(a, return_index=True) #Return the indices of the original array that give the unique values:
print("the unique elements are ", u)
print("corresponding indices of unique elements are ", indices)

We can use numpy to create coordinate matrices from coordinate vectors.

It allows us to make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays.
meshgrid is very useful to evaluate functions on a grid.

In [None]:
nx, ny = (3, 2)
x = np.linspace(0, 1, nx)
y = np.linspace(0, 1, ny)
xv, yv = np.meshgrid(x, y)
print(xv)
print(yv)

NumPy has a module called `random` to generate arrays of random numbers. There are different ways to generate a random number:

In [None]:
print(np.random.rand())

Generate a 3x5 matrix of random numbers

In [None]:
print(np.random.rand(3,5))

We can fix the seed of the random number generator. Execute the previous cell more than one time and see the numbers changing randomly. In the below cell, once the seed is fixed, the numbers once sampled will not change. Seeding is especially used for debugging the programs generating the smae sequence. 

In [None]:
np.random.seed(2348) #seeding
print(np.random.rand(3,5))

We can drawrandom samples from a multivariate normal distribution using numpy.

In [None]:
mean = (1, 2)   #arithmetic mean
cov = [[1, 0], [0, 1]] #covaraince 
x = np.random.multivariate_normal(mean, cov, (5))
print(x)
print(x.shape)

We can compute eigen vector and eigen values using numpy.


In [None]:
w, v = np.linalg.eig(np.diag((1, 2, 3)))
print("eigen values are :", w)
print("eigen vectors are :", v)

We can take the inverse of a matrix using numpy. 

In [None]:
a = np.random.randn(3, 2)
print(a)
b = np.linalg.pinv(a)
print(b)

We can compute the norm of a matrix or vector using numpy.

In [None]:
a = np.linspace(-4, 4, num=8)
print(a)
print(np.linalg.norm(a)) # frobeinus norm
print(np.linalg.norm(a,1)) # 1 norm
print(np.linalg.norm(a,2)) # 2 norm

We list some miscellaneous function that you might encounter further down the path.

In [None]:
import numpy as np
print(np.exp(0))  #exponential function
print(np.sin(0)) #sine funciton
print(np.cos(0)) #cosine function

**Note**: If you need any help on how to use a function or what it does, you can IPython help. Just add a question mark (?) at the end of the function and execute the cell:

### <span style="color:#0b486b">3 File Input and Output </span>
Here we will demonstrate reading, updating and writing files.  In many cases the data you need to work with is stored in files. Real world data usually appears in a file of some type such as txt, csv, xml, json, or so.


### 3.1 Reading and Writing Files
Now, let us read a text file already stored in the directory. 

In [None]:
myfile = open("data/months.txt")
print(myfile.read())
myfile.close()

Here, open() creates a file object (a way of getting at the contents of the file), which is then stored in the variable myfile. myfile.read() tells the file object to read the full contents of the file, and return it as a string.

We can also read the lines one by one using readlines() 

In [None]:
myfile = open("data/months.txt")

for month in myfile.readlines():
    #strip() removes all whitespaces from string
    print("Month: {}".format(month.strip()))

myfile.close()

In fact, you don't even have to call readlines() - Python assumes that if you try to iterate through a text file with a for loop, you probably want to iterate through it line by line:

In [None]:
myfile = open("data/months.txt")

for month in myfile:
    #strip() removes all whitespaces from string
    print("Month: {}".format(month.strip()))

myfile.close()


When you open() a file, you can optionally specify a file mode, which tells Python what you want to do with the file. The default mode is r for read, but another mode is w to write to a file.

Tip: the write (w) mode will write completely new contents to a file, wiping out what it had previously!

There are actually a whole lot of file modes, r and w are just the most common. Option 'a' will append to the end of the file.
There is a <a href="https://docs.python.org/3/library/functions.html#open">full list in the Python documentation for the open function</a> or you can type open? in IPython Notebook and run it to see the help displayed there.

Lets open a new file and write something to it.

In [None]:
f = open("data/awesomenewfile.txt", "w")
f.write("Awesome message!")
f.close()

In [None]:
f = open("data/awesomenewfile.txt", "r")
print(f.read())
f.close()

In [None]:
f = open("data/awesomenewfile.txt", "a") # appending to file
f.write("\nAnother message!") # "\n" introduces a new line character
f.close()

f = open("data/awesomenewfile.txt", "r")
print(f.read())
f.close()

**Exercise** Insert December into months.txt file

In [None]:
#open the file in append mode
myfile = 

#write into the file


#Close the file
myfile.close()

### <span style="color:#0b486b"> 4 Reading and Writing Files using Pandas

We now introduce a powerful package called pandas to read the file from csv format. Using pandas, we can read and store data as a dataframe just like in R programming language. [You can read this section](http://pandas.pydata.org/pandas-docs/stable/10min.html) for a quick introduction to pandas. It is very simple. Read the contents of the csv file into a pandas dataframe.

In [None]:
#import pandas
import pandas as pd

 Read the contents of the csv file into a pandas dataframe. Advertising.csv is a dataset which details about the advertising mediums that have contributed towards the sales of a product. It has got 3 feature, TV, Radio adn Newspaper. 

In [None]:
my_dataframe = pd.read_csv('data/Advertising.csv', delimiter=',') 

In [None]:
my_dataframe.head() # print the first 5 entries

In [None]:
my_dataframe.tail() # print the first 5 entries

In pandas, you can use the column name to access induvidual columns. Also you can combine the columns to form the feature matrix. 

In [None]:
feature_1 = my_dataframe.TV.values #  accessing a single column, .values convert the dataframe to a numpy array
feature_2 = my_dataframe.Radio.values
feature_3 = my_dataframe.Newspaper.values

feature_cols = ['TV','Radio','Newspaper']

x = my_dataframe[feature_cols].values 
y = my_dataframe.Sales.values

print(feature_1.shape)
print(feature_2.shape)
print(feature_3.shape)


print(x.shape)
print(y.shape)

You can write the data to a csv using pandas. 

In [None]:
my_dataframe.to_csv("data/file_write.csv") # if the data is already in the dataframe format

#if data is a numpy structure, then first convert it to dataframe and then write.
np_array = my_dataframe.values   
header_labels = ['TV','Radio','Newspaper','Sales']
my_dataframe = pd.DataFrame(np_array) #if data is a numpy structure, then first convert it to dataframe and then write.
my_dataframe.to_csv("data/file_write.csv",  header = header_labels) #if header is there, pass it here.

## <span style="color:#0b486b">6 Acknowledgement </span>

Some of the materials are adapted from the course SIT112 By Prof. Dinh Phung.