# 1. Introduction
This lab aims to introduce NumPy Arrays and Pandas DataFrame which provides a computational foundation for general numeracal data processing. NumPy N-dimensional array (ndarray) object is a fast, flexible container, typically used for large datasets. Panddas provides high performance, easy to use data structures and data analysis tools for the Python programming language. 

### 1.0.1 NumPy

#### Creating Arrays
First we need to import NumPy library. To create an array use the array function. 

In [1]:
import numpy as np

In [2]:
data1 = [2, 5, 4, 8, 0, 1]
arr1 = np.array(data1)

In [3]:
arr1

array([2, 5, 4, 8, 0, 1])

We can create an array using arange. It is a NumPy function that generates values within half-open interval [start, stop). By defeault, the step size is 1. Here, we generate a sequence of values from 0 to 19.

In [4]:
arr1 = np.arange(0,20)

In [5]:
arr1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

We can specify the step size as follows.

In [6]:
arr1 = np.arange(0,2,0.1)

In [7]:
arr1

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2,
       1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9])

There are other functions for creating new arrays. <b>zeros</b> and <b>ones</b> create ayyas of 0s and 1s respectively. Here we can use <b>zeros</b> to create an array with length of 10.

In [8]:
arr1 = np.zeros(10)

In [9]:
arr1

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Here we use ones to create a two dimensional array by passing a tuple as the size of the array.

In [10]:
arr1 = np.ones((3,4))

In [11]:
arr1

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

We can create a multidimensional array by manually specifying the elements. Here we create an array (arr2) of two dimensions. The shape of the array will be inferred from the data.

In [12]:
data2 = [[3, 4.9, -2.5, 1.7], [-5.2, -3.1, 7.4, 3.6], [4.2, 6.8, -4.3, -5.7]]

In [13]:
arr2 = np.array(data2)

In [14]:
arr2

array([[ 3. ,  4.9, -2.5,  1.7],
       [-5.2, -3.1,  7.4,  3.6],
       [ 4.2,  6.8, -4.3, -5.7]])

We can confirm this by inspecting the <b>ndim</b> and <b>shape</b> attributes.

In [15]:
arr2.ndim

2

In [16]:
arr2.shape

(3, 4)

#### Indexing and Slicing
We can access the elements in the array by using index and square bracket []. Here we are accessing the third element in the array.

In [17]:
arr1[2]

array([1., 1., 1., 1.])

The same index notation is used to access the elements in multidimensional array. But we need to specify two indexes row, column. Here we are accessing element at the first row and second column.

In [18]:
arr2[0,1]

4.9

Slicing an array is similar to slicing a list. We need to specify start:stop in the square bracket [].

In [19]:
arr1[2:5]

array([[1., 1., 1., 1.]])

To slice multidimensional array, we need to specify two start:stop, each for row and column. Here we are selecting the first two rows and last three columns. 

In [20]:
arr2[0:2, 1:4]

array([[ 4.9, -2.5,  1.7],
       [-3.1,  7.4,  3.6]])

### 1.0.2 Arithmetic

NumPy enables batch operations on data or what is known as vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise. Let's create another array called arr3. We are going to use the <b>randn</b> function to generate some random nnormally distributed data.

In [21]:
arr3 = np.random.randn(3,4)

In [22]:
arr3

array([[ 0.15871994, -1.10138556,  0.02429916, -1.43385989],
       [ 0.24381828, -1.16598724, -1.66770287,  0.39813812],
       [-0.77445368, -1.9141633 ,  1.53969353,  0.50608273]])

Now, let's perofrm addition operation between the arr2 and arr3 and store the result in arr4.

In [23]:
arr4 = arr2 + arr3

In [24]:
arr4

array([[ 3.15871994,  3.79861444, -2.47570084,  0.26614011],
       [-4.95618172, -4.26598724,  5.73229713,  3.99813812],
       [ 3.42554632,  4.8858367 , -2.76030647, -5.19391727]])

We can perform subtraction, multiplication and division using following operators.<br>
subtraction -, multiplication * and division / <br>
Here we perform multiplication between the arrays.

In [25]:
arr4 = arr2 * arr3

In [26]:
arr4

array([[  0.47615983,  -5.39678925,  -0.06074791,  -2.43756181],
       [ -1.26785503,   3.61456044, -12.34100121,   1.43329722],
       [ -3.25270544, -13.01631044,  -6.62068219,  -2.88467156]])

We can perform comparison between arrays of the same size which will yield a boolean array. Here we test if elements in arr2 are greater than elements in arr3 or not.

In [27]:
arr2 > arr3

array([[ True,  True, False,  True],
       [False, False,  True,  True],
       [ True,  True, False, False]])

Transposing an array (matrix) is performed using a special T attribute.

In [28]:
arr2.T

array([[ 3. , -5.2,  4.2],
       [ 4.9, -3.1,  6.8],
       [-2.5,  7.4, -4.3],
       [ 1.7,  3.6, -5.7]])

We can perform a matrix computation such as dot product using <b>dot</b> function. Here we perform dot product between arr2 and arr3. We transpose arr2.

In [29]:
arr4 = np.dot(arr2.T, arr3)

In [30]:
arr4

array([[ -4.04440064,  -5.2805089 ,  15.21166522,  -4.24635041],
       [ -5.24439392, -14.79853925,  15.7588608 ,  -4.81877905],
       [  4.73760619,   2.35606052, -19.0224313 ,   4.35471605],
       [  5.56195565,   4.84082129, -14.73867487,  -3.88893615]])

#### Mathematical and statistical methods for arrays
NumPy provides methematical element-wise operations on data in arrays. <br><br>
Square root: sqrt Square: square Exponent: exp Logarithm (base 3): log Logarithm (base 10): log10 Logarithm (base 2) : log2 Regular and hyperbolic trignometric: cos, cosh, sin, sinh, tan, tanh Inverse trigonometric: arccos, arccosh, arcsin, arcsinh, arctan, arctanh

In [31]:
arr4 = np.array([0,1,2,3,4,5,6,7,8])

Here we perform element-wise transformation i.e. square root and exponent.

In [32]:
np.sqrt(arr4)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712])

In [33]:
np.exp(arr4)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03])

Numpy also provides mathematical functions that compute statistics about an entire array or about the data along an axis.

Sum of all elements in the array or along an axis: <b>sum</b> Arithmetic mean: <b>mean</b> Standard deviation and variance: <b>std</b>, <b>var</b> Minimum and maximum: <b>min</b>, <b>max</b> Indices of minimum and maximum elements: <b>argmin</b>, <b>argmax</b> Cumulative sum of elements starting from 0: <b>cumsum</b> Cummulative product of elements starting from 1: <b>cumprod</b>

In [34]:
arr4 = np.random.randn(2,3)

In [35]:
arr4

array([[-0.33475437, -0.19599102,  0.48346306],
       [ 0.06125897, -0.48613045, -0.61257556]])

Here we compute the mean of the entire data.

In [36]:
np.mean(arr4)

-0.18078822798085983

We can perform mean of the data along an axis. Here axis=0 means "compute mean across the row".

In [37]:
np.mean(arr4, axis=0)

array([-0.1367477 , -0.34106074, -0.06455625])

axis=1 means "compute mean across the column".

In [38]:
np.mean(arr4, axis=1)

array([-0.01576078, -0.34581568])

#### Reshaping, concatenating and splitting arrays
In many cases, we need to convert an array from one shape to another. NumPy uses row major order when reshaping arrays. In row major order, the consecutive elements of a row reside next to each other. Here we reshape arr1 a one-dimensional array to two-dimensional array.

In [39]:
arr1 = np.arange(0,9)

In [40]:
arr1

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [41]:
arr1.reshape((3,3))

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

We can let the NumPy to infer the dimension from the data. Here we specify -1 to infer the dimension of the column from the data.

In [42]:
arr1 = np.arange(0,12)

In [43]:
arr1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [44]:
arr1.reshape((4, -1))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

The opposite operation of reshape is known as flattening. Here we flatten arr1 to one-dimensional array.

In [45]:
arr1.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

Two arrays can be concatenated to form a single array. The concatenate function takes a sequence of arrays and join them together in order along the input axis.

In [46]:
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[7, 8, 9], [10, 11, 12]])

Here we concatenate the arrays along the row axis.

In [47]:
np.concatenate([arr1, arr2], axis=0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

Here we can concatenate the arrays along the column axis.

In [48]:
np.concatenate([arr1, arr2], axis=1)

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

We can split an array into multiple arrays along an axis. If such a split is not possible, an error is raised.

In [49]:
arr1 = np.random.randn(6, 2)

In [50]:
arr1

array([[ 1.22152001, -1.39956223],
       [-0.56892279, -2.2945763 ],
       [-0.21400151, -0.91838446],
       [ 0.39123937, -0.28990437],
       [ 1.73674895, -0.42479259],
       [-0.9456671 ,  0.10861245]])

Here we split arr1 into two along row axis.

In [51]:
first, sec = np.split(arr1, 2, axis=0)

In [52]:
first

array([[ 1.22152001, -1.39956223],
       [-0.56892279, -2.2945763 ],
       [-0.21400151, -0.91838446]])

In [53]:
sec

array([[ 0.39123937, -0.28990437],
       [ 1.73674895, -0.42479259],
       [-0.9456671 ,  0.10861245]])

We can define where along axis the array is split. For example, ```[1,3]``` will result in ```arr[:1]``` ```arr[1:3]``` ```arr[3:]```

In [54]:
first, sec, third = np.split(arr1, [1,3], axis=0)

In [55]:
first

array([[ 1.22152001, -1.39956223]])

In [56]:
sec

array([[-0.56892279, -2.2945763 ],
       [-0.21400151, -0.91838446]])

In [57]:
third

array([[ 0.39123937, -0.28990437],
       [ 1.73674895, -0.42479259],
       [-0.9456671 ,  0.10861245]])

### 1.0.3 Pandas

Pandas provides high-performance, easy to use data structures and data analysis tools for the Python programming language.

#### Series
A series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

First we need to import Pandas library

In [58]:
import pandas as pd

The simplest Series is formed from only a sequence of data (list). Notice the indexes are automatically generated and each element is associated with an index.

In [59]:
data1 = [2, 5, 4, 8, 0, 1]
sr1 = pd.Series(data1)

Here we display the content of the Series without displaying the index.

In [60]:
sr1.values

array([2, 5, 4, 8, 0, 1])

We access the elements in a Series using index notation.

In [61]:
sr1[2]

4

We can specify own indexes for each element.

In [62]:
sr2 = pd.Series([4, 7, 1, -2], index=['b', 'd', 'f', 'c'])

In [63]:
sr2['d']

7

We can perform element-wise arithmetic operation such as addition, subtraction, multiplication and division on a Series. The operation preserves the index-value link.

In [64]:
sr2 * 2

b     8
d    14
f     2
c    -4
dtype: int64

In [65]:
np.exp(sr2)

b      54.598150
d    1096.633158
f       2.718282
c       0.135335
dtype: float64

#### DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, etc.). The resulting DataFrame will have its index assigned automatically.

In [66]:
data = {'university': ['USM', 'UM', 'UPM', 'UKM', 'UTM', 'UiTM', 'UniMAP', 'UTHM'], 
       'year': [1969, 1905, 1931, 1970, 1972, 1956, 2001, 1993],
       'location': ['Pulau Pinang', 'Kuala Lumpur', 'Serdang, Selangor', 'Bangi, Selangor', 'Skudai, Johor', 'Shah Alam, Selangor', 'Arau, Perlis', 'Batu Pahat, Johor']}

In [67]:
fm1 = pd.DataFrame(data)

The head method can be used to select only the first five rows.

In [68]:
fm1.head()

Unnamed: 0,university,year,location
0,USM,1969,Pulau Pinang
1,UM,1905,Kuala Lumpur
2,UPM,1931,"Serdang, Selangor"
3,UKM,1970,"Bangi, Selangor"
4,UTM,1972,"Skudai, Johor"


A column can be retrieved by specifying the column name.

In [69]:
fm1['university']

0       USM
1        UM
2       UPM
3       UKM
4       UTM
5      UiTM
6    UniMAP
7      UTHM
Name: university, dtype: object

In [70]:
fm1['year']

0    1969
1    1905
2    1931
3    1970
4    1972
5    1956
6    2001
7    1993
Name: year, dtype: int64

A row can be retrieved by using a special ```loc``` attribute with the index. Here we are retrieving a row with index 3.

In [71]:
fm1.loc[3]

university                UKM
year                     1970
location      Bangi, Selangor
Name: 3, dtype: object

We can add a row to a DataFrame as follows.

In [72]:
fm1.loc[8] = {'year': 2000, 'university': 'UTeM', 'location': 'Durian Tunggal, Melaka'}

We can drop a row from a DataFrame by specifying the index.

In [73]:
fm1.drop(7)

Unnamed: 0,university,year,location
0,USM,1969,Pulau Pinang
1,UM,1905,Kuala Lumpur
2,UPM,1931,"Serdang, Selangor"
3,UKM,1970,"Bangi, Selangor"
4,UTM,1972,"Skudai, Johor"
5,UiTM,1956,"Shah Alam, Selangor"
6,UniMAP,2001,"Arau, Perlis"
8,UTeM,2000,"Durian Tunggal, Melaka"


We can concatenate two series to become a DataFrame.

In [74]:
s1 = pd.Series([82160, 4108], index = ['cost', 'duration'])
s2 = pd.Series([602, 30.1], index = ['cost', 'duration'])

In [75]:
s1

cost        82160
duration     4108
dtype: int64

In [76]:
s2

cost        602.0
duration     30.1
dtype: float64

In [77]:
fm2 = pd.concat([s1, s2], axis=1)

#### File input and output using Pandas
We can save and load data to and from disk in text format. Here, we save the content of fm1 to a csv file using ```to_csv``` method. 
The first argument is the file name. The second argument is to specify the separator to separate the values. The third argument to write out the column names. The fourth argument is to specify the index.

In [78]:
fm1.to_csv('uni_in_Malaysia.csv', sep=';', header=True, index=False)

To load a csv file, we use ```read_csv``` method. Here, we are reading a file named diabetes.csv.

In [79]:
fm2 = pd.read_csv('diabetes.csv', sep=',')

In [80]:
fm2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Summarizing and Computing Descriptive Statistics
We can calculate a specific statistical measure such as mean and variance. axis=1 calculate mean across the rows.

In [81]:
fm2.mean(axis=0)

Pregnancies                   3.845052
Glucose                     120.894531
BloodPressure                69.105469
SkinThickness                20.536458
Insulin                      79.799479
BMI                          31.992578
DiabetesPedigreeFunction      0.471876
Age                          33.240885
Outcome                       0.348958
dtype: float64

In [82]:
fm2.var(axis=0)

Pregnancies                    11.354056
Glucose                      1022.248314
BloodPressure                 374.647271
SkinThickness                 254.473245
Insulin                     13281.180078
BMI                            62.159984
DiabetesPedigreeFunction        0.109779
Age                           138.303046
Outcome                         0.227483
dtype: float64

```pd.describe``` method produces multiple summary statistics of the DataFrame.

In [83]:
fm2.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


We compute the correlation between the attributes (columns) using ```pd.corr``` method.

In [84]:
corr_matrix = fm2.corr()

In [85]:
corr_matrix

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


Calculating the covariance.

In [86]:
cov_matrix = fm2.cov()

In [87]:
cov_matrix

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,11.354056,13.947131,9.214538,-4.390041,-28.555231,0.469774,-0.037426,21.57062,0.356618
Glucose,13.947131,1022.248314,94.430956,29.239183,1220.935799,55.726987,1.454875,99.082805,7.115079
BloodPressure,9.214538,94.430956,374.647271,64.029396,198.378412,43.004695,0.264638,54.523453,0.600697
SkinThickness,-4.390041,29.239183,64.029396,254.473245,802.979941,49.373869,0.972136,-21.381023,0.568747
Insulin,-28.555231,1220.935799,198.378412,802.979941,13281.180078,179.775172,7.066681,-57.14329,7.175671
BMI,0.469774,55.726987,43.004695,49.373869,179.775172,62.159984,0.367405,3.36033,1.100638
DiabetesPedigreeFunction,-0.037426,1.454875,0.264638,0.972136,7.066681,0.367405,0.109779,0.130772,0.027472
Age,21.57062,99.082805,54.523453,-21.381023,-57.14329,3.36033,0.130772,138.303046,1.336953
Outcome,0.356618,7.115079,0.600697,0.568747,7.175671,1.100638,0.027472,1.336953,0.227483
