# SCS2013 Exercise 14

**This exercise notebook will go through the understanding of "NumPy" and "Pandas" libraries in Python:**

- NumPy (넘파이)
- Pandas (판다스)
- There are various open-source standard Python packages/libraries: `NumPy`, `Pandas`, `pytest`, `Matplotlib`, `PyTorch`, `Beautiful Soup`, and so more! 






## NumPy 

[NumPy](https://numpy.org/) is a Python package providing fast, flexible, and expressive data structures designed to make working with arrays

In [None]:
import numpy as np

In [None]:
my_arr = np.array([3.14, 4, 2, 3.5])
print(my_arr)

my_arr = np.array([1, 2, 3], dtype='float32')
print(my_arr)

In [None]:
# multi-dimensional array

my_arr = np.array([range(i, i+3) for i in [2,4,6]])
print(my_arr)

We can create arrays by using following functions:     
- `np.zeros(shape)`: return an array of given shape filled with zeros
- `np.ones(shape)`: return an array of given shape filled with ones
- `np.full(shape, fill_value)`: return an array of given shape filled with `fill_value`
- `np.arange(start, end, step)`: return evenly spaced values within a given interval 
- `np.linspace(start, end, num)`: return `num` evenly spaced samples over the interval
- `np.random.rand(shape)`: return an array of given shape with random values from a uniform distribution between 0 and 1
- `np.random.randn(shape)`: return an array of given shape with random values from the standard normal distribution
- `np.random.randint(start, end, shape)`: return an array of given shape with random integers from the interval

We can change the size of array by using `.reshape()` function: the size of the initial array must match the size of the reshaped array

We can check attributes of the `nd.array`: `arr.shape`, `arr.ndim`, `arr.itemsize`

In [None]:
# create an array with integers 
my_arr = np.arange(1,10)
print(my_arr)

my_arr_2 = np.arange(1,10).reshape(3,3)
print(my_arr_2)

In [None]:
print(f'1: Array shape: {my_arr.shape}')
print(f'2: Array dimensions: {my_arr.ndim}')
print(f'3: Length of each element of array in bytes: {my_arr.itemsize}')

print(f'1: Array shape: {my_arr_2.shape}')
print(f'2: Array dimensions: {my_arr_2.ndim}')
print(f'3: Length of each element of array in bytes: {my_arr_2.itemsize}')

In [None]:
my_arr = np.random.randint(20, size=6)
print(my_arr)

print(my_arr[0])
print(my_arr[2])
print(my_arr[-2])

In [None]:
# generate 2x4 array of integers between 0 and 4, inclusive:
my_arr = np.random.randint(20, size=(2,4))
print(my_arr)

print(my_arr[0,0])
print(my_arr[1,-1])

In [None]:
my_arr = np.random.randint(5, size=(5,4))
print(my_arr)

# pick 1st to 3rd columns of the array
new_arr = my_arr[:, 0:2]
print(new_arr)

# pick every odd columns of the array
new_arr = my_arr[:, ::2]
print(new_arr)

# reverse the rows of the array
new_arr = my_arr[::-1, :]
print(new_arr)

In [None]:
# replace items that satisfy a condition with another value
print('Mask:', my_arr % 2 == 1)

my_arr[my_arr % 2 == 1] = -1
print(my_arr)

In [None]:
# masking
name_array = np.array(['Lee', 'Kim', 'Choi', 'Park'])
score_array = np.array([[100,50,60], [30,53,22], [90,95,88], [10,98,24]])

mask = score_array[:,0] > 75
print(mask)
print(name_array[mask])

Computations on NumPy arrays can be very fast
- `np.sum`, `np.prod`, `np.mean`, `np.std`, `np.min`, `np.max`, `np.argmax`, `np.argmin`, `np.median`, ... 

In [None]:
# computations on my_arr
print(my_arr)
print('=====')

# compute sum, product, and mean of all numbers in array
print(my_arr.sum())
print(my_arr.prod())
print(my_arr.mean())
print('=====')

# compute sum, product, and mean of all numbers from axis 0 (1st dimension) 
print(my_arr.sum(axis=0))
print(my_arr.prod(axis=0))
print(my_arr.mean(axis=0))
print('=====')

# compute sum, product, and mean of all numbers from axis 1
print(my_arr.sum(axis=1))
print(my_arr.prod(axis=1))
print(my_arr.mean(axis=1))

In [None]:
# compute max, min values and max, min item positions 
my_arr = np.random.randint(20, size=(4,3))
print(my_arr)

num_max = my_arr.max()
num_min = my_arr.min()
idx_max = my_arr.argmax()
idx_min = my_arr.argmin()

print(f'Max value: {num_max}, Max position(index): {idx_max}')
print(f'Min value: {num_min}, Min position(index): {idx_min}')

## Pandas

[Pandas](https://pandas.pydata.org/) is an open-source Python library for analyzing large and complex, "labeled" data.

In [None]:
import pandas as pd

In [None]:
# create series from list and labeled list
my_ser = pd.Series([1,2,4,3], ['Kim','Lee','Park','Choi'])
print(my_ser)

print('===')
print(my_ser.values)
print(my_ser.index)
print(my_ser['Kim'])

In [None]:
# create series from dict
my_dict = {'b': 10, 'a': 30, 'c': 20}
print(my_dict)

data = pd.Series(my_dict)
print(data)

In [None]:
# series[label]
print(data['a'])

data['e'] = 15
print(data)

In [None]:
ser1 = pd.Series([1,2,3,4], ['Kim','Lee','Park','Choi'])
ser2 = pd.Series([5.2,3.5,7.2,12.5], ['Kim','Lee','Yoo','Choi'])

print(ser1)
print('===')
print(ser2)
print('===')
print(ser1+ser2)

**DataFrame** is a tabular (with rows and columns) representation of data

- `pd.DataFrame(data, index, columns)`
- If we have data in `dict` data structure, we can convert it into DataFrame

- `DataFrame.index`: gives a range of the row index
- `DataFrame.columns`: gives a list of column labels
- `DataFrame.values`: gives all the rows in DataFrame
- `DataFrame.size`: gives a total number of values in DataFrame
- `DataFrame.shape`: gives a number of rows and columns in DataFrame

In [None]:
arr = np.array([[1,100], [5,50], [10,10]])
label = ['a','b','c']
my_df = pd.DataFrame(data=arr, index=label, columns=['Type 1', 'Type 2'])

my_df

In [None]:
# check the DataFrame attributes
print(my_df.index)
print(my_df.columns)
print(my_df.values)
print(my_df.size)
print(my_df.shape)

In [None]:
my_df['Type 3'] = my_df['Type 1'] * my_df['Type 2']
my_df['Type 4'] = 'Hey!'

my_df

While dealing with data in DataFrame, we can select a set of rows or columns:

**as like dictionary:**
- `DataFrame[column_label]`: select a group of data based on the column labels

**by using index labels:**
- `DataFrame.loc[row_label, column_label]`: select a group of data based on the row labels and column labels
- `DataFrame.loc[row_label,:]` or `DataFrame.loc[row_label]` is to select a row
- `DataFrame.loc[:, column_label]` is to select a column

**by using index positions:**
- `DataFrame.iloc[row_index, column_index]`: select a group of data based on the row index position and column index position
- `DataFrame.iloc[row_index,:]` or `DataFrame.loc[row_index]` is to select a row
- `DataFrame.iloc[:, column_index]` is to select a column

In [None]:
print(my_df)

# select column
print('\n=====Select a column as like dictionary:=====')
print(my_df['Type 2'])

In [None]:
print(my_df)

# select row, column by label
print('\n=====Select a row by using index labels=====')
print(my_df.loc['b'])

print('\n=====Select a column by using index labels=====')
print(my_df.loc[:, 'Type 2'])

In [None]:
print(my_df)
# select row, column by integer
print('\n=====Select a row by using index labels=====')
print(my_df.iloc[1])

print('\n=====Select a column by using index labels=====')
print(my_df.iloc[:, 1])

In [None]:
# select data via 'lists'
my_df.loc[['a','c'], ['Type 1', 'Type 4']]

In [None]:
# get the row/column with boolean vector

print(my_df)

print('\n=====Select rows whose "Type 2" value is smaller than 100=====')
print(my_df[my_df['Type 2']<100])

## Exercises

### E-1:

Create a $5 \times 3$ array of integers from a range between $10$ to $99$: $[10,99]$, such that the difference between each element is $6$. 

Result:
```
[[10 16 22]
 [28 34 40]
 [46 52 58]
 [64 70 76]
 [82 88 94]]
```

In [None]:
import numpy as np

# your code here:




### E-2:     

Return an array of even rows and odd columns from the above array `my_arr`.

Result:
```
[[28 40]
 [64 76]]
```

In [None]:
print('Input array:')
print(my_arr)
print(my_arr.shape)

print('\n')
print('Output array: pick even rows and odd columns:')

# your code here:




### E-3:     
Create a $3 \times 4$ array of random integers in range [10, 99] and compute the mean and standard deviation of the array along axis 0 and axis 1

Expected Result:
```
===Array===
[[13 45 34 34]
 [44 51 38 60]
 [13 47 37 63]]
===Axis 0===
mean [23.33333333 47.66666667 36.33333333 52.33333333]
std [14.61354014  2.49443826  1.69967317 13.02134999]
===Axis 1===
mean [31.5  48.25 40.  ]
std [11.58663023  8.19679816 18.13835715]
```

In [None]:
# your code here: create an array


# your code here: compute mean and std along axis 0 (along rows)


# your code here: compute mean and std along axis 1 (along columns)




### E-4:    
Create a new array that normalizes the above array `my_arr` so that the values range exactly between 0 and 1

- normalization can be done by `(array - minimum value) / (maximum value - minimum value)`

Expected Result:
```
===Input array===
[[13 45 34 34]
 [44 51 38 60]
 [13 47 37 63]]
===Normalized array===
[[0.   0.64 0.42 0.42]
 [0.62 0.76 0.5  0.94]
 [0.   0.68 0.48 1.  ]]
```

In [None]:
print('===Input array===')
print(my_arr)

# your code here: get the maximum and minimum values



# your code here: normalize the array




### E-5:     
Create a dataframe `my_df` from the ndarray `my_arr` given in E-3

- make row labels being 'a','b','c'
- make column labels being 'T1','T2','T3','T4'

Expected Result:
```
===Input array===
[[13 45 34 34]
 [44 51 38 60]
 [13 47 37 63]]
===DataFrame:===
   T1  T2  T3  T4
a  13  45  34  34
b  44  51  38  60
c  13  47  37  63
```

In [None]:
import pandas as pd
print('===Input array===')
print(my_arr)

# your code here: create a dataframe


