# NLU+ 2021-2022: Lab 1
#### Authors: Christos Baziotis, Lexi Birch, Frank Keller
These exercises aim to support you in successfully completing your assignments. Here, we will focus on NumPy, which is a library for working with numerical data in Python and is what you will be using in your first assignment. It would be helpful to first read the official NumPy quickstart [guide](https://numpy.org/doc/stable/user/absolute_beginners.html). While here we will review some operations that will help you with your assignment, you should use the official guide as a reference.



In [1]:
import numpy as np
np.set_printoptions(suppress=True)  # suppresses the use of scientific notation for small numbers

# you may use this function to print a numpy array and its properties
def print_array(arr):
    print(arr)
    print("shape:", arr.shape)
    print("type:", arr.dtype.type)
    print()

# Load the data
We will be working with the Wine Data Set from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Wine). It is contains the results of a chemical analysis of 178 wines. The wines are categorized into 3 classes and described by 13 attributes. All attributes are continuous.

The dataset is stored in the `wine.csv` file. The first row contains the column names and the rest of the rows the corresponding values. Open the file and check its structure. The columns in the data are as follows:

 1. *Type*: The type of wine, into one of three classes, 1 (59 obs), 2(71 obs), and 3 (48 obs).
 2. Alcohol
 3. Malic acid
 4. Ash
 5. Alcalinity of ash
 6. Magnesium
 7. Total phenols
 8. Flavanoids
 9. Nonflavanoid phenols
 10. Proanthocyanins
 11. Color intensity
 12. Hue
 13. D280/OD315 of diluted wines
 14. Proline

**NOTE**: As you can see, the first attribute is the *class* identifier (1-3)



First, we naively read all the data into a regular 2D Python list (i.e., list of lists), named `data`.

In [2]:
# solution
data = []
with open("wine.csv") as f:
    for line in f:
        row = line.strip().split(",")
        data.append(row)
        
print(data)

[['Wine', 'Alcohol', 'Malic.acid', 'Ash', 'Acl', 'Mg', 'Phenols', 'Flavanoids', 'Nonflavanoid.phenols', 'Proanth', 'Color.int', 'Hue', 'OD', 'Proline'], ['1', '14.23', '1.71', '2.43', '15.6', '127', '2.8', '3.06', '.28', '2.29', '5.64', '1.04', '3.92', '1065'], ['1', '13.2', '1.78', '2.14', '11.2', '100', '2.65', '2.76', '.26', '1.28', '4.38', '1.05', '3.4', '1050'], ['1', '13.16', '2.36', '2.67', '18.6', '101', '2.8', '3.24', '.3', '2.81', '5.68', '1.03', '3.17', '1185'], ['1', '14.37', '1.95', '2.5', '16.8', '113', '3.85', '3.49', '.24', '2.18', '7.8', '.86', '3.45', '1480'], ['1', '13.24', '2.59', '2.87', '21', '118', '2.8', '2.69', '.39', '1.82', '4.32', '1.04', '2.93', '735'], ['1', '14.2', '1.76', '2.45', '15.2', '112', '3.27', '3.39', '.34', '1.97', '6.75', '1.05', '2.85', '1450'], ['1', '14.39', '1.87', '2.45', '14.6', '96', '2.5', '2.52', '.3', '1.98', '5.25', '1.02', '3.58', '1290'], ['1', '14.06', '2.15', '2.61', '17.6', '121', '2.6', '2.51', '.31', '1.25', '5.05', '1.06', '

### 1. Initialize a Numpy Array
Create a numpy array (named `data` ) out of the Python array and check its shape and data type.
What is the data type of the numpy array and why? How do numpy arrays differ from regular Python lists?

In [3]:
data = np.array(data)

In [4]:
# let's see what is in the array
print_array(data)

[['Wine' 'Alcohol' 'Malic.acid' ... 'Hue' 'OD' 'Proline']
 ['1' '14.23' '1.71' ... '1.04' '3.92' '1065']
 ['1' '13.2' '1.78' ... '1.05' '3.4' '1050']
 ...
 ['3' '13.27' '4.28' ... '.59' '1.56' '835']
 ['3' '13.17' '2.59' ... '.6' '1.62' '840']
 ['3' '14.13' '4.1' ... '.61' '1.6' '560']]
shape: (179, 14)
type: <class 'numpy.str_'>



### 2. Array Indexing and Slicing
Now we need to split the data into separate numpy arrays, in order to disentangle the attribute names, attribute values and class labels. To do this you need to use numpy array slicing.
Do the following:
 1. Store the 13 attribute names into an 1D numpy array, called `names`. This means you should ignore the first column (i.e., Wine type).
 2. Store the class labels (i.e., Wine type) into an 1D numpy array, called `classes`.
 3. Store the attribute values into an 2D numpy array, called `attributes`.

In [5]:
names = data[0,][1:]
classes = data[1:,0]
attributes = data[1:,1:]
        

In [6]:
print("names")
print_array(names)

print("classes")
print_array(classes)

print("attributes")
print_array(attributes)

assert names.shape == (13,)
assert classes.shape == (178,)
assert attributes.shape == (178, 13)

names
['Alcohol' 'Malic.acid' 'Ash' 'Acl' 'Mg' 'Phenols' 'Flavanoids'
 'Nonflavanoid.phenols' 'Proanth' 'Color.int' 'Hue' 'OD' 'Proline']
shape: (13,)
type: <class 'numpy.str_'>

classes
['1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2'
 '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2'
 '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2'
 '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2' '2'
 '2' '2' '2' '2' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3'
 '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3'
 '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3' '3']
shape: (178,)
type: <class 'numpy.str_'>

attributes
[['14.23' '1.71' '2.43' ... '1.04' '3

4\. Using the `attributes` array, print the second to last row, without its last 3 elements.

The expected output is
`['13.17' '2.59' '2.37' '20' '120' '1.65' '.68' '.53' '1.46' '9.3']`

In [7]:
print(attributes[1:,:-3])

[['13.2' '1.78' '2.14' ... '.26' '1.28' '4.38']
 ['13.16' '2.36' '2.67' ... '.3' '2.81' '5.68']
 ['14.37' '1.95' '2.5' ... '.24' '2.18' '7.8']
 ...
 ['13.27' '4.28' '2.26' ... '.43' '1.35' '10.2']
 ['13.17' '2.59' '2.37' ... '.53' '1.46' '9.3']
 ['14.13' '4.1' '2.74' ... '.56' '1.35' '9.2']]


### 3. Data Type Casting
Cast each numpy array to the appropriate data type. We need to represent numerical values with the appropriate data type to be able to do numerical operations.
1. The `attributes` array contains continuous values, therefore it needs to be converted to `float`.
2. The `classes` array contains categorical values, so you should convert it to `int`.

(The `names` array already contains string values as it should. You don't need to change it.)

In [8]:
attributes = attributes.astype(float)
classes = classes.astype(int)

In [9]:
print("classes")
print_array(classes)

print("attributes")
print_array(attributes)

classes
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]
shape: (178,)
type: <class 'numpy.int64'>

attributes
[[  14.23    1.71    2.43 ...    1.04    3.92 1065.  ]
 [  13.2     1.78    2.14 ...    1.05    3.4  1050.  ]
 [  13.16    2.36    2.67 ...    1.03    3.17 1185.  ]
 ...
 [  13.27    4.28    2.26 ...    0.59    1.56  835.  ]
 [  13.17    2.59    2.37 ...    0.6     1.62  840.  ]
 [  14.13    4.1     2.74 ...    0.61    1.6   560.  ]]
shape: (178, 13)
type: <class 'numpy.float64'>




### 4. Array Axis and Statistics
We often need to compute some statistics using aggregating methods. A common pitfall however is computing these statistics along the wrong axis.

Using the `attributes` numpy array, do the following:

1. Compute the sum of all values.
2. Compute the average value of each column (i.e., feature).
3. Compute the maximum of the row minimums.
4. Compute the average of the column maximus.

**Hint:** The output for questions `4.3`, `4.4`, will be a *scalar*. To make sure you are aggregating over the correct values, check the shape of the intermediate resulting array first.


In [10]:
# Solution for 4.1
sum = np.sum(attributes)
print (sum)

159975.295999


In [11]:
# Solution for 4.2
ave = np.average(attributes, axis=0)
print(ave)

[ 13.00061798   2.33634831   2.36651685  19.49494382  99.74157303
   2.29511236   2.02926966   0.36185393   1.59089888   5.05808988
   0.95744944   2.61168539 746.89325843]


In [12]:
# Solution for 4.3
row_min = attributes.min(axis = 1)
max = row_min.max()
print(max)

0.66


In [13]:
# Solution for 4.4
col_max = attributes.max(axis = 0)
ave = np.average(col_max)
print(ave)

148.29


### 5. Array Transpose
When transposing a `2x3` array, we get a `3x2` array. Matrix transpose has a lot of significance in linear algebra and you will rely on it many times in your assignments.

Using the transpose of the original `attributes` array:

1. Compute the same statistics of the question 4.2.
1. Compute the same statistics of the question 4.3.


In [14]:
# Solution for 5.1
attr_transpose = np.transpose(attributes)
ave = np.average(attr_transpose, axis = 0)
print(ave)

[ 95.76923077  91.85384615 103.21692308 126.96076923  69.89923077
 124.24846154 110.49692308 112.98307692  91.67846154  92.03538462
 128.50769231 109.62846154 112.30769231  99.28307692 130.88615385
 113.56307692 112.08461538 100.13769231 141.91692308  77.88307692
  73.76384615  71.23692308  91.29923077  89.23        76.49692308
  77.91769231 102.95384615 109.77769231  82.85923077  90.88153846
 111.15       128.98384615  88.03923077 109.36        96.68
  82.73923077  79.90076923  96.36461538  89.57153846  72.28923077
  74.25846154  90.63384615  95.99307692  64.26461538  80.26923077
  95.92923077  93.96230769  87.60615385  93.74692308 109.63692308
  99.34769231 108.64923077 104.16307692 118.74384615  94.56615385
  99.45384615  87.81846154 109.60384615 111.32384615  49.33384615
  63.31384615  45.82461538  59.32846154  42.98923077  38.91538462
  63.52692308  48.39384615  49.05846154  69.58153846  70.30153846
  78.40276923  42.67923077  47.14692308  91.36769231  79.84461538
  43.71538462  4

In [15]:
# Solution for 5.2
row_min = attr_transpose.min(axis = 1)
max = row_min.min()
print(max)

0.13


### 6. Sorting and Indexing
Sort the `names` array alphabetically, and then apply the same ordering to the **columns** of the `attributes` array, in order to preserve the correspondence between them.

Hint: be careful when applying the sorting of `names` to `attributes` and think about the role of each axis.



In [16]:
names = np.sort(names)
print(names)

ind = np.argsort(names)

#attributes = np.take_along_axis(attributes, ind, axis=1)

attri = attributes[:,ind]
print(attri[0])

['Acl' 'Alcohol' 'Ash' 'Color.int' 'Flavanoids' 'Hue' 'Malic.acid' 'Mg'
 'Nonflavanoid.phenols' 'OD' 'Phenols' 'Proanth' 'Proline']
[  14.23    1.71    2.43   15.6   127.      2.8     3.06    0.28    2.29
    5.64    1.04    3.92 1065.  ]


### 7. Data Standardization
Standardization (not to be confused with normalization), is a preprocessing step that is commonly used in many machine learning models and ensures that all features are normally distributed (i.e., they have zero mean and unit variance).

To do this, you need to transform the data as follows: 
1. Remove the mean value of each feature (i.e., centering).
2. Divide the features by their standard deviation (i.e., rescaling).

Save the standardized version of the `attributes` array to `attributes_norm`.

In [33]:
attributes_norm = attributes - np.average(attributes, axis = 0)
print(attributes_norm)
print(attributes_norm.shape)
attributes_norm /= np.std(attributes_norm, axis = 0)
print(attributes_norm)
print(attributes_norm.shape)

[[   1.22938202   -0.62634831    0.06348315 ...    0.08255056
     1.30831461  318.10674157]
 [   0.19938202   -0.55634831   -0.22651685 ...    0.09255056
     0.78831461  303.10674157]
 [   0.15938202    0.02365169    0.30348315 ...    0.07255056
     0.55831461  438.10674157]
 ...
 [   0.26938202    1.94365169   -0.10651685 ...   -0.36744944
    -1.05168539   88.10674157]
 [   0.16938202    0.25365169    0.00348315 ...   -0.35744944
    -0.99168539   93.10674157]
 [   1.12938202    1.76365169    0.37348315 ...   -0.34744944
    -1.01168539 -186.89325843]]
(178, 13)
[[ 1.51861254 -0.5622498   0.23205254 ...  0.36217728  1.84791957
   1.01300893]
 [ 0.24628963 -0.49941338 -0.82799632 ...  0.40605066  1.1134493
   0.96524152]
 [ 0.19687903  0.02123125  1.10933436 ...  0.31830389  0.78858745
   1.39514818]
 ...
 [ 0.33275817  1.74474449 -0.38935541 ... -1.61212515 -1.48544548
   0.28057537]
 [ 0.20923168  0.22769377  0.01273209 ... -1.56825176 -1.40069891
   0.29649784]
 [ 1.39508604  1.

### 8. Matrix Operations
We are going to work with following slices of the `attributes` array (see the cell below). You will compute some simple operations without using NumPy's builtin methods, but you may use them to check that your solution is correct.

In [19]:
slice1 = attributes[6:10]
slice2 = attributes[76:80]

print_array(slice1)
print_array(slice2)

[[  14.39    1.87    2.45   14.6    96.      2.5     2.52    0.3     1.98
     5.25    1.02    3.58 1290.  ]
 [  14.06    2.15    2.61   17.6   121.      2.6     2.51    0.31    1.25
     5.05    1.06    3.58 1295.  ]
 [  14.83    1.64    2.17   14.     97.      2.8     2.98    0.29    1.98
     5.2     1.08    2.85 1045.  ]
 [  13.86    1.35    2.27   16.     98.      2.98    3.15    0.22    1.85
     7.22    1.01    3.55 1045.  ]]
shape: (4, 13)
type: <class 'numpy.float64'>

[[ 13.03   0.9    1.71  16.    86.     1.95   2.03   0.24   1.46   4.6
    1.19   2.48 392.  ]
 [ 11.84   2.89   2.23  18.   112.     1.72   1.32   0.43   0.95   2.65
    0.96   2.52 500.  ]
 [ 12.33   0.99   1.95  14.8  136.     1.9    1.85   0.35   2.76   3.4
    1.06   2.31 750.  ]
 [ 12.7    3.87   2.4   23.   101.     2.83   2.55   0.43   1.95   2.57
    1.19   3.13 463.  ]]
shape: (4, 13)
type: <class 'numpy.float64'>



**1\.** Compute the dot product between each vector (i.e., row) of `slice1`, with the corresponding vector of `slice2`. This means, the 1st vector `slice1`  with the 1st vector of `slice2`, the 2nd vector `slice1`  with the 2nd vector of `slice2` etc. Use numpy, but avoid using `np.dot` or for loops. Think about the definition of the dot product.

The expected output is `[514410.1698, 661579.8319, 797379.7166, 494338.7313]`

In [24]:
dot = np.zeros(4, dtype=float)
print(dot)
for i in range(slice1.shape[0]):
    product = np.matmul(slice1[i], slice2[i])
    dot[i] = product
print(dot)

[0. 0. 0. 0.]
[514410.1698 661579.8319 797379.7166 494338.7313]


**2\.** Compute the cosine similarity between 7th and 77th rows (use 0-based indexing) of the `attributes` array using the dot product. 

(0-based indexing, means you should use the vectors `attributes[7]` and `attributes[77]` )



In [26]:
cos_sim = np.dot(attributes[7], attributes[77])/(np.linalg.norm(attributes[7])*np.linalg.norm(attributes[77]))
print(cos_sim)

0.9916060988277015


Cosine similarity is used a lot in machine learning and has many nice properties compared to other metrics, such as a cosine. One of them is that its values are in the `[-1, 1]` range regardless of the properties of the vector space (e.g., dimensionality).

**3\.** Compute the cosine similarity between the 7th and 77th rows (use 0-based indexing) of the `attributes_norm` array. How can you explain the difference in their cosine similarities? Why did these vectors look very similar before standardization, but dissimilar afterwards?

You can use the function below:

In [27]:
from scipy import spatial
cos_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)

In [34]:
cos_sim = 1 - spatial.distance.cosine(attributes[7], attributes[77])
print(cos_sim)
cos_sim_norm = 1 - spatial.distance.cosine(attributes_norm[7], attributes_norm[77])
print(cos_sim_norm)
# The cosine similarity is very sensitive to the means of elements in vectors. 
# Adding a large number to all elements in both vectors consequently increases the cosine similarity between them

0.9916060988277015
-0.28874806323497904
