# <center> CS 178: Machine Learning &amp; Data Mining </center>
## <center>  Discussion 01: 6 April 2023 </center>

---
## Part 1 : Setting up Your Conda Environment

### What is Conda?

[Conda](https://docs.conda.io/en/latest/#) is an open-source package and environment manager for Python. Using Conda, we can easily install, update, or remove various Python packages. One of the key features of Conda is that it lets us create separate environments, which allows us to install different packages (or even different versions of the same package) for different projects. For example, if we are working on a project which requires Python 2.7 and we're also working on a separate project which requires Python 3.10, we can maintain a separate Conda environment for each project in order to easily switch between the two.

If you have used Python before, you've probably used pip to install packages. For a comparision between pip and Conda, check out [this blog post](https://www.anaconda.com/blog/understanding-conda-and-pip).

### Installing Python and Conda via Miniconda
In this tutorial, we wil use [Miniconda](https://docs.conda.io/en/latest/miniconda.html) to install Python and Conda. 
1. Download the correct installer for your system [here](https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links), and follow the instructions to install Miniconda. 

There is also [Anaconda](https://www.anaconda.com/), which is a distribution of Conda that comes with many popular data science packages in addition to Conda. It doesn't matter too much which one you use (see [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html#anaconda-or-miniconda) for some guidelines), but in this tutorial we'll be using Miniconda.

###  Creating a Conda Environment

Once you've installed Conda, open up a terminal (on Linux / Mac) or the Anaconda Prompt (on Windows). We'll now set up an environment for CS178 and install some necessary packages.

1. Let's first verify that conda is installed correctly. In your terminal, run the command `conda --version` and verify that the output is something like `conda 4.14.0`.
2. Now, we'll create a new Conda environment named `cs178` with the latest version of Python (3.10). To do this, run the command `conda create --name cs178 python=3.10`. If you ever forget the name of an environment you've created, you can list all of your environments with `conda --info envs`.
3. To use our new environment, we must first activate it with `conda activate cs178`. The name of the active environment should now be displayed in front of your prompt in parentheses. An environment can be deactivated using `conda deactivate`.
4. Let's install some packages in our `cs178` environment. First, run the command `conda config --env --add channels conda-forge`. This command tells Conda to search [Conda Forge](https://conda-forge.org/docs/user/introduction.html) for packages. Next, Run the following command to install some packages: `
conda install matplotlib pandas jupyterlab`.
5. Now, run `conda install 'scikit-learn>=1.1'`. This will install the package `scikit-learn`, with the version being at least 1.1. Having an up-to-date version of `scikit-learn` is important for Homework 1.
    - Some have experienced trouble executing this command; this can be due to potential conflicts with how `conda` installed the previous packages. Should you face difficulties, try running just `conda install scikit-learn` and then manually confirm the version to be at least 1.1 by running in a python interpreter: 
       
       `>>> import sklearn; print(sklearn.__version__)`
6. You can use the command `conda list` to show all of the packages installed in your environment -- if everything has worked so far, you should see the packages we installed (scikit-learn, pandas, ...) plus their (many) dependencies.

Congratulations! You've now set up your Conda environment with all of the necessary packages to complete the assignments in CS178. In the next section, we'll tinker around with some of the packages we've installed to better familiarize ourselves with our new tools.

### Additional Resources
Note that we've only covered the bare essentials of Conda. If you'd like to read more, here are some resources:
- [Conda User Guide](https://conda.io/projects/conda/en/latest/user-guide/index.html)
- [Conda Cheat Sheet](https://docs.conda.io/projects/conda/en/latest/user-guide/cheatsheet.html)
- [Conda vs Pip](https://www.anaconda.com/blog/understanding-conda-and-pip)


---
## Part 2: Jupyter Notebooks and Numpy

### Jupyter Notebooks

In the previous section, we installed a package called `jupyterlab`. This package lets us use **Jupyter Notebooks**. A notebook is a web-browser based tool that combines code, text, equations, and much more into a single document. Let's create a notebook and get a feel for how these work.

1. In your terminal, `cd` into the directory where you'd like to create your notebooks, and run `jupyter notebook`. This should automatically open a tab in your browser. From here, we can create a new blank notebook (under "new"). Make sure you have our `cs178` environment active!

A notebook consists of many **cells**. Each cell can either contain Markdown (like this cell!) or code. In Markdown cells, we can use standard [Markdown syntax](https://www.markdownguide.org/cheat-sheet/) -- for example, we can make text **bold**, *italics*, make lists, write `code`, etc. 

Markdown cells additionally support mathematical equations via LaTeX. For example, inline math can be written by wrapping LaTeX code in`$[...]$` and display-mode math can be written by wrapping LaTex code in `$$[...]$$`. For example, here's some inline math: $a^2 + b^2 = c^2$, and here's some display-mode math: $$\int x^2 d x = \frac{1}{3} x^3 + C.$$ If you'd like to learn more about LaTeX, check out this [short tutorial](https://www.overleaf.com/learn/latex/Learn_LaTeX_in_30_minutes). **LaTeX is a very useful tool for typesetting math equations, but using LaTeX isn't a requirement in this course.**

In code cells, we can type Python code as usual. We can also execute individual cells in order to run the code they contain.

In [None]:
a = 3
b = 5
c = a + b
print(c)

### Numpy Arrays

Let's now familiarize ourselves a little bit with Python and Numpy. We start by importing some packages that we'll be using.

In [2]:
import numpy as np
from sklearn.datasets import load_iris

First, we'll load in the Iris dataset and store the features in a numpy array `X` and the labels in the numpy array `y`.

In [3]:
iris = load_iris()
X = iris.data
y = iris.target

#### Printing 
We can print `X` and `y` to see their contents.

In [5]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [4]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

#### Shape
We can see the shape of a numpy array by using `.shape`. Here, we see that `X` is a numpy array with 150 rows and 4 columns, and `y` is a numpy array with 150 entries.


In [6]:
X.shape

(150, 4)

In [7]:
y.shape

(150,)

#### Indexing.

We can access specific elements of numpy arrays by *indexing*. There's various ways of doing this, each of which is useful in different situations. It is important to be familiar with all of these.

##### Basic Indexing.

The simplest way of indexing a numpy array is by specifying integers corresponding to which entry we want. We can also access particular rows/columns of our numpy array.

In [8]:
# Gets the entry of X in row 1 and column 2 -- remember that Python is zero-indexed!
X[1, 2]

1.4

In [9]:
# Gets the 7th entry of y
y[6]

0

In [10]:
# Get the first row of X
X[0, :]  # Or just X[0] would also work

array([5.1, 3.5, 1.4, 0.2])

In [14]:
# Get the first column of X
X[:, 0]

array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
       6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
       6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
       6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
       6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
       6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
       7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
       7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
       6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])

##### Setting elements of an array.

Using indexing, we can set the value of a specific element in a numpy array.

In [15]:
# Sets the 6th entry of y to 1
y[6] = 1 

In [16]:
y[6]

1

##### Indexing by Slicing.

Slicing lets us access multiple contiguous rows/columns.

Answers to question:

1. X[9:12, 1:3]

2.X[:, :-1] <-- Trick with negative indexing 

In [13]:
# Get the first 3 rows of X
X[:3, :]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2]])

In [17]:
# Get rows 5, 6, 7 -- note row 8 is not included
X[5:8, :]

array([[5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2]])

In [18]:
# Get rows 5, 6, 7 of X and columns 1, 2
X[5:8, 1:3]

array([[3.9, 1.7],
       [3.4, 1.4],
       [3.4, 1.5]])

##### Negative Indexing.

You can also use negative indexes to count from the end of the array.

1. X[-4 :-1, :]

2. X[:, -3: -2]

In [None]:
X[-1, :]   # Get the last row

In [None]:
X[-2:, :]  # Get the last two rows

In [19]:
X[:, :-1]  # First three columns

array([[5.1, 3.5, 1.4],
       [4.9, 3. , 1.4],
       [4.7, 3.2, 1.3],
       [4.6, 3.1, 1.5],
       [5. , 3.6, 1.4],
       [5.4, 3.9, 1.7],
       [4.6, 3.4, 1.4],
       [5. , 3.4, 1.5],
       [4.4, 2.9, 1.4],
       [4.9, 3.1, 1.5],
       [5.4, 3.7, 1.5],
       [4.8, 3.4, 1.6],
       [4.8, 3. , 1.4],
       [4.3, 3. , 1.1],
       [5.8, 4. , 1.2],
       [5.7, 4.4, 1.5],
       [5.4, 3.9, 1.3],
       [5.1, 3.5, 1.4],
       [5.7, 3.8, 1.7],
       [5.1, 3.8, 1.5],
       [5.4, 3.4, 1.7],
       [5.1, 3.7, 1.5],
       [4.6, 3.6, 1. ],
       [5.1, 3.3, 1.7],
       [4.8, 3.4, 1.9],
       [5. , 3. , 1.6],
       [5. , 3.4, 1.6],
       [5.2, 3.5, 1.5],
       [5.2, 3.4, 1.4],
       [4.7, 3.2, 1.6],
       [4.8, 3.1, 1.6],
       [5.4, 3.4, 1.5],
       [5.2, 4.1, 1.5],
       [5.5, 4.2, 1.4],
       [4.9, 3.1, 1.5],
       [5. , 3.2, 1.2],
       [5.5, 3.5, 1.3],
       [4.9, 3.6, 1.4],
       [4.4, 3. , 1.3],
       [5.1, 3.4, 1.5],
       [5. , 3.5, 1.3],
       [4.5, 2.3

##### Indexing with Arrays.

We can also access non-contiguous parts of a numpy array by specifying a list (or numpy array) of indexes.

1. rows = [3,5,19], X[rows] OR X[(3,5,19)]

2. cols = [0, -1, -2], X[cols] OR X[(0,-1,-2)]

Handy trick is to just write the tuple inside the square brackets 

In [None]:
# Gets rows 1, 5, 9 from X
rows = [1, 5, 9]
X[rows, :]

In [None]:
# Gets first and last column from X
cols = [0, -1]
X[:, cols]

##### Logical Indexing.

We can perform logical operations on an array, and use the results to index the array.

1. X[X <= 0.8]

Cool trick: X[0, X[0] > 0.5] returns all cols in first row with a value greater than 0.5 

In [None]:
Z = np.random.rand((10))  # A numpy array with 10 random elements

In [None]:
Z

In [None]:
Z > 0.5  # Check where entries are larger than 0.5

In [None]:
Z[Z > 0.5]  # Get all entries larger than 0.5

##### Vectorization

As was shown above, sometimes we can perform operations on all elements of an array without having to index into each element individually like we would have to with lists. Operations that allow this are called _vectorized operations_. Here are some of them:

In [20]:
u, v = np.arange(0, 6, 1), np.arange(10, 61, 10)  
print("u: {}, v: {}".format(u, v))

u: [0 1 2 3 4 5], v: [10 20 30 40 50 60]


In [21]:
# Unary Vectorized Operations
u + 5

array([ 5,  6,  7,  8,  9, 10])

In [22]:
u * 5

array([ 0,  5, 10, 15, 20, 25])

In [23]:
u ** 2

array([ 0,  1,  4,  9, 16, 25])

In [None]:
2 ** u

In [None]:
np.abs(u)  # absolute value, alternatively `abs(u)`

In [None]:
np.sum(u)  # summation, alternatively `sum(u)` or `u.sum()`

In [None]:
np.exp(u)  # e^u, natural log is also supported with `np.log(u)`

In [None]:
# Binary Vectorized Operations (Note: u and v need to have the same shape)
u+v, u-v

In [None]:
u*v, u/v

In [None]:
# Naturally, we can compose or chain multiple vectorized operations together
np.exp(u / v) * v / np.sum(u) - u**2

np.std(X[:, X[3] == 0])

##### Broadcasting

The vectorized operations extend to multi-dimensional arrays as you would expect, with the caveat that the binary operations either need to have the same shape or fall into a special case. Namely, if we have a 2D array of shape `(n, m)` and another of shape `(1, m)`, then it is possible to act as if the second array has been repeated along the first axis `n` times (effectively becoming a `(n, m)` array) and then perform the vectorized operation with the two of them. This is called _broadcasting_. In some respect, the unary vectorized operations can be seen as broadcasting scalars to the shape of the tensor being operated on. 

In [None]:
w = np.arange(0, 18, 1).reshape(3, 6)
w, w.shape

In [None]:
w - u  # Numpy accurately guesses that we want to broadcast `u` over the first axis, since u.shape[0] == w.shape[1]

In [None]:
# Sometimes we need to be explicit and add in the missing dimension to broadcast over
print(u, u.shape)
print(u[np.newaxis, :], u[np.newaxis, :].shape)

In [None]:
# This is especially useful when performing outer products, distance matrices, etc.
u[np.newaxis, :] * v[:, np.newaxis]

In [None]:
abs(u[np.newaxis, :] - v[:, np.newaxis])

### Additional Resources
If you'd like to read more about Markdown, LaTeX, Jupyter Notebooks, or Python/Numpy, here are some useful resources:
- [Markdown Cheatsheet](https://www.markdownguide.org/cheat-sheet/)
- [Learn LaTeX in 30 Minutes](https://www.overleaf.com/learn/latex/Learn_LaTeX_in_30_minutes)
- [Jupyter Notebook Tutorial](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)
- [A More In-Depth Python Tutorial](https://cs231n.github.io/python-numpy-tutorial/)
- [Numpy QuickStart](https://numpy.org/doc/stable/user/quickstart.html)
- [SciPy Lectures](http://scipy-lectures.org/)