# Lab 1
## Machine Learning

## Today

* [Preliminary lab info.](#prelim)
* [Helpful tools](#tools)
* [Setting up a new project...](#setup)
* [Good coding practice :)](#goodcoding)
* [Before you get started](#getstarted) (optional)
* [Numpy Matrices](#exercises)

## Some preliminary stuff. <a class="anchor" id="prelim"></a>

### How will these labs be structured?

90 minutes in total, divided into

1. Intro
2. Independent Work and questions (~1h)
3. Any other coding-related work & questions, **if there is time**. Otherwise, feel free to email me at K.conyngham@students.hertie-school.org


### Lab 'rules'

There are no stupid questions!

Try it yourself before asking others/googling/stack overflowing/asking chatGPT/asking me!

Tasks can be solved quickly with LLMs but it is very valuable to try yourself first and really make sure you understand each line of code.

Everyone's abilities will naturally differ coming into this class and that is okay! I will try to move at a pace that fits our schedule and suits as many of you as possible. If you already have advanced knowledge feel free to sit nearer the back and work independently, but I do **expect everyone to go through the materials at least once**. I also found that a great way to make sure I know something is to explain it to a colleague who might be stuggling, and I am sure they will appreciate it also!

I'm very happy to help everyone, unfortunately time is limited in the lab so try make sure you have a well defined question before I come over - it helps to make things simple until they work again then work up to the problem methodically so you know where things are getting confused!

I will try to use datasets relevant to public policy questions throughout the course, but I highly recommend you find datasets of interest to **you** and try the things we are learning on those. Remember we are learnign a toolkit to answer questions and make predictions.

Enjoy it - Machine Learning can be fun!

### Access lab resources and exercises
You will find the Jupyter notebooks for the labs at [https://github.com/Killian-Conyngham/Hertie_ML_2025_Labs](https://github.com/Killian-Conyngham/Hertie_ML_2025_Labs).

To access the Jupyter notebooks, the easiest thing is to `git clone` the repository. I will update the repository each week with the lab for that week. Even better is to `git fork` the repository so you have your own version in Github. This can be especially nice for posterity, so you can come back and use your own work as a reference. The easiest way to do this is directly at the link above, and then to work from your own repo, if you do this make sure to rebase before each lab to get the most updated materials.

You can open the terminal by searching for terminal (Mac), or open Command Prompt or Git Bash. (Windows)

## Installing Git
### For Windows:
1. Visit [git-scm.com](git-scm.com) and download the Windows installer.
2. During installation, select "Git from the command line" when prompted.
3. After installation, open Command Prompt (search for "cmd") and verify the installation: ```git --version``` in command line.

### For Mac: 

1. Install Homebrew (if you don't have it): ```/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"```
2. Install Git using Homebrew: ```brew install git``` 
*Alternatively, you can download Git directly from [git-scm.com](git-scm.com).*
3. Verify installation: by running ```git --version```

## Cloning the Repository:

Next we will need to clone the repository into our local machine. This can be done by navigating within the terminal to the desired directory and typing the ```git clone``` command. 

1. In-terminal, navigate to where you would like to set up your repository (I recommend that you create a directory called ```repositories``` either at the root directory, or within your documents directory).  
*Useful commands can be found in the [Helpful Tools](#helpful-tools) section below.*
2. Clone the repo:

```
git clone https://github.com/Killian-Conyngham/Hertie_ML_2025_Labs.git

```
(If you have forked the repository, replace the link above with the link to your forked reposotory).

This will create a directory (in whichever folder your terminal is currently operating in) and copy the remote repository into your local machine. 

---

## Running Jupyter Notebooks:
Once you've cloned the repository, follow the guide in the [Before you get started](#getstarted) section for a step-by-step walkthrough on how to run Jupyter Notebooks.

📌 Note: All lab materials will be distributed via GitHub, not Moodle. Check the repository regularly for updates.

## Helpful tools <a class="anchor" id="tools"></a>

Here is a list of tools and resources that can help you along the way. Use these to go through the steps in the [Before you get started](#getstarted) section.

* **How to work with the command line?** It's useful for anyone working with data and code to know how to use the command line (aka terminal/shell), this [article](https://www.dataquest.io/blog/why-learn-the-command-line/) explains why. [Here](https://tutorial.djangogirls.org/en/intro_to_command_line/) is an introduction of the most important commands on different operating systems. Using [TMUX](https://deliciousbrains.com/tmux-for-local-development/#what-is-tmux) will make using the command line much easier! 
* **How to install Python and keep track of dependencies?** I highly recommend using a virtual environment, ideally with [miniconda](https://docs.anaconda.com/free/miniconda/#quick-command-line-install)! Miniconda [CHEATSHEET](https://docs.conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf)
* **Where/how to write your code?** Choose an integrated development environment (IDE) or code editor (I recommend [VSCode](https://code.visualstudio.com/)) and install [Jupyter Notebooks](https://realpython.com/jupyter-notebook-introduction/) for code experimentation (and to run the DS&A labs), and use Jupyter [keyboard shortcuts](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330).
* **How to keep track of changes?** [Download and install Git](https://www.atlassian.com/git/tutorials/install-git)! Git [CHEATSHEET](https://education.github.com/git-cheat-sheet-education.pdf)
* **How to collaborate?** Sign up to GitHub.
* **How to write basic syntax in Python?** Look at this [CHEATSHEET](https://www.pythoncheatsheet.org/cheatsheet/basics) or use StackOverflow.

## Setting up a new project <a class="anchor" id="setup"></a>

Here are some best-practice steps you can take *every time* you start a new project (including a 'project' for what we'll be doing in DS&A labs), to keep your code organised, sharable, and up-to-date. The [Before you get started](#getstarted) section below goes through these steps one by one.

1. Create a new directory for each project!
2. Set up a virtual environment with your preferred Python version using conda.
3. Install jupyter into this environment.
4. Set up a new project in PyCharm, selecting your preferred Python installation as the interpreter (ideally the one you just created using conda).
5. Set up git (create .gitignore, initialise, first commit, etc.).
6. Start jupyter within your environment by running the command `jupyter notebook`. (If you are using the VSCode jupyter notebook module this is not necessary.)

## Refresher: some best practice for coding <a class="anchor" id="goodcoding"></a>

The course textbooks are a great resource and there are many blog posts on best-practice for coding. Here's a very top-line summary:

* Think carefully about naming conventions for variables, classes and functions
* Write good documentation and comments
* Make sure your code is reusable and scalable.
* Test your code (the smaller the units you test, the easier you will make your life in the future).
* Track your changes; remember to use version control so that your collaborators and future self understand what you have changed/added!

Machine Learning Specifics:
* Set random seeds explicitly for reproducibility
* Make sure your code works on smaller chunks of the data or more limited loops before running the whole thing


## Before you get started <a class="anchor" id="getstarted"></a>

If you have been coding in Python for a while and already have your preferred set-up, you can skip this section and go on to the coding exercises. Otherwise, go through the following steps to create a handy and easily reproducible coding environment, before you dive into the coding exercises. Doing this once will very likely make your coding experience in the future much easier, even if it seems like a hassle in the beginning. **Use the resources and links listed in the [Helpful tools](#tools) section.**

### 1. Familiarise yourself with the command line
Being familiar with the command line, and being used to working with it, will help you set up and navigate data processing pipelines, work with data that is stored remotely (i.e. not on your local computer), switch between different programmes, and deploy web apps (for dsa). 

It's a little unwiedlely at first and not necessarily intuitive, but essential and worth putting in a little time initially to get familiar. If you have never used the command line before, open the command line introduction and make yourself familiar with the commands to:
1. how to print the information of the current directory: `pwd` 
    - "print working directory".
    - outputs the full path to the current directory you're working in.
2. move between directories (aka folders)" `cd`
    - "change directory"
    - `cd repositories`
    - `cd ..`
    - `cd ~`
    - absolute vs relative paths
3. how to print a list of files and subdirectories within the current directory: `ls`
    - "list"
    - `ls -l` - detailed information (like permissions, sizes, and timestamps)
    - `ls -a` - list files and folders including hidden ones (those starting with .)
4. create new directories: `mkdir`
    - "make directory"
    - `mkdir dir1 dir2 dir3`
    - `mkdir -p parent_dir/child_dir/grandchild_dir` (directory with nested subdirectories, using `-p` flag)
5. remove files and folders: `rm`
    - "remove"
    - remove file
    - remove directory w/o its contents
    - remove directory w/o its contents: `rm -r` + dir name
6. (optional and slightly more advanced) a really helpful tool for simplifying the command line is TMUX (terminal multiplexer), so you can run multiple terminal (command line) windows in parallel.

### 2. Set up a tool for virtual environments
Virtual environments are a great way to separate your package dependencies and even your Python versions. We'll be using a dedicated virtual environment for this course as well as DSA (and future Hertie techical courses).

I recommend miniconda (called `conda`) -- since it is not limited to Python; you can even set up an R environment with conda. 
    - *Alternatives are `virtualenv` and `pipenv`.* 
- If you already work with virtual environments, you can skip this step. Otherwise:

1. **install Miniconda**:
    - either from the [webpage](https://docs.anaconda.com/miniconda/install/)
    - or using basic command line: 
        - Mac`bash Miniconda3-latest-MacOSX-x86_64.sh` 
        - Windows `Miniconda3-latest-Windows-x86_64.exe`
    - follow the prompts (accept default install location)
    - initialise: `source ~/.bashrc`
    - verify: `conda --version`
2. **create a new virtual environment**: in which you install Python version 3.11, give it the name `ml2025`: `conda create -n ml2025 python=3.11`
3. **verify**: by running the conda command that lists all environments: `conda env list`
4. **activate the new environment** that you created: `conda activate ml2025`
5. **list the packages** that are installed within this environment: `conda list`


### 3. Create a new Git repo (or clone an existing one)

Using version control with git will make your collaborators' and future self's life *a lot* easier. It helps you to track your own changes on a project and to collaborate effectively with your teams. Generally, this part of the work flow will consist of creating a new repo or cloning an existing one from Github. 

The steps are:

1. **install git**: (run `git --version` to check if you already have it installed)
2. **clone the existing DS&A labs repository**: such as the one for these labs, run `git clone https://github.com/Killian-Conyngham/Hertie_ML_2025_Labs.git` and then move into the new directory that this creates, 
3. **branch**: now create and move into a new branch by running `git checkout -b lab1` (where lab1 is now the name of the branch you have created)

### 4. Set up Jupyter Notebooks
- Jupyter notebooks are a great way for quick experimentation with code, to present code, or to create data science work flows. 
- *NB: The final functions and classes (and testing) that you write for a project should **not** sit in Jupyter notebook. Those should be written in `.py` files (aka Python modules) or, even better, in Python packages.*
- One best-practice tip: start your code experiments in a Jupyter notebook and once you find yourself using some functions/classes repeatedly or think you might need them in other notebooks, migrate them into Python modules (and later to a package). 

To run this Jupyter notebook, follow these steps:
1. make sure the `ml2025` environment you created in step 2 is activated
2. install jupyter into this environment using the `pip install jupyter` command (installing something *into* an environment just means that you need to run the installation command after having activated the environment)
3. now move into the directory for the DS&A repository that you cloned in step 3 (called `Hertie_ML_2025_Labs`).
4. run `jupyter notebook`; this will start your default browser (or open a new tab) and you should now see a folder structure that includes the Jupyter notebook file `lab1.ipynb`
5. click on `lab1.ipynb` to run it and familiarise yourself with the keyboard shortcuts for running cells, creating new cells (above and below the current cell), and removing cells.

### 5. Set up your IDE
- After step 4., you can now write code in Jupyter notebooks. 
- For the type of coding necessary for larger (and collaborative) projects, you'll also need an IDE (integrated development environment). IDEs are easier to use than base Jupyter notebook - as run from the command line above.
- This is where you'll write classes, functions, helper functions, tests etc. 
- I recommend VS Code or PyCharm, because it they have many 'intelligent code' features such as code prediction, readability, error detection, easy refactoring, and because it provides debugging tools.
- You can either open a project from the IDE, or open the file using the IDE programme. 
    - The 'more correct' way is to open up the project from the IDE.
    - **For VS Code**: 
        - click the `Explorer` icon on the top left hand corner
        - select `Open Folder`
        - select the `Interpreter` (top right, also the central top bar), choose the Python installation that is within the `ml2025` conda environment you created above.
    - **For PyCharm**
        - on the `New Project` screen, set the `Location` path to the path to the `Hertie_ML_2025_Labs` directory 
        - change the radio button from `New environment using ...` to `Previously configured interpreter`
        - in the `Interpreter` dropdown menu, choose the Python installation that is within the `ml2025` conda environment you created above.

## Let's get coding! <a class="anchor" id="exercises"></a>

### Import the Library




First you will have to install numpy into your virtual environment using `pip install numpy` (note: you can also use `conda install numpy`, this works better for ensuring dependencies etc. but is not available for every package, so for nicher packages its better to use pip which is the python default package loader.)

In [2]:
import numpy as np

#### If this command runs without any errors, it means your library has been successfully imported.

## Matrix Operations using Numpy

See if you can complete the following excercises using the python basics you have learned and the cheatsheet below for numpy:
https://media.datacamp.com/legacy/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf

Some code chunks are blank, this is where you should put your code, others are already filled in, make sure to run these as you go along!

If you are stuggling the solutions are in lab1_solutions.ipynb, but try asking your neighbour or searching online first!

One important note: in Numpy matrices are considered a subset of arrays.

### 1. Defining and printing matrices

#### Define a 2x2 matrix A such that it takes the following values:

$$
\begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}
$$

#### Let's check the values of A:

##### Way 1

In [None]:
A

##### Way 2

In [None]:
print(A)

#### Define a 2x2 matrix B such that it takes the following values:

$$
\begin{bmatrix}
5 & 6 \\
7 & 8
\end{bmatrix}
$$

In [None]:
B

#### Let's try to see values of 2 variables in one cell without the print statement

We should see that only matrix B gets shown on the screen.

#### Now try to see values of 2 variables in one cell with print statement.

We should see that both the matrices get printed with the print statement.

#### Now let's define a 3x3 matrix C such that it takes the following values:

$$
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{bmatrix}
$$

In [None]:
C

##### Bonus: see all variables defined in current scope:

In [None]:
whos


### 2. Datatypes of matrices:

The type of a matrix depends on its components and how we set it up.

In [None]:
A.dtype

#### We can define matrix G so that it is of type double (which allows for decimal values)
Note: Double and float64 are two names for the same datatype.

In [None]:
G=np.array([[1,2,3],[4,5,6]],np.double)
G

In [None]:
print(G.dtype)

In [None]:
whos

### 3. Defining Zero and Identity Matrices:

#### Define a zero matrix of size 4x4:
(See if you can do this directly without typing out the zeroes)

##### We can also define a Zero matrix that has integer values:

In [None]:
Z4=np.zeros(shape=(4,4),dtype=int)
Z4

#### Define an Identity matrix of size 4x4:

#### Define an Identity Matrix of type int and size 4x4:

In [None]:
print(I4.dtype)

### 4. Access rows and columns of a matrix (slicing the matrix):

#### In Python, indices start from [0,0]

#### If we want to access the first element of a matrix, we will use indices [0,0]

In [None]:
print("Matrix C:")
print(C)

print("\nFirst element of Matrix C:")
print(C[0,0])

#### Modify that element so that it is 7:

In [None]:
print("First element of Matrix C:")
print(C[0,0])

print("\nMatrix C:")
print(C)

#### Then change it back to its original value of 1:

In [None]:
print(C)

#### We can also access entire rows of a matrix:

The first index denotes row, and the second denotes column: Matrix[row,column]

In [None]:
C[0,:]

In [None]:
print("First row of Matrix C:")
print(C[0,:])


#### Access the entries on the third row of matrix C:

#### Access the 2nd column of matrix C:

#### Retrieve Matrix Blocks:

#### Let's retrieve elements with values 1,2,4,5 from C:

In [None]:
print(C[0:2,0:2])

Note: 0:2 doesn't include position 2 (only 0 and 1)

#### Now retrieve elements with values 5,6,8,9 from C:
(Hint, we require rows 1 and 2 and columns 1 and 2, counting from 0)

Note: [1:2,1:2] only gives us the center most element in the matrix

In [None]:
print (C[1:2,1:2])

#### Another approach to get elements with values 5,6,8,9:
Here we use negative indices count from the end:
- -1 refers to the last row/column.
- -2 refers to the second-to-last row/column, and so on.

In [None]:
print(C[-2:,-2:])

##### Find the number of rows and columns for matrix G:

#### Now see if you can find the no. of rows and columns in G seperately:

### 5. Matrix Operations:

#### Find the Transpose of matrix G:

#### Add A and B, call the answer S and print it:

#### Multiply A by a scalar of 2 i.e. multiply all elements by 2:

#### Multiplying Matrices:

If we use the default multiplication operator in python, ``*`` we get element-wise multiplcation of A and B, but not standard (dot product) matrix multiplication.

In [None]:
M=A*B
print ("M is\n",M)
print("\nA is\n", A)
print("\nB is \n", B)

#### Find the dot product of A and B:

We can also use the @ operator for (dot product) matrix multiplication

In [None]:
M3=A@B
print("\nMatrix multiplication of A and B is\n", M3)

#### Let's calculate inverse of matrix A:

In [46]:
A1=np.linalg.inv(A)

Note that this uses linalg submodule of numpy library, submodules are accessed using dot notation and sometimes must be imported separately

We can check if A1 is the inverse of using matrix multiplication: A1@A=A@A1=Identity matrix

In [None]:
print(A@A1)

Don't be confused by e^-16, it is effectively zero

### 6. Save matrix to a file:

#### Save matrix G to the file matrix_file.npy

#### Let's load the saved file and compare the matrices, all elements should be equal:

In [None]:
G4=np.load('matrix_file.npy')
print(G4==G)

You are done this tutorial, congrats! If you have lots of extra time in the tutorial it might be worth starting to look for interesting datasets for your project!