# Lab 2

## 0, SSH to your EC2 instance
Start your instance and run:

```
ssh -i YOUR_KEY_PATH -L 8888:localhost:8888 EC2_IP_ADDRESS
```
Note that we add the option '-L 8888:localhost:8888' to create a port tunnel between EC2 8888 and your local machine 8888.


## 1, Install Required Tools/Packages
    
(1)Install the Tmux on your EC2 instance, see https://github.com/tmux/tmux/wiki/Installing. See Tmux cheetsheet https://tmuxcheatsheet.com/. Important functions includes start new sessions, attach/detach sessions, create/delete new windows in a session, navigate between sessions/windows.  (Though the Tmux is optional, this tool will give you a lot convenience when navigating between multiple shells)

(2)Install the AWS CLI (AWS command line interface) on your EC2 instance. See https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html. (You can also install this on your local laptop, in case you want to upload/download between your local laption and EC2/S3 bucket

(3) Configure your AWS CLI, following instructions at the page: https://www.cloud.northwestern.edu/resources/howtos/use-netid-authentication-with-the-aws-cli/ . This includes running 
```
aws configure sso --no-browser
```
and set your credentials (available at your login page of AWS at 'access key') by running 
```
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KE="
export AWS_SESSION_TO=="

```

(4) Install the Docker on your EC2 instance by running 
```
sudo apt install docker.io
```
and set permissions
```
sudo chmod 666 /var/run/docker.sock    
```


## 2, Github Set Up

 - generate SSH key: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent
 - add the public key to your Github: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account

   After set up the key pairs, please create a course repo (e.g yourGithubAccount/DE300) on the Github webpage, 'git pull' your repo to certain directory at your instance, create a test file by 'touch test', then run
   ```
    git add .
    git commit -m "Initial commit"
    git remote add origin <repo URL i.e https://github.com/(your Github name)/repo name.git>
    git push origin main
   ```
If your test file appear on your remote Github repo, then you have successfully build connection between your local repo and your remote repo!

 
## 3 Create your first Docker(Jupyter Notebook) Image and run the Docker Container with this image
(1) Under your course directory, create a folder my-jupyter-image (you can choose your preferred name). In this folder, create file with name Dockerfile that contains the following txt:

```
# Use the Jupyter Data Science Notebook as the base image
FROM jupyter/datascience-notebook

# Set the working directory to /home/jovyan - the default for Jupyter images
WORKDIR /home/jovyan

EXPOSE 8888

# Avoid prompts from apt during the build process
ENV DEBIAN_FRONTEND=noninteractive

COPY . .

# Install the latest AWS CLI version 2
USER root

# Update and install necessary packages
RUN apt-get update && apt-get install -y \
    wget \
    curl \
    unzip \
    bzip2 \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install the AWS CLI tools to your image
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
    unzip awscliv2.zip && \
    ./aws/install && \
    rm -rf awscliv2.zip ./aws
USER jovyan


# Start the Jupyter Notebook server when the container launches
# The base image already configures Jupyter to run on start, but you can customize
# startup options here if needed.
CMD ["start-notebook.sh", "--NotebookApp.token=''", "--NotebookApp.password=''", "--allow-root"]

```
The 'Dockerfile' is an instruction to create the docker image

(2) Create your image, cd to the myjupyter folder, run 
```
docker build -t my_jupyter .
```
After running the above command, you can check the image is created by running 'docker images'. Building the myjupyter image first time will take some time, since this image is built from the base image 'jupyter/datascience-notebook', which is about 6GB and it takes time to download from the 'docker hub' (this is like github repo to git, search this!). So when you 'docker images', you can actuallly see two images, one is 'jupyter/datascience-notebook' as the base image, the other is the one you created with the name 'myjupyter'.

(3) Create your SAMPLE_PROJ_PATH. Runte the container in your EC2 instance wit 'myjupytr' image:
```  
docker run -p 8888:8888 -v SAMPLE_PROJ_PATH:/home/jovyan/ myjupyt 

```

In the above command, the option -p sets up the port tunnel: the container's 8888 tunnel, which the jupyter notebook application occupies, sync with your instance 8888 port. The option '-v SAMPLE_PROJ_PATH:/home/jovyan/' tells the container to sync your working directory 'SAMPLE_PROJ_PATH'.

You can check your container by command 'docker ps -a', which will show all your containers. Option '-a' makes it show both running and stopped containers.

Search the following command and see what they do:
```
docker start container_id
docker stop container_id
docker logs container_id

docker rm container_id
docker rmi image_id
```



 

## Lab2 Assigment

(1) Find the bucket 'de300spring2024' in the S3 buckets, create your own folder with your name. This will be the volume that you will use to store all your data in this course. Please do not modify files in with other students' folders.

(2) Use 'aws s3 cp s3://de300spring2024/robert_su/sample_dataset.csv YOUR_PATH' to copy the file 'sample_dataset.csv' from TA(Robert Su) folder to your folder

(3) You can open the jupyter lab visiting 'http://localhost:8888/lab/' in your local machine browser. (This lab/notebook is running in the container in your EC2)

(4) Create a jupyter notebook notebook with name 'reading_data.ipynb' can run the following python code cells, if you can read the data, then it means your set up is good to go!

(5) After doing all this, push your EC2 git repo to the remote main repo, the mentors will grade based on your remote repo origin/main 

In [3]:
# Loading required package

import boto3
from io import BytesIO
import pandas as pd



In [7]:
# you need to change the credentials for yourself

s3 = boto3.client('s3',
                  aws_access_key_id='...',
                  aws_secret_access_key='...',
                  aws_session_token='...')


bucket_name = 'de300spring2024'
object_key = 'robert_su/sample_dataset.csv'



In [9]:
csv_obj = s3.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8')


In [10]:
df = pd.read_csv(BytesIO(csv_string.encode()))
print(df.head())

   ID           Name  Age         City
0   1       John Doe   28     New York
1   2     Jane Smith   32  Los Angeles
2   3    Emily Davis   45      Chicago
3   4  Michael Brown   22        Miami
