# Sagemaker notebook Git, Snowflake and S3 set up

## Git

1. Stop your Sagemaker notebook instance if it is on.
2. Edit the instance and add git repositories under 'Git Repositories'. You can choose to add public repositories for your specific instance. 
3. Start the notebook instance. Your repositories will be cloned under the notebook directories. Local changes will be saved when you restart your notebook. 
4. You can use the git extension or terminal commands to work with git. To use the extension, you will need to generate a personal access token 
https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token


## S3

1. Search S3 on AWS and make a folder under `/datascience-hbo-users/users`. 
2. Access this folder on Sagemaker as `s3://datascience-hbo-users/` directory. You can also use aws s3 commands for general file management ('aws s3 rm', 'aws s3 cp', etc).

```
Read file :  pd.read_csv('s3://datascience-hbo-users/users/tjung/test_s.csv')
Delete file:  !aws s3 rm 's3://datascience-hbo-users/users/tjung/test.csv'```

## Plotly (optional)

1. Set up lifecycle configuration on SageMaker and start your notebook instance with the configuration.
2. Enter the following under Script- start notebook: 
    ```
    set -e
    #!/bin/bash
    sudo -u ec2-user -i <<'EOF'

    mkdir ~/.jupyter/lab/user-settings
    mkdir ~/.jupyter/lab/user-settings/@jupyterlab
    mkdir ~/.jupyter/lab/user-settings/@jupyterlab/extensionmanager-extension
    touch ~/.jupyter/lab/user-settings/@jupyterlab/extensionmanager-extension/plugin.jupyterlab-settings
    echo "{"enabled": true}" >> ~/.jupyter/lab/user-settings/@jupyterlab/extensionmanager-extension/plugin.jupyterlab-settings 

    rm -rf /home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json
    touch /home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json
    echo "{"JupyterApp":{}}" >> /home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json
    source /home/ec2-user/anaconda3/bin/activate JupyterSystemEnv
    conda install "ipywidgets=7.5" --yes
    export NODE_OPTIONS=--max-old-space-size=4096
    jupyter labextension install @jupyter-widgets/jupyterlab-manager@1.1 --no-build
    jupyter labextension install jupyterlab-plotly@4.6.0 --no-build
    jupyter labextension install plotlywidget@4.6.0 --no-build
    jupyter lab build
    unset NODE_OPTIONS
    ```
    ```
    # This will affect only the Jupyter kernel called "conda_python3".
    source activate python3
    pip install xgboost
    pip install lightgbm
    pip install fuzzywuzzy
    pip install snowflake
    pip install snowflake-connector-python
    pip install category_encoders

    source deactivate

    EOF
    ```



Alternatively, if you don't want to set up a lifecycle config:  

1. Enable Extension Manager under jupyterlab's Settings tab.
2. Run the following in terminal

    ```
    rm -rf /home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json
    touch /home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json
    echo "{"JupyterApp":{}}" >> /home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json
    #!/bin/bash
    conda install "ipywidgets=7.5" --yes
    export NODE_OPTIONS=--max-old-space-size=4096
    jupyter labextension install @jupyter-widgets/jupyterlab-manager@1.1 --no-build
    jupyter labextension install jupyterlab-plotly@4.6.0 --no-build
    jupyter labextension install plotlywidget@4.6.0 --no-build
    jupyter lab build
    unset NODE_OPTIONS


## Snowflake

In [1]:
## Run the following pip install commands and restart the notebook kernel 
!pip install snowflake --user
!pip install snowflake-connector-python --user

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [5]:
###### import pandas as pd
import json
import snowflake.connector
import pandas as pd
from abc import ABCMeta, abstractmethod
import boto3

## Limit Size of Returned Records
MAX_QUERY_RETURN_SIZE = 1000000

class Credentials(metaclass=ABCMeta):
    pass
    
    
class SSMPSCredentials(Credentials):
    def __init__(self, secretid: str):
        self._secretid = secretid
        self._secrets = {}
        
    def get_keys(self):
        """
        credential fetching 
        """
        _aws_sm_args = {'service_name': 'secretsmanager', 'region_name': 'us-east-1'}
        secrets_client = boto3.client(**_aws_sm_args)
        get_secret_value_response = secrets_client.get_secret_value(SecretId=self._secretid)
        return get_secret_value_response
    
    
class BaseConnector(metaclass=ABCMeta):
    @abstractmethod
    def connect(self):
        raise NotImplementedError
        

class SnowflakeConnector(BaseConnector):
    def __init__(self, credentials: Credentials):
        keys = credentials.get_keys()
        self._secrets = json.loads(keys.get('SecretString', "{}"))

    def connect(self, dbname: str, schema: str = 'DEFAULT'):
        ctx = snowflake.connector.connect(
            user=self._secrets['login_name'],
            password=self._secrets['login_password'],
            account=self._secrets['account'],
            warehouse=self._secrets['warehouse'],
            database=dbname,
            schema=schema
        )

        return ctx
    

In [2]:
## Credentials
SF_CREDS = 'datascience-max-dev-sagemaker-notebooks'

## Snowflake connection 
conn=SnowflakeConnector(SSMPSCredentials(SF_CREDS))
ctx=conn.connect("MAX_PROD","DATASCIENCE_STAGE")
cur = ctx.cursor()


In [6]:
# Execute a statement that will generate a result set.
querystr='''
    select *
    from max_prod.content_intelligence.future_programming_schedule
    limit 2
'''
cur.execute(querystr)
# Fetch the result set from the cursor and deliver it as the Pandas DataFrame.

colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(cur.fetchall(), columns =colstring.split(","))
display(df)

df.to_csv('test.csv')

Unnamed: 0,RELEASE_MONTH,CATEGORY,UNCLEANED_TITLE,TITLE,SEASON,TIER,NUM_EPISODES_RELEASED,NUM_HOURS_RELEASED,NUM_PREMIERING_TITLES
0,2020-08-01,DOCUMENTARY FEATURES,Class Action Park,Class Action Park,0,3,1,2,1
1,2020-08-01,DOCUMENTARY FEATURES,On the Trail: Inside the 2020 Primaries,On the Trail: Inside the 2020 Primaries,0,3,1,2,1


Username for 'https://github.com/tiffanyjungym-hbo/setup': 