# Sagemaker notebook Git, Snowflake and S3 set up

## Git

1. Generate a SSH public key on sagemaker & add it on github to allow access. \
    a. If there's no ~/.ssh/id_rsa, run `ssh-keygen -o`.  \
    b. Navigate to ~./ssh and copy the key `cat id_rsa.pub` \
    c. On github, click your profile-> settings-> SSH/GPG keys. Add new SSH key and paste in the key from a.\
2. On your notebook instance, open terminal and clone your git repo (git clone ...). 

## S3

Search S3 on AWS and make a folder under `/datascience-hbo-users/users`. You can access this folder on Sagemaker as `'s3://datascience-hbo-users/'` directory. You can also use aws s3 commands for general file management ('aws s3 rm', 'aws s3 cp', etc).\ 

e.g. \
Read file :  `pd.read_csv('s3://datascience-hbo-users/users/tjung/test_s.csv')`\
Delete file:  `!aws s3 rm s3://datascience-hbo-users/users/tjung/test.csv`

## Plotly (optional)


1. Remove 'jupyter_client.kernelspec.KernelSpecManager' from jupyterlab config.
    1. In terminal, find the configurations directories with jupyter --paths.
    2. Find jupyter_notebook_config.json.  (likely `/home/ec2-user/anaconda3/envs/JupyterSystemEnv/etc/jupyter/jupyter_config.json`) 
    3. From the file delete the following line: `"kernel_spec_manager_class": "nb_conda_kernels.CondaKernelSpecManager"`
    
2. Enable Extension Manager under jupyterlab's Settings tab.
3. Run the following in terminal

    ```#!/bin/bash\
    conda install "ipywidgets=7.5" --yes\
    export NODE_OPTIONS=--max-old-space-size=4096\
    jupyter labextension install @jupyter-widgets/jupyterlab-manager@1.1 --no-build\
    jupyter labextension install jupyterlab-plotly@4.6.0 --no-build\
    jupyter labextension install plotlywidget@4.6.0 --no-build\
    jupyter lab build\
    unset NODE_OPTIONS\```


## Snowflake

In [2]:
## Run the following pip install commands and restart the notebook kernel 
!pip install snowflake --user
!pip install snowflake-connector-python --user

ERROR! Session/line number was not unique in database. History logging moved to new session 3
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
The folder you are executing pip from can no longer be found.
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
The folder you are executing pip from can no longer be found.


In [None]:
###### import pandas as pd
import json
import snowflake.connector
from abc import ABCMeta, abstractmethod
import boto3

## Limit Size of Returned Records
MAX_QUERY_RETURN_SIZE = 1000000

class Credentials(metaclass=ABCMeta):
    pass
    
    
class SSMPSCredentials(Credentials):
    def __init__(self, secretid: str):
        self._secretid = secretid
        self._secrets = {}
        
    def get_keys(self):
        """
        credential fetching 
        """
        _aws_sm_args = {'service_name': 'secretsmanager', 'region_name': 'us-east-1'}
        secrets_client = boto3.client(**_aws_sm_args)
        get_secret_value_response = secrets_client.get_secret_value(SecretId=self._secretid)
        return get_secret_value_response
    
    
class BaseConnector(metaclass=ABCMeta):
    @abstractmethod
    def connect(self):
        raise NotImplementedError
        

class SnowflakeConnector(BaseConnector):
    def __init__(self, credentials: Credentials):
        keys = credentials.get_keys()
        self._secrets = json.loads(keys.get('SecretString', "{}"))

    def connect(self, dbname: str, schema: str = 'DEFAULT'):
        ctx = snowflake.connector.connect(
            user=self._secrets['login_name'],
            password=self._secrets['login_password'],
            account=self._secrets['account'],
            warehouse=self._secrets['warehouse'],
            database=dbname,
            schema=schema
        )

        return ctx
    

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-716495213f65>", line 3, in <module>
    import snowflake.connector
ModuleNotFoundError: No module named 'snowflake'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2044, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'ModuleNotFoundError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1169, in get_records
    return _fixed_getinnerframes(etb, number_

In [4]:
## Credentials
SF_CREDS = 'datascience-max-dev-sagemaker-notebooks'

## Snowflake connection 
conn=SnowflakeConnector(SSMPSCredentials(SF_CREDS))
ctx=conn.connect("MAX_PROD","DATASCIENCE_STAGE")
cur = ctx.cursor()


In [5]:
# Execute a statement that will generate a result set.
querystr='''
    select *
    from max_prod.content_intelligence.future_programming_schedule
    limit 2
'''
cur.execute(querystr)
# Fetch the result set from the cursor and deliver it as the Pandas DataFrame.

colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(cur.fetchall(), columns =colstring.split(","))
display(df)

df.to_csv('test.csv')

Unnamed: 0,RELEASE_MONTH,CATEGORY,UNCLEANED_TITLE,TITLE,SEASON,TIER,NUM_EPISODES_RELEASED,NUM_HOURS_RELEASED,NUM_PREMIERING_TITLES
0,2020-08-01,DOCUMENTARY FEATURES,Class Action Park,Class Action Park,0,3,1,2,1
1,2020-08-01,DOCUMENTARY FEATURES,On the Trail: Inside the 2020 Primaries,On the Trail: Inside the 2020 Primaries,0,3,1,2,1
