# AWS Starter
* StelllarAlgo Data Science
* Peter Morrison
* May 30, 2022

This notebook does the following:
1. Reads records from `datascience.ds.dummytable` in qa-app in Redshift and displays in Pandas DF
2. Appends a record to the table, then displays only that record from the table
3. Updates an existing record, then displays only that record from the table
4. Deletes the new record that I created in step 2

In [1]:
import awswrangler as wr
import boto3
import pandas as pd

from IPython.display import display

In [2]:
! aws sso login --profile Stellaralgo-DataScienceAdmin

Attempting to automatically open the SSO authorization page in your default browser.
If the browser does not open or you wish to use a different device to authorize this request, open the following URL:

https://device.sso.us-east-1.amazonaws.com/

Then enter the code:

MZPT-HBRW
Successfully logged into Start URL: https://stellaralgo.awsapps.com/start#/


In [3]:
TABLE_TO_MIGRATE = 'datadictionary'

session = boto3.setup_default_session(profile_name='Stellaralgo-DataScienceAdmin')
client = boto3.client('redshift')

cluster = 'qa-app'
dbname = 'datascience'
schema = 'ds'
    

conn = wr.data_api.redshift.connect(
    cluster_id = cluster,
    database = dbname,
    db_user = "admin"
)


In [4]:
def execute_sql(statement: str) -> pd.DataFrame:
    """Executes the passed SQL statement on Redshift.

    Returns:
        pd.DataFrame: Redshift result as a dataframe.
    """
    return wr.data_api.redshift.read_sql_query(
                sql = statement, 
                con = conn
        )

## Reading Data

Here we read all data from the table in redshift, satisfying requirement 1

In [5]:
read_all_statement = f'SELECT * FROM {dbname}.{schema}.{TABLE_TO_MIGRATE}'

# We call the execute_sql func to reduce code duplication when reading/writing from/to Redshift
df = execute_sql(read_all_statement)
display(df)

Unnamed: 0,featureimportanceid,feature,featureraw,datatype,source,engineered,description
0,1000,Tenure,tenure,int64,Ticketing,true,Days between first ticketing purchase date to ...
1,1024,Annual Household Income,annualHHIncome,object,Demographic Data,true,Total amount of annual income in the household
2,1028,Attendance,attendancePercent,float64,Ticketing,true,The attendance percentage for the season
3,1032,Cancel Records,cancel_records,int64,Secondary,true,Total number of Ticket Exchange cancel records...
4,1036,Children in Household,childrenPresentInHH,int64,Demographic Data,false,Total number of childerns in a household
...,...,...,...,...,...,...,...
91,1392,Not Married,maritalStatus_0,object,Pycaret,true,Fan is not married
92,1396,Married,maritalStatus_1,object,Pycaret,true,Fan is married
93,1400,No Post Secondary Education,education_0,object,Pycaret,true,Fan has a record of post secondary education
94,1404,Post Secondary Education,education_1,object,Pycaret,true,Fan does not have record of post secondary edu...


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   featureimportanceid  96 non-null     int64 
 1   feature              96 non-null     object
 2   featureraw           96 non-null     object
 3   datatype             96 non-null     object
 4   source               96 non-null     object
 5   engineered           96 non-null     object
 6   description          96 non-null     object
dtypes: int64(1), object(6)
memory usage: 5.4+ KB


## Removing features
Here we remove the following features as per Ryan's request:
"annualHHIncome",
"cancel_records",
"childrenPresentInHH",
"education",
"email"
"fill_out_form",
"internal_note",
"left_message",
"lengthOfResidenceInYrs",
"maritalStatus"
"posting_records",
"resale_atp",
"resale_records"


Also removing the featureimportanceid column.

In [14]:
features_to_remove = ["annualHHIncome",
"cancel_records",
"childrenPresentInHH",
"education",
"email",
"fill_out_form",
"internal_note",
"left_message",
"lengthOfResidenceInYrs",
"maritalStatus",
"posting_records",
"resale_atp",
"resale_records"]

df_with_desired_features = df[~df['featureraw'].isin(features_to_remove)]
df_with_desired_features = df_with_desired_features.drop("featureimportanceid", 1)
display(df_with_desired_features)

  df_with_desired_features = df_with_desired_features.drop("featureimportanceid", 1)


Unnamed: 0,feature,featureraw,datatype,source,engineered,description
0,Tenure,tenure,int64,Ticketing,true,Days between first ticketing purchase date to ...
2,Attendance,attendancePercent,float64,Ticketing,true,The attendance percentage for the season
5,Click Link,click_link,int64,Marketing,true,Total number of activities a customer clicked ...
6,Click to Open Ratio,clickToOpenRatio,float64,Marketing,true,Total number of clicked marketing email devide...
7,Click to Send Ratio,clickToSendRatio,float64,Marketing,true,Total number of clicked marketing email devide...
...,...,...,...,...,...,...
91,Not Married,maritalStatus_0,object,Pycaret,true,Fan is not married
92,Married,maritalStatus_1,object,Pycaret,true,Fan is married
93,No Post Secondary Education,education_0,object,Pycaret,true,Fan has a record of post secondary education
94,Post Secondary Education,education_1,object,Pycaret,true,Fan does not have record of post secondary edu...


## Writing to JSON
Here we output the file to the root dir of data-sci as a JSON file.

Be sure to move to the root dir.

In [15]:
df_with_desired_features.to_json("datadictionary.json", orient="records")