# Saving data to Parquet file
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* june 17, 2022

## Hypothesis
This notebook will get the data from redshift that can be used for training and scoring and save it to the parquet file.

In [2]:
import boto3
import pandas as pd
import psycopg2

### To connect to RedShift, we have to tell AWS which profile we'd like to login to, this will send us for a browser authentication trip:

In [3]:
! aws sso login --profile Stellaralgo-DataScienceAdmin

Attempting to automatically open the SSO authorization page in your default browser.
If the browser does not open or you wish to use a different device to authorize this request, open the following URL:

https://device.sso.us-east-1.amazonaws.com/

Then enter the code:

WKDS-KPZG
Successully logged into Start URL: https://stellaralgo.awsapps.com/start#/


### Now we can create a session and client to RedShift, and create a new connection using Psycopg2:

In [16]:
session = boto3.session.Session(profile_name='Stellaralgo-DataScienceAdmin')
client = session.client('redshift')

CLUSTER = 'prod-app'
DBNAME = 'stlrmilb'

creds = client.get_cluster_credentials(                
    ClusterIdentifier = CLUSTER,
    DbUser = 'admin',
    DbName = DBNAME,
    DbGroups = ['admin_group'],
    AutoCreate=True
)

print(f"GOT CONNECTION TO DATABASE: {CLUSTER} {DBNAME}")

GOT CONNECTION TO DATABASE: prod-app stlrmilb


### First we need to call the stored procedure, which will create a named cursor - then we can query the results produced by the stored proc using that named cursor:

In [17]:
# create a connection & cursor and call the stored procedure
conn = psycopg2.connect(
    host = 'prod-app.ctjussvyafp4.us-east-1.redshift.amazonaws.com',
    port = 5439,
    user = creds['DbUser'],
    password = creds['DbPassword'],
    database = DBNAME
)

cur = conn.cursor()
cur.execute(f"CALL dw.getretentionmodeldata(11, 2010, 2021, 'rkcursor')")

# create a named cursor based on the cursor name passed in above
named_cursor = conn.cursor('rkcursor')
data = named_cursor.fetchall()

# load the data and columns into a data frame
cols = [row[0] for row in named_cursor.description]
df = pd.DataFrame(data=data, columns=cols)

conn.commit()

In [18]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31188 entries, 0 to 31187
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   lkupclientid         31188 non-null  int64         
 1   dimcustomermasterid  31188 non-null  int64         
 2   year                 31188 non-null  int64         
 3   productgrouping      31188 non-null  object        
 4   totalspent           31188 non-null  float64       
 5   recentdate           31188 non-null  datetime64[ns]
 6   attendancepercent    31188 non-null  float64       
 7   renewedbeforedays    31186 non-null  float64       
 8   source_tenure        31188 non-null  object        
 9   tenure               31186 non-null  float64       
 10  disttovenue          31188 non-null  float64       
 11  recency              31188 non-null  int64         
 12  missed_games_1       31188 non-null  object        
 13  missed_games_2       31188 non-

Unnamed: 0,lkupclientid,dimcustomermasterid,year,productgrouping,totalspent,recentdate,attendancepercent,renewedbeforedays,source_tenure,tenure,...,missed_games_2,missed_games_over_2,forward_records,opentosendratio,clicktosendratio,clicktoopenratio,gender,phonecall,inperson_contact,isnextyear_buyer
0,11,351977033,2014,Mini Plan,32.0,1970-01-01,1.0,0.0,1825,0.0,...,0,0,0,0.0,0.0,0.0,Unknown,0,0,1
1,11,355679837,2014,Mini Plan,16.0,1970-01-01,1.0,0.0,1460,0.0,...,0,0,0,0.0,0.0,0.0,Unknown,0,0,1
2,11,319764155,2014,Group,144.0,1970-01-01,0.91,2.0,2,2.0,...,0,0,0,0.0,0.0,0.0,Unknown,0,0,0
3,11,351191248,2014,Group,40.5,1970-01-01,1.0,8.0,8,8.0,...,0,0,0,0.0,0.0,0.0,Unknown,0,0,0
4,11,319744072,2014,Group,33.0,2014-05-01,0.33,1.0,1,1.0,...,0,0,0,0.0,0.0,0.0,Unknown,0,0,0


In [19]:
team_name = 'Bulls'
df.to_parquet(f"Files/Prod_Retention_Redshift_{team_name}.parquet")