# Saving data to Parquet file
* StelllarAlgo Data Science
* Ryan Kazmerik & Nakisa Rad
* june 17, 2022

## Hypothesis
This notebook will get the data from redshift that can be used for training and scoring and save it to the parquet file.

In [13]:
import boto3
import pandas as pd
import psycopg2

### To connect to RedShift, we have to tell AWS which profile we'd like to login to, this will send us for a browser authentication trip:

In [8]:
! aws sso login --profile Stellaralgo-DataScienceAdmin

Attempting to automatically open the SSO authorization page in your default browser.
If the browser does not open or you wish to use a different device to authorize this request, open the following URL:

https://device.sso.us-east-1.amazonaws.com/

Then enter the code:

KGBQ-SVRC
Successully logged into Start URL: https://stellaralgo.awsapps.com/start#/


### Now we can create a session and client to RedShift, and create a new connection using Psycopg2:

In [14]:
session = boto3.session.Session(profile_name='Stellaralgo-DataScienceAdmin')
client = session.client('redshift')

CLUSTER = 'prod-app'
DBNAME = 'stlrcfl'

creds = client.get_cluster_credentials(                
    ClusterIdentifier = CLUSTER,
    DbUser = 'admin',
    DbName = DBNAME,
    DbGroups = ['admin_group'],
    AutoCreate=True
)

print(f"GOT CONNECTION TO DATABASE: {CLUSTER} {DBNAME}")

GOT CONNECTION TO DATABASE: prod-app stlrcfl


### First we need to call the stored procedure, which will create a named cursor - then we can query the results produced by the stored proc using that named cursor:

In [15]:
# create a connection & cursor and call the stored procedure
conn = psycopg2.connect(
    host = 'prod-app.ctjussvyafp4.us-east-1.redshift.amazonaws.com',
    port = 5439,
    user = creds['DbUser'],
    password = creds['DbPassword'],
    database = DBNAME
)

cur = conn.cursor()
cur.execute(f"CALL dw.getretentionmodeldata(35, 2010, 2022, 'rkcursor')")

# create a named cursor based on the cursor name passed in above
named_cursor = conn.cursor('rkcursor')
data = named_cursor.fetchall()

# load the data and columns into a data frame
cols = [row[0] for row in named_cursor.description]
df = pd.DataFrame(data=data, columns=cols)

conn.commit()

In [16]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29604 entries, 0 to 29603
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   lkupclientid         29604 non-null  int64         
 1   dimcustomermasterid  29604 non-null  int64         
 2   year                 29604 non-null  int64         
 3   productgrouping      29604 non-null  object        
 4   totalspent           29604 non-null  float64       
 5   recentdate           29604 non-null  datetime64[ns]
 6   attendancepercent    29604 non-null  float64       
 7   renewedbeforedays    29604 non-null  int64         
 8   source_tenure        29604 non-null  object        
 9   tenure               29604 non-null  int64         
 10  disttovenue          29604 non-null  float64       
 11  recency              29604 non-null  int64         
 12  missed_games_1       29604 non-null  object        
 13  missed_games_2       29604 non-

Unnamed: 0,lkupclientid,dimcustomermasterid,year,productgrouping,totalspent,recentdate,attendancepercent,renewedbeforedays,source_tenure,tenure,...,missed_games_2,missed_games_over_2,forward_records,opentosendratio,clicktosendratio,clicktoopenratio,gender,phonecall,inperson_contact,isnextyear_buyer
0,35,346078778,2021,Full Season,420.9,2021-11-20,0.73,22,1460,1251,...,0,2,0,0.04,0.0,0.0,Unknown,0,0,1
1,35,346119210,2021,Full Season,794.5,1970-01-01,1.17,57,4380,1753,...,0,1,5,2.07,0.55,0.27,Unknown,0,0,1
2,35,346130415,2018,Full Season,1184.22,2018-06-28,0.94,151,4380,660,...,0,3,0,1.0,0.0,0.0,Unknown,0,0,1
3,35,346111500,2021,Full Season,221.36,1970-01-01,0.75,47,4380,1729,...,0,1,0,2.4,0.0,0.0,Unknown,0,0,0
4,35,346208799,2022,Full Season,297.45,2022-06-09,0.0,182,4380,2158,...,0,0,0,3.5,0.5,0.14,Unknown,0,0,0


In [17]:
team_name = 'Stamps'
df.to_parquet(f"Files/Prod_Retention_Redshift_{team_name}.parquet")