# Clustering Visualization
* StellarAlgo Data Science
* Peter Morrison
* June 24, 2022

## Loading Data and Creating the Model
We will load up an example dataset to just get some simple clustering results to demonstrate with.

In [1]:
import getpass
import pandas as pd
import pyodbc
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from pycaret.clustering import *

In [2]:
# connect to SQL Server.
SERVER = '52.44.171.130' 
DATABASE = 'datascience' 
USERNAME = 'dsAdminWrite' 
PASSWORD = getpass.getpass(prompt='Enter your password')
CNXN = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+SERVER+';DATABASE='+DATABASE+';UID='+USERNAME+';PWD='+ PASSWORD)

In [3]:
TEAMS = [
    {
            "mssql_dbname": "stlrMLS",
            "lkupclientid": "31",
            "clientcode": "sacfc",
            "train_year": 2021,
            "test_year": 2022
        }
]

In [4]:
print(f"GETTING TEAM DATASETS:")

team_datasets = []
for team in TEAMS:

    cursor = CNXN.cursor()

    storedProc = (
        f"""Exec {team["mssql_dbname"]}.[ds].[getRetentionScoringModelData] {team["lkupclientid"]}"""
    )

    df = pd.read_sql(storedProc, CNXN)

    df["year"] = pd.to_numeric(df["year"])
    df = df[df["year"] <= team["train_year"]]

    print(f" > ADDING TEAM TO DATASET: {team['clientcode']}")

    CNXN.commit()
    cursor.close()

    team_datasets.append(df)

print(f"TOTAL TEAMS IN DATASET: {len(team_datasets)}")

GETTING TEAM DATASETS:
 > ADDING TEAM TO DATASET: sacfc
TOTAL TEAMS IN DATASET: 1


In [5]:
df_dataset = pd.concat(team_datasets)

print(df_dataset.shape)
print(df_dataset.year.value_counts())

(10409, 53)
2017    2358
2018    2202
2021    2078
2020    1988
2019    1767
2016      16
Name: year, dtype: int64


In [6]:
df.head()

Unnamed: 0,lkupClientId,dimCustomerMasterId,year,productGrouping,totalSpent,recentDate,attendancePercent,renewedBeforeDays,isBuyer,source_tenure,tenure,distToVenue,totalGames,recency,missed_games_1,missed_games_2,missed_games_over_2,click_link,fill_out_form,open_email,send_email,unsubscribe_email,openToSendRatio,clickToSendRatio,clickToOpenRatio,posting_records,resale_records,resale_atp,forward_records,cancel_records,email,inbound_email,inbound_phonecall,inperson_contact,internal_note,left_message,outbound_email,outbound_phonecall,phonecall,text,unknown,gender,childrenPresentInHH,maritalStatus,lengthOfResidenceInYrs,annualHHIncome,education,urbanicity,credits_after_refund,is_Lockdown,NumberofGamesPerSeason,CNTPostponedGames,isNextYear_Buyer
0,31,441555341,2016,Mini/Flex Plan,171.0,2016-09-17,1.0,2,True,2190,23,214.25574,2,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,4,0,0,0,2,0,0,M,1,1,,,,,0.0,0,17,,0
1,31,441565511,2016,Mini/Flex Plan,470.0,2016-09-03,0.5,0,True,2190,101,214.25574,2,0,0,1,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,M,1,1,,,,,0.0,0,17,,0
2,31,441573182,2016,Mini/Flex Plan,560.0,2016-09-17,0.666667,4,True,2190,32,214.25574,2,1,2,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,17,,0
3,31,441621284,2016,Mini/Flex Plan,525.0,2016-09-17,1.0,8,True,2190,50,214.25574,4,0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,Unknown,1,1,,,,,0.0,0,17,,0
4,31,441583206,2016,Mini/Flex Plan,514.0,2016-09-17,0.5,2,True,2190,30,214.25574,2,1,0,0,1,0,0,0,0,0,0.0,0.0,0.0,0,0,0.0,0,0,4,0,0,0,0,0,0,0,0,0,0,M,1,1,,,,,0.0,0,17,,0


## Creating Model
In this section we make the model with Pycaret

In [10]:
data = df_dataset.sample(frac=0.95, random_state=786)
data_unseen = df_dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Data for Modeling: (9889, 53)
Unseen Data For Predictions: (520, 53)


In [11]:
cluster = setup(data, normalize = True, ignore_features= ['lkupClientId', 'dimCustomerMasterId'], session_id = 7652)

IntProgress(value=0, description='Processing: ', max=3)

ValueError: Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.

## Conclusions
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?