### Part A: General descriptive analysis on our data

<strong>Background:</strong> Cluster analysis is usually a multivariate technique. Using k-means and similar techniques maybe  challenging for univariate data.
For that case we are going to employ <strong><i>Jenks natural breaks optimization</strong></i> technique which is a data clustering method designed to determine the best arrangement of values into different classes. 
This is done by seeking to minimize each class's average deviation from the class mean, while maximizing each class's deviation from the means of the other groups.

In [1]:
# importing packages

import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv("../file_2.csv")

In [3]:
df.head(3)

Unnamed: 0,tenure month,loan fee,loan,mobile transactions,mobile transactions value,mobile services,app users,TOTAL_INFLOW,SALARY_STATUS,TOTAL_OUTFLOW,age,gender,is_default
0,5,4.7,10680.0,0,0.0,0,0,0,0,0,25,M,Y
1,21,9.6,9120.0,13,2922000.0,3,0,3000000,0,2994764,36,M,N
2,28,10.9,57900.0,0,0.0,0,0,0,0,502900,25,M,Y


In [4]:
df.columns

Index(['tenure month', 'loan fee', 'loan', 'mobile transactions',
       'mobile transactions value', 'mobile services', 'app users',
       'TOTAL_INFLOW', 'SALARY_STATUS', 'TOTAL_OUTFLOW', 'age', 'gender',
       'is_default'],
      dtype='object')

In [5]:
#Import Jenks from Jenkspy
import jenkspy

In [6]:
# For each feature we will need to update the name here then run the next cell to get breakpoints

Qty_array = df['age'].to_numpy()

def goodness_of_variance_fit(array, classes):
    # get the break points
    classes =  jenkspy.jenks_breaks(array, classes)

    # do the actual classification
    classified = np.array([classify(i, classes) for i in array])

    # max value of zones
    maxz = max(classified)

    # nested list of zone indices
    zone_indices = [[idx for idx, val in enumerate(classified) if zone + 1 == val] for zone in range(maxz)]

    # sum of squared deviations from array mean
    sdam = np.sum((array - array.mean()) ** 2)

    # sorted polygon stats
    array_sort = [np.array([array[index] for index in zone]) for zone in zone_indices]

    # sum of squared deviations of class means
    sdcm = sum([np.sum((classified - classified.mean()) ** 2) for classified in array_sort])

    # goodness of variance fit
    gvf = (sdam - sdcm) / sdam

    return classes

def classify(value, breaks):
    for i in range(1, len(breaks)):
        if value < breaks[i]:
            return i
    return len(breaks) - 1

In [7]:
# Determine breakpoints
goodness_of_variance_fit(Qty_array, classes=3)

[19, 30, 41, 60]