## K-Means Clustering

**Overview**<br>
<a href="https://archive.ics.uci.edu/ml/datasets/online+retail">Online retail is a transnational data set</a> which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

The steps are broadly:
1. Read and understand the data
2. Clean the data
3. Prepare the data for modelling
4. Modelling
5. Final analysis and reco

# 1. Read and visualise the data

In [117]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime as dt

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [118]:
# read the dataset
cricket_df = pd.read_csv("Cricket.csv", sep=",", encoding="ISO-8859-1", header=0)
cricket_df.head()

Unnamed: 0,Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0
0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20
1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15
2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20
3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34
4,DPMD Jayawardene (Asia/SL),1998-2015,448,418,39,12650,144,33.37,16020,78.96,19,77,28


In [119]:
# basics of the df
cricket_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Player  79 non-null     object 
 1   Span    79 non-null     object 
 2   Mat     79 non-null     int64  
 3   Inns    79 non-null     int64  
 4   NO      79 non-null     int64  
 5   Runs    79 non-null     int64  
 6   HS      79 non-null     object 
 7   Ave     79 non-null     float64
 8   BF      79 non-null     int64  
 9   SR      79 non-null     float64
 10  100     79 non-null     int64  
 11  50      79 non-null     int64  
 12  0       79 non-null     int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 8.2+ KB


# 2. Clean the data

In [120]:
#Drop all columns other than SR and Ave
cricket_df_new = cricket_df[['Player', 'Ave', 'SR']]
cricket_df_new.head()

Unnamed: 0,Player,Ave,SR
0,SR Tendulkar (INDIA),44.83,86.23
1,KC Sangakkara (Asia/ICC/SL),41.98,78.86
2,RT Ponting (AUS/ICC),42.03,80.39
3,ST Jayasuriya (Asia/SL),32.36,91.2
4,DPMD Jayawardene (Asia/SL),33.37,78.96


# 3. Prepare the data for modelling

In [121]:
cricket_df_new1 = cricket_df_new.drop('Player', axis=1)
cricket_df_new1.head()

Unnamed: 0,Ave,SR
0,44.83,86.23
1,41.98,78.86
2,42.03,80.39
3,32.36,91.2
4,33.37,78.96


In [122]:
# 2. rescaling
# instantiate
scaler = StandardScaler()

# fit_transform
cricket_df_scaled = scaler.fit_transform(cricket_df_new1)
cricket_df_scaled.shape

(79, 2)

In [123]:
cricket_df_scaled = pd.DataFrame(cricket_df_scaled)
cricket_df_scaled.columns = ['Average', 'StrikeRate']
cricket_df_scaled.head()

Unnamed: 0,Average,StrikeRate
0,1.072294,0.703152
1,0.587725,-0.044139
2,0.596226,0.110997
3,-1.047909,1.207091
4,-0.876185,-0.034


# 4. Modelling

In [124]:
# k-means with some arbitrary k
kmeans = KMeans(n_clusters=4, max_iter=50, random_state=100)
kmeans.fit(cricket_df_scaled)



In [125]:
kmeans.labels_

array([2, 2, 2, 3, 1, 2, 2, 2, 1, 2, 2, 2, 3, 0, 1, 0, 1, 2, 2, 2, 2, 1,
       1, 2, 3, 0, 2, 3, 1, 2, 1, 1, 2, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 1,
       1, 1, 2, 1, 1, 2, 3, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 3, 2, 2, 0, 2,
       2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1])

In [126]:
# assign the label
cricket_df_new['cluster_id'] = kmeans.labels_
pd.set_option('display.max_rows', None)
cricket_df_new

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cricket_df_new['cluster_id'] = kmeans.labels_


Unnamed: 0,Player,Ave,SR,cluster_id
0,SR Tendulkar (INDIA),44.83,86.23,2
1,KC Sangakkara (Asia/ICC/SL),41.98,78.86,2
2,RT Ponting (AUS/ICC),42.03,80.39,2
3,ST Jayasuriya (Asia/SL),32.36,91.2,3
4,DPMD Jayawardene (Asia/SL),33.37,78.96,1
5,Inzamam-ul-Haq (Asia/PAK),39.52,74.24,2
6,JH Kallis (Afr/ICC/SA),44.36,72.89,2
7,SC Ganguly (Asia/INDIA),41.02,73.7,2
8,R Dravid (Asia/ICC/INDIA),39.16,71.24,1
9,BC Lara (ICC/WI),40.48,79.51,2


In [127]:
#Question 1
cricket_df_new[cricket_df_new['cluster_id'] == 0]

Unnamed: 0,Player,Ave,SR,cluster_id
13,MS Dhoni (Asia/INDIA),51.32,88.69,0
15,AB de Villiers (Afr/SA),53.55,100.25,0
25,V Kohli (INDIA),53.94,90.99,0
34,HM Amla (SA),50.25,89.05,0
38,MG Bevan (AUS),53.58,74.16,0
42,IVA Richards (WI),47.0,90.2,0
64,MEK Hussey (AUS),48.15,87.16,0
