# KMeans Outlier Detection 

KMeans is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. 

The algorithm works as follows:

First we initialize k points, called means, randomly.
We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
We repeat the process for a given number of iterations and at the end, we have our clusters.

 

#### The objective is to identify outliers in Sales data using KMeans.The elbow method is used to determine the optimal number of clusters in k-means clustering

Client Name : XXXXX 

project Name : Sales Outlier Detection

Data source : Sales

Script type : Python 3.7

LOD : 70

In [37]:
#Clearning Cache
import gc 
gc.collect()

import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
import scipy.stats as stats
import seaborn as sns
from sklearn.cluster import KMeans


print('Outlier Detection for Sales Data')
print('Please enter the file path for dataset in the format : /users/..../sales.csv')
dataset=input()
print()
print('Please enter the file path for mapped file in the format : /users/..../map_file.csv')
map_file=input()
print()
print('Please enter the destination folder name in the format : /users/....desktop/')
dest_folder=input()
print()

#### loading the mapped file to pandas dataframe

n=pd.read_csv(map_file,encoding = "ISO-8859-1")
features=list(n['Columns'].loc[n['Features']==True])
dimension=list(n['Columns'].loc[(n['Type']=='Dimension') & (n['Features']==True)])
measure=list(n['Columns'].loc[(n['Type']=='Measure') & (n['Features']==True)])
date=list(n['Columns'].loc[(n['Type']=='Date') & (n['Features']==True)])[0]

### loading the CSV file to pandas dataframe

df=pd.read_csv(dataset,encoding = "ISO-8859-1",low_memory=False)

### Selecting only the relevant features
pd.set_option('mode.chained_assignment', None)
df1=df[features]

### Handling Missing Values 

for i in measure:
    df1[i]=df1[i].astype('float')
    df1=df1.loc[df[i]!=0]

df1=df1.dropna()

### Extracting the month from the Date Column and adding a month Column

df1[date] = df1[date].astype('datetime64[ns]') 
df1['month']=df1[date].dt.month
dimension.append('month')

### Deleting Billing Date and Material Code

df1=df1.drop([date],axis=1)

### One Hot Encoding the entire dataset and normalizing material Price

data = pd.get_dummies(df1, columns=dimension, drop_first=True)
for i in measure:
    data[i]=(data[i]-data[i].mean())/data[i].std()



### Finding the ideal K values using Elbow Method
print('Give the value of k for elbow method:')
k=int(input())
print("This might take a few min.")
Sum_of_squared_distances=[]
K = range(1,k)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(data)
    print(k)
    Sum_of_squared_distances.append(km.inertia_)
# Plot the elbow
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
file_name=dest_folder+'elbow_graph.png'
plt.savefig(file_name, bbox_inches='tight')

### Fitting the dataset in model(k-NN) and finding the outliers 
k=int(input('Set the value of K for kNN : '))
c=float(input('Set the value of contamination for kNN eg:0.005 or 0.01 : '))
print("This might take a few min.")
from pyod.models.knn import KNN
knn=KNN(n_neighbors=k,contamination=c)
knn.fit(data)

y_pred = knn.predict(data)
df1['Predicted']=y_pred

km = KMeans(n_clusters=k).fit(data)
cluster_map = pd.DataFrame()
df1['cluster'] = km.labels_
df.to_csv('/Users/daksha.singhal/Desktop/outpu2.csv', encoding='utf-8', index=False)

### Exporting the output in csv file

name=input('PLease enter the name of the output file for outliers:')
extension=input('PLease enter the extension for the output file:')
sep=input('Please enter the seperator for the output file:')
dest_file=dest_folder+name+'.'+extension
pd.merge(df, df1, left_index=True, right_index=True).to_csv(dest_file, encoding='utf-8',sep=sep)
print('The job is done!!Check the specified folder for the output :',dest_folder)

Outlier Detection for Sales Data
Please enter the file path for dataset in the format : /users/..../sales.csv
/Users/daksha.singhal/Desktop/Bajaj  auto/Bajaj_Auto_Limited_Sales_Files/new Bajaj_Sales_Domestic.csv

Please enter the file path for mapped file in the format : /users/..../map_file.csv
/Users/daksha.singhal/Desktop/Bajaj  auto/domestic map_file.csv

Please enter the destination folder name in the format : /users/....desktop/
/Users/daksha.singhal/Desktop/Bajaj  auto/Bajaj_Auto_Limited_Sales_Files/

Give the value of k for elbow method:
1


KeyboardInterrupt: 