#### PROBLEM DESCRIPTION:
---------------------------------------------------------------------
The banks that are receiving customer complaints filed against them will analyse the complaint data to provide results on where the most complaints are being filed, what products/ services are producing the most complaints and other useful data. These datasets fall under the complaints of Credit reporting, Mortgage, Debt Collection, Consumer Loan and Banking Accounting.

#### OBJECTIVE:
---------------------------------------------------------------------
Build a clustering model using Python language or other suitable tools to find how many similar complaints are there in relation to the same bank or service or product .This model will be used to assist banks in identifying the location and types of errors for resolution leading to increased customer satisfaction to drive revenue and profitability. 

#### DATASET SOURCE:
https://www.kaggle.com/sebastienverpile/consumercomplaintsdata

#### EXPECTED ACTIVITIES & OUTCOMES
---------------------------------------------------------------------
Your activities should include - performing various activities pertaining to the data such as, preparing the dataset for analysis; investigating the relationships in the data set using statistical techniques, visualization; creating a model; evaluating the performance of the clustering model.

Demonstrate the Data Mining process with following activities:
1. Problem statement
2. Perform exploratory data analysis using the statistical techniques and box plot as applicable
3. Preprocess the Data as the data from internet source may be noisy
4. Evaluate the model performance using cohesion and separation
5. Suggest ways of improving the model
6. State all your assumptions clearly and provide clear explanations to explain your stand


In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
sns.set()

# Plotly libraries
import plotly
import plotly.express as px
import plotly.graph_objs as go
import chart_studio.plotly as py
import os
import cufflinks as cf
from plotly.offline import iplot, init_notebook_mode, plot
cf.go_offline()

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from kmodes.kmodes import KModes
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

In [None]:
customer_data = pd.read_csv('Consumer_Complaints.csv')
customer_data.head()

In [None]:
customer_data.columns = customer_data.columns.str.title()

In [None]:
customer_data[['Issue','Sub-Issue','Product','Sub-Product','Company','State',
               'Company Public Response','Consumer Consent Provided?','Date Received','Consumer Complaint Narrative',
               'Company Response To Consumer','Submitted Via']].describe().transpose()

### What are the top 15 issues and sub-issues and companies?

In [None]:
sns.set(style='white')
customer_data['Issue'].str.strip("'").value_counts()[0:15].iplot(kind='bar',title='Top 15 issues',fontsize=14,color='orange')

In [None]:
customer_data['Sub-Issue'].str.strip("'").value_counts()[0:15].iplot(
    kind ='bar',title='Top 15 Sub Issues',fontsize=14,color='#9370DB')

In [None]:
customer_data['Company'].str.strip("'").value_counts()[0:15].iplot(
    kind='bar',title='Top 15 Company',fontsize=14,color='purple')

### What is the most common response received from companies?

In [None]:
grouped = customer_data.groupby(['Company Response To Consumer']).size()
pie_chart = go.Pie(labels=grouped.index,values=grouped,
                  title='Company Response to the Customer')
iplot([pie_chart])

### Which state received the largest number of complaints?

In [None]:
states = customer_data['State'].value_counts()

scl = [
    [0.0, 'rgb(242,240,247)'],
    [0.2, 'rgb(218,218,235)'],
    [0.4, 'rgb(188,189,220)'],
    [0.6, 'rgb(158,154,200)'],
    [0.8, 'rgb(117,107,177)'],
    [1.0, 'rgb(84,39,143)']
]

data = [go.Choropleth(
    colorscale = scl,
    autocolorscale = False,
    locations = states.index,
    z = states.values,
    locationmode = 'USA-states',
    text = states.index,
    marker = go.choropleth.Marker(
        line = go.choropleth.marker.Line(
            color = 'rgb(244,244,244)',
            width = 2
        )),
    colorbar = go.choropleth.ColorBar(
        title = "Complaints")
)]

layout = go.Layout(
    title = go.layout.Title(
        text = 'Complaints by State'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = True,
        lakecolor = 'rgb(100,149,237)'),
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

# Preprocessing of the data

### Removing unwanted features (checking % of missing values in data)

In [None]:
# getting the sum of null values and ordering.
total = customer_data.isnull().sum().sort_values(ascending = False)  

#getting the percent and order of null.
percent = (customer_data.isnull().sum()/customer_data.isnull().count()*100).sort_values(ascending =False)

# Concatenating the total and percent
df = pd.concat([total , percent],axis =1,keys=['Total' ,'Percent'])

# Returning values of nulls different of 0
(df[~(df['Total'] == 0)])

##### Based on above % rate , implementing the below cleanup activity - 
1. Remove Records from ZIP code since Location information can be retrieved from State features
3. Remove Sub-issue, Consumer complaint narrative, Company public response, Tags, Consumer consent provided? as the missing % is high and their relation with Product is low

### Analyzing missing values and eradicating missing value features

In [None]:
train_complain = customer_data.drop(['Sub-Issue','Consumer Complaint Narrative','Date Received','Date Sent To Company',
                                 'Company Public Response','Zip Code','Tags','Consumer Consent Provided?'], axis=1)

In [None]:
train_complain.head(2)

In [None]:
train_complain.info()

In [None]:
train_complain[['Product','Sub-Product','Issue','Company','State',
               'Company Response To Consumer','Submitted Via','Timely Response?','Consumer Disputed?']].describe().transpose()

In [None]:
train = train_complain.dropna()

#### Finally we select below features which has complete data to support our analysis

In [None]:
train.info()

### using Label-Encoder to convert data into numerical

In [None]:
from collections import defaultdict 

In [None]:
labelencoder = LabelEncoder()

In [None]:
encoder_dict = defaultdict(LabelEncoder)
labeled_df = train.apply(lambda x: encoder_dict[x.name].fit_transform(x))
# train_enc = train.apply(labelencoder.fit_transform)
labeled_df.head(2)

In [None]:
labeled_df.info()

In [None]:
X = labeled_df.drop(['Product'], 1)  #independent columns
y = labeled_df['Product']

### Apply SelectKBest to analyze top best features related to Product using Chi square Test

In [None]:
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  # naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  # print 10 best features

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X_scaled,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  # naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  # print 10 best features

### Top best features selected for clustering using the Feature selection method above

In [None]:
train_enc = labeled_df.drop(['Submitted Via','Consumer Disputed?','Company Response To Consumer','Timely Response?'], axis=1)

In [None]:
train_enc.info()

## First approach: using the PCA components of the dataset for K-Means clustering

### scaling using Standard scalar for PCA

In [None]:
scalar = StandardScaler()
train_std = scalar.fit_transform(train_enc)

In [None]:
pca = PCA()
pca.fit(train_std)

In [None]:
pca.explained_variance_ratio_

In [None]:
plt.figure(figsize=(10,8))
plt.plot(range(1,7), pca.explained_variance_ratio_.cumsum(), marker='o', linestyle='--')
plt.title('Explained variance by components')
plt.xlabel('No of components')
plt.ylabel('Cumulative Explained Variance')

#### The graph shows the amount of variance captured (on the y-axis) depending on the number of components we include (the x-axis). A rule of thumb is to preserve around 80 % of the variance. So, in this instance, we decide to keep 5 components.

In [None]:
pca = PCA(n_components=5)
pca.fit(train_std)

In [None]:
scores_pca = pca.transform(train_std)

### RUN K-MEANS on PCA Components

In [None]:
cost = []
for num_clusters in list(range(2,11)):
    kmeans = KMeans(n_clusters=num_clusters, init="k-means++", verbose=0)
    cluster_labels = kmeans.fit_predict(scores_pca)
    cost.append(kmeans.inertia_)
    print(f"=============================== Cluster of {num_clusters} ====================================")

In [None]:
y_bar = np.array([i for i in range(2,11,1)])
plt.plot(y_bar, cost)
plt.xlabel("Number of Clusters")
plt.ylabel("K-Means cost")
# K-Means

### Since k->3 from graph, run K-Means with n_clusters=3 

In [None]:
kmeans_pca = KMeans(n_clusters=3, init="k-means++", verbose=0)
kmeans_pca.fit_predict(scores_pca)

In [None]:
customer_df = labeled_df.apply(lambda x: encoder_dict[x.name].inverse_transform(x))

In [None]:
customer_df_pca_kmeans = pd.concat([customer_df.reset_index(drop=True), pd.DataFrame(scores_pca)], axis=1)
customer_df_pca_kmeans.columns.values[-5: ] = ['Component1','Component2','Component3','Component4','Component5']
customer_df_pca_kmeans['PCA_KMeans_Clusters'] = kmeans_pca.labels_

In [None]:
customer_df_pca_kmeans.head()

In [None]:
customer_df_pca_kmeans['Cluster'] = customer_df_pca_kmeans['PCA_KMeans_Clusters'].map({0:'first',1:'second',2:'third'})

### visualize the segments with respect to the first two components

In [None]:
x_axis = customer_df_pca_kmeans['Component1']
y_axis = customer_df_pca_kmeans['Component2']
plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=customer_df_pca_kmeans['Cluster'], palette=['g','r','y'])
plt.title('Clusters by PCA components [1 and 2]')
plt.show()

In [None]:
x_axis = customer_df_pca_kmeans['Component2']
y_axis = customer_df_pca_kmeans['Component3']
plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=customer_df_pca_kmeans['Cluster'], palette=['g','r','y'])
plt.title('Clusters by PCA components [2 and 3]')
plt.show()

In [None]:
x_axis = customer_df_pca_kmeans['Component3']
y_axis = customer_df_pca_kmeans['Component4']
plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=customer_df_pca_kmeans['Cluster'], palette=['g','r','y'])
plt.title('Clusters by PCA components [3 and 4]')
plt.show()

In [None]:
x_axis = customer_df_pca_kmeans['Component4']
y_axis = customer_df_pca_kmeans['Component5']
plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=customer_df_pca_kmeans['Cluster'], palette=['g','r','y'])
plt.title('Clusters by PCA components [4 and 5]')
plt.show()

In [None]:
x_axis = customer_df_pca_kmeans['Component1']
y_axis = customer_df_pca_kmeans['Component5']
plt.figure(figsize=(10,8))
sns.scatterplot(x_axis, y_axis, hue=customer_df_pca_kmeans['Cluster'], palette=['g','r','y'])
plt.title('Clusters by PCA components [1 and 5]')
plt.show()

In [None]:
# Checking the count per cluster
cluster_pca_df = pd.DataFrame(customer_df_pca_kmeans['Cluster'].value_counts())
sns.barplot(x=cluster_pca_df.index, y=customer_df_pca_kmeans['Cluster'])

### Top 10 Products segregated per cluster

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=customer_df_pca_kmeans['Product'],order=customer_df_pca_kmeans['Product'].value_counts().index,hue=customer_df_pca_kmeans['Cluster'])
plt.show()

### Top 5 Issues segregated per cluster

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=customer_df_pca_kmeans['Issue'],order=customer_df_pca_kmeans['Issue'].value_counts()[:5].index,hue=customer_df_pca_kmeans['Cluster'])
plt.show()

### Top 5 Sub-Products segregated per cluster

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=customer_df_pca_kmeans['Sub-Product'],order=customer_df_pca_kmeans['Sub-Product'].value_counts()[:5].index,hue=customer_df_pca_kmeans['Cluster'])
plt.show()

### Top 5 Companies segregated per cluster

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=customer_df_pca_kmeans['Company'],order=customer_df_pca_kmeans['Company'].value_counts()[:5].index,hue=customer_df_pca_kmeans['Cluster'])
plt.show()

#### From the Above charts we can clearly see Companies have some relationship with same type of compaints clusters like 
1. Bank Of America - maximum complaints in cluster 1 and rest in cluster 0 and 2 which are internally related to the complaints like Loan Service and Loan Modification.
2. Wells Fargo & Company - have max complaints in cluster 1 and rest in 0 and 2 which are internally related to the complaints like Debts and Account opening.
3. JPMorgan Chase - have max complaints in cluster 1 and few in cluster 1, cluster 2 which are internally related to the complaints like Account Opening and Cont'd attempts collect debt and Loans.
4. Ocwen - have max complaints in cluster 0 and 1 which are internally related to the complaints like Loan Service and Loan Modification and related to Mortgage products.
5. Ciibank - have evenly distributed complaints in cluster 2, cluster 1 and cluster 0 which are internally related to the complaints like Loan Service and Loan Modification and Disclosure verification of debt


##### Top products - Mortgage and Debt Collection are distributed amongst cluster 0 and 1.

## Another Approach: using K-Modes clustering technique for categorical data

#### Top 5 features selected for clustering

In [None]:
train_enc.info()

#### Choosing K by comparing Cost against each K

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
scalar = StandardScaler()
scaled_train = scalar.fit_transform(train_enc)

In [None]:
cost = []

for num_clusters in list(range(2,6)):
    kmode = KModes(n_clusters=num_clusters, init="Huang", n_init=1, verbose=0)
    kmode.fit_predict(train_enc)
    cost.append(kmode.cost_)

In [None]:
y_bar = np.array([i for i in range(2,6,1)])
plt.plot(y_bar, cost)
# RANDOM

### Clearly we can see that the Elbow method show that the K value is 3 in the graph, so we select K=3 for our KModes algorithm

#### Silhouette Coefficient measures how similar an object is to it's own cluster(cohesion) compared to other clusters (separation) It's value ranges from -1 to 1. A value close to 1 indicates that the object is well matched to its cluster and poorly matched to neighbouring clusters. If most of the objects have high Sil Coefficient, clusters are formed correctly. If more negative valued points are present, there is some issue in the clustering configuration

In [None]:
silhouette_coefficients = []

for num_clusters in range(2, 5):
    kmodes_score = KModes(n_clusters=num_clusters, init = "Huang", n_init = 1, verbose=1)
    kmodes_score.fit(scaled_train)
    silhouette_avg = silhouette_score(scaled_train, kmodes_score.labels_)
    print(f"For n_clusters = {num_clusters}, The average silhouette_score is : {silhouette_avg}")
    silhouette_coefficients.append(silhouette_avg)

In [None]:
plt.style.use("fivethirtyeight")
plt.plot(range(2, 5), silhouette_coefficients)
plt.xticks(range(2, 5))
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Coefficient")
plt.show()

###### For n_clusters = 3, The average silhouette_score is : 0.12128887593549928

### Using K-Mode with "Huang" initialization for K=3

In [None]:
km_rand = KModes(n_clusters=3, init="Huang", n_init=5, verbose=0)
fitClusters_rand = km_rand.fit_predict(train_enc)

In [None]:
train_decoded = labeled_df.apply(lambda x: encoder_dict[x.name].inverse_transform(x))

In [None]:
train_decoded = train_decoded.reset_index()
clustersDf = pd.DataFrame(fitClusters_rand)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([train_decoded, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index', 'level_0'], axis = 1)

In [None]:
combinedDf.head()

In [None]:
# Checking the count per cluster
cluster_df = pd.DataFrame(combinedDf['cluster_predicted'].value_counts())
sns.barplot(x=cluster_df.index, y=cluster_df['cluster_predicted'])

In [None]:
cluster_0 = combinedDf[combinedDf['cluster_predicted'] == 0]
cluster_1 = combinedDf[combinedDf['cluster_predicted'] == 1]
cluster_2 = combinedDf[combinedDf['cluster_predicted'] == 2]

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=combinedDf['Product'],order=combinedDf['Product'].value_counts().index,hue=combinedDf['cluster_predicted'])
plt.show()

In [None]:
plt.subplots(figsize = (20,7))
sns.countplot(x=combinedDf['Sub-Product'],order=combinedDf['Sub-Product'].value_counts()[:5].index,hue=combinedDf['cluster_predicted'])
plt.show()

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=combinedDf['Issue'],order=combinedDf['Issue'].value_counts()[:5].index,hue=combinedDf['cluster_predicted'])
plt.show()

In [None]:
plt.subplots(figsize = (20,5))
sns.countplot(x=combinedDf['Company'],order=combinedDf['Company'].value_counts()[:5].index,hue=combinedDf['cluster_predicted'])
plt.show()

#### From the Above charts we can clearly see Companies have some relationship with same type of compaints clusters like 
1. Bank Of America - maximum complaints in cluster 1 and few cluster 2 which are internally related to the complaints like Loan Service and Loan Modification 
2. Wells Fargo & Company - have max complaints in cluster 1 and 2 which are internally related to the complaints like Loan Modification and Account opening
3. JPMorgan Chase - have max complaints in cluster 1, cluster 2 and few in cluster 0 which are internally related to the complaints like Account Opening and Cont'd attempts collect debt and Loans
4. Ocwen - have max complaints in cluster 1 which are internally related to the complaints like Loan Service and Loan Modification 
5. Ciibank - have averaged complaints in cluster 0 and cluster 2 and most in cluster 1 which are internally related to the complaints like Loan Service and Loan Modification and Disclosure verification of debt


##### Most of them are related to Mortgage and Debt Collection products

In [None]:
combinedDf.head()

### Ways of Improvements
1. We can use K- Medoid instead of K-Means to reduce the effect of outliers.
2. We can use Feature Weightage using domain knowledge to improve the clustering. We have used this.
3. There are so many missing data in the input. We have eliminated 2 products with more missing values. The model can be improved by getting better data.
4. Principal Component Analysis can be used to identify the most important features of the dataset. We have used this as well in our model