# Mall Customer Data : Analysis and Visualization

In this notebook, I visualize the best feature of the dataset i.e have great importance in prediction of Spending Score. First I analyze the data an then find which feature has more importance. 

I used basic libraries
* Numpy 
* Pandas
* Plotly
* Seaborn
* Sklearn

**Importing Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [None]:
data = pd.read_csv('../input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')

Have a look at dataset before preprocessing.

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.isnull().sum()

We see that there is no null values in the dataset. but there is a column name "Gender" which has categorical data so we have to do some preprocessing here. Let's start

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label = LabelEncoder()
data["Gender"] = label.fit_transform(data["Gender"])

Here we see we have done the preprocessing part.

 Let's have a look at data after preprocessing.

In [None]:
data.head()

In [None]:
data.hist(figsize=(16, 8))
plt.show()

Now Let's start visualization of the dataset.

**HeatMap Visualization**

In [None]:
sns.heatmap(data.corr(), annot=True)
plt.show()

**Violin Plotting**

In [None]:
sns.violinplot(data['Age'], data['Gender'])
sns.despine()

**Scatter Plotting**

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(data['Age'],data['Annual Income (k$)'], s=data['Spending Score (1-100)']) 
plt.show()

**Pie Visualizations**

In [None]:
var=data.groupby(['Gender']).sum().stack()
temp=var.unstack()
type(temp)
x_list = temp['Annual Income (k$)']
label_list = temp.index
plt.axis("equal")
plt.pie(x_list,labels=label_list,autopct="%1.1f%%") 
plt.title("Pastafarianism expenses") 
plt.show()

**Scatter visulizations with upgrade form**

In [None]:
fig = px.scatter(data, x="Age", y="Annual Income (k$)", 
                 color="Age",
                 size='Age', 
                 hover_data=[ 'Spending Score (1-100)'], 
                 title = "Age wise Annual Income")
fig.show()

In [None]:
fig2 = go.Figure(data=go.Scatter(x=data['Age'],
                                y=data['Annual Income (k$)'],
                                mode='markers',
                                marker_color=data['Age'],
                                text=data['Spending Score (1-100)'])) # hover text goes here

fig2.update_layout(title='Age wise Annual Income')
fig2.show()

Finding which feature has great importances

In [None]:
X = data.iloc[:, 0:4]
y = data.iloc[:, 4]

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
clf = ExtraTreesClassifier(n_estimators = 5, 
                                        criterion ='entropy', max_features = 2)
clf.fit(X, y)

In [None]:
feature_importance = clf.feature_importances_ 
feature_importance_normalized = np.std([tree.feature_importances_ for tree in 
                                        clf.estimators_], 
                                        axis = 0) 

In [None]:
plt.bar(X.columns, feature_importance_normalized) 
plt.xlabel('Feature Labels') 
plt.ylabel('Feature Importances') 
plt.title('Comparison of different Feature Importances') 
plt.show()

As we can see in the graph Age has maximum importance among all the columns in the dataset. So Let's start making model by this column only.

In [None]:
X_new = data.iloc[:, [2]].values
y_new = data.iloc[:, 4].values

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
reg = DecisionTreeRegressor(random_state = 0)
reg.fit(X_new, y_new)

In [None]:
y_pred = reg.predict(X_new)

**Our Predictions**

In [None]:
y_pred

Test on this model by giving the age of customer and model predict it's spending score.

In [None]:
y_test = reg.predict([[28]])

In [None]:
y_test

Okay, If you like this notebook please give a upvote your upvote is my encouragement.

Till then **Enjoy Machine Learning**