## **Data Exploration and Visualisation**

---
Exploring your data with visualisation: visualising your variables is a great way to spot
issues in your data straight away. You can visualise your variables (columns or features) in
your dataset in multiple ways:


### 1- Univariate data visualisations
in these plots, a single variable is visualised only; hence,
the name “uni” means just “one”; examples of this are frequency distribution plots like
histograms and bar plots. To plot a histogram, you can use a library called plotly and its
express package.


In [17]:
import plotly.express as px

let’s load your prepared dataset and plot a univariate histogram for the “Spending_Score”
variable. To get an interactive histogram plot.

In [18]:
#lets load your prepared dataset
import pandas as pd #to data manipulate
data = pd.read_csv('/content/Prepared_Mall_Customers.csv')

In [19]:
# Construct the histogram plot for the Spending_Score histogarm
Spending_Score_fig = px.histogram(data, x='Spending_Score')
# Display the plot
Spending_Score_fig.show()

Bar plots are similar to histograms; they both visualise the frequency distribution of the
variables. However, histograms are used for plotting variables whose values are numeric
(continuous) like integers and floats. Bar plot are for categorical variables (Object data type
variables). Let’s plot a Bar plot for the “Customer_Satisfaction” variable.

In [20]:
# Construct the histogram plot for the Spending_Score histogarm
Customer_Satisfaction_fig = px.bar(data, x='Customer_Satisfaction')
# Display the plot
Customer_Satisfaction_fig.show()

### 2- Bivariate data visualisations
In these plots, TWO variables are visualised only; hence,
the name “Bi” means just “TWO”. Bivariate plots combine two variables in the plot to see
if there is any association between them. An example of bivariate plots are scatter plots. Let’s
see if there is an association between Customer’s “Spending_Score” and their “Salary”
Do you notice any interesting observations in the plot?


In [21]:
Age_Salary_Association_fig = px.scatter(x=data['Spending_Score'], y=data['Salary'])
Age_Salary_Association_fig.show()

You can also create bivariate histograms to compare two distributions with histograms.
These are called “Stacked Histograms”. Let’s see the distribution of satisfaction compared
to the customer’s age.

In [22]:
Spending_Score_Satisfaction_fig = px.histogram(data, x='Age',color='Customer_Satisfaction')
Spending_Score_Satisfaction_fig.show()


Due to overlapping distributions, it is not possible to spot interestingness between them, but
what if we made the colours translucent? That would offer us a better chance to see the
separability between both variables. For that, we use barmode='overlay' to create an
overlaid histogram. Can you spot any interestingness here?


In [23]:
Spending_Score_Satisfaction_fig = px.histogram(data, x='Age',color='Customer_Satisfaction', barmode='overlay')
Spending_Score_Satisfaction_fig.show()

You can also create bivariate bar plots to compare two mixed distributions; one is for a
numeric variable, and the other for a categorical variable. These are also called “Stacked Bar
Charts” Let’s see the distribution of “Customer_Satisfaction” compared to “Salary”. Can
you interpret what you see in this plot? What does the heat legend indicate?


In [24]:
Spending_Score_Satisfaction_fig = px.bar(data, x='Customer_Satisfaction',color='Salary')
Spending_Score_Satisfaction_fig.show()

You can also create bivariate bar plots to compare two categorical distributions, each for a
categorical (object-data type) variable. These are also called Stacked Bar Charts. Let’s see

the distribution of “Customer_Satisfaction” compared to “Sex”. Why did we manipulate the
“Sex” variable? What was done to it? Can you interpret what you see in this plot? What
does the legend indicate?


In [25]:
data['Sex'] = data['Sex'].map({1:'Male', 0:'Female'})
Spending_Score_Satisfaction_fig = px.bar(data, x='Customer_Satisfaction',color='Sex')
Spending_Score_Satisfaction_fig.show()

If interpreting the previous Stacked Bar Charts was difficult, let’s try Clustered Bar
Charts. Let’s see if you can find any interestingness when visualising the distribution of
“Age” compared to “Customer_Satisfaction”. Can you interpret what you see in this
plot? Anything you want to flag to the marketing department? If you are to shop
there, are you likely to be satisfied or unsatisfied?

In [26]:
data['Sex'] = data['Sex'].map({1:'Male', 0:'Female'})
Spending_Score_Satisfaction_fig = px.histogram(data, x='Age',color='Customer_Satisfaction', barmode="group")
Spending_Score_Satisfaction_fig.show()

### 3- Multivariate plots
 combine more than two variables, hence the name. In these plots, you
can try to find association/interestingness between all of them. Let’s combine three variables,
“Customer_Satisfaction”, “Age”, and “Salary” in one scatterplot! What are the salary
groups for unsatisfied customers? In which age group are they? I which income range?

In [27]:
Age_Salary_Satisfaction_fig = px.scatter(data, x="Age", y="Salary",color="Customer_Satisfaction")
Age_Salary_Satisfaction_fig.show()