A python cum Tableau project involving use of clustering to identify customer segments and create dashboards to analyze sales and customers (including new clusters)
Tasks- We have a dataset pertaining to the sales of an online global superstore. We are going to perform the following tasks:
1.Identify clusters of Customers using Unsupervised Machine Learning Algorithms.
2.Create Dashboards in Tableau to visualize and analyse a) The sales performance of the store & b) customers and the new clusters groups.
The project will involve the use of the following tools and skills -
- Python
- Machine Learning - Unsupervised learning
- MySQL
- Tableau
- Figma
- Data Visualization - Dashboard creation
- The data was loaded into python using jupyter notebook.
- Initial checks were performed to understand the structure and characteristics of the dataset like size, shape, missing values, count of categorical & numerical columns.
- Several categorical features like City, State, Product Name, Category, etc were checked for leading spaces, special characters, correct category names, etc.
- Missing values were detected only in postal code.
- Country had names which were either abbreviated or doubtful of being recognized in tableau.
- Region was a feature with too many different categories and ambigious regions.
- To make necessary corrections to country name and replace existing categories in region with lesser and more significant names, the World Indicators dataset was downloaded and imported into python.
- Using joins and country names/code, the country names were replaced with names which are identified by tableau.
- Regions were also replaced with more appropriate categories using country name or code with joins.
- Columns which were not required for our dashboard were dropped.
- In order to perform clustering, the RFM framework was followed where clustering of customers would be done on the basis of recency, frequency and monetary value of their purchases.
- A new dataset was created with customer id as index and new columns for recency, frequency and monetary value.
- Extreme / High value outliers were detected, which were removed using power transformer.
- All features were scaled for uniform scales required for optimum clustering.
- K-means and Agglomerative clustering were used to find optimum no. of clusters for customers
- Cluster quality were evaluated using inertia, silhoutte scores, elbow plots, dendograms, etc.
- Optimum no. of clusters identified were 2, with a silhoutte score of around 0.30 for K-means and 0.25 for Agglomerative clustering.
- New cluster labels were added to dataset.
- The sql alchemy package was imported to create a connection object required to connect python to mysql server.
- Using new connection, the final dataframe created was transferred from jupyter notebook to a new table created in a new database in mysql workbench.
- Using OBDC drivers, Tabeau was connected to MySQL Workbench and the data was pulled.
A. Sales Dashboard -
- Charts like sales, profit, sales per region, top 5 sub category, customer details, monthly sales, etc were created using several types of charts like line, donut, tree map, bar charts, world maps, etc.
- Template for the dashboard were created using Figma, an online platform.
The following dashboards were finally created