<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/KEMPER_SHELDON_CAM_C101_Week_6_Mini-project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mini-project 6.3 Customer segmentation with clustering

**Welcome to your second mini-project: Customer segmentation with clustering!**

Understanding and serving customers are two of the retail industry's most important strategic marketing objectives. Knowing the customer allows businesses to be more customer-centric with improvements to metrics such as marketing efficiency, enhanced product development, increased customer satisfaction, better customer retention, price optimisation, and strategic resource allocation.

Customer segmentation allows a business to group customers based on demographics (e.g. age, gender, education, occupation, marital status, family size), geographics (e.g. country, transportation, preferred language), psychographics (e.g. lifestyle, values, personality, attitudes), behaviour (e.g. purchase history, brand loyalty, response to marketing activities), technographic (e.g. device type, browser type, original source), and needs (e.g. product features, service needs, delivery method). Your challenge in this project is to apply critical thinking and machine learning concepts to design and implement clustering models to perform customer segmentation and improve marketing efforts.

Please set aside approximately **12 hours** to complete the mini-project by **Friday 30 August, 2024, at 5 p.m. (UK Time)**.

<br></br>

## **Business context**
You are provided an e-commerce data set from a real-world organisation to perform customer segmentation with clustering models to improve marketing efforts (SAS, 2024). It is a transnational data set with customers from five continents (Oceania, North America, Europe, Africa, and Asia) and 47 countries.

The data set contains 951,668 rows, each representing a product a customer ordered. The data set contains details about the customer (e.g. location, product type, loyalty member) and order (e.g. days to delivery, delivery date, order date, cost, quantity ordered, profit) based on orders between 1 January 2012 and 30 December 2016.

As each customer is unique, it is critical to identify and/or create new features for customer segmentation to inform marketing efforts. The data set has 20 features you can choose from:
- **Quantity:** The quantity the customer orders (e.g. 1, 2, 3).
- **City:** Name of the customer's residence (e.g. Leinster, Berowra, Northbridge).
- **Continent:** Name of the continent where the customer resides (Oceania, North America).
- **Postal code:** Where the customer resides (e.g. 6437, 2081, 2063).   
State province: State or province where the customer resides (e.g. Western Australia, Quebec, New South Wales).
- **Order date:** The date the order was placed (e.g. 1 January 2012, 20 June 2014).
- **Delivery date:** The date the order was delivered (e.g. 12 April 2014, 19 November 2016).
- **Total revenue:** Total revenue based on ordered items in USD (e.g. 123.80, 85.10).
- **Unit cost:** Cost per unit ordered in USD (e.g. 9.10, 56.90).
- **Discount:** Percentage or normal total retail price (e.g. 50%, 30%).
- **Order type label:** Method in which the order was placed (e.g. internet sale, retail sale).
- **Customer country label:** The country where the customer resides (e.g. Australia, Canada, Switzerland).
- **Customer birthdate:** The date the customer was born (e.g. 8 May 1978, 18 December 1987).
- **Customer group:** Loyalty member group (e.g. internet/catalogue customers, Orion club gold members).
- **Customer type:** Loyalty member level (e.g. internet/catalogue customers, Orion club gold members high activity).
- **Order ID:** Unique order identifier (e.g. 1230000033).
- **Profit:** Total profit is calculated: $Total\:profit=(total\:revenue-unit\:cost)\times quantity$ in USD (e.g. 1.20, 0.40).
- **Days to delivery:** The number of days for delivery is calculated: $Delivery\:days=delivery\:date-order\:date$ (e.g. 6, 3, 2).
- **Loyalty number:** Loyal customer (99) versus non-loyal customer (0).
- **Customer ID:** A unique identifier for the customer (e.g. 8818, 47793).

Since we have a transnational data set, which implies customers from different continents, several metrics are important when performing customer segmentation for target marketing. From a marketing perspective, the following five metrics help to understand the nuance of the customer base, buying behaviour, preferences, and value to the business.
- **Frequency** indicates how often a customer purchases over a given period of time. A high frequency indicates a loyal customer, a high level of satisfaction, trust or brand loyalty, and/or effective marketing efforts. Frequency based on purchases guides a business in the effectiveness of target marketing campaigns and how to target less active customers.
- **Recency** measures how recently a customer made a purchase or placed an order. It helps predict customer churn (turnover) and engagement. A customer is a business’s most valuable asset, so securing customer retention is essential. A high recency indicates customer satisfaction and engagement.
- **Customer lifetime value (CLV)** indicates the average or total value a customer contributes to a business over the course of their relationship. In other words, CLV is a metric of the total income a business can expect to generate from a customer as long as said customer remains a loyal client. CLV helps to prioritise marketing efforts and resources as it focuses on customers who are expected to bring the most value over time. Therefore, retaining high-value customers.
- The **average unit cost** indicates if the customer prefers low cost or high cost items. This is related to the profitability of purchases. Customers buying products with a higher average unit cost price should be targeted differently. Customer segmentation assists in identifying these customers.

You may encounter data science challenges when performing customer segmentation. Let’s focus on five examples that you may encounter in this project:
1. **Data quality and management:** Data tends to be prone to inaccuracy, inconsistency, and incompleteness. The customer segments should be clearly defined, easily understood, and simple to incorporate into current and future strategies. Special attention should be paid to feature engineering and data preprocessing.
2. **Relevance segmentation:** The most relevant criteria (features) should be used for customer segmentation. Choosing the wrong or irrelevant criteria might dilute the clustering. As a result, cluster characteristics might overlap.
3. **Dynamic customer behaviour:** Customer preferences and behaviour can be seasonal, change rapidly based on new trends, or be influenced by personal and economic factors.  
4. **Privacy and ethical concerns:** Businesses must navigate the ethical and legal implications when collecting and analysing customer data. Data scientists must be unbiased regarding gender, race, country, etc.
5. **Actionability:** Creating segments that are too broad might ignore subtle but essential differences between customers, while segments that are too narrow might not be actionable. Creating a balance is important for marketing efficiency.

How you approach these challenges underscores the importance of understanding the business scenario for effective customer segmentation. Without direct input from the marketing team or domain experts, customer segmentation must be approached with a keen awareness of the nuanced relationships between different features and their potential implications for operational integrity.

Your task is to develop a robust customer segmentation to assist the e-commerce company in understanding and serving its customers better. This will help to have a more customer-centric focus, improving marketing efficiency. Therefore, you’ll explore the data, employ preprocessing and feature engineering, dimension reduction, and perform customer segmentation with clustering models.

You must prepare a report that illustrates your insights to the prospective stakeholders, showing how your solution will save the business money and build trust with its stakeholders. At this stage of the project, the five main questions you need to consider are:
1. What insights can be gained from the data, and what recommendations can be made to the company based on these insights? Clearly explain your rationale.
2. Which features can be deleted, selected, and combined (feature creation) to effectively segment customers based on the five metrics (frequency, recency, CLV, average unit cost)?
3. Are there more features beyond the suggested ones that would help in better segmentation
3. Based on this data set, which statistical or ML technique is the best for determining the optimum number of clusters ($k$)?
4. How do the clusters compare based on frequency, recency, CLV, and average unit cost?
5. What did you deduce from the dimensional reduction analysis?

<br></br>

> **Disclaimer**
>
> Note that although a real-life data set was provided, the business context in this project is fictitious. Any resemblance to companies and persons (living or dead) is coincidental. The course designers and hosts assume no responsibility or liability for any errors or omissions in the content of the business context and data sets. The information in the data sets is provided on an 'as is' basis, with no guarantees of completeness, accuracy, usefulness, or timeliness.

<br></br>

## **Objective**
By the end of this mini-project, you’ll be able to understand and apply statistical and ML methods to apply customer segmentation with clustering techniques.

In the Notebook, you will:
- explore the data set
- preprocess the data and conduct feature engineering
- determine the optimal number of clusters ($k$)
- apply ML models to reduce dimensions and segment customers.

You will also write a report summarising the results of your findings and recommendations.

<br></br>

## **Assessment criteria**
By completing this project, you will be able to provide evidence that you can:
- demonstrate enhanced problem-solving skills and proposed strategic solutions by systematically analysing complex organisational challenges
- identify meaningful patterns in complex data to evidence advanced critical and statistical thinking skills
- select statistical techniques appropriate to a solutions design approach and evidence the ability to evaluate their effectiveness
- demonstrate enhanced data representation and improved model performance by systematically implementing relevant techniques
- design innovative solutions through critically selecting, evaluating and implementing effective unsupervised learning techniques.

<br></br>

## **Project guidance**
1. Import the required libraries and data set with the provided URL.
2. View the DataFrame and perform data pre-processing:
  - Identify missing values.
  - Check for duplicate values.
  - Determine if there are any outliers.
  - Aggregate the data into one customer per row
3. Perform feature engineering as follows:
  - Create new features for frequency, recency, CLV, average unit cost, and customer age.
  - Select the best features for analysis.
  - Perform feature scaling and encoding if needed.
4. Perform EDA and create visualisations to explore the data.
5. For more efficient programming, incorporate column transformer and pipeline. Visit Python documentation to understand the usage.
6. Select the optimum value of clusters ($k$) with the Elbow and Silhouette score methods. Motivate your choice.
7. Perform hierarchical clustering and create a dendogram.
8. Based on the optimum number of $k$, perform k-means clustering.
9. View the cluster number associated with each `customer_ID`. You can create a table or DataFrame.
10. Create boxplots to display the clusters with regard to frequency, recency, CLV, average unit cost, and CLP.
11. Perform dimension reduction with PCA and t-SNE to reduce the data to 2D.
12. Create a 2D visualisation to display the clusters with different colours. Use the output from the PCA and t-SNE.
13. Document your approach and major inferences from the data analysis and describe which method provided the best results and why.
14. When you’ve completed the activity:
  - Download your completed Notebook as an IPYNB (Jupyter Notebook) or PY (Python) file. Save the file as follows: **LastName_FirstName_CAM_C101_Week_6_Mini-project**.
  - Prepare a detailed report (between 800-1,000 words) that includes:
    - an overview of your approach
    - a description of your analysis
    - an explanation of the insights you identified
    - a summary of which method gave the best results in determining $k$
    - a clear visualisation of your customer segmentation
    - an explanation of visualisations you created.
  - Save the document as a PDF named according to the following convention: **LastName_FirstName_CAM_C101_Week_6_Mini-project.pdf**.
  - You can submit your files individually or as a ZIP file. If you upload a ZIP file, use the correct naming convention: **LastName_FirstName_CAM_C101_Week_6_Mini-project.zip**.
  - Submit your Notebook and PDF document by **Friday, 30 August 2024, at 5 p.m. (UK Time)**.

<br></br>
> **Declaration**
>
> By submitting your project, you indicate that the work is your own and has been created with academic integrity. Refer to the Cambridge plagiarism regulations.

> Start your activity here. Select the pen from the toolbar to add your entry.

In [None]:
!pip install gdown



In [None]:
!gdown 'https://drive.google.com/uc?export=download&id=1S5wniOV5_5htDfUFeZhlCLibvtihNLKK'

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1S5wniOV5_5htDfUFeZhlCLibvtihNLKK
From (redirected): https://drive.google.com/uc?export=download&id=1S5wniOV5_5htDfUFeZhlCLibvtihNLKK&confirm=t&uuid=73980cf4-0135-42e4-994b-2dcc0204cfcf
To: /content/CUSTOMERS_CLEAN.csv
100% 191M/191M [00:01<00:00, 120MB/s]


# Reflect

Write a brief paragraph highlighting your process and the rationale to showcase critical thinking and problem-solving.

> Select the pen from the toolbar to add your entry.

### Reference:
SAS, 2024. CUSTOMERS_CLEAN [Data set]. SAS. Last revised on 15 December 2021. [Accessed 20 February 2024].