# Exercise 6: Customer Segmentation with Bank Marketing Data

In this exercise, you will explore and perform customer segmentation using the Bank Marketing Data Set from the UCI Machine Learning Repository. This dataset contains information gathered from direct marketing campaigns of a Portuguese banking institution, with the goal of predicting whether a client will subscribe to a term deposit.

**Data Source:** [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)

Below is an overview of the dataset and its attributes:

## Data Set Overview

The dataset includes data from direct marketing campaigns conducted via phone calls. In many cases, multiple contacts were made with the same client to determine if they would subscribe to a term deposit. The classification target is `y`, which indicates whether the client subscribed (`yes`) or did not subscribe (`no`) to the term deposit.

### Available Datasets

There are four versions of the dataset:

1. **bank-additional-full.csv**: Contains all 41,188 examples with 20 input attributes. The data is ordered by date (from May 2008 to November 2010) and is similar to the dataset used in [Moro et al., 2014].
2. **bank-additional.csv**: A random 10% sample (4,119 examples) from the full dataset, with the same 20 input attributes.
3. **bank-full.csv**: Contains all examples with 17 input attributes (an older version of the dataset).
4. **bank.csv**: A random 10% sample from the older dataset with 17 input attributes.

*Note:* The smaller datasets are particularly useful for testing computationally intensive machine learning algorithms (e.g., Support Vector Machines).

### Attribute Details

**Input Attributes:**

- **Bank Client Data:**
  1. **age**: Numeric.
  2. **job**: Type of job. (Categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
  3. **marital**: Marital status. (Categorical: 'divorced' [includes divorced or widowed], 'married', 'single', 'unknown')
  4. **education**: Educational level. (Categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
  5. **default**: Credit in default. (Categorical: 'no', 'yes', 'unknown')
  6. **housing**: Housing loan. (Categorical: 'no', 'yes', 'unknown')
  7. **loan**: Personal loan. (Categorical: 'no', 'yes', 'unknown')

- **Campaign Contact Information:**
  8. **contact**: Communication type. (Categorical: 'cellular', 'telephone')
  9. **month**: Last contact month of the year. (Categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  10. **day_of_week**: Last contact day of the week. (Categorical: 'mon', 'tue', 'wed', 'thu', 'fri')
  11. **duration**: Duration of the last contact (in seconds). *Important:* This feature strongly influences the target (e.g., if duration = 0, then `y` is typically 'no'). However, since the duration is only available after a call, it should be excluded when building a predictive model for prospective campaigns.

- **Other Attributes:**
  12. **campaign**: Number of contacts during the current campaign (numeric; includes the last contact).
  13. **pdays**: Number of days since the client was last contacted in a previous campaign (numeric; 999 indicates the client was not previously contacted).
  14. **previous**: Number of contacts before the current campaign (numeric).
  15. **poutcome**: Outcome of the previous marketing campaign. (Categorical: 'failure', 'nonexistent', 'success')

- **Social and Economic Context:**
  16. **emp.var.rate**: Employment variation rate (quarterly indicator, numeric).
  17. **cons.price.idx**: Consumer price index (monthly indicator, numeric).
  18. **cons.conf.idx**: Consumer confidence index (monthly indicator, numeric).
  19. **euribor3m**: Euribor 3-month rate (daily indicator, numeric).
  20. **nr.employed**: Number of employees (quarterly indicator, numeric).

**Output Attribute:**

- **y**: Indicates whether the client subscribed to a term deposit. (Binary: 'yes' or 'no')

### Discussion Points

- **Data Heterogeneity:** Reflect on the challenges of clustering when dealing with a mix of numerical and categorical data. How does the inclusion of highly influential features (such as `duration`) affect your analysis?
- **Preprocessing Impact:** Consider the importance of data cleaning, normalization, and feature encoding in obtaining meaningful clustering results.
- **Real-World Implications:** Think about how customer segmentation insights could influence targeted marketing strategies and improve campaign effectiveness.


# Exercises

In this section, you will work through several steps to analyze the Bank Marketing Data Set using clustering techniques. Complete the tasks below and reflect on your findings:

1. **Data Loading and Preprocessing:**
   - Load the Bank Marketing Data Set.
   - Handle any missing values appropriately.
   - Normalize the numerical features and encode the categorical variables.
   - *Discussion:* Consider how data cleaning and feature scaling can impact the clustering results.

2. **Data Visualization:**
   - Create scatter plots or other visualizations to explore the dataset.
   - Visualize the distribution of key features and look for potential clusters or outliers.
   - *Reflection:* How can initial visualizations guide your understanding of the data?

3. **Clustering with $K$-Means:**
   - Apply the $K$-Means clustering algorithm to segment the data.
   - Experiment with different numbers of clusters ($K$).
   - Evaluate the clustering performance using metrics such as the Within-Cluster Sum of Squares (WCSS) and the silhouette coefficient.
   - *Discussion:* What criteria did you use to choose the optimal number of clusters? How does varying $K$ affect the clustering results?

4. **Cluster Visualization:**
   - Use Principal Component Analysis (PCA) to reduce the dimensionality of the dataset.
   - Plot the first two principal components and visualize the clusters.
   - *Reflection:* Assess how well the clusters separate in the reduced feature space. What might overlapping clusters indicate?

5. **Cluster Interpretation and Insights:**
   - Analyze the characteristics of each cluster.
   - Identify patterns or insights, such as demographic trends or features that distinguish customers who are more likely to subscribe to a term deposit.
   - *Discussion:* How could these insights inform targeted marketing strategies? What are the potential limitations of your clustering approach, and how might you address them in a real-world scenario?
