<h1 style='background:#418E32; border:1; color:#F0FFFF; text-align:center; font-size:32px;  padding:6px;'><left>Customer Segmentation with Clustering Model</left></h1>

**Created By**: Wuttipat S. <br>
**Created Date**: 2024-05-19 <br>
**Status**: <span style="color:green">Completed</span>

 <h3 style='background:green; color:#F0FFFF; text-align:center'><left>If you found my notebook helpful or informative, please consider upvoting it to show your support 👍</left></h3>

# Index
---
Customer Personality Analysis is an in-depth examination of a business's target consumers. It enables a company to gain a deeper understanding of its customers, facilitating the tailoring of products to meet the distinct needs, behaviors, and concerns of various customer groups.

This analysis allows a business to fine-tune its products for specific customer segments. For instance, rather than marketing a new product to every customer in their database, a company can identify which segment is most likely to purchase the product and focus its marketing efforts solely on that group.

# Dataset Content
---


### People

- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise

### Products

- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years

### Promotion

- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted 1st campaign offer, 0 otherwise
- AcceptedCmp2: 1 if customer accepted 2nd campaign offer, 0 otherwise
- AcceptedCmp3: 1 if customer accepted 3rd campaign offer, 0 otherwise
- AcceptedCmp4: 1 if customer accepted 4th campaign offer, 0 otherwise
- AcceptedCmp5: 1 if customer accepted 5th campaign offer, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

### Place

- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

In [None]:
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")


In [None]:
'''
Vertify what environment are running
'''
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

if iskaggle:
    path='/kaggle/input/customer-personality-analysis' 
else:
    path="{}".format(os.getcwd())

<h1 style='background:#95E885; border:1; color:#F0FFFF; text-align:left; font-size:24px;  padding:6px;'><left>Import Libraries</left></h1>
---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)


from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans



# Import Dataset <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>

In [None]:
# Import dataset
data = pd.read_csv(f"{path}/marketing_campaign.csv", sep='\t')
display(data.info())

display(data.describe())
display(data.head())

> The data contains 29 variables and 2240 observations.

# Data Cleaning <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>
---

- Drop unuse columns
- Validate columns datatype
- Remove missing values
- Remove outliers

### Drop **unuse** columns

- **ID**: Remove because it's a unique identifier per customer, which doesn't contribute to predictive modeling and only serves to identify records.

- **Z_CostContact and Z_Revenue**: Remove because they have constant values across all data, providing no variability or useful information for analysis.


In [None]:
# Drop unnecessary columns 
data1 = data.drop(['ID', 'Z_CostContact', 'Z_Revenue'], axis=1)

### Validate columns **datatype**


- This table provides a structured overview to ensure that each feature is treated with the correct data type, facilitating more effective data handling and analysis.

| Feature Name           | Correct Data Type |
|------------------------|-------------------|
| Year_Birth             | int               |
| Education              | category          |
| Marital_Status         | category          |
| Income                 | float             |
| Kidhome                | int               |
| Teenhome               | int               |
| Dt_Customer            | datetime          |
| Recency                | int               |
| MntWines               | int               |
| MntFruits              | int               |
| MntMeatProducts        | int               |
| MntFishProducts        | int               |
| MntSweetProducts       | int               |
| MntGoldProds           | int               |
| NumDealsPurchases      | int               |
| NumWebPurchases        | int               |
| NumCatalogPurchases    | int               |
| NumStorePurchases      | int               |
| NumWebVisitsMonth      | int               |
| AcceptedCmp3           | bool              |
| AcceptedCmp4           | bool              |
| AcceptedCmp5           | bool              |
| AcceptedCmp1           | bool              |
| AcceptedCmp2           | bool              |
| Complain               | bool              |
| Response               | bool              |


In [None]:
'''
Validate Columns Datatype
'''

# List of categorical columns
categorical_cols = ['Education', 'Marital_Status']

# List of boolean columns
boolean_cols = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
                'AcceptedCmp4','AcceptedCmp5', 'Complain', 'Response']

# List of numerical columns
numerical_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 
                  'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 
                  'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 
                  'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 
                  'NumWebVisitsMonth']

# Convert categorical columns to 'category' datatype
for col in categorical_cols:
    data1[col] = data1[col].astype('category')

# Convert boolean columns to 'boolean' datatype
for col in boolean_cols:
    data1[col] = data1[col].astype('boolean')

# Convert 'Dt_Customer' to datetime
data1['Dt_Customer'] = pd.to_datetime(data1['Dt_Customer'], format='%d-%m-%Y')



# Compare datatypes of the original and new DataFrames
compare_dtypes = pd.DataFrame({
    'Original DataType': data.dtypes,
    'New DataType': data1.dtypes
})

compare_dtypes

### Remove **missing values**

In [None]:
# Display missing values in a dataset
sns.heatmap(data1.isna())
plt.title('Missing values in dataset')
plt.show()

> The 'Income' feature contains missing values that will be removed.

In [None]:
# Removeing missing values
data2 = data1.dropna(axis=0, how='any')

# Creating dataframe comapring before and after process
compare_missing_values = pd.DataFrame({'Original': data1.isna().sum(),
                                       'After': data2.isna().sum()})

print("Number of missing values: ")
display(compare_missing_values)

> The 24 missing value in **'Income'** variable have been removed.

### Remove **outliers**

    - I apply standardize to all numerical columns before plot, so we can detect an outlier easier.

In [None]:
# Normalizing the numerical columns
normalized_data = (data2[numerical_cols] - data2[numerical_cols].mean()) / data2[numerical_cols].std()

# Identify the most left outlier in 'Year_Birth'
year_birth_min_outlier = normalized_data['Year_Birth'].idxmin()

# Identify the most right outlier in 'Income'
income_max_outlier = normalized_data['Income'].idxmax()

# Plotting the boxplot
plt.figure(figsize=(15, 10))
sns_boxplot = sns.boxplot(data=normalized_data, orient='h')
plt.title("Normalized Boxplot for Numerical Columns")
plt.xticks(rotation=45)
plt.ylabel("Normalized Values")

# Annotating the most left outlier for 'Year_Birth'
plt.annotate('Outlier',
             xy=(normalized_data['Year_Birth'][year_birth_min_outlier], numerical_cols.index('Year_Birth')),
             xytext=(normalized_data['Year_Birth'][year_birth_min_outlier] + 0, numerical_cols.index('Year_Birth') + 2),
             arrowprops=dict(facecolor='red', shrink=0.1),
             fontsize=14, ha='center')

# Annotating the most right outlier for 'Income'
plt.annotate('Outlier',
             xy=(normalized_data['Income'][income_max_outlier], numerical_cols.index('Income')),
             xytext=(normalized_data['Income'][income_max_outlier] + 0, numerical_cols.index('Income') + 2),
             arrowprops=dict(facecolor='red', shrink=0.1),
             fontsize=14, ha='center')

# Turning off the x-axis ticks
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)


plt.show()


> As notice that there are two group of extreme outlier in 'Year_Birth' and 'Income'. Let focus only two variables.

In [None]:
# Setting up the subplots
fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(8, 4))

# Plotting 'Year_Birth' on the first subplot
sns.boxplot(data=data2['Year_Birth'], orient='h', ax=axs[0])
axs[0].set_title('Boxplot of Year_Birth')

# Plotting 'Income' on the second subplot
sns.boxplot(data=data2['Income'], orient='h', ax=axs[1])
axs[1].set_title('Boxplot of Income')

plt.tight_layout()
plt.show()

>- The boxplot reveals outliers with birth years before 1900, indicating these are either significantly older individuals or deceased.
>- Similarly, the income data shows one extremely individual outlier with an income around $600,000. This could indicate a data entry error, an anomaly, or a genuinely high-income individual.
>- These outliers will be remove.

In [None]:
# Removing outliers
year_outlier = (data2['Year_Birth'] > 1920)
income_outlier = (data2['Income'] < 200000)

data3 = data2[year_outlier & income_outlier]

print(data2[['Year_Birth','Income']].describe())
print("\n===== After removing outliers =====\n")
print(data3[['Year_Birth','Income']].describe())

> The minimum values in 'Year of Birth' and the maximum values in 'Income' have been removed as outliers.

#### Display data before and after outlier removing

In [None]:
plt.figure(figsize=(10, 6))

# Original Data Plot
plt.subplot(2, 1, 1)
sns.boxplot(data=normalized_data, orient='h')
plt.title("Original Data")
plt.xticks(rotation=45)
plt.ylabel("Normalized Values")
plt.xlabel("")
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)  # Turn off x-axis ticks


# Normalizing the filtered data
normalized_data_filtered = (data3[numerical_cols] - data3[numerical_cols].mean()) / data3[numerical_cols].std()


# Filtered Data Plot
plt.subplot(2, 1, 2, sharex=plt.gca())
sns.boxplot(data=normalized_data_filtered, orient='h')
plt.title("Data After Removing Outliers")
plt.xticks(rotation=45)
plt.ylabel("Normalized Values")
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)  # Turn off x-axis ticks



plt.tight_layout()
plt.show()


#### Display data before and after outlier removing

In [None]:
plt.figure(figsize=(10, 6))

# Original Data Plot
plt.subplot(2, 1, 1)
sns.boxplot(data=normalized_data, orient='h')
plt.title("Original Data")
plt.xticks(rotation=45)
plt.ylabel("Normalized Values")
plt.xlabel("")
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)  # Turn off x-axis ticks


# Filtered Data Plot
plt.subplot(2, 1, 2, sharex=plt.gca())
sns.boxplot(data=normalized_data_filtered, orient='h')
plt.title("Data After Removing Outliers")
plt.xticks(rotation=45)
plt.ylabel("Normalized Values")
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)  # Turn off x-axis ticks



plt.tight_layout()
plt.show()


> Now the data is cleaned and ready to perform a futher analysis tasks. Next step is **'Feature Engineering'**.
---


# Feature Engineering <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>
---

Create new features to improve analysis aspect and model efficiency.
- Age: Age of customer.
- Year_Membership: Preriod of membership.
- Adult_home: Number of adults living in a house.
- Family_Size: Total number of people live in a house.
- Income_Segment: Income segments based on qauntiles.
- Total_Spend: Total spending based on sum of amount spending on each product.

In [None]:
data4 = data3.copy()

In [None]:
# Age I setting age as time that the dataset was public (2021)
data4['Age'] = 2021 - data4['Year_Birth']

data4[['Year_Birth', 'Age']].head()

In [None]:
# Year_Membership
data4['Year_Membership'] = 2021 - data4['Dt_Customer'].dt.year
data4[['Dt_Customer', 'Year_Membership']].head()

In [None]:
# Adult_home
mapping = {
    'Single': 1,
    'Together': 2,
    'Married': 2,
    'Divorced': 1,
    'Widow': 1,
    'Alone': 1,
    'Absurd': 1,
    'YOLO': 1}
data4['Adulthome'] = data4['Marital_Status'].map(mapping)

data4[['Marital_Status', 'Adulthome']].head(5)

In [None]:
# Family_Size
data4['Family_Size'] = data4[['Kidhome', 'Teenhome', 'Adulthome']].sum(axis=1)
data4[['Kidhome', 'Teenhome', 'Adulthome', 'Family_Size']].head()

In [None]:
# Income_Segment
data4['Income_Segment'] = pd.qcut(data4['Income'], 4, labels=['Low', 'Mid-Low', 'Mid-High', 'High'])

print(data4['Income'].describe()) #reference
data4[['Income', 'Income_Segment']].head()

In [None]:
data4

>The 'Income_Segment' is defined by ranges between the quartiles of the 'Income' column as follows:
>
> - 0 to 35,233 = 'Low'
> - 35,233 to 51,371 = 'Mid-Low'
> - 51,372 to 68,487 = 'Mid-High'
> - 68,488 and above = 'High'

In [None]:
# Total_Spend
data4['Total_Spend'] = data4[['MntWines', 'MntFruits', 'MntMeatProducts',
                              'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)
data4[['MntWines', 'MntFruits', 'MntMeatProducts','MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'Total_Spend']].head()

In [None]:
#dataset after feature engineering
data4.head()

> Our dataset are ready to analyse. Let start **'Exploratory Data Analysis'** process.


# After Cleaning Summary <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>

In [None]:
print(data4.info())
display(data4.describe())

> 
> 1. **Age**: Born around 1940-1996, making them approximately 25-81 years old as of the current year (assume it 2021).
> 2. **Income**: Has an average annual income of around 51,000 - 52,000.
> 
> 3. **Family Composition**: Likely to have a small family, with a mean of 0.44 kids and 0.51 teenagers at home. This suggests that they might have one child or teenager living with them, or possibly a mix of both. The average family size, including adults and children, is around 2.6.
> 
> 4. **Purchasing Habits**: Tends to spend more on wines, with a mean expenditure of around 305 in this category. Also spends on meats, fruits, fish, sweet products, and gold, but to a lesser extent.
> 
> 5. **Shopping Behavior**: Makes an average of 2.32 deals purchases, 4.09 web purchases, 2.67 catalog purchases, and 5.81 store purchases. Visits the web (presumably the store's website or related online platforms) around 5.32 times a month.
> 
> 6. **Engagement with Marketing Campaigns**: Generally has low engagement with marketing campaigns, as indicated by the low mean values for accepted campaigns.
> 
> 7. **Membership Duration**: The majority of customers have been members of the service or loyalty program for approximately 7-8 years, demonstrating sustained loyalty and engagement with the brand. It is unusual that there are no newer members, suggesting that the company may not have accepted new registrations for the past 7 years.
> 
> 8. **Complaints**: Unlikely to have made complaints, as suggested by the low mean value in the 'Complain' category.
> <br>
>
> Our dataset are ready to analyse. Let start **'Exploratory Data Analysis'** process.

# Exploratory Data Analysis <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>
---

1. **Distribution Analysis**: Analyze the distribution of key numerical variables.
1. **Categorical Analysis**: Understand the frequency of each category.
1. **Proportion Analysis**: Understand which family sizes and eduacation level are more prevalent in each income segment.
1. **Campaign Responses**: Look at how customers responded to different campaigns.
1. **Purchasing Behavior**: Explore across different channels.
1. **Impact of Website**: Examime how website impact customer behavious.
1. **Impact of 'Family Size'**: Explore how family size affects pruchasing, product preferences.
1. **Correlation Analysis**: Discover the relationship between different viriables.
1. **Analyzing Variables Across Segments**: Review how variuos varibles such as purchasing patterns differ across distinct customer segments.

### 1. Distribution Analysis

In [None]:
# List of key numerical columns to visualize
numerical_columns = ['Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'Age',
       'Year_Membership', 'Family_Size', 'Total_Spend']

# Setting up the figure for multiple subplots
plt.figure(figsize=(15, 10))

# Plotting histograms for each numerical column
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(4, 4, i)  # Adjust the grid dimensions as needed
    sns.histplot(data4[col],bins='auto', kde=True)
    plt.title(col)

plt.tight_layout()
plt.show()

> The plots shows a series of histograms, each depicting the distribution of a different variable.
> 
> 1. **Recency**: The distribution appears fairly uniform, suggesting that customers have made purchases throughout the recent period sampled without significant time gaps.
> 
> 2. **MntWines**: Shows a right-skewed distribution with a peak at lower spending amounts, indicating that most customers spend less on wines, but there is a long tail of customers who spend more.
> 
> 3. **MntFruits, MntFishProducts, MntSweetProducts, MntGoldProds**: These are all right-skewed distributions, indicating that most customers spend smaller amounts on these product categories, with fewer customers spending more.
> 
> 4. **MntMeatProducts**: Also right-skewed, with a higher peak, suggesting that while there's a concentration of customers spending less, there's also a substantial number of customers spending a lot on meat products.
> 
> 5. **NumWebPurchases**: The plot shows that a few customers have made 2-5 purchases on the web, with a portion that has never tried purchasing via the website.
>
> 6. **NumCatalogPurchases** The plot shows a right-skewed distribution, indicating that the majority of customers have never tried catalog purchasing. 
>
> 7. **NumStorePurchases** Although the plot shows signs of a right-skewed distribution, store purchases have a slightly gentler slope compared to other purchasing channels. Additionally, there are nearly no customers who have never made a store purchase.
> 
> 6. **NumDealsPurchases**: This histogram displays a high peak at the lower end, indicating that most customers purchase at least one discounted deal. There is a rapid decrease in the number of customers as the count of deal purchases increases.
> 
> 7. **NumWebVisitsMonth**: The distribution has a hign peak between 5-10 times, suggesting that most customers visit the company's website a few times per month.
> 
> 8. **Age**: The distribution of age looks approximately bell-shaped, centered around late middle age, implying a diverse customer base but with a concentration in late-middle-aged customers.
> 
> 9. **Year_Membership**: The data suggests that a significant number of customers have been members for approximately 8 years, with membership durations mostly ranging between 7 to 9 years.
> 
> 10. **Family_Size**: Most customers have a family size of 2 or 3, with single individuals and larger families being less common.
> 
> 11. **Total_Spend**: The distribution is right-skewed, indicating that while most customers have a lower total spend, there's a tail of customers who spend much more, up to 2500 units of currency.
> 

### 2. Categorical Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming data4 is your DataFrame
multi_cat_cols = data4.select_dtypes('category').columns
print(multi_cat_cols.values)

# Setting up the figure for multiple subplots with a shared y-axis
plt.figure(figsize=(12, 4))

# Track the maximum count to set a common y-axis range
max_count = 0
for col in multi_cat_cols:
    max_count = max(max_count, data4[col].value_counts().max())

# Plotting bar charts for each categorical column with shared y-axis
for i, col in enumerate(multi_cat_cols, 1):
    plt.subplot(1, 3, i, sharey=ax1 if i > 1 else None)  # Sharing y-axis with the first subplot
    ax1 = sns.countplot(x=data4[col], order=data4[col].value_counts().index, color='seagreen')
    plt.title(f'Distribution of {col}')
    plt.tick_params(axis='x', rotation=45)

    # Set the same y-axis limit for all subplots based on the maximum count found
    plt.ylim(0, max_count)

plt.tight_layout()
plt.show()


In [None]:
multi_cat_cols = data4.select_dtypes('category').columns
print(multi_cat_cols.values)

# Setting up the figure for multiple subplots
plt.figure(figsize=(12, 4))

# Plotting bar charts for each categorical column
for i, col in enumerate(multi_cat_cols, 1):
    plt.subplot(1, 3, i)
    sns.countplot(x=data4[col], order = data4[col].value_counts().index, color='seagreen')
    plt.title(f'Distribution of {col}')
    plt.tick_params(axis='x', rotation=45)
    plt.tight_layout()
    
    plt.ylim(0, 1200)

plt.show()


> - Higher education levels like "Graduation" and "PhD" dominate.
> - Marital status varies widely, but traditional statuses like "Married", "Together", and "Single" are more common than others.
> - Income distribution across the segments is even, since I divine income by quantile, that mean the income levels will balance distribution.

### 3. Proportion Analysis
- Calculate the proportion of each **'Family size'** within each income segment.
- Calculate the proportion of each **'Education'** within each income segment.

In [None]:
# Crosstabulation between Family_Size and Income_Segment
Family_Size_income_segment_crosstab = pd.crosstab(data4['Family_Size'], data4['Income_Segment'])

# Visualization using a stacked bar chart
Family_Size_income_segment_crosstab.plot(kind='bar', stacked=True, figsize=(6, 4))
plt.title('Family Size Distribution within Each Income Segment')
plt.xlabel('Family Size')
plt.ylabel('Number of Cusomer')
plt.xticks(rotation=0)
plt.show()

> Overall, the chart indicates that as family size increases, the overall number of customers tends to decrease. Interestingly, larger families appear to be poorer compared to smaller families, particularly in higher income brackets.

In [None]:
# Crosstabulation between Education and Income_Segment
education_income_segment_crosstab = pd.crosstab(data4['Education'], data4['Income_Segment'])

# Visualization using a stacked bar chart
education_income_segment_crosstab.plot(kind='bar', stacked=True, figsize=(6,4))
plt.title('Education Distribution within Each Income Segment')
plt.xlabel('Education')
plt.ylabel('Number of Cusomer')
plt.xticks(rotation=45)
plt.show()


> - Overall, the chart reveals a clear trend between those with education levels above graduation and those with only basic education. Specifically, basic education is associated solely with the 'Low' income segment. 
> - Conversely, higher education levels, such as a PhD, show no significant difference in income distribution compared to other graduate-level education, such as Master's degrees and Bachelor's degrees.

### 4. Campaign Responses

In [None]:
# Calculating the overall response rates for each campaign
campaign_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']
campaign_response_rates = data4[campaign_columns].mean() * 100  # Converting to percentages

# Visualization of response rates
plt.figure(figsize=(6, 4))
sns.barplot(x=campaign_response_rates.index, y=campaign_response_rates.values)

# Calculate the average and add a horizontal line
average_response_rate = campaign_response_rates.mean()
plt.axhline(average_response_rate, color='red', linestyle='dashed', linewidth=1)
plt.text(len(campaign_response_rates)+0.7, average_response_rate+0.2, f'Average: {average_response_rate:.2f}%', color='red', ha='right')


plt.title('Response Rates for Each Campaign')
plt.ylabel('Response Rate (%)')
plt.xticks(rotation=90)
plt.show()

campaign_response_rates


> The provided bar chart visualizes the response rates for five different promotional campaigns, with each bar representing the percentage of respondents who accepted the offer in each campaign. 
> 1. Campaigns 3,4  and 5 have the highest response rates, which suggests that the offers or the method of these campaigns were particularly appealing or well-received.
> 1. Campaign 2's strategy was least effective, given its significantly lower response rate compared to the others.
> 1. The response rates for Campaigns 1 are moderately successful, falling slightly from highest of the range.

### 5. Purchasing Behavior
To analyze purchasing behavior across different channels, we will focus on variables such as **NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases**, and **NumDealsPurchases**. These represent purchases made through the company’s website, using a catalogue, directly in stores, and purchases made with a discount, respectively. We can approach this analysis by:

In [None]:
purchases_data = data4.groupby('Income_Segment')[['NumDealsPurchases','NumWebPurchases',
                       'NumCatalogPurchases', 'NumStorePurchases',
                       'NumWebVisitsMonth'
                      ]].mean()

purchases_data.plot(kind='bar')
plt.show()

> The chart you provided shows the purchasing behavior of different income segments across several channels: deals, web, catalog, and store purchases, as well as the number of web visits per month. Let's analyze the trends observed in each income segment:
> 
> 1. **Deal Purchases**: Correlate with income, indicating that wealthier people do not care much about deal purchases.
> 1. **Store Purchases**: Correlate with income levels.
> 1. **Web Purchases**: Show a negative correlation with income.
> 1. **Catalog Purchases**: Correlate with income levels.
> 1. **Web Visits**: Show a negative correlation with income.

### 6. Impact of Website
Next, let's examine the correlation between web visits and web purchases.

In [None]:
purchases_data = data4.groupby('Income_Segment')[['NumWebPurchases','NumWebVisitsMonth']].mean()

purchases_data.plot(kind='bar')
plt.show()


> From the chart, we can observe the following key trends regarding the number of web visits per month and the average number of web purchases across different income segments:
> - There is a clear negative correlation between income level and the number of web visits; as income increases, the frequency of web visits decreases.
> - Web purchase behavior does not directly correlate with the number of web visits; higher web visits do not necessarily translate to more purchases, especially in lower income segments.
> - High Income Segment experiences a further decrease in web purchases, despite their lower web visit numbers, which might reflect higher quality or more expensive purchases that occur less frequently.
> 

### 7. Impact of 'Family Size'
Finally, let's investigate the impact of family size on spending habits. We'll create a series of boxplots to observe how family size influences spending on different product categories.

In [None]:
spending_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
                    'MntSweetProducts', 'MntGoldProds', 'Total_Spend']

# Impact of Family Size on Spending Habits
plt.figure(figsize=(15, 10))

# Creating boxplots for each spending category against Family Size
for i, col in enumerate(spending_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(x=data4['Family_Size'], y=data4[col])
    plt.title(f'Family Size vs {col}')
    plt.tight_layout()

plt.show()


>The boxplots provide a visual summary of the spending distribution across different family sizes for various product categories:1. Across all product categories, as family size increases, median spending tends to decrease.
>1. Single-person families (family size 1) have the greatest variability in spending, which could be due to more disposable income or fewer family obligations.
>1. Outliers are present in all categories, suggesting that there are exceptions to the general spending patterns within each family size group.
>1. Smaller families have a wider range of spending, which narrows for larger family sizes, possibly indicating budget constraints or different spending priorities.
>
>These insights suggest that **family size does have an influence on spending habits**, with larger families generally spending less on these product categories. This could be due to budget allocation towards other necessities or preferences.

### 8. Correlation Analysis
Let's explore the relationships between different variables in the dataset. We'll focus on:

- Income & Spending
- Webvisit & WebPurchase
- Family Size & Spending

In [None]:
spending_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
                    'MntSweetProducts', 'MntGoldProds', 'Total_Spend']

# Function to format the tick labels
def format_tick(value, pos):
    return f'{int(value/1000)}k' if value >= 1000 else str(int(value))

# Setting up the figure for multiple subplots
plt.figure(figsize=(15, 10))

# Creating scatter plots with red regression lines for each spending category against Income
for i, col in enumerate(spending_columns, 1):
    plt.subplot(3, 3, i)
    sns.regplot(x=data4['Income'], y=data4[col], marker='x', scatter_kws={'alpha':0.3}, line_kws={"color": "red", 'alpha':0.3})
    plt.title(f'Income vs {col}')
    
    # Get current axis
    ax = plt.gca()
    
    # Set x-tick labels with the custom formatter
    ax.set_xticklabels([format_tick(x, pos) for x, pos in zip(ax.get_xticks(), range(len(ax.get_xticks())))])

    plt.tight_layout()

plt.show()


> The scatter plots reveal interesting patterns in the relationship between income and spending on various products:
>
> 1. **Income vs MntWines**: There seems to be the strongest positive correlation, suggesting that higher income customers tend to spend more on wines.
> 1. **Income vs MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts**: These categories also show a similar trend, with higher spending observed at higher income levels, although the correlation appears weaker compared to wine.
> 1. **Income vs MntGoldProds**: The pattern is less clear, but there is still an indication that higher income might lead to higher spending on gold products.

### 9. Analyzing Variables Across Segments
We'll then examine key variables across these income segments to observe how spending patterns
For Numerical Variables: We'll use boxplots to compare the distribution of key numerical variables across different income segments. This will help us understand how variables like 'MntWines', 'MntMeatProducts', and others vary with income.


In [None]:
# Visualization for Numerical Variables across Income Segments
numerical_variables_for_visualization = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 
                                         'MntSweetProducts','MntGoldProds', 'Total_Spend']

# Setting up the figure for multiple subplots
plt.figure(figsize=(15, 12))

# Creating boxplots for each numerical variable
for i, col in enumerate(numerical_variables_for_visualization, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(x=data4['Income_Segment'], y=data4[col])
    plt.title(f'{col} by Income Segment')
    plt.tight_layout()

plt.show()


> The boxplots provide a visual representation of how key numerical variables vary across different income segments:
> 
> 1. The observation ranking by Income more family's imcome is more family spend across products category.
> 1. Wine is the most obious product that we can see the different level of spendign across family weath.
> 1. Gold was the only product that 'Mid-High' and 'Hign' family spend similarly. This can infer..

# Remove Unused columns After Analysed <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>

When preparing data for a customer segmentation analysis using a clustering model, the choice of which columns to drop depends on the relevance of the data to the analysis objectives.

In [None]:
data_new = data4.copy()

# Step 1: Remove unused columns
drop_cols = ['Year_Birth', 'Marital_Status', 'Dt_Customer', 'Kidhome', 'Teenhome', 'Adulthome'
             , 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']
data_processed = data_new.drop(columns=drop_cols)

display(data_new.head())
print(data_new.shape)
print("\nAfter remove columns: ")
display(data_processed.head())
print(data_processed.shape)

# Data Proprocessing <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>
---

### How to deal with many columns dataset

Dealing with high-dimensional data can be complex due to computational demands, increased risk of overfitting, and difficulty in visualizing or interpreting the data. Principal Component Analysis (PCA) is a common method for addressing these challenges.

### What is PCA?
PCA is a linear transformation technique used to reduce the dimensionality of a dataset while preserving the maximum variance. It does this by finding new axes, known as principal components, which represent the directions of greatest variance in the data.

### Steps to Use PCA with High-Dimensional Data

1. **Standardize the Data**: <br>
High-dimensional data often contain features with different scales. Standardize (or normalize) the data to ensure each feature has a mean of zero and a standard deviation of one. This step is crucial for PCA to work effectively.

2. **Fit PCA**: <br>
Apply PCA to the standardized data to obtain the principal components. You can specify the number of components to retain based on the cumulative explained variance or a specific number of components.

3. **Choose the Number of Components**: <br>
Determine the optimal number of principal components to retain. You can plot the explained variance ratio to find the point where adding more components yields diminishing returns.
How to do PCA

4. **Use PCA for Dimensionality Reduction**: <br>
Once you've determined the optimal number of components, use the principal components to transform the original dataset, effectively reducing its dimensionality. This new dataset can be used for machine learning or data analysis tasks.

5. **Visualize the Reduced Data**: <br>
With fewer dimensions, it's easier to visualize the data. You can plot the first two or three principal components to see how the data points are distributed.

## Step 1: Standardize and Encode the Data

In [None]:
# Step 1: Standardize the Data:

numerical_cols = data_processed.select_dtypes(include=['int32', 'int64', 'float64']).columns
categorical_cols = data_processed.select_dtypes(include=['object', 'category', 'boolean']).columns

print("Numeric: {}".format(numerical_cols.values))
print("\n")
print("Category: {}".format(categorical_cols.values))

In [None]:
# Step 3: Create a Column Transformer for standardizing and encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

# Apply the transformations to the data
preprocessed_data = preprocessor.fit_transform(data_processed)

# Checking the shape of the transformed data
preprocessed_data_shape = preprocessed_data.shape

print("Original data has {} columns".format(data_new.shape[1]))
print("Prepreocessed data has {} columns".format(preprocessed_data_shape[1]))

## Step 2 Fit PCA

In [None]:
# Step 2 Fit PCA:

# The number of components is not specified, so PCA will retain all components but ordered by explained variance
pca = PCA()
pca_data = pca.fit_transform(preprocessed_data)

# Getting the explained variance ratio to determine how many components to keep
explained_variance_ratio = pca.explained_variance_ratio_

# Displaying the cumulative explained variance to decide on the number of components
cumulative_explained_variance = explained_variance_ratio.cumsum()

cumulative_explained_variance


## Step 3: Choose the Number of Components

In [None]:
# Creating subplot [1, 2] for the provided plots
plt.figure(figsize=(12, 6))

# First plot: Bar chart for explained variance ratio
plt.subplot(1, 2, 1)
plt.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio by PCA Components')
plt.grid(True)

# Second plot: Line plot for cumulative explained variance
plt.subplot(1, 2, 2)
plt.plot(cumulative_explained_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance by PCA Components')
plt.grid(True)

# Show the combined plot
plt.tight_layout()
plt.show()


## Step 4: Use PCA for Dimensionality Reduction:

In [None]:
# Applying PCA with 15 components
pca_15 = PCA(n_components=15)
pca_15.fit(preprocessed_data)


# Shape of Preprocessed Data
print("The shape of the preprocessed data before applying PCA:")
print(f"Number of observations: {preprocessed_data.shape[0]}")
print(f"Number of features: {preprocessed_data.shape[1]}")


# Principal Components
print("\n\n====After applying PCA ==== ")
print(f"Shape of PCA components: {pca_15.components_.shape}")

In [None]:
import seaborn as sns

# Visualizing the importance of each original variable in the first few principal components
# We will use a heatmap for this purpose

# Reconstructing the feature names after one-hot encoding
feature_names = numerical_cols.tolist() + preprocessor.named_transformers_['cat'].get_feature_names_out().tolist()

# Correctly displaying the PCA loadings with the transformed feature names
pca_loadings_corrected = pd.DataFrame(pca_15.components_, columns=feature_names)

plt.figure(figsize=(12, 8))
sns.heatmap(pca_loadings_corrected.iloc[:5, :].transpose(), 
            cmap='YlGnBu', 
            annot=True, 
            fmt=".2f", 
            cbar_kws={'label': 'Loading Value'},
            xticklabels=[f'PC{i}' for i in range(1,6)])
plt.title('PCA Loadings - Importance of Each Original Variable in the First 5 Components')
plt.xlabel('Principal Component')
plt.ylabel('Original Variable')
plt.yticks(rotation=0)  # To keep the variable names readable
plt.show()


## Step 5: Visualize the Reduced Data:

In [None]:
import pandas as pd

# Transforming the preprocessed data with PCA to reduce its dimensions
reduced_data = pca_15.transform(preprocessed_data)

# Checking the shape of the reduced data
print("Shape of the reduced data:", reduced_data.shape)

# Converting the reduced data to a DataFrame for better readability
reduced_data_df = pd.DataFrame(reduced_data, columns=[f'PC{i+1}' for i in range(reduced_data.shape[1])])

# Displaying the first few rows of the reduced data
print("First few rows of the reduced PCA data:")
display(reduced_data_df.head())


# Clustering with K-Means <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>

### How can we cluster data?
Clustering is a technique used in data mining and machine learning to group similar data points together. There are various algorithms for clustering, but one of the most popular methods is K-means clustering. 

### Why can K-means cluster data?
K-means is one of the most popular clustering algorithms because of its simplicity and efficiency. It works based on the following principles:

### How can we select the number of clusters in K-means?
Selecting the optimal number of clusters, often denoted by "K", is a crucial step in K-means clustering. Several methods can be used for this purpose, with one of the most commonly employed being the Elbow method. In the Elbow method, the within-cluster sum of squares (WCSS) is plotted against the number of clusters. The "elbow point" represents the optimal number of clusters, where adding more clusters does not significantly reduce WCSS.

### Selecting Kmeans's Number of Clusters
Identify Elbow Point: Examine the plot. The point where the rate of decrease of WCSS slows down (forming an "elbow" shape) is often a good indication of the appropriate number of clusters. This is because adding more clusters beyond this point may not significantly reduce the WCSS.

In [None]:
# Calculate WCSS for different number of clusters
wcss = []  # Within-Cluster-Sum-of-Squares
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(reduced_data)
    wcss.append(kmeans.inertia_)

# Plotting the results onto a line graph to observe 'The elbow'
plt.figure(figsize=(10, 6))
lines = plt.plot(range(1, 11), wcss, marker='o', linestyle='--')

# Annotate the 3rd marker
third_marker = lines[0].get_xydata()[2]  # Get the x,y coordinates of the 3rd point
plt.scatter(*third_marker, s=100, color='red')  # Highlight the 3rd marker
plt.annotate('Elbow Point', (third_marker[0], third_marker[1]), textcoords="offset points", xytext=(-10,10), ha='center', color='red')

plt.title('Elbow Method For Optimal Number of Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')  # Within cluster sum of squares
plt.grid(True)

# Display the plot
plt.show()


In [None]:
#temp

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Assuming 'reduced_data' contains your data after dimensionality reduction

# Number of clusters to try
cluster_counts = [2, 3, 4, 5, 6, 7]

n_rows = 2
n_cols = 3

fig, axes = plt.subplots(n_rows, n_cols, figsize=(18, 10))

for i, k in enumerate(cluster_counts):
    row = i // n_cols
    col = i % n_cols

    # Apply K-means clustering
    kmeans = KMeans(n_clusters=k, random_state=42)
    clusters = kmeans.fit_predict(reduced_data)

    # Scatter plot using the first two principal components
    axes[row, col].scatter(reduced_data[:, 0], reduced_data[:, 1], c=clusters, cmap='viridis', marker='o', alpha=0.3)
    axes[row, col].set_title(f'K-Means Clustering with {k} Clusters', size=16)
    axes[row, col].set_xlabel('Principal Component 1')
    axes[row, col].set_ylabel('Principal Component 2')
    axes[row, col].grid(True)

plt.tight_layout()
plt.show()



> The analysis of the K-Means clustering plots for 𝑘 k values ranging from 2 to 7 suggests that **k=3** is the most suitable choice for clustering this data, the clusters are distinct, well-separated, and each captures a dense group of data points, providing a clear and meaningful partition of the dataset. Increasing the number of clusters beyond three leads to overlaps and potentially over-segmentation

## Apply the chosen number of clusters

In [None]:
# Assigninge color palette
palette = 'viridis'

# Assigning cluster labels
kmeans_3_clusters = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans_3_clusters.fit_predict(reduced_data)
data_new['Cluster'] = cluster_labels

plt.figure(figsize=(10, 6))

# Using seaborn's scatter plot with a KDE overlay for each cluster
sns.scatterplot(data=data_new, x='Income', y='Total_Spend', hue='Cluster', palette=palette, alpha=1)
#     sns.kdeplot(data=data_new, x='Income', y=col, hue='Cluster', palette=palette, alpha=.9, linewidths=1)
plt.title(f'Income vs Total spend with Cluster Density')
plt.xlabel('Income')
plt.ylabel('Total spend')

plt.tight_layout()
plt.show()

# Summary <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>

## Key observations from the clustering results:

1. **Cluster Characteristics**:
   - **<p style="color: purple;">Cluster 0 (Purple)</p>**: Represents customers with a wide range of incomes but generally lower total spending. This is the largest cluster and suggests a group of customers who are conservative spenders across different income levels.
   - **<p style="color: teal;">Cluster 1 (Teal)</p>**: Encompasses customers with moderate incomes and moderate to high total spending. This cluster is tightly packed, indicating a strong correlation between income and spending for this group.
   - **<p style="color: gold;">Cluster 2 (Yellow)</p>**: Contains customers with high incomes and high total spending. It appears to be smaller and more spread out than the other two clusters, suggesting these are premium customers who vary more in their spending despite high incomes.


<br>

2. **Income vs. Spend Correlation**:
   - There is a positive correlation between income and total spend, which is most apparent in clusters 1 and 2. As income increases, total spend also tends to increase.

# SWOT Analysis and Next Steps <span style=' border:1; color:#95E885; text-align:left; font-size:42px;  padding:6px;'>|</span>


#### <p style="color: purple;">Cluster 0 (Purple): Conservative Spenders</p>
| **Category** | **Description** |
|:--------------|:-----------------|
| **Strengths** | Large customer base, stable revenue, less impacted by economic changes. |
| **Weaknesses** | Low revenue per customer, limited growth. |
| **Opportunities** | Upsell services, create loyalty programs to boost spending. |
| **Threats** | Price sensitive, risk of losing to cheaper options. |
| **Next Steps** | Offer affordable products, improve rewards in loyalty programs. |

#### <p style="color: teal;">Cluster 1 (Teal): Middle Income, Moderate to High Spenders</p>
| **Category** | **Description** |
|:--------------|:-----------------|
| **Strengths** | Regular spending, responds well to value-focused marketing. |
| **Weaknesses** | Limited extra money for luxuries, faces lots of competition. |
| **Opportunities** | Introduce varied product levels, use targeted marketing. |
| **Threats** | Vulnerable to economic downturns, could lose customers to different brands. |
| **Next Steps** | Use data to customize products and ads, offer bundled products. |

#### <p style="color: gold;">Cluster 2 (Yellow): High Income, High Spenders</p>
| **Category** | **Description** |
|:--------------|:-----------------|
| **Strengths** | More profitable, likes high-end products. |
| **Weaknesses** | Fewer customers, expects high quality. |
| **Opportunities** | Sell unique, high-quality items, provide personalized service. |
| **Threats** | Spending may drop during economic hardship, faces strong competition. |
| **Next Steps** | Focus on high-end products and services, enhance personalized shopping experiences. |
