# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score


### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [2]:
# Load the dataset
df = pd.read_csv('marketing.csv')

# Display first 5 rows, shape, and info
print(df.head())
print(df.shape)
print(df.info())


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumCatalogPurchases  NumStorePurchases  \
0    4/9/2012       58       635  ...                   10                  4   
1    8/3/2014       38        11  ...                    1                  2   
2  21-08-2013       26       426  ...                    2                 10   
3   10/2/2014       26        11  ...                    0                  4   
4  19-01-2014       94       173  ...                    3                  6   

   NumWebVisitsMonth  AcceptedCmp3  Accepted

### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [3]:
# Check for missing values and calculate percentage
missing_percentage = df.isnull().sum() / len(df) * 100
print(missing_percentage)
# If there are missing values, handle them using imputation or removal


ID                     0.000000
Year_Birth             0.000000
Education              0.000000
Marital_Status         0.000000
Income                 1.071429
Kidhome                0.000000
Teenhome               0.000000
Dt_Customer            0.000000
Recency                0.000000
MntWines               0.000000
MntFruits              0.000000
MntMeatProducts        0.000000
MntFishProducts        0.000000
MntSweetProducts       0.000000
MntGoldProds           0.000000
NumDealsPurchases      0.000000
NumWebPurchases        0.000000
NumCatalogPurchases    0.000000
NumStorePurchases      0.000000
NumWebVisitsMonth      0.000000
AcceptedCmp3           0.000000
AcceptedCmp4           0.000000
AcceptedCmp5           0.000000
AcceptedCmp1           0.000000
AcceptedCmp2           0.000000
Complain               0.000000
Response               0.000000
dtype: float64


### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [4]:
# Check for duplicates
print(df.duplicated().sum())

# Drop duplicates
df = df.drop_duplicates()

# Verify if duplicates are dropped
print(df.shape)


0
(2240, 27)


### 5. Drop the columns which you think redundant for the analysis 

In [5]:
# List of columns to drop
columns_to_drop = []

# Drop the columns
df = df.drop(columns=columns_to_drop)

# Verify the dataframe after dropping columns
print(df.head())


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumCatalogPurchases  NumStorePurchases  \
0    4/9/2012       58       635  ...                   10                  4   
1    8/3/2014       38        11  ...                    1                  2   
2  21-08-2013       26       426  ...                    2                 10   
3   10/2/2014       26        11  ...                    0                  4   
4  19-01-2014       94       173  ...                    3                  6   

   NumWebVisitsMonth  AcceptedCmp3  Accepted

### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [6]:
# Group categories in 'Marital_Status'
df['Marital_Status'] = df['Marital_Status'].replace({
    'Married': 'relationship',
    'Together': 'relationship',
    'Divorced': 'Single',
    'Widow': 'Single',
    'Alone': 'Single',
    'YOLO': 'Single',
    'Absurd': 'Single'
})


### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [7]:
# Group expenses columns
df['Total_Expenses'] = df[['MntWines', 'MntFruits', 'MntMeatProducts', 
                           'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)

# Drop individual expense columns
df = df.drop(columns=['MntWines', 'MntFruits', 'MntMeatProducts', 
                      'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'])


### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [8]:
# Group purchases columns
df['Num_Total_Purchases'] = df[['NumWebPurchases', 'NumCatalogPurchases', 
                                'NumStorePurchases', 'NumDealsPurchases']].sum(axis=1)

# Drop individual purchase columns
df = df.drop(columns=['NumWebPurchases', 'NumCatalogPurchases', 
                      'NumStorePurchases', 'NumDealsPurchases'])


### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [9]:
# Group 'Kidhome' and 'Teenhome'
df['Kids'] = df['Kidhome'] + df['Teenhome']

# Drop individual columns
df = df.drop(columns=['Kidhome', 'Teenhome'])


### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [10]:
# Group campaign acceptance columns
df['TotalAcceptedCmp'] = df[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 
                             'AcceptedCmp4', 'AcceptedCmp5', 'Response']].sum(axis=1)

# Drop individual campaign acceptance columns
df = df.drop(columns=['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 
                      'AcceptedCmp4', 'AcceptedCmp5', 'Response'])


### 11. Drop those columns which we have used above for obtaining new features

In [11]:
# Drop columns used for new features
df = df.drop(columns=['ID', 'Dt_Customer'])


### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [12]:
# Extract age from 'Year_Birth'
df['Age'] = 2024 - df['Year_Birth']

# Drop 'Year_Birth'
df = df.drop(columns=['Year_Birth'])


### 13. Encode the categorical variables in the dataset

In [13]:
# Encode categorical variables
label_encoder = LabelEncoder()
df['Education'] = label_encoder.fit_transform(df['Education'])
df['Marital_Status'] = label_encoder.fit_transform(df['Marital_Status'])


### 14. Standardize the columns, so that values are in a particular range

In [14]:
# Standardize numerical columns
scaler = StandardScaler()
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])


### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [17]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)

# Handle missing values (example of dropping rows with missing values)
df = df.dropna()

# Alternatively, impute missing values (example with SimpleImputer)
from sklearn.impute import SimpleImputer

# Replace NaNs with mean (you can choose 'median', 'most_frequent', or other strategies)
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Use df_imputed for PCA if you choose to impute values

# Apply PCA
pca = PCA()
pca.fit(df)

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)

# Determine number of components for 90-95% variance
n_components = np.argmax(cumulative_variance >= 0.95) + 1
print("Number of components for 95% variance:", n_components)

# Apply PCA with the determined number of components
pca = PCA(n_components=n_components)
pca_data = pca.fit_transform(df)


Education               0
Marital_Status          0
Income                 24
Recency                 0
NumWebVisitsMonth       0
Complain                0
Total_Expenses          0
Num_Total_Purchases     0
Kids                    0
TotalAcceptedCmp        0
Age                     0
dtype: int64
Number of components for 95% variance: 9


### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [21]:
# Step 16: Apply K-means clustering and assign clusters
from sklearn.cluster import KMeans

desired_clusters = 5  # Replace with your desired number of clusters
kmeans = KMeans(n_clusters=desired_clusters, random_state=42, n_init=10)
kmeans.fit(pca_data)  # Fit K-means on PCA transformed data

# Assign clusters back to the original dataframe
df['KMeans_Cluster'] = kmeans.labels_

# Display cluster assignments
print(df['KMeans_Cluster'].value_counts())

KMeans_Cluster
2    707
0    570
3    491
1    427
4     21
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['KMeans_Cluster'] = kmeans.labels_


### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

### Visualization and Interpretation of results

-----
## Happy Learning
-----