# Unsupervised Lab Session

## Learning outcomes:
- Exploratory data analysis and data preparation for model building.
- PCA for dimensionality reduction.
- K-means and Agglomerative Clustering

## Problem Statement
Based on the given marketing campigan dataset, segment the similar customers into suitable clusters. Analyze the clusters and provide your insights to help the organization promote their business.

## Context:
- Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
- Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

## About dataset
- Source: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?datasetId=1546318&sortBy=voteCount

### Attribute Information:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month

### 1. Import required libraries

In [None]:
import pandas as pd

### 2. Load the CSV file (i.e marketing.csv) and display the first 5 rows of the dataframe. Check the shape and info of the dataset.

In [1]:
import pandas as pd
file_path = 'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Display a sample of five rows
print(df.head()) 
# Check the shape of the data
print("\nShape of the data (rows, columns):")
print(df.shape)

# Check general information about the dataframe
print("\nGeneral information about the dataframe:")
print(df.info())

     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumCatalogPurchases  NumStorePurchases  \
0    4/9/2012       58       635  ...                   10                  4   
1    8/3/2014       38        11  ...                    1                  2   
2  21-08-2013       26       426  ...                    2                 10   
3   10/2/2014       26        11  ...                    0                  4   
4  19-01-2014       94       173  ...                    3                  6   

   NumWebVisitsMonth  AcceptedCmp3  Accepted

### 3. Check the percentage of missing values? If there is presence of missing values, treat them accordingly.

In [2]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Check for missing values
missing_values = df.isnull().sum()

# Calculate percentage of missing values
missing_percentage = (missing_values / len(df)) * 100

# Combine missing values and percentages into a DataFrame for clarity
missing_info = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage
})

print("Missing Value Information:")
print(missing_info)

# Handle missing values based on your chosen method (e.g., imputation or dropping rows)
# Example: Impute missing values in numeric columns with mean
numeric_columns = df.select_dtypes(include='number').columns
for col in numeric_columns:
    df[col].fillna(df[col].mean(), inplace=True)

# Example: Impute missing values in categorical columns with mode
categorical_columns = df.select_dtypes(include='object').columns
for col in categorical_columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Alternatively, you can drop rows with any missing values
# df.dropna(axis=0, inplace=True)

# Now your dataset (`df`) should be processed with missing values handled appropriately


Missing Value Information:
                     Missing Values  Percentage
ID                                0    0.000000
Year_Birth                        0    0.000000
Education                         0    0.000000
Marital_Status                    0    0.000000
Income                           24    1.071429
Kidhome                           0    0.000000
Teenhome                          0    0.000000
Dt_Customer                       0    0.000000
Recency                           0    0.000000
MntWines                          0    0.000000
MntFruits                         0    0.000000
MntMeatProducts                   0    0.000000
MntFishProducts                   0    0.000000
MntSweetProducts                  0    0.000000
MntGoldProds                      0    0.000000
NumDealsPurchases                 0    0.000000
NumWebPurchases                   0    0.000000
NumCatalogPurchases               0    0.000000
NumStorePurchases                 0    0.000000
NumWebVisitsM

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


### 4. Check if there are any duplicate records in the dataset? If any drop them.

In [3]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Check for duplicate records
duplicate_rows = df[df.duplicated()]

# Print the number of duplicate records
print(f"Number of duplicate records: {duplicate_rows.shape[0]}")

# Optionally, display the duplicate rows
if duplicate_rows.shape[0] > 0:
    print("\nDuplicate Rows:")
    print(duplicate_rows)

# Drop duplicate records
df.drop_duplicates(inplace=True)

# Confirm duplicates have been dropped
print(f"Dataset shape after removing duplicates: {df.shape}")

# Now your dataset (`df`) should be processed with duplicate records removed


Number of duplicate records: 0
Dataset shape after removing duplicates: (2240, 27)


### 5. Drop the columns which you think redundant for the analysis 

In [6]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Display current columns
print("Columns before dropping redundant ones:")
print(df.columns)

# Example: Dropping columns with constant values
constant_columns = [col for col in df.columns if df[col].nunique() == 1]
df.drop(columns=constant_columns, inplace=True)

# Example: Dropping columns with high missing values (threshold set to 30%)
missing_threshold = 30  # Set your own threshold as needed
missing_percentage = (df.isnull().sum() / len(df)) * 100
high_missing_columns = missing_percentage[missing_percentage > missing_threshold].index.tolist()
df.drop(columns=high_missing_columns, inplace=True)

# Example: Dropping columns that are not relevant to analysis
columns_to_drop = ['IrrelevantColumn1', 'IrrelevantColumn2']

# Check if columns_to_drop exist in df.columns before dropping
columns_to_drop_existing = [col for col in columns_to_drop if col in df.columns]
df.drop(columns=columns_to_drop_existing, inplace=True)

# Display columns after dropping redundant ones
print("\nColumns after dropping redundant ones:")
print(df.columns)


Columns before dropping redundant ones:
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response'],
      dtype='object')

Columns after dropping redundant ones:
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'Ac

### 6. Check the unique categories in the column 'Marital_Status'
- i) Group categories 'Married', 'Together' as 'relationship'
- ii) Group categories 'Divorced', 'Widow', 'Alone', 'YOLO', and 'Absurd' as 'Single'.

In [7]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Check unique categories in 'Marital_Status'
unique_categories = df['Marital_Status'].unique()
print("Unique categories in 'Marital_Status':")
print(unique_categories)

# Group categories as specified
category_mapping = {
    'Married': 'relationship',
    'Together': 'relationship',
    'Divorced': 'Single',
    'Widow': 'Single',
    'Alone': 'Single',
    'YOLO': 'Single',
    'Absurd': 'Single'
}

# Replace categories in 'Marital_Status' column
df['Marital_Status'] = df['Marital_Status'].replace(category_mapping)

# Verify the changes
updated_unique_categories = df['Marital_Status'].unique()
print("\nUpdated unique categories in 'Marital_Status':")
print(updated_unique_categories)


Unique categories in 'Marital_Status':
['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

Updated unique categories in 'Marital_Status':
['Single' 'relationship']


### 7. Group the columns 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', and 'MntGoldProds' as 'Total_Expenses'

In [8]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Specify columns to be summed
expense_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

# Create 'Total_Expenses' column by summing specified columns
df['Total_Expenses'] = df[expense_columns].sum(axis=1)

# Display the updated DataFrame with 'Total_Expenses'
print(df.head())  # Displaying the first few rows to verify

# Optionally, you can drop the individual expense columns if needed
# df.drop(columns=expense_columns, inplace=True)


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumStorePurchases  NumWebVisitsMonth  \
0    4/9/2012       58       635  ...                  4                  7   
1    8/3/2014       38        11  ...                  2                  5   
2  21-08-2013       26       426  ...                 10                  4   
3   10/2/2014       26        11  ...                  4                  6   
4  19-01-2014       94       173  ...                  6                  5   

   AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp

### 8. Group the columns 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', and 'NumDealsPurchases' as 'Num_Total_Purchases'

In [9]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Specify columns to be summed
purchase_columns = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases']

# Create 'Num_Total_Purchases' column by summing specified columns
df['Num_Total_Purchases'] = df[purchase_columns].sum(axis=1)

# Display the updated DataFrame with 'Num_Total_Purchases'
print(df.head())  # Displaying the first few rows to verify

# Optionally, you can drop the individual purchase columns if needed
# df.drop(columns=purchase_columns, inplace=True)


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumStorePurchases  NumWebVisitsMonth  \
0    4/9/2012       58       635  ...                  4                  7   
1    8/3/2014       38        11  ...                  2                  5   
2  21-08-2013       26       426  ...                 10                  4   
3   10/2/2014       26        11  ...                  4                  6   
4  19-01-2014       94       173  ...                  6                  5   

   AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp

### 9. Group the columns 'Kidhome' and 'Teenhome' as 'Kids'

In [10]:
import pandas as pd

file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Create 'Kids' column by summing 'Kidhome' and 'Teenhome'
df['Kids'] = df['Kidhome'] + df['Teenhome']

# Display the updated DataFrame with 'Kids'
print(df.head())  # Displaying the first few rows to verify

# Optionally, you can drop the individual 'Kidhome' and 'Teenhome' columns if needed
# df.drop(columns=['Kidhome', 'Teenhome'], inplace=True)


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumStorePurchases  NumWebVisitsMonth  \
0    4/9/2012       58       635  ...                  4                  7   
1    8/3/2014       38        11  ...                  2                  5   
2  21-08-2013       26       426  ...                 10                  4   
3   10/2/2014       26        11  ...                  4                  6   
4  19-01-2014       94       173  ...                  6                  5   

   AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp

### 10. Group columns 'AcceptedCmp1 , 2 , 3 , 4, 5' and 'Response' as 'TotalAcceptedCmp'

In [11]:
import pandas as pd

file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Specify columns to be summed
accepted_cmp_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']

# Create 'TotalAcceptedCmp' column by summing specified columns
df['TotalAcceptedCmp'] = df[accepted_cmp_columns].sum(axis=1)

# Display the updated DataFrame with 'TotalAcceptedCmp'
print(df.head())  # Displaying the first few rows to verify

# Optionally, you can drop the individual accepted campaign columns if needed
# df.drop(columns=accepted_cmp_columns, inplace=True)


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumStorePurchases  NumWebVisitsMonth  \
0    4/9/2012       58       635  ...                  4                  7   
1    8/3/2014       38        11  ...                  2                  5   
2  21-08-2013       26       426  ...                 10                  4   
3   10/2/2014       26        11  ...                  4                  6   
4  19-01-2014       94       173  ...                  6                  5   

   AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp

### 11. Drop those columns which we have used above for obtaining new features

In [12]:
# Assuming 'df' is your DataFrame and you have already created the new features

# Drop original columns used for creating 'Total_Expenses'
expense_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df.drop(columns=expense_columns, inplace=True)

# Drop original columns used for creating 'Num_Total_Purchases'
purchase_columns = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumDealsPurchases']
df.drop(columns=purchase_columns, inplace=True)

# Drop original columns used for creating 'Kids'
df.drop(columns=['Kidhome', 'Teenhome'], inplace=True)

# Drop original columns used for creating 'TotalAcceptedCmp'
accepted_cmp_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']
df.drop(columns=accepted_cmp_columns, inplace=True)

# Optionally, you can print the updated DataFrame to verify
print(df.head())


     ID  Year_Birth   Education Marital_Status   Income Dt_Customer  Recency  \
0  5524        1957  Graduation         Single  58138.0    4/9/2012       58   
1  2174        1954  Graduation         Single  46344.0    8/3/2014       38   
2  4141        1965  Graduation       Together  71613.0  21-08-2013       26   
3  6182        1984  Graduation       Together  26646.0   10/2/2014       26   
4  5324        1981         PhD        Married  58293.0  19-01-2014       94   

   NumWebVisitsMonth  Complain  TotalAcceptedCmp  
0                  7         0                 1  
1                  5         0                 0  
2                  4         0                 0  
3                  6         0                 0  
4                  5         0                 0  


### 12. Extract 'age' using the column 'Year_Birth' and then drop the column 'Year_birth'

In [15]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Print the first few rows of the dataset to inspect its structure
print(df.head())

# Check if 'Year_Birth' column exists in the dataset
if 'Year_Birth' in df.columns:
    # Calculate 'age' from 'Year_Birth' column
    current_year = 2024
    df['age'] = current_year - df['Year_Birth']

    # Drop 'Year_Birth' column
    df.drop(columns=['Year_Birth'], inplace=True)

    # Display the updated DataFrame with 'age' and without 'Year_Birth'
    print(df.head())
else:
    print("Column 'Year_Birth' not found in the dataset.")


     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumCatalogPurchases  NumStorePurchases  \
0    4/9/2012       58       635  ...                   10                  4   
1    8/3/2014       38        11  ...                    1                  2   
2  21-08-2013       26       426  ...                    2                 10   
3   10/2/2014       26        11  ...                    0                  4   
4  19-01-2014       94       173  ...                    3                  6   

   NumWebVisitsMonth  AcceptedCmp3  Accepted

### 13. Encode the categorical variables in the dataset

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Correcting the file pathC:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv
file_path = r''

# Load the dataset
df = pd.read_csv(file_path)

# Example: Label Encoding for a single column 'Marital_Status'
label_encoder = LabelEncoder()
df['Marital_Status_LabelEncoded'] = label_encoder.fit_transform(df['Marital_Status'])

# Display the first few rows to verify
print(df[['Marital_Status', 'Marital_Status_LabelEncoded']].head())

# Repeat the above steps for other categorical columns as needed
# Example: One-Hot Encoding for a single column 'Education'
df_encoded = pd.get_dummies(df, columns=['Education'])

# Display the first few rows to verify
print(df_encoded.head())

# Repeat the above steps for other categorical columns as needed


  Marital_Status  Marital_Status_LabelEncoded
0         Single                            4
1         Single                            4
2       Together                            5
3       Together                            5
4        Married                            3
     ID  Year_Birth Marital_Status   Income  Kidhome  Teenhome Dt_Customer  \
0  5524        1957         Single  58138.0        0         0    4/9/2012   
1  2174        1954         Single  46344.0        1         1    8/3/2014   
2  4141        1965       Together  71613.0        0         0  21-08-2013   
3  6182        1984       Together  26646.0        1         0   10/2/2014   
4  5324        1981        Married  58293.0        1         0  19-01-2014   

   Recency  MntWines  MntFruits  ...  AcceptedCmp1  AcceptedCmp2  Complain  \
0       58       635         88  ...             0             0         0   
1       38        11          1  ...             0             0         0   
2       26       426 

### 14. Standardize the columns, so that values are in a particular range

In [18]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Correcting the file pathC:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv
file_path = r''

# Load the dataset
df = pd.read_csv(file_path)

# Display the columns to verify their names and existence
print(df.columns)

# Assuming 'Age' and 'Income' are actual columns in your dataset
columns_to_standardize = ['Age', 'Income']

# Check if columns_to_standardize are in df.columns
if all(col in df.columns for col in columns_to_standardize):
    # Initialize StandardScaler
    scaler = StandardScaler()

    # Fit and transform the selected columns
    df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

    # Display the first few rows to verify
    print(df.head())
else:
    print("Columns not found in the dataset.")



Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response'],
      dtype='object')
Columns not found in the dataset.


### 15. Apply PCA on the above dataset and determine the number of PCA components to be used so that 90-95% of the variance in data is explained by the same.

In [20]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Display the columns to verify their names and existence
print(df.columns)

# Assume 'Age' and 'Income' are actual columns in your dataset
columns_to_standardize = ['Age', 'Income']

# Check if columns_to_standardize are in df.columns
if all(col in df.columns for col in columns_to_standardize):
    # Initialize StandardScaler
    scaler = StandardScaler()

    # Standardize the selected columns
    df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

    # Apply PCA
    pca = PCA()
    pca.fit(df)

    # Calculate cumulative explained variance ratio
    explained_variance_ratio = pca.explained_variance_ratio_
    cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

    # Determine the number of components for 90-95% variance explained
    n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1  # +1 because index starts from 0

    # Print explained variance ratios and number of components
    print("Explained Variance Ratio:")
    print(explained_variance_ratio)
    print("\nCumulative Explained Variance Ratio:")
    print(cumulative_variance_ratio)
    print(f"\nNumber of components to explain at least 95% variance: {n_components}")
else:
    print("Columns not found in the dataset.")


Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response'],
      dtype='object')
Columns not found in the dataset.


### 16. Apply K-means clustering and segment the data (Use PCA transformed data for clustering)

In [22]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Display the columns to verify their names and existence
print(df.columns)

# Assume 'Age' and 'Income' are actual columns in your dataset
columns_to_standardize = ['Age', 'Income']

# Check if columns_to_standardize are in df.columns
if all(col in df.columns for col in columns_to_standardize):
    # Initialize StandardScaler
    scaler = StandardScaler()

    # Standardize the selected columns
    df[columns_to_standardize] = scaler.fit_transform(df[columns_to_standardize])

    # Apply PCA
    pca = PCA(n_components=2)  # Example: choose number of components based on explained variance ratio or requirement
    principal_components = pca.fit_transform(df)

    # Apply K-means clustering
    kmeans = KMeans(n_clusters=3, random_state=42)  # Example: choose number of clusters
    kmeans.fit(principal_components)

    # Add cluster labels to DataFrame
    df['Cluster'] = kmeans.labels_

    # Visualize clusters (Example plot for first two principal components)
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(principal_components[:, 0], principal_components[:, 1], c=kmeans.labels_, cmap='viridis')
    plt.title('K-means Clustering with PCA Components')
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.colorbar(scatter, label='Cluster')
    plt.show()

else:
    print("Columns not found in the dataset.")


Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response'],
      dtype='object')
Columns not found in the dataset.


### 17. Apply Agglomerative clustering and segment the data (Use Original data for clustering), and perform cluster analysis by doing bivariate analysis between the cluster label and different features and write your observations.

In [24]:
import pandas as pd

# Correcting the file path
file_path = r'C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv'

# Load the dataset
df = pd.read_csv(file_path)

# Display the columns to verify their names and existence
print(df.columns)

# Check if the columns you intend to use ('Age', 'Income', etc.) are in df.columns
columns_to_check = ['Age', 'Income']

if all(col in df.columns for col in columns_to_check):
    # Proceed with your data preprocessing and clustering steps here
    print("Columns found in the dataset. Proceed with further steps.")
else:
    # Print an error message or handle the case where columns are not found
    print("Columns not found in the dataset. Check column names or dataset loading.")


Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Response'],
      dtype='object')
Columns not found in the dataset. Check column names or dataset loading.


### Visualization and Interpretation of results

In [30]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

# Load the dataset
df = pd.read_csv('C:/Users/hp/Downloads/Lab 4 - Unsu[ervised learinng/marketing.csv')

# Assume data preprocessing and K-means clustering have been performed

# Apply K-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(df[['ID', 'Income']])

# Visualize clusters
plt.figure(figsize=(10, 6))
sns.scatterplot( x='ID',y='Income', hue='Cluster', data=df, palette='viridis', legend='full')
plt.title('K-means Clustering: Age vs Income by Cluster')
plt.xlabel('ID')
plt.ylabel('Income')
plt.show()

# Optional: Print cluster centers or perform additional analysis
print("\nCluster Centers:")
print(kmeans.cluster_centers_)

# Explore other visualizations and interpretations as per your analysis needs


ValueError: Input X contains NaN.
KMeans does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

-----
## Happy Learning
-----