<a href="https://colab.research.google.com/github/valentinpylypchuk/AAA-ML-Project-2025-26/blob/main/Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The Customer Personality Analysis project focuses on understanding the characteristics and behaviors of a company’s ideal customers. The goal of this analysis is to segment customers based on their purchasing habits, preferences, and demographic attributes, in order to help the company tailor its marketing strategies and product offerings to specific groups.

The dataset provides various customer-related features such as demographics, spending patterns, and product preferences. By analyzing these attributes, we aim to identify distinct customer segments and determine which profiles are most likely to respond to specific marketing campaigns or purchase particular products.

Link: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

# Libraries

In [81]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

# Dataset Load and understanding

In [82]:
#Importing dataset
url = "https://raw.githubusercontent.com/valentinpylypchuk/AAA-ML-Project-2025-26/refs/heads/main/marketing_campaign.csv"
data = pd.read_csv(url, sep='\t')
data.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0



Description of all the columns the dataset includes:
###  People
- **ID**: Customer's unique identifier  
- **Year_Birth**: Customer's birth year  
- **Education**: Customer's education level  
- **Marital_Status**: Customer's marital status  
- **Income**: Customer's yearly household income  
- **Kidhome**: Number of children in customer's household  
- **Teenhome**: Number of teenagers in customer's household  
- **Dt_Customer**: Date of customer's enrollment with the company  
- **Recency**: Number of days since customer's last purchase  
- **Complain**: 1 if the customer complained in the last 2 years, 0 otherwise  


###  Products
- **MntWines**: Amount spent on wine in the last 2 years  
- **MntFruits**: Amount spent on fruits in the last 2 years  
- **MntMeatProducts**: Amount spent on meat in the last 2 years  
- **MntFishProducts**: Amount spent on fish in the last 2 years  
- **MntSweetProducts**: Amount spent on sweets in the last 2 years  
- **MntGoldProds**: Amount spent on gold in the last 2 years  



###  Promotion
- **NumDealsPurchases**: Number of purchases made with a discount  
- **AcceptedCmp1**: 1 if customer accepted the offer in the 1st campaign, 0 otherwise  
- **AcceptedCmp2**: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise  
- **AcceptedCmp3**: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise  
- **AcceptedCmp4**: 1 if customer accepted the offer in the 4th campaign, 0 otherwise  
- **AcceptedCmp5**: 1 if customer accepted the offer in the 5th campaign, 0 otherwise  
- **Response**: 1 if customer accepted the offer in the last campaign, 0 otherwise  


###  Place
- **NumWebPurchases**: Number of purchases made through the company’s website  
- **NumCatalogPurchases**: Number of purchases made using a catalogue  
- **NumStorePurchases**: Number of purchases made directly in stores  
- **NumWebVisitsMonth**: Number of visits to the company’s website in the last month  


Now we verify size of the dataset and the data types.

In [83]:
print(data.info())
print(data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

# Data cleaning

Check of duplicated values and misisng values.

In [84]:
#checking duplicated data
print(data.duplicated().sum())
#checking missing value
print(data.isnull().sum())
#Proportion of missing values
print("Percentage of missing values in column income: ", data['Income'].isna().sum() / len(data) * 100)


0
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64
Percentage of missing values in column income:  1.0714285714285714


Given that only one column has missing values, and the amount of missing values is approximately one percent of samples, we decide to simply drop missing values.

In [85]:
#remove missing value
data = data.dropna()

Now, since we have a column which contains date time, we standarize the format using `pd.to_datetime` to properly perform data exploration.

In [87]:
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format='%d-%m-%Y')
print(data['Dt_Customer'].dtypes)

datetime64[ns]
