# Identifying Customer Segments

## Project

There is a company in Germany which performs mail-order sales. It wants to target a marketing campaign to people that are most likely to purchase their products. It asks for help in identifying these groups of people. The goal of this project is to help the company by finding its customer segments.

The company is real and it is a real business case provided by the Bertlesmann Arvato Analytics. This is why the enterprise name is not revealed. Arvato provided real datasets with business data concerning the company's customers. They are not publicly accessible so I can only present here the results of my analysis.

### Project outline

This project is quite complicated, so here is the outline of what I am going to do.
1. Data pre-processing

    The dataset needs a fair amount of preprocessing. There are missing data, features that need re-encoding.


2. Feature transformation

    There is a lot of features, so I will apply PCA to reduce it. Before applying PCA features need to be scaled.


3. Clustering

    First, I am going to find clusters in general population. Then I will apply the same clustering to the customer population. I want to find clusters that are over-represented among consumers. This will indicate the company's customer profile.

## Import resources

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Load the data

Arvato Analytics provided 4 files for the project:
- `AZDIAS_Subset.csv` - general German population data, 891211 persons x 85 features,
- `CUSTOMERS_Subset.csv` - company's customer population data, 191652 persons x 85 features,
- `Data_Dictionary.md` - file with information about the features in the provided datasets,
- `AZDIAS_Feature_Summary.csv` - attributes of each feature, 85 features x 4 columns.

Let's explore the data. General population and customer datasets have the same structure, so it will be enough to look at only one of them.

In [2]:
# Load in the general demographics data.
azdias = pd.read_csv('AZDIAS_Subset.csv', sep = ';')

# Load in the feature summary file.
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv', sep = ';')

In [3]:
# Display a first few rows
azdias.head()

Unnamed: 0,AGER_TYP,ALTERSKATEGORIE_GROB,ANREDE_KZ,CJT_GESAMTTYP,FINANZ_MINIMALIST,FINANZ_SPARER,FINANZ_VORSORGER,FINANZ_ANLEGER,FINANZ_UNAUFFAELLIGER,FINANZ_HAUSBAUER,...,PLZ8_ANTG1,PLZ8_ANTG2,PLZ8_ANTG3,PLZ8_ANTG4,PLZ8_BAUMAX,PLZ8_HHZ,PLZ8_GBZ,ARBEIT,ORTSGR_KLS9,RELAT_AB
0,-1,2,1,2.0,3,4,3,5,5,3,...,,,,,,,,,,
1,-1,1,2,5.0,1,5,2,5,4,5,...,2.0,3.0,2.0,1.0,1.0,5.0,4.0,3.0,5.0,4.0
2,-1,3,2,3.0,1,4,1,2,3,5,...,3.0,3.0,1.0,0.0,1.0,4.0,4.0,3.0,5.0,2.0
3,2,4,2,2.0,4,2,5,2,1,2,...,2.0,2.0,2.0,0.0,1.0,3.0,4.0,2.0,3.0,3.0
4,-1,3,1,5.0,4,3,4,1,3,2,...,2.0,4.0,2.0,1.0,2.0,3.0,3.0,4.0,6.0,5.0


Features seem to be encoded numerically, there are some `NaN` values.

In [4]:
# Print a few info records about the features to see how it looks
feat_info.head()

Unnamed: 0,attribute,information_level,type,missing_or_unknown
0,AGER_TYP,person,categorical,"[-1,0]"
1,ALTERSKATEGORIE_GROB,person,ordinal,"[-1,0,9]"
2,ANREDE_KZ,person,categorical,"[-1,0]"
3,CJT_GESAMTTYP,person,categorical,[0]
4,FINANZ_MINIMALIST,person,ordinal,[-1]


Here are the codes for missing data. It looks like some of missing values were encoded with the values given here.

In [5]:
azdias.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Data columns (total 85 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   AGER_TYP               891221 non-null  int64  
 1   ALTERSKATEGORIE_GROB   891221 non-null  int64  
 2   ANREDE_KZ              891221 non-null  int64  
 3   CJT_GESAMTTYP          886367 non-null  float64
 4   FINANZ_MINIMALIST      891221 non-null  int64  
 5   FINANZ_SPARER          891221 non-null  int64  
 6   FINANZ_VORSORGER       891221 non-null  int64  
 7   FINANZ_ANLEGER         891221 non-null  int64  
 8   FINANZ_UNAUFFAELLIGER  891221 non-null  int64  
 9   FINANZ_HAUSBAUER       891221 non-null  int64  
 10  FINANZTYP              891221 non-null  int64  
 11  GEBURTSJAHR            891221 non-null  int64  
 12  GFK_URLAUBERTYP        886367 non-null  float64
 13  GREEN_AVANTGARDE       891221 non-null  int64  
 14  HEALTH_TYP             891221 non-nu

You can see that many columns contain missing data. Notice that 4 of them have the type `object`. It indicates that these columns are not numerical.