## [EM Algorithm and Gaussian Mixture Models for Advanced Data Clustering](https://medium.com/data-science-collective/the-em-algorithm-and-gaussian-mixture-models-for-advanced-data-clustering-948756fe76c9)

> A deep dive into the core concepts of unsupervised clustering with practical application on customer data segmentation


The **Expectation-Maximization (EM)** algorithm, particularly its application to **Gaussian Mixture Models (GMM)**, is a foundational unsupervised learning technique.

A **_Gaussian Mixture Model (GMM)_** is a probabilistic model that represents the data as a combination of multiple Gaussian distributions.

It's a clustering tool for unsupervised learning, offering more flexibility than other clustering methods like k-means by computing probabilities of data points to each cluster.

Its key features include:

- **Probabilistic Model**: Generates the likelihood of different outcomes using probability distribution, rather than predicting a single, definite result.
- **Unsupervised Learning**: No labeled data required for training.
- **Clustering and Density Estimation**: Clusters data points into different groups and estimates the probability density function (PDF) of the data.
- **Flexibility in Modeling**: Can approximate any distribution represented as a weighted sum of normal distributions.

GMMs are commonly used in various fields like financial investments, natural language analysis, predictive maintenance, and medical imaging (MRI, CT scans).

The core concept of GMMs is to assume that the data points are generated from a mixture of multiple Gaussian distributions, each of which has its own model parameters:

- _mean_ (`μ_k`),
- _variance_ (or _covariance_) (`Σ_k`), and
- _mixing coefficient_ (`ϕ_k`).


In [1]:
!pip install -q numpy pandas matplotlib scikit-learn ucimlrepo

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pprint

import pandas as pd
import numpy as np

from ucimlrepo import fetch_ucirepo

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [4]:
# fetch dataset
iranian_churn = fetch_ucirepo(id=563)

# data (as pandas dataframes)
features = iranian_churn.data.features
targets = iranian_churn.data.targets

# metadata
print("------Metadata Information------\n")
pprint.pp(iranian_churn.metadata)
  
# variable information
print("------Variable Information\n", iranian_churn.variables)

------Metadata Information------

{'uci_id': 563,
 'name': 'Iranian Churn',
 'repository_url': 'https://archive.ics.uci.edu/dataset/563/iranian+churn+dataset',
 'data_url': 'https://archive.ics.uci.edu/static/public/563/data.csv',
 'abstract': 'This dataset is randomly collected from an Iranian telecom '
             "company's database over a period of 12 months.",
 'area': 'Business',
 'tasks': ['Classification', 'Regression'],
 'characteristics': ['Multivariate'],
 'num_instances': 3150,
 'num_features': 13,
 'feature_types': ['Integer'],
 'demographics': ['Age'],
 'target_col': ['Churn'],
 'index_col': None,
 'has_missing_values': 'no',
 'missing_values_symbol': None,
 'year_of_dataset_creation': 2020,
 'last_updated': 'Sat Mar 09 2024',
 'dataset_doi': '10.24432/C5JW3Z',
 'creators': [],
 'intro_paper': None,
 'additional_info': {'summary': 'This dataset is randomly collected from an '
                                'Iranian telecom companyâ€™s database over a '
                 

In [5]:
df = pd.read_csv(iranian_churn.metadata.data_url)

df.rename(columns=
          {'Subscription  Length': 'subscription_length',
           'Call  Failure': 'call_failure',
           'Complains': 'complains',
           'Charge  Amount': 'charge_amount',
           'Frequency of use': 'frequency_of_use',
           'Seconds of Use': 'seconds_of_use',
           'Frequency of SMS': 'frequency_of_sms',
           'Distinct Called Numbers': 'distinct_called_numbers',
           'Age Group': 'age_group',
           'Tariff Plan': 'tariff_plan',
           'Status': 'status',
           'Age': 'age',
           'Customer Value': 'customer_value',
           'Churn': 'churn'}
, inplace=True)

cols = ['subscription_length', 'customer_value', 'age', 'frequency_of_use', 'churn']
X = df[cols]

num_features = ['subscription_length', 'customer_value', 'age', 'frequency_of_use']
num_transformer = Pipeline(steps=[('scaler', StandardScaler())])

cat_features = ['churn']
cat_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ],
    remainder='passthrough'
)

X_processed = preprocessor.fit_transform(X)

#### Model Tuning

In [6]:
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans

# gmm
gmm =  GaussianMixture(
    n_components=10,            # cluseter 10 Gaussian components
    covariance_type='full',     # 'full' for complex underlying data distribution (not spherical)
    tol=1e-10,                  # stop the iteration when increase in log-likelihood hits the tol
    reg_covar=1e-10,            # add to the covariance to ensure positive definite matrices
    max_iter=1000,              # 1,000 epochs
    n_init=1000,                # 1,000 init model runs with diff initializations of the parameters
    init_params='kmeans',       # default
    weights_init=None,          # init model params to set to None (without clues)
    means_init=None,
    precisions_init=None,
    random_state=42,
)
gmm_labels = gmm.fit_predict(X_processed)


# k-means (for comparison)
kmeans = KMeans(
    n_clusters=5, 
    init="k-means++",
    n_init=500,
    max_iter=500,
    tol=1e-10,
    random_state=42,
    algorithm='lloyd'
)
kmeans_labels = kmeans.fit_predict(X_processed)

KeyboardInterrupt: 