# Competitor analysis

# Introduction

As Company **`A`**, a burgeoning brand in the market, we aim to leverage the dataset belonging to our established competitor, Company **`B`**. Our strategic approach involves segmenting customers within the dataset and thoroughly analyzing the behavior of each segment. Subsequently, we will tailor specific strategies for each segment to effectively position ourselves in the market.

In this notebook, our objective is to conduct user segmentation based on categorical properties such as smoking status, gender, and health condition. We'll then group users into five-year age intervals.

Subsequently, we'll compute the average revenue for each segment, along with the revenue per customer KPI. We'll also analyze the preferred mode of payment to determine the most suitable insurance plan for each segment. Additionally, we'll identify the minimum, maximum, and median payments of users within each segment, as well as the number of users in each segment. By calculating the revenue generated from each segment, we'll pinpoint the most valuable and populous customer segments.

Given the company's goal to increase revenue and the number of customers by 20% over the next six months, we'll provide recommendations aligned with these objectives.

In [1]:
import numpy as np 
import seaborn as sns
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

In [2]:
path = "../insurance.csv"
data = pd.read_csv(path)
data["age"] = data["age"].astype("Int64")


In [3]:
data.drop_duplicates(inplace=True)
data.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


--------------------

# Data preparation

We need to prepare our data for segmentation. Let's examine the dataset in its initial state.

In [4]:
data.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462


## Age interval:

In [5]:
bins = [18, 23, 28, 33, 38, 43, 48, 53, 58, 63, 68, 73, 78, 83]  # Define your age intervals here
labels = ['18-23', '24-28', '29-33', '34-38', '39-43', '44-48', "49-53", "54-58", "59-63", "64-68", "69-73", "74-78", "79-83"]  # Labels for the intervals

# Create a new column with age intervals
data['age_range'] = pd.cut(data['age'], bins=bins, labels=labels, right=False)
data.head(4)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,age_range
0,19,female,27.9,0,yes,southwest,16884.924,18-23
1,18,male,33.77,1,no,southeast,1725.5523,18-23
2,28,male,33.0,3,no,southeast,4449.462,29-33
3,33,male,22.705,0,no,northwest,21984.47061,34-38


## Health condition Based on BMI score

In [6]:
def bmi_condition(row):

    if row['bmi'] < 18.5:
        return "under_weight"
    elif row['bmi'] > 24.9:
        return "over_weight"
    else:
        return "healthy"

In [7]:
data["bmi_condition"] = data.apply(bmi_condition, axis=1)

This is our finalized dataset, which is the foundation of our segmentation process. Within this dataset, we have defined:

1. Age intervals
2. BMI conditions

 Due to the nature of the insurance business and after observing no significant differences in revenue among different regions (as indicated in the EDA report), I have decided to drop region column.

In the next step, I will segment the data based on categorical variables in the dataframe and create separate dataframes for each condition.

In [8]:
data.drop(columns="region", inplace=True)
data.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,charges,age_range,bmi_condition
0,19,female,27.9,0,yes,16884.924,18-23,over_weight
1,18,male,33.77,1,no,1725.5523,18-23,over_weight
2,28,male,33.0,3,no,4449.462,29-33,over_weight


# Segmentation

In our dataset, we have the following variables:

- Gender (male/female)
- Number of children (0, 1, 3, 2, 4, 5)
- Smoking status (smoker/non-smoker)
- Age range (in 5-year intervals)
- Health condition: Underweight, Healthy, Overweight (based on BMI score of each user)

We will create several dataframes, each containing a combination of these conditions. For example: **`male-smoker-underweight-two_children`**.


In [9]:
gender = ["male", "female"]
smoke = ["yes", "no"]
children = list(data.children.unique())
health_condition = list(data.bmi_condition.unique())

In [19]:
data.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,charges,age_range,bmi_condition
0,19,female,27.9,0,yes,16884.924,18-23,over_weight
1,18,male,33.77,1,no,1725.5523,18-23,over_weight
2,28,male,33.0,3,no,4449.462,29-33,over_weight


In [20]:
data.groupby(by=["sex", "smoker", "bmi_condition", "children"], as_index=False).agg(charges_count=('charges', 'count'), charges_avg=('charges', 'mean'))

Unnamed: 0,sex,smoker,bmi_condition,children,charges_count,charges_avg
0,female,no,healthy,0,36,7476.52933
1,female,no,healthy,1,22,9277.686915
2,female,no,healthy,2,17,8150.450399
3,female,no,healthy,3,11,8148.792741
4,female,no,healthy,5,2,8774.049025
5,female,no,over_weight,0,195,7728.144666
6,female,no,over_weight,1,109,8761.7134
7,female,no,over_weight,2,79,10341.478413
8,female,no,over_weight,3,52,10547.546353
9,female,no,over_weight,4,11,13937.674562
