# Data Mining Project

## Context

Market segmentation, the process of identifying customers’ groups, makes use of geographic, demographic, psychographic, and behavioral characteristics of customers. By understanding the differences between the different segments, organizations can make better strategic choices about opportunities, product definition, positioning, promotions, pricing, and target marketing.

A2Z Insurance (A2Z) is a portuguese long standing insurance company that serves a wide array of insurance services: Motor, Household, Health, Life and Work Compensation. Although A2Z primarily serves portuguese customers, a significant portion of their customer acquisition comes from their web site.

In 2016, A2Z became one of the largest insurers in Portugal.and they provided you an ABT (Analytic Based Table) with data regarding a sample of 10.290 Customers from its active database. These are customers that had at least one insurance service with the company at the time the dataset was extracted. 

## Goals
1. Explore the data and identify the variables that should be used to segment customers.
2. Identify customer segments
3. Justify the number of clusters you chose (taking in consideration the business use as well).
4. Explain the clusters found.
5. Suggest business applications for the findings and define general marketing approaches for each cluster.

## Notes

- 2016 is the current year
- Change Name of Columns
- Fill with 0 the Insurance Premiums
- Fill with mode Children
- Fill with median Salary, etc.
- Deal with Negative Values
- Remove Outliers (IQR Method and Manual)
- Perform Standardization (MinMaxScaler,Standard Scaler)
- Feature Selection(Check Correlation)
- Create New Variables (Age Group, Income Group, Has Children (Yes/No, Total_Premiums)
- Taking into account that the current year of the database is 2016, we calculated the customer Age , which is 2016 minus birthday year and the Customer_Years is 2016 minus first policy year. Also, i f we divide Customer Monetary Value ( CMV ) by Customer_Years , we obt ain an estimation of the customer Annual_Profit .
- Coherence Checking (Age < 18, Age < Years of Education+4?, Birthday > First_Policy, Negative Salary, Children outside the Ranges, Area different from the range interval
- Clustering by Demographic and Value and Behavior (Data Partition)
- Create new variable Salary Invested Total Premiums/(Salary x 14) x 100

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

## Loading Data

In [None]:
df = pd.read_sas('a2z_insurance.sas7bdat')

In [None]:
df['CustID'] = df.CustID.astype(int)

In [None]:
df.set_index('CustID', inplace=True)

## Data Exploration

In [None]:
#Renaming Columns
df.rename(columns={'FirstPolYear':'First_Policy',
                   'EducDeg': 'Education',
                   'MonthSal': 'Salary',
                  'GeoLivArea': 'Area',
                  'CustMonVal': 'CMV',
                   'ClaimsRate': 'Claims',
                  'PremMotor':'Motor',
                  'PremHousehold': 'Household',
                  'PremHealth': 'Health',
                  'PremLife': 'Life',
                  'PremWork': 'Work'}, inplace=True)

In [None]:
df.head()

In [None]:
#Checking Shape of the Dataset
df.shape

In [None]:
#Checking all the Columns in the Dataset
df.columns

In [None]:
df.describe().T

In [None]:
#Checking information for all columns
df.info()

In [None]:
#Checking if there is any Null value present in the Dataset?
df.isnull().sum()

In [None]:
### Percentage of Missing Data
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})
missing_value_df

## Data Preparation

### Handling Missing Values