# Customer Analysis Cleaning 



<i> You are working as an analyst for an auto insurance company. The company has collected some data about its customers including their demographics, education, employment, policy details, vehicle information on which insurance policy is, and claim amounts. You will help the senior management with some business questions that will help them to better understand their customers, improve their services, and improve profitability. </i>
    
 **Notebook operations summary:**
 <p> We import three data files in cvs format and clean them. <br>
     The cleaning process includes: </p>
        - Renaming columns (to ensure consistent concatenation and eliminate column redundancy) <br>
        - Replacing null data entries (zeros and nans) by mean values <br>
        - Standardizing string entries <br>
        - Removing duplicate records <br>

        

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 100

## Helper functions

In [2]:
def col_rename(df, dict_rules, ip=True):
    ''' rename columns in data frame using a dictionary of rules  '''
    if isinstance(df,pd.core.frame.DataFrame) and isinstance(dict_rules,dict):
        z = df.rename(columns = dict_rules, inplace=ip)
    else:
        raise TypeError
    return z

In [3]:
def print_unique(df,col):
    ''' gives a list of unique values in a field '''
    return df[col].unique()

In [4]:
def fill_nans_with_means(df,col):
    ''' fills nans in column with median '''
    return df[col].fillna(c_df.income.mean()).round()

In [5]:
def make_lower(df, col):
    '''lower the case of record in a field '''
    return df[col].str.lower()

In [6]:
def lower_case_column_names(df):
    ''' make columns lower case '''
    if isinstance(df,pd.core.frame.DataFrame):
        df.columns=[i.lower() for i in df.columns]
    else: 
        raise TypeError
    return df

In [7]:
def strip_char(df,col,char):
    ''' strips a char and rounds '''
    return list(map(lambda x: round(float(x.strip(char))/100,2)\
                                          if type(x)==str else round((x/100),2),df[col]))

In [8]:
def record_str_replace(df, col, dict_rules):
    return df[col].replace(dict_rules) # additional replacement


In [9]:
def get_between_slash_and_join(df,col):
    ''' returns middle entry in a string formatted by "a/b/cc" and avoids nans. 
        (Tailored to the vehicles data set)  '''
    res  = list(
              map(lambda x : int(x[2]) if type(x)==str else x, df[col])
    )
                
    return res

In [10]:
def state(old_names, new_names):
    ''' renames states '''
    return c_df['st'].replace(old_names, new_names)

## Data cleaning 

### Read in the data and examine columns

In [39]:
file1 = pd.read_csv('Data/file1.csv')
file2 = pd.read_csv('Data/file2.csv')
file3 = pd.read_csv('Data/file3.csv')

In [40]:
a_ll_columns = set(list(file1.columns) + list(file2.columns) + list(file3.columns))
a_ll_columns

{'Customer',
 'Customer Lifetime Value',
 'Education',
 'GENDER',
 'Gender',
 'Income',
 'Monthly Premium Auto',
 'Number of Open Complaints',
 'Policy Type',
 'ST',
 'State',
 'Total Claim Amount',
 'Vehicle Class'}

In [35]:
list(file1.columns)

['customer',
 'st',
 'gender',
 'education',
 'customer lifetime value',
 'income',
 'monthly premium auto',
 'number of open complaints',
 'policy type',
 'vehicle class',
 'total claim amount']

In [31]:
file2.columns

Index(['customer', 'st', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'total claim amount', 'policy type', 'vehicle class'],
      dtype='object')

In [14]:
file3.columns

Index(['Customer', 'State', 'Customer Lifetime Value', 'Education', 'Gender',
       'Income', 'Monthly Premium Auto', 'Number of Open Complaints',
       'Policy Type', 'Total Claim Amount', 'Vehicle Class'],
      dtype='object')

In [15]:
# rename State to ST before concat to avoid redundant columns
col_rename(file3, {'State': 'ST'}) 

In [16]:
files = [file1,file2,file3]
files = list(map(lower_case_column_names,files)) # make the headers lowercase
c_df  = pd.concat(files) # concat the data into a pandas frame
c_df  = c_df.drop(labels=['customer'], axis=1) # drop customer label
c_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   st                         9137 non-null   object 
 1   gender                     9015 non-null   object 
 2   education                  9137 non-null   object 
 3   customer lifetime value    9130 non-null   object 
 4   income                     9137 non-null   float64
 5   monthly premium auto       9137 non-null   float64
 6   number of open complaints  9137 non-null   object 
 7   policy type                9137 non-null   object 
 8   vehicle class              9137 non-null   object 
 9   total claim amount         9137 non-null   float64
dtypes: float64(3), object(7)
memory usage: 1.0+ MB


In [17]:
gender_old = print_unique(c_df,'gender')
gender_old = list(gender_old)
gender_old

[nan, 'F', 'M', 'Femal', 'Male', 'female']

In [18]:
gender_new = ['U','F','M','F','M','F']
gender_replace_rules = dict(zip(gender_old,gender_new))
c_df['gender']=record_str_replace(c_df,'gender',gender_replace_rules)
print_unique(c_df,'gender')

array(['U', 'F', 'M'], dtype=object)

In [19]:
state_old = print_unique(c_df,'st')
state_old = list(state_old)
state_old

['Washington',
 'Arizona',
 'Nevada',
 'California',
 'Oregon',
 'Cali',
 'AZ',
 'WA',
 nan]

In [20]:
state_new = ['Washington',
 'Arizona',
 'Nevada',
 'California',
 'Oregon',
 'California',
 'Arizona',
 'Washington', 'Unknown']
state_replace_rules = dict(zip(state_old,state_new))
c_df['st']=record_str_replace(c_df,'st',state_replace_rules)
print_unique(c_df,'st')

array(['Washington', 'Arizona', 'Nevada', 'California', 'Oregon',
       'Unknown'], dtype=object)

In [21]:
c_df['number of open complaints'] = get_between_slash_and_join(c_df,'number of open complaints')
print_unique(c_df,'number of open complaints')

array([ 0.,  2.,  1.,  3.,  5.,  4., nan])

In [22]:
c_df['customer lifetime value'] = strip_char(c_df,'customer lifetime value','%')


In [25]:
c_df['customer lifetime value'] = fill_nans_with_means(c_df,'customer lifetime value').apply(round)
c_df['total claim amount']      = fill_nans_with_means(c_df,'total claim amount').apply(round)
c_df['monthly premium auto']    = fill_nans_with_means(c_df,'monthly premium auto').apply(round)
c_df['income']                  = fill_nans_with_means(c_df,'income').apply(round)

In [26]:
c_df.income = c_df.income.replace(0, c_df.income.mean()).round(0).astype(int)

In [27]:
c_df.head(10)

Unnamed: 0,st,gender,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,Washington,U,Master,37829,37829,1000,0.0,Personal Auto,Four-Door Car,3
1,Arizona,F,Bachelor,6980,37829,94,0.0,Personal Auto,Four-Door Car,1131
2,Nevada,F,Bachelor,12887,48767,108,0.0,Personal Auto,Two-Door Car,566
3,California,M,Bachelor,7646,37829,106,0.0,Corporate Auto,SUV,530
4,Washington,M,High School or Below,5363,36357,68,0.0,Personal Auto,Four-Door Car,17
5,Oregon,F,Bachelor,8256,62902,69,0.0,Personal Auto,Two-Door Car,159
6,Oregon,F,College,5381,55350,67,0.0,Corporate Auto,Four-Door Car,322
7,Arizona,M,Master,7216,37829,101,0.0,Corporate Auto,Four-Door Car,363
8,Oregon,M,Bachelor,24128,14072,71,0.0,Corporate Auto,Four-Door Car,511
9,Oregon,F,College,7388,28812,93,0.0,Special Auto,Four-Door Car,426


In [28]:
old_st = ['CA', 'WA', 'OR', 'AZ', 'NV'] 
new_st = ['West Region', 'East Region', 'Northeast Region', 'Central', 'Central']
c_df['st'] = state(old_st,new_st)
c_df

Unnamed: 0,st,gender,education,customer lifetime value,income,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,Washington,U,Master,37829,37829,1000,0.0,Personal Auto,Four-Door Car,3
1,Arizona,F,Bachelor,6980,37829,94,0.0,Personal Auto,Four-Door Car,1131
2,Nevada,F,Bachelor,12887,48767,108,0.0,Personal Auto,Two-Door Car,566
3,California,M,Bachelor,7646,37829,106,0.0,Corporate Auto,SUV,530
4,Washington,M,High School or Below,5363,36357,68,0.0,Personal Auto,Four-Door Car,17
...,...,...,...,...,...,...,...,...,...,...
7065,California,M,Bachelor,234,71941,73,0.0,Personal Auto,Four-Door Car,198
7066,California,F,College,31,21604,79,0.0,Corporate Auto,Four-Door Car,379
7067,California,M,Bachelor,82,37829,85,3.0,Corporate Auto,Four-Door Car,791
7068,California,M,College,75,21941,96,0.0,Personal Auto,Four-Door Car,691
