# Lab | Customer Analysis Round 1

## Remember the process:

1. Case Study
2. Get data
3. Cleaning/Wrangling/EDA
4. Processing Data
5. Modeling
6. Validation
7. Reporting

## Abstract

The objective of this data is to understand customer demographics and buying behavior. Later during the week, we will use predictive analytics to analyze the most profitable customers and how they interact. After that, we will take targeted actions to increase profitable customer response, retention, and growth.

For this lab, we will gather the data from 3 _csv_ files that are provided in the `files_for_lab` folder. Use that data and complete the data cleaning tasks as mentioned later in the instructions.

## Instructions

1. Read the three files into python as dataframes
2. Show the DataFrame's shape.
3. Standardize header names.
4. Rearrange the columns in the dataframe as needed
5. Concatenate the three dataframes
6. Which columns are numerical?
7. Which columns are categorical?
8. Understand the meaning of all columns
9. Perform the data cleaning operations mentioned so far in class

  - A. Delete the column education and the number of open complaints from the dataframe.
  - B. Correct the values in the column customer lifetime value. They are given as a percent, so multiply them by 100 and change `dtype` to `numerical` type.
  - C. Check for duplicate rows in the data and remove if any.
  - D. Filter out the data for customers who have an income of 0 or less.


### Import libraries

In [1]:
import numpy as np
import pandas as pd

### 1. Read the three files into python as dataframes

In [2]:
df1 = pd.read_csv('C:/Users/digit/Desktop/Ironhack/lab-work/lab-customer-analysis-round-1/files_for_lab/csv_files/file1.csv')
df2 = pd.read_csv('C:/Users/digit/Desktop/Ironhack/lab-work/lab-customer-analysis-round-1/files_for_lab/csv_files/file2.csv')
df3 = pd.read_csv('C:/Users/digit/Desktop/Ironhack/lab-work/lab-customer-analysis-round-1/files_for_lab/csv_files/file3.csv')

### 2. Show the DataFrames' shape

In [3]:
df1.shape, df2.shape, df3.shape 

# the 3 dataframes have different shapes

((4008, 11), (996, 11), (7070, 11))

### 3. Standardize header names

In [4]:
# before that I want to see the head of df1, df2, df3 to have an overview
df1.head(5)

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Vehicle Class,Total Claim Amount
0,RB50392,Washington,,Master,,0.0,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,697953.59%,0.0,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,1288743.17%,48767.0,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,764586.18%,0.0,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,536307.65%,36357.0,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323


In [5]:
# now with df2
df2.head(5)

Unnamed: 0,Customer,ST,GENDER,Education,Customer Lifetime Value,Income,Monthly Premium Auto,Number of Open Complaints,Total Claim Amount,Policy Type,Vehicle Class
0,GS98873,Arizona,F,Bachelor,323912.47%,16061,88,1/0/00,633.6,Personal Auto,Four-Door Car
1,CW49887,California,F,Master,462680.11%,79487,114,1/0/00,547.2,Special Auto,SUV
2,MY31220,California,F,College,899704.02%,54230,112,1/0/00,537.6,Personal Auto,Two-Door Car
3,UH35128,Oregon,F,College,2580706.30%,71210,214,1/1/00,1027.2,Personal Auto,Luxury Car
4,WH52799,Arizona,F,College,380812.21%,94903,94,1/0/00,451.2,Corporate Auto,Two-Door Car


In [6]:
# and finally with df3
df3.head(5)

Unnamed: 0,Customer,State,Customer Lifetime Value,Education,Gender,Income,Monthly Premium Auto,Number of Open Complaints,Policy Type,Total Claim Amount,Vehicle Class
0,SA25987,Washington,3479.137523,High School or Below,M,0,104,0,Personal Auto,499.2,Two-Door Car
1,TB86706,Arizona,2502.637401,Master,M,0,66,0,Personal Auto,3.468912,Two-Door Car
2,ZL73902,Nevada,3265.156348,Bachelor,F,25820,82,0,Personal Auto,393.6,Four-Door Car
3,KX23516,California,4455.843406,High School or Below,F,0,121,0,Personal Auto,699.615192,SUV
4,FN77294,California,7704.95848,High School or Below,M,30366,101,2,Personal Auto,484.8,SUV


In [7]:
# let's convert the column names into lowercase to make it look more clean
# we start with df1
df1.columns = df1.columns.str.lower()
df1.columns

Index(['customer', 'st', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'policy type', 'vehicle class', 'total claim amount'],
      dtype='object')

In [8]:
# rename the 'st' column into 'state'

df1.rename(columns={"st": "state"}, inplace=True)

In [9]:
# check the results
df1.columns

Index(['customer', 'state', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'policy type', 'vehicle class', 'total claim amount'],
      dtype='object')

In [10]:
# now we do the same for df2
df2.columns = df2.columns.str.lower()
df2.columns

Index(['customer', 'st', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'total claim amount', 'policy type', 'vehicle class'],
      dtype='object')

In [11]:
# rename the 'st' column into 'state' for df2 as well
df2.rename(columns={"st":"state"}, inplace=True)
df2.columns

Index(['customer', 'state', 'gender', 'education', 'customer lifetime value',
       'income', 'monthly premium auto', 'number of open complaints',
       'total claim amount', 'policy type', 'vehicle class'],
      dtype='object')

In [12]:
# now we convert the column names into lowercase for df3
df3.columns = df3.columns.str.lower()
df3.columns

Index(['customer', 'state', 'customer lifetime value', 'education', 'gender',
       'income', 'monthly premium auto', 'number of open complaints',
       'policy type', 'total claim amount', 'vehicle class'],
      dtype='object')

### 4. Rearrange the columns in the dataframe as needed

In [13]:
# start with df1 and rearrange the columns into an intelligent order
df1 = df1[["customer", "state", "gender", "education", "income", "customer lifetime value", "monthly premium auto", 
          "number of open complaints", "policy type", "vehicle class", "total claim amount"]]

df1.head()

Unnamed: 0,customer,state,gender,education,income,customer lifetime value,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,RB50392,Washington,,Master,0.0,,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,0.0,697953.59%,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,48767.0,1288743.17%,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,0.0,764586.18%,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,36357.0,536307.65%,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323


In [14]:
# now with df2
df2 = df2[["customer", "state", "gender", "education", "income", "customer lifetime value", "monthly premium auto", 
          "number of open complaints", "policy type", "vehicle class", "total claim amount"]]

df2.head()

Unnamed: 0,customer,state,gender,education,income,customer lifetime value,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,GS98873,Arizona,F,Bachelor,16061,323912.47%,88,1/0/00,Personal Auto,Four-Door Car,633.6
1,CW49887,California,F,Master,79487,462680.11%,114,1/0/00,Special Auto,SUV,547.2
2,MY31220,California,F,College,54230,899704.02%,112,1/0/00,Personal Auto,Two-Door Car,537.6
3,UH35128,Oregon,F,College,71210,2580706.30%,214,1/1/00,Personal Auto,Luxury Car,1027.2
4,WH52799,Arizona,F,College,94903,380812.21%,94,1/0/00,Corporate Auto,Two-Door Car,451.2


In [15]:
# now with df3
df3 = df3[["customer", "state", "gender", "education", "income", "customer lifetime value", "monthly premium auto", 
          "number of open complaints", "policy type", "vehicle class", "total claim amount"]]

df3.head()

Unnamed: 0,customer,state,gender,education,income,customer lifetime value,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,SA25987,Washington,M,High School or Below,0,3479.137523,104,0,Personal Auto,Two-Door Car,499.2
1,TB86706,Arizona,M,Master,0,2502.637401,66,0,Personal Auto,Two-Door Car,3.468912
2,ZL73902,Nevada,F,Bachelor,25820,3265.156348,82,0,Personal Auto,Four-Door Car,393.6
3,KX23516,California,F,High School or Below,0,4455.843406,121,0,Personal Auto,SUV,699.615192
4,FN77294,California,M,High School or Below,30366,7704.95848,101,2,Personal Auto,SUV,484.8


### 5. Concatenate the three dataframes

In [16]:
# concatenate them all together and assign under a variable name "df"
df = pd.concat([df1,df2,df3], axis=0)
df.head()

Unnamed: 0,customer,state,gender,education,income,customer lifetime value,monthly premium auto,number of open complaints,policy type,vehicle class,total claim amount
0,RB50392,Washington,,Master,0.0,,1000.0,1/0/00,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,Bachelor,0.0,697953.59%,94.0,1/0/00,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,Bachelor,48767.0,1288743.17%,108.0,1/0/00,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,Bachelor,0.0,764586.18%,106.0,1/0/00,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,High School or Below,36357.0,536307.65%,68.0,1/0/00,Personal Auto,Four-Door Car,17.269323


In [17]:
df.info()
# 4008 + 996 + 7070 = 12074 rows
# the concatenation was successful :)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer                   9137 non-null   object 
 1   state                      9137 non-null   object 
 2   gender                     9015 non-null   object 
 3   education                  9137 non-null   object 
 4   income                     9137 non-null   float64
 5   customer lifetime value    9130 non-null   object 
 6   monthly premium auto       9137 non-null   float64
 7   number of open complaints  9137 non-null   object 
 8   policy type                9137 non-null   object 
 9   vehicle class              9137 non-null   object 
 10  total claim amount         9137 non-null   float64
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


In [18]:
# let's check for NA values just for extra curiosity
df.isna().sum()

customer                     2937
state                        2937
gender                       3059
education                    2937
income                       2937
customer lifetime value      2944
monthly premium auto         2937
number of open complaints    2937
policy type                  2937
vehicle class                2937
total claim amount           2937
dtype: int64

### 6. Which columns are numerical?

In [19]:
# get numeric data and assign under the variable name "dfnum"
dfnum = df._get_numeric_data()
dfnum
# would consider "customer lifetime value" and "number of open complaints" as well
# just need to clean them

Unnamed: 0,income,monthly premium auto,total claim amount
0,0.0,1000.0,2.704934
1,0.0,94.0,1131.464935
2,48767.0,108.0,566.472247
3,0.0,106.0,529.881344
4,36357.0,68.0,17.269323
...,...,...,...
7065,71941.0,73.0,198.234764
7066,21604.0,79.0,379.200000
7067,0.0,85.0,790.784983
7068,21941.0,96.0,691.200000


### 7. Which columns are categorical?

In [20]:
dfcols = [col for col in df.columns if df[col].dtype=="O"]
dfcols
# we need to clean customer lifetime value and number of open complaints

['customer',
 'state',
 'gender',
 'education',
 'customer lifetime value',
 'number of open complaints',
 'policy type',
 'vehicle class']

In [21]:
# we can also check for numericals and categoricals with this query
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer                   9137 non-null   object 
 1   state                      9137 non-null   object 
 2   gender                     9015 non-null   object 
 3   education                  9137 non-null   object 
 4   income                     9137 non-null   float64
 5   customer lifetime value    9130 non-null   object 
 6   monthly premium auto       9137 non-null   float64
 7   number of open complaints  9137 non-null   object 
 8   policy type                9137 non-null   object 
 9   vehicle class              9137 non-null   object 
 10  total claim amount         9137 non-null   float64
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


### 8. Understand the meaning of all columns

In [22]:
# with this question we can start by checking the dtypes of each columns and possible NA values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12074 entries, 0 to 7069
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   customer                   9137 non-null   object 
 1   state                      9137 non-null   object 
 2   gender                     9015 non-null   object 
 3   education                  9137 non-null   object 
 4   income                     9137 non-null   float64
 5   customer lifetime value    9130 non-null   object 
 6   monthly premium auto       9137 non-null   float64
 7   number of open complaints  9137 non-null   object 
 8   policy type                9137 non-null   object 
 9   vehicle class              9137 non-null   object 
 10  total claim amount         9137 non-null   float64
dtypes: float64(3), object(8)
memory usage: 1.1+ MB


In [23]:
# it's good to start with describe to get a statistical overview
df.describe()

Unnamed: 0,income,monthly premium auto,total claim amount
count,9137.0,9137.0,9137.0
mean,37828.820291,110.391266,430.52714
std,30358.716159,581.376032,289.582968
min,0.0,61.0,0.099007
25%,0.0,68.0,266.996814
50%,34244.0,83.0,377.561463
75%,62447.0,109.0,546.420009
max,99981.0,35354.0,2893.239678


### 9. Perform the data cleaning operations mentioned so far in class


#### Delete the column education and the number of open complaints from the dataframe.

In [24]:
# through reassigning
df = df[["customer", "state", "gender", "income", "customer lifetime value", "monthly premium auto",
         "policy type", "vehicle class", "total claim amount"]]
df.head()

Unnamed: 0,customer,state,gender,income,customer lifetime value,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,0.0,,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,0.0,697953.59%,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,48767.0,1288743.17%,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,0.0,764586.18%,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,36357.0,536307.65%,68.0,Personal Auto,Four-Door Car,17.269323


#### Correct the values in the column customer lifetime value. They are given as a percent, so multiply them by 100 and change `dtype` to `numerical` type. 

In [25]:
df["customer lifetime value"]

# the dtype is an object (string), currently

0               NaN
1        697953.59%
2       1288743.17%
3        764586.18%
4        536307.65%
           ...     
7065    23405.98798
7066    3096.511217
7067    8163.890428
7068    7524.442436
7069    2611.836866
Name: customer lifetime value, Length: 12074, dtype: object

In [26]:
df["customer lifetime value"] = df["customer lifetime value"].str.rstrip("%").astype("float") *100.0

In [27]:
# check the result
df.dtypes

customer                    object
state                       object
gender                      object
income                     float64
customer lifetime value    float64
monthly premium auto       float64
policy type                 object
vehicle class               object
total claim amount         float64
dtype: object

In [28]:
# check NA values
# the NA values in "customer lifetime value" increased
df.isna().sum()

customer                    2937
state                       2937
gender                      3059
income                      2937
customer lifetime value    10014
monthly premium auto        2937
policy type                 2937
vehicle class               2937
total claim amount          2937
dtype: int64

In [29]:
# let's check the % of nulls
# "customer lifetime value" has over 82 % of nulls, should we drop it?
df.isna().mean().round(4) *100

customer                   24.32
state                      24.32
gender                     25.34
income                     24.32
customer lifetime value    82.94
monthly premium auto       24.32
policy type                24.32
vehicle class              24.32
total claim amount         24.32
dtype: float64

#### Check for duplicate rows in the data and remove if any.

In [30]:
# check for duplicates
df.duplicated(subset=None, keep="first").sum()

2943

In [31]:
# we are not officially dropping duplicates here, just testing the query
df.drop_duplicates()

Unnamed: 0,customer,state,gender,income,customer lifetime value,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,0.0,,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,0.0,69795359.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,48767.0,128874317.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,0.0,76458618.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,36357.0,53630765.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,M,71941.0,,73.0,Personal Auto,Four-Door Car,198.234764
7066,PK87824,California,F,21604.0,,79.0,Corporate Auto,Four-Door Car,379.200000
7067,TD14365,California,M,0.0,,85.0,Corporate Auto,Four-Door Car,790.784983
7068,UP19263,California,M,21941.0,,96.0,Personal Auto,Four-Door Car,691.200000


In [32]:
# now we drop the duplicates
df = df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9131 entries, 0 to 7069
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   customer                 9130 non-null   object 
 1   state                    9130 non-null   object 
 2   gender                   9008 non-null   object 
 3   income                   9130 non-null   float64
 4   customer lifetime value  2057 non-null   float64
 5   monthly premium auto     9130 non-null   float64
 6   policy type              9130 non-null   object 
 7   vehicle class            9130 non-null   object 
 8   total claim amount       9130 non-null   float64
dtypes: float64(4), object(5)
memory usage: 713.4+ KB


In [33]:
# just checking this query
df.infer_objects()

Unnamed: 0,customer,state,gender,income,customer lifetime value,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,0.0,,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,0.0,69795359.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,48767.0,128874317.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,0.0,76458618.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,36357.0,53630765.0,68.0,Personal Auto,Four-Door Car,17.269323
...,...,...,...,...,...,...,...,...,...
7065,LA72316,California,M,71941.0,,73.0,Personal Auto,Four-Door Car,198.234764
7066,PK87824,California,F,21604.0,,79.0,Corporate Auto,Four-Door Car,379.200000
7067,TD14365,California,M,0.0,,85.0,Corporate Auto,Four-Door Car,790.784983
7068,UP19263,California,M,21941.0,,96.0,Personal Auto,Four-Door Car,691.200000


#### Filter out the data for customers who have an income of 0 or less

In [34]:
# let's do it with .loc
df.loc[df.income <= 0]

Unnamed: 0,customer,state,gender,income,customer lifetime value,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,0.0,,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,0.0,69795359.0,94.0,Personal Auto,Four-Door Car,1131.464935
3,WW63253,California,M,0.0,76458618.0,106.0,Corporate Auto,SUV,529.881344
7,CF85061,Arizona,M,0.0,72161003.0,101.0,Corporate Auto,Four-Door Car,363.029680
10,SX51350,California,M,0.0,47389920.0,67.0,Personal Auto,Four-Door Car,482.400000
...,...,...,...,...,...,...,...,...,...
7059,WZ45103,California,F,0.0,,76.0,Personal Auto,Four-Door Car,364.800000
7061,RX91025,California,M,0.0,,185.0,Personal Auto,SUV,1950.725547
7062,AC13887,California,M,0.0,,67.0,Corporate Auto,Two-Door Car,482.400000
7067,TD14365,California,M,0.0,,85.0,Corporate Auto,Four-Door Car,790.784983


In [35]:
# now we need to reset the index
df = df.reset_index(drop=True)

In [36]:
df.head()

Unnamed: 0,customer,state,gender,income,customer lifetime value,monthly premium auto,policy type,vehicle class,total claim amount
0,RB50392,Washington,,0.0,,1000.0,Personal Auto,Four-Door Car,2.704934
1,QZ44356,Arizona,F,0.0,69795359.0,94.0,Personal Auto,Four-Door Car,1131.464935
2,AI49188,Nevada,F,48767.0,128874317.0,108.0,Personal Auto,Two-Door Car,566.472247
3,WW63253,California,M,0.0,76458618.0,106.0,Corporate Auto,SUV,529.881344
4,GA49547,Washington,M,36357.0,53630765.0,68.0,Personal Auto,Four-Door Car,17.269323
