# Lab | Customer Analysis Round 7

For this lab, we still keep using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab` folder.

Remember the previous rounds. Follow the steps as shown in previous lectures and try to improve the accuracy of the model. Include both categorical columns in the exercise.
Some approaches you can try in this exercise:

- use the concept of multicollinearity and remove insignificant variables
- use a different method of scaling the numerical variables
- use a different ratio of train test split
- use the transformation on numerical columns which align it more towards a normal distribution

### Get the data

We are using the `marketing_customer_analysis.csv` file.

In [37]:
import pandas as pd

pd.set_option('display.max_columns', None)

import warnings
warnings.simplefilter('ignore')

data=pd.read_csv('files_for_lab/csv_files/marketing_customer_analysis.csv')

In [38]:
data.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,69,32,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,94,13,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,108,18,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,106,18,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,73,12,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


### Dealing with the data

Already done in rounds 2 to 7.

In [40]:
# clean
data.columns=[e.lower().replace(' ', '_') for e in data.columns]
data=data.dropna()
data['effective_to_date']=pd.to_datetime(data['effective_to_date'], errors='coerce')

# select data
X=data.drop(columns=['customer', 'effective_to_date', 'total_claim_amount'], axis=1)
y=data.total_claim_amount

# num-cat split
X_num=X._get_numeric_data()
X_cat=X.drop(columns=X_num.columns)

# numeric normalization
from sklearn.preprocessing import StandardScaler

for c in X_num.columns:
    X_num[c]=StandardScaler().fit_transform(X_num[[c]])

X_num.head()

# cat, one hot encoding
X_cat=pd.get_dummies(X_cat, drop_first=True)

# concat numerical and categorical transformations
X=pd.concat([X_num, X_cat], axis=1)

**Bonus**: Build a function, from round 2 and round 7, to clean and process the data.

In [41]:
def normalize(X):         # normalization function
    X_mean=X.mean(axis=0)
    X_std=X.std(axis=0)
    X_std[X_std==0]=1.0
    X=(X-X_mean)/X_std
    return X

def process_clean_data(df):
    # clean
    df.columns=[e.lower().replace(' ', '_') for e in df.columns]
    #df=df.drop(columns=['unnamed:_0', 'vehicle_type'])
    df=df.dropna()
    df['effective_to_date']=pd.to_datetime(df['effective_to_date'], errors='coerce')

    # select data
    X=df.drop(columns=['customer', 'effective_to_date', 'total_claim_amount'], axis=1)
    y=df.total_claim_amount

    # num-cat split
    X_num=X._get_numeric_data()
    X_cat=X.drop(columns=X_num.columns)

    # numeric normalization
    X_num=normalize(X_num)

    # cat, one hot encoding
    X_cat=pd.get_dummies(X_cat, drop_first=True)

    # concat numerical and categorical transformations
    X=pd.concat([X_num, X_cat], axis=1)

    # return X,y
    return X, y

### Explore the data

Done in the round 3.

### Modeling

Description:

- Create a linear regression model
- Try to improve the linear regression model

In [42]:
# train-test-split
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test=tts(X, y, test_size=.2)

In [43]:
# statsmodels version


print(model.summary())

                            OLS Regression Results                            
Dep. Variable:     total_claim_amount   R-squared:                       0.773
Model:                            OLS   Adj. R-squared:                  0.772
Method:                 Least Squares   F-statistic:                     516.1
Date:                Tue, 24 Jan 2023   Prob (F-statistic):               0.00
Time:                        15:20:41   Log-Likelihood:                -46467.
No. Observations:                7307   AIC:                         9.303e+04
Df Residuals:                    7258   BIC:                         9.337e+04
Df Model:                          48                                         
Covariance Type:            nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const       

In [44]:
# sklearn version


In [45]:
linreg.score(X_train, y_train)

0.773411897854125

In [46]:
linreg.score(X_test, y_test)

0.7658679140225813

In [47]:
from sklearn.metrics import mean_squared_error as mse



print ('train RMSE: {} -- test RMSE: {}'.format(train_mse**.5, test_mse**.5))

train RMSE: 139.81492478888 -- test RMSE: 134.09410643530302


In [48]:
# drop some columns to test if model improves
X=X.drop(columns=['customer_lifetime_value','coverage_Premium',
                 'policy_Corporate L3', 'policy_Special L2',
                  'sales_channel_Branch', 'vehicle_class_Two-Door Car', 'vehicle_size_Small'
                 ])

X_train, X_test, y_train, y_test=tts(X, y, test_size=.2)

linreg=LinReg().fit(X_train, y_train)

train_mse=mse(linreg.predict(X_train), y_train)
test_mse=mse(linreg.predict(X_test), y_test)

print ('train RMSE: {} -- test RMSE: {}'.format(train_mse**.5, test_mse**.5))

train RMSE: 138.74791351688256 -- test RMSE: 138.14925081002957
