# Lesson 1.9: Improving the model

### Lesson Duration: 3 hours

> Purpose: The purpose of this DIY (Do It Yourself) lesson is to go through the complete process of building the model and using different techniques to improve the accuracy of the model. As George Box famously said, "All models are wrong but some are useful".

### Learning Objectives

After this lesson, students will be able to:

- Apply a linear regression model from the beginning
- Improve the accuracy of the model
- Collaborate using Git and GitHub
- Deliver impactful presentations

---

### Lesson 1 key concepts

> :clock10: 20 mins

- Revisiting the model
- List down the followed steps
- Spot the areas in the model where changes could be made
- Apply the changes
- Fit the model and check the accuracy
- Compare the changes in the different models


# ~~Lab | Customer Analysis Round 7~~

For this lab, we still keep using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab_round7` folder.

Remember the previous rounds. Follow the steps as shown in previous lectures and try to improve the accuracy of the model. Include both categorical columns in the exercise.
Some approaches you can try in this exercise:

- use the concept of multicollinearity and remove insignificant variables
- use a different method of scaling the numerical variables
- use a different ratio of train test split
- use the transformation on numerical columns which align it more towards a normal distribution

### Get the data

We are using the `marketing_customer_analysis.csv` file.

### Dealing with the data

Already done in rounds 2 to 7.

**Bonus**: Build a function, from round 2 and round 7, to clean and process the data.

### Explore the data

Done in the round 3.

### Modeling

Description:

- Try to improve the linear regression model.

### Lesson 2 key concepts

> :clock10: 20 min

- Quick recap to version control systems: Why git and Github?
- Managing a repository on GitHub (quick recap - already discussed on day 1):
  - Create a repo
  - Make changes and commit to a repo
  - Forking and cloning a repo
  - Fork vs. clone
  - Create a pull request

</details>

### Lesson 3 key concepts

> :clock10: 20 min

- Working with branches
- Resolving merge conflicts
- Adding large files on GitHub
- Initializing directories on personal computer as GitHub repos



```shell
$ git branch 	                        # shows the current branches in the repo
$ git branch -a 	                    # shows all branches (even the ones you haven't worked on)
$ git checkout -b <NameOfNewBranch>	  # creates new branch
$ git checkout <BranchName> 	        # switches to the branch we want to work on
$ git pull 	                          # pulls the latest changes to the branch we are working on (git pull = git fetch + git merge)
$ git fetch 	                        # gets all branches from the repository
```

#### Merging Branches

Case 1: Merges changes in branch to the master file

```shell
$ git checkout master
$ git merge <Name of Branch to be merged to Master>
```

Case 2: Merges changes in master file to the branch

```shell
$ git checkout <branch name>
$ git merge master
```

### :pencil2: Practice on key concepts - Lab

> :clock10: 30 min

# Lab | Customer Analysis Final Round

For this lab, we still keep using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab_final` folder.

It's time to put it all together. Remember the previous rounds and follow the steps as shown in previous lectures.

### 01 - Problem (case study)

- Data Description.
- Goal.

### 02 - Getting Data

- Read the `.csv` file.

### 03 - Cleaning/Wrangling/EDA

- Change headers names.
- Deal with NaN values.
- Categorical Features.
- Numerical Features.
- Exploration.

### 04 - Processing Data

- Dealing with outliers.
- Normalization.
- Encoding Categorical Data.
- Splitting into train set and test set.

### 05 - Modeling

- Apply model.

### 06 - Model Validation

- R2.
- MSE.
- RMSE.
- MAE.

### 07 - Reporting

- Present results.

# Solution to Lab: Custumer Analysis Final Round

<font color='magenta'>
Please comment before each cell of code using a markdown cell. You should clearly state with your own words what the portion of code in the cell bellow does or add other insightful comments on that operation. Use the html tags in this cell to add your comments in a striking color for an easy review.
</font>


### 01 - Problem (case study)

Data Description.

- **customer:** Customer ID
- **state:** US State
- **customer_lifetime_value:** CLV is the client economic value for a company during all their relationship
- **response:** Response to marketing calls (customer engagement)
- **coverage:** Customer coverage type
- **education:** Customer education level
- **effective_to_date:** Effective to date
- **employmentstatus:** Customer employment status
- **gender:** Customer gender
- **income:** Customer income
- **location_code:** Customer living zone
- **marital_status:** Customer marital status
- **monthly_premium_auto:** Monthly premium
- **months_since_last_claim:** Last customer claim
- **months_since_policy_inception:** Policy Inception
- **number_of_open_complaints:** Open claims
- **number_of_policies:** Number policies
- **policy_type:** Policy type
- **policy:** Policy
- **renew_offer_type:** Renew
- **sales_channel:** Sales channel (customer-company first contact)
- **total_claim_amount:** Claims amount
- **vehicle_class:** Vehicle class
- **vehicle_size:** Vehicle size
- **vehicle_type:** Vehicle type

**Goal.**  
Can we predict the amount claimed by a client?

### 02 - Getting Data

- Read `.csv` file

In [None]:
import pandas as pd                                           # panel data, handling dataframes
pd.set_option('display.max_columns', None)

In [None]:
data=pd.read_csv('./files_for_lab_final/csv_files/marketing_customer_analysis.csv')    # import csv file
data.head()                                                    # show first 5 rows

### 03 - Cleaning/Wrangling/EDA

- Change headers names.

In [None]:
data.shape       # dataframe dimensions

In [None]:
data.columns     # columns headers

In [None]:
data.columns=[e.lower().replace(' ', '_') for e in data.columns]   # lower and replace
data.columns

- Deal with NaN values.

In [None]:
data.info(memory_usage='deep')   # dataframe info

In [None]:
data.isna().sum()     # missing values

In [None]:
data = data.drop(columns=['vehicle_class', 'customer'])   # drop useless columns (no info or nan)

In [None]:
data=data.dropna()   # drop rows with nan values

In [None]:
for c in data.columns.tolist():         # know the unique values for each column
    print(c, len(data[c].unique()))

In [None]:
data.shape

- Datetime Features.

**Effective To Date**

In [None]:
print(f"Original dtype: {data['effective_to_date'].dtype}\n")   # object
data['effective_to_date']=pd.to_datetime(data['effective_to_date'])   # datetime
print(f"Meantime dtype: {data['effective_to_date'].dtype}")

In [None]:
print('--')
print(f"Min date: {data['effective_to_date'].min()}")         # from January 1st..
print(f"Max date: {data['effective_to_date'].max()}")         # to February 28th
print('--')

In [None]:
data['effective_to_date']=data['effective_to_date'].apply(lambda x: x.toordinal())   # you can change the type to ordinal.

print(f"New dtype: {data['effective_to_date'].dtype}")

In [None]:
# Or alternatively use Unix time
# data['effective_to_date']  = (data['effective_to_date']  - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

- Categorical Features.

**Values for each class in categorical features**

In [None]:
cat_cols=[col for col in data.columns if (data[col].dtype==object)]     # categorical columns

In [None]:
print('Categorical Features:', len(cat_cols))
print('----------')
for c in cat_cols:
    print(f'Name: {data[c].name}')    # column name
    print(f'Type: {data[c].dtype}')   # column type
    print(f'Unique values: {len(data[c].unique())}')   # column unique values
    print(data[c].unique())
    print(((data[c].value_counts()/ sum(data[c].value_counts()))*100))   # percentage
    print('\n----------')

- Numerical Features.

In [None]:
data.describe()     # stats

In [None]:
num_cols=[c for c in data.columns if (data[c].dtype!='object') and (c!='Effective To Date')]   # numerical columns

- Exploration.

**Bar plot for each categorical variable.**

In [None]:
import matplotlib.pyplot as plt                 # visualization library
%matplotlib inline

for c in cat_cols:
    plt.figure(figsize=(10,5))
    plt.bar(data[c].unique(), data[c].value_counts())
    plt.title(c)
    plt.show();

In [None]:
import seaborn as sns                           # visualization library, extends plt
sns.set(style="white")                          # style

**Correlation**

In [None]:
import numpy as np    # numerical python, algebra library


corr=data.corr()      # compute the correlation matrix


mask=np.triu(np.ones_like(corr, dtype=np.bool))     # generate a mask for the upper triangle

f, ax=plt.subplots(figsize=(11, 9))                 # set up the matplotlib figure

cmap=sns.diverging_palette(220, 10, as_cmap=True)   # generate a custom diverging colormap

sns.heatmap(corr, mask=mask, cmap=cmap,             # draw the heatmap with the mask and correct aspect ratio
            vmax=.3, center=0, square=True,
            linewidths=.5, cbar_kws={"shrink": .5});

**All variables**
(be careful, this command may be quite memory hungry)

In [None]:
sns.pairplot(data[num_cols]);

**Bar plot for each numerical variable.**

In [None]:
for c in num_cols:
    plt.figure(figsize=(10,5))
    plt.hist(data[c])
    plt.title(c)
    plt.show();

**Box plot for each numerical variable for know outliers of each feature.**

In [None]:
for c in num_cols:
    plt.figure(figsize=(10,5))
    plt.boxplot(data[c])
    plt.title(c)
    plt.show();

**Show a plot of the total number of response.**

In [None]:
sns.countplot('response', data=data)
plt.ylabel('Total number of Response')
plt.show();

**Show a plot of the response rate by sales channel.**

In [None]:
plt.figure(figsize=(8,4))
sns.countplot('response', hue='sales_channel', data=data)
plt.ylabel('Response by Sales Channel')
plt.show();

**Show a plot of the response rate by total claim amount.**

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(y='total_claim_amount' , x='response', data=data)
plt.ylabel('Response by Total Claim Amount')
plt.show();

**Show a plot of the response rate by income.**

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(y='income' , x='response', data=data)
plt.ylabel('Response by Inncome')
plt.show();

### 04 - Processing Data

- Dealing with outliers

In [None]:
# e.g. 3*IQR in a column

q1=np.percentile(data['customer_lifetime_value'], 25)   # percentile 25
q3=np.percentile(data['customer_lifetime_value'], 75)   # percentile 75

iqr=q3-q1  # IQR

upper=q3+3*iqr   # upper boundary
lower=q1-3*iqr   # lower boundary

In [None]:
len(data[data['customer_lifetime_value']<lower])

In [None]:
len(data[data['customer_lifetime_value']>upper])

- Normalization

**Min-Max Scaler**

In [None]:
from sklearn.preprocessing import MinMaxScaler

data['effective_to_date']=MinMaxScaler().fit_transform(data['effective_to_date'].values.reshape(-1, 1))

data['effective_to_date'].head()

**Standardize**

In [None]:
from sklearn.preprocessing import StandardScaler

num_cols

In [None]:
for c in num_cols[:-1]:   # we'll normalize all less the target column
    data[c]=StandardScaler().fit_transform(data[c].values.reshape(-1, 1))

In [None]:
data.head()

- **Encoding Categorical Data**

In [None]:
one_hot_data=pd.get_dummies(data[cat_cols], drop_first=True)   # one hot encoding categorical variables

one_hot_data.head()

**Concat numerical and categorical DataFrames**

In [None]:
data=pd.concat([data, one_hot_data], axis=1)   # concat dataframes
data.drop(columns=cat_cols, inplace=True)
data.head()

- Splitting into train set and test set

In [None]:
# first, split X-y (learning-target data)
X=data.drop(columns=['total_claim_amount'])
y=data['total_claim_amount']

# checking shape
print(X.shape)
print(y.shape)

In [None]:
# train_test_split
from sklearn.model_selection import train_test_split as tts

In [None]:
# train-test-split (4 sets)

X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2, random_state=42)  # random state fixed sample

### 05 - Modeling

We have now the data prepared for the modeling phase.
https://s3.amazonaws.com/assets.datacamp.com/email/other/ML+Cheat+Sheet_2.pdf


Linear regression is a linear model, which means it works really nicely when the data has a linear shape. But, when the data has a non-linear shape, then a linear model cannot capture the non-linear features.

So in this case, you can use the tree-based methods, which do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces depending on the questions asked.

**Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression as LinReg
linreg=LinReg()    # model
linreg.fit(X_train, y_train)   # model train
y_pred_linreg=linreg.predict(X_test)   # model prediction

**Regularization**

In [None]:
from sklearn.linear_model import Lasso       # L1
from sklearn.linear_model import Ridge       # L2
from sklearn.linear_model import ElasticNet  # L1+L2

In [None]:
# Lasso L1

lasso=Lasso()
lasso.fit(X_train, y_train)

y_pred_lasso = lasso.predict(X_test)

In [None]:
# Ridge L2

ridge=Ridge()
ridge.fit(X_train, y_train)

y_pred_ridge = ridge.predict(X_test)

In [None]:
# ElasticNet L1+L2

elastic=ElasticNet()
elastic.fit(X_train, y_train)

y_pred_elastic = elastic.predict(X_test)

**Random Forest Regressor**

In [None]:
from sklearn.ensemble import RandomForestRegressor as RFR

rfr=RFR()
rfr.fit(X_train, y_train)

y_pred_rfr = rfr.predict(X_test)

**XGBoost**

In [None]:
# conda install -c conda-forge xgboost

In [None]:
from xgboost import XGBRegressor as XGBR

xgbr=XGBR()
xgbr.fit(X_train, y_train)

y_pred_xgbr = xgbr.predict(X_test)

**LightGBM**

In [None]:
# conda install -c conda-forge lightgbm

In [None]:
from lightgbm import LGBMRegressor as LGBMR

lgbmr=LGBMR()
lgbmr.fit(X_train, y_train)

y_pred_lgbmr = lgbmr.predict(X_test)

### 06 - Model Validation

In [None]:
models=[linreg, lasso, ridge, elastic, rfr, xgbr, lgbmr]
model_names=['linreg', 'lasso', 'ridge', 'elastic', 'rfr', 'xgbr', 'lgbmr']
preds=[y_pred_linreg, y_pred_lasso, y_pred_ridge, y_pred_elastic, y_pred_rfr, y_pred_xgbr, y_pred_lgbmr]

- R2.

In [None]:
for i in range(len(models)):

    train_score=models[i].score(X_train, y_train) #R2
    test_score=models[i].score(X_test, y_test)

    print ('Model: {}, train R2: {} -- test R2: {}'.format(model_names[i], train_score, test_score))

- MSE.

In [None]:
from sklearn.metrics import mean_squared_error as mse

for i in range(len(models)):

    train_mse=mse(models[i].predict(X_train), y_train) #MSE
    test_mse=mse(preds[i], y_test)

    print ('Model: {}, train MSE: {} -- test MSE: {}'.format(model_names[i], train_mse, test_mse))

- **RMSE.**

In [None]:
for i in range(len(models)):

    train_rmse=mse(models[i].predict(X_train), y_train)**0.5 #RMSE
    test_rmse=mse(preds[i], y_test)**0.5

    print ('Model: {}, train RMSE: {} -- test RMSE: {}'.format(model_names[i], train_rmse, test_rmse))

- MAE.

In [None]:
from sklearn.metrics import mean_absolute_error as mae
for i in range(len(models)):
    train_mae=mae(models[i].predict(X_train), y_train) #MAE
    test_mae=mae(preds[i], y_test)

    print ('Model: {}, train MAE: {} -- test MAE: {}'.format(model_names[i], train_mae, test_mae))

### Can you try to improve the model ?

I.e. you can try by removing columns that you feel are not predictive. Or making transformations to some columns to make them closer to a normal distribution, or..

    - Choosing a different way to fill null values
    - Working with a categorical variable to reduce the number of categories
    - Different data transformation
    - A different method to remove outliers
    - Choosing different scaling method

<font color='magenta'>
Your code goes here:
    
</font>

### 07 - Reporting

- Present results.

**Data Level**

- Drop Nan values because they are, in fact, duplicates.
- Do not drop outliers because they are just a few.

**Problem Level**

- Total claim amount has a great variance.
- We can predict the total claim amount with a 25% of error, even when R2 is high.
- We need to determinate which are the significative variables.

### Additional Resources

- [Best practices for PowerPoint presentations](https://alum.mit.edu/best-practices-powerpoint-presentations)
- [More tips on PowerPoint formatting](https://www.workfront.com/blog/10-tips-for-designing-presentations-that-dont-suck-part-1)