#### 1. X-y split.

In order to do the X-y split, we need to figure out the inputs and outputs of our model.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


# Find more information about the dataset
df = pd.read_csv('files_for_lab/csv_files/marketing_customer_analysis.csv')
print(df.info())
print(df.shape)
print(df.columns)

# Run the transformations from the previous lab

# 1. Standardize column names
df.rename(columns = {'EmploymentStatus': 'Employment Status'}, inplace = True)
df.columns = df.columns.str.lower()

# 2. Remove columns that are highly correlated to each other
df.drop(['policy', 'vehicle size'], axis = 1, inplace = True)

We will assume that the `total claim amount` is the output we're looking to predict, as for an insurance policy company it would be relevant to know which customer type is more likely to make claims - so that they can perhaps change the insurance policy pricing for customers that would be considered "high-risk", i.e. more likely to make claims.

In [None]:
y = pd.DataFrame(df['total claim amount'])
X = df.drop('total claim amount', axis=1)

# Check that the operations ran correctly
print(y.columns)
print(X.columns)

#### 2. Normalize (numerical).

We need to separate the numerical columns in X from the categorical columns so we can normalize the data at once:

In [None]:
X_num = X.select_dtypes(include=np.number)

# Check that we have selected the correct data
print(X_num.info())

Now we can normalize the data using `MinMaxScaler`:

In [None]:
# Compute the minimum and maximum for each column of the dataframe:
transformer = MinMaxScaler().fit(X_num) 

# Find out what the transformer is:
print(type(transformer))

# Show the maximum across all columns (mainly to see what the info in the transformer):
print(transformer.data_max_)

# Normalize the data (or transform):
x_minmax = transformer.transform(X_num)
print(type(x_minmax))
print(x_minmax.shape)

# Transform the numpy array into the normalized dataframe 
X_num_norm = pd.DataFrame(x_minmax, columns=X_num.columns)
print(X_num_norm.head())