<a href="https://colab.research.google.com/github/uday-routhu/week4/blob/master/Model_Pipeline_Core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Model Pipeline (Core):

* Author: Udayakumar Routhu

* For this task, you will need to:
    * Load in the dataset.
    * Define X and y, using charges for your target vector (y).
    * Split your data into train and test sets. Please use the random number 42 for consistency!
    * Create a column transformer that will:
      
      * Impute missing values (if needed)
      * One-hot encode any nominal features
      * Scale any numeric features
      * (This dataset does not have ordinal features)
    * Instantiate a linear regression model
    * Create a model pipeline with your preprocessor first and linear regression model last
    Fit the modeling pipeline on the training data
Evaluate the model performance on both the training set and the test set using the R-squared score, MAE, MSE, and RMSE


In [46]:
# Import packages
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn import set_config
set_config(transform_output='pandas')

In [47]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Prepare Dataset

In [48]:
# Load the data set
fpath = "/content/drive/MyDrive/CodingDojo/02-MachineLearning/Week06/Data/insurance.csv"
df = pd.read_csv(fpath)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


2.Column for duplicate rows and address them, if needed.

In [49]:
#Explore Data
df.duplicated().sum()

1

In [50]:
# drop duplicated record
df = df.drop_duplicates()

In [51]:
#again check duplicates dropped or not
df.duplicated().sum()

0

* now, there are no duplicated rows

In [52]:
df.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

* There are no null values to impute place holder

3.Check for inconsistent categories and fix them if needed.

In [53]:
data_types = df.dtypes
str_cols = data_types[data_types=='object'].index
for col in str_cols:
    print(f'- {col}:')
    print(df[col].value_counts(dropna=False))
    print("\n\n")
    print(df[col])

- sex:
male      675
female    662
Name: sex, dtype: int64



0       female
1         male
2         male
3         male
4         male
         ...  
1333      male
1334    female
1335    female
1336    female
1337    female
Name: sex, Length: 1337, dtype: object
- smoker:
no     1063
yes     274
Name: smoker, dtype: int64



0       yes
1        no
2        no
3        no
4        no
       ... 
1333     no
1334     no
1335     no
1336     no
1337    yes
Name: smoker, Length: 1337, dtype: object
- region:
southeast    364
southwest    325
northwest    324
northeast    324
Name: region, dtype: int64



0       southwest
1       southeast
2       southeast
3       northwest
4       northwest
          ...    
1333    northwest
1334    northeast
1335    southeast
1336    southwest
1337    northwest
Name: region, Length: 1337, dtype: object


* There are no inconsistent categories

4.Check for impossible numeric values and fix them, if needed

In [54]:
# Check data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1337 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB


In [55]:
# Obtain summary statistics
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.222139,30.663452,1.095737,13279.121487
std,14.044333,6.100468,1.205571,12110.359656
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29,0.0,4746.344
50%,39.0,30.4,1.0,9386.1613
75%,51.0,34.7,2.0,16657.71745
max,64.0,53.13,5.0,63770.42801


In [56]:
stats =  df.describe()
stats.loc[['mean','min','max']]

Unnamed: 0,age,bmi,children,charges
mean,39.222139,30.663452,1.095737,13279.121487
min,18.0,15.96,0.0,1121.8739
max,64.0,53.13,5.0,63770.42801


In [57]:
# Summary for object columns
df.describe(include='object')

Unnamed: 0,sex,smoker,region
count,1337,1337,1337
unique,2,2,4
top,male,no,southeast
freq,675,1063,364


We see no cardinality (2 unique values).

###Define X and y, using charges for your target vector (y).

In [58]:
# Define features and target
X = df.drop(columns='charges')
y = df['charges']
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

###Create a column transformer

In [59]:
# Using select_dtypes for categorical data
list(X_train.select_dtypes('object').columns)

['sex', 'smoker', 'region']

In [60]:
# Creating a "column selector" from sklearn
cat_selector = make_column_selector(dtype_include='object')
# Works just like select_dtypes from pandas
cat_selector(X_train)

['sex', 'smoker', 'region']

In [61]:
# Create the preprocessing pipeline for categorical data
# (New) Select columns with make_column_selector
cat_selector = make_column_selector(dtype_include='object')
# Insantiate transfomers
freq_imputer = SimpleImputer(strategy='most_frequent')
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Instantiate the pipeline
cat_pipe = make_pipeline(freq_imputer, ohe)
# Make a tuple for column transformer
cat_tuple = ('categorical',cat_pipe, cat_selector)
cat_tuple

('categorical',
 Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse_output=False))]),
 <sklearn.compose._column_transformer.make_column_selector at 0x790cec51b1c0>)

In [62]:
# Create the preprocessing pipeline for numeric data
# (New) Select columns wiht make_column)selector
num_selector = make_column_selector(dtype_include='number')
# Instantiate the transformers
scaler = StandardScaler()
mean_imputer = SimpleImputer(strategy='mean')
# Instantiate the pipeline
num_pipe = make_pipeline(mean_imputer, scaler)
# Make the tuple for ColumnTransformer
num_tuple = ('numeric',num_pipe, num_selector)
num_tuple

('numeric',
 Pipeline(steps=[('simpleimputer', SimpleImputer()),
                 ('standardscaler', StandardScaler())]),
 <sklearn.compose._column_transformer.make_column_selector at 0x790cec519090>)

In [63]:
# Create the preprocessing ColumnTransformer
preprocessor = ColumnTransformer([cat_tuple, num_tuple],
                                 verbose_feature_names_out=False)
preprocessor

###Instantiate a linear regression model

In [64]:
# Instantiate a linear regression model
linreg = LinearRegression()
# Combine the preprocessing ColumnTransformer and the linear regression model in a Pipeline
linreg_pipe = make_pipeline(preprocessor, linreg)
linreg_pipe

In [65]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1002 entries, 763 to 1127
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1002 non-null   int64  
 1   sex       1002 non-null   object 
 2   bmi       1002 non-null   float64
 3   children  1002 non-null   int64  
 4   smoker    1002 non-null   object 
 5   region    1002 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 54.8+ KB


In [75]:
# Fit the model pipeline on the training data
linreg_pipe.fit(X_train, y_train)

###Evalute the model

In [77]:
def regression_metrics(y_true, y_pred, label='', verbose = True, output_dict=False):
  # Get metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:
    # Print Result with Label and Header
    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics

def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  # Get predictions for training data
  y_train_pred = reg.predict(X_train)

  # Call the helper function to obtain regression metrics for training data
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = reg.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )

  # Store results in a dataframe if ouput_frame is True
  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    # Set the label as the index
    results_df = results_df.set_index('Label')
    # Set index.name to none to get a cleaner looking result
    results_df.index.name=None
    # Return the dataframe
    return results_df.round(3)

In [78]:
evaluate_regression(linreg_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 4,212.898
- MSE = 37,186,464.220
- RMSE = 6,098.071
- R^2 = 0.730

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 4,074.084
- MSE = 35,365,682.683
- RMSE = 5,946.905
- R^2 = 0.795
