## LinearRegression from SAS® Viya® on Customer Subscription Length

### About the [Churn Dataset](https://archive.ics.uci.edu/dataset/563/iranian+churn+dataset) 

This dataset was randomly collected from an Iranian telecom company's database over a 12-month period. It comprises 3150 rows, each representing a customer, with information across 13 columns. The dataset includes attributes such as call failures, SMS frequency, number of complaints, distinct calls, subscription length, age group, charge amount, service type, usage duration, status, usage frequency, and Customer Value.

All attributes, except for the churn attribute, consist of aggregated data from the first 9 months. The churn labels indicate the customers' status at the end of the 12-month period. The three-month gap is designated for planning purposes.

This constitutes a regression task aimed at predicting the subscription length.

### Cross-Validation with Linear Regression
This notebook illustrates how to perform cross-validation (CV) with linear regression as an example. We will primarily utilize sklearn for cross-validation.

The notebook is divided into the following sections:

1. Building a linear regression model without cross-validation
2. Hyperparameter tuning using CV

In [None]:
# import all libraries
import os
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

from sasviya.ml.linear_model import LinearRegression

import warnings # supress warnings
warnings.filterwarnings('ignore')

In [None]:
workspace=f'{os.path.abspath("")}/../data/'
subscrlength_df=pd.read_csv(workspace+'churn.csv')
subscrlength_df.head()

In [None]:
# number of observations 
len(subscrlength_df.index)

### Data exploration

#### View the distribution of the data

In [None]:
numeric_X_df = subscrlength_df.select_dtypes(exclude=['object'])
numeric_X_df.describe().T

In [None]:
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in numeric_X_df.items():
    sns.boxplot(y=k, data=numeric_X_df, ax=axs[index])
    index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

The plots indicate that some variables have outliers. 

#### View the correlation between the variables in a correlation matrix

In [None]:
X_df = subscrlength_df.drop(['Subscription Length','Churn'], axis=1)

plt.figure(figsize=(10, 5))
sns.heatmap(X_df.corr().abs(),  annot=True, cmap="coolwarm", annot_kws={"size": 10})
plt.show()

Age and Age Group are correlated as expected, while Customer Value and Frequency of SMS are highly correlated.

### Building a Model Without Cross-Validation
Let's build a multiple regression model. First, let's build a vanilla MLR model without any cross-validation etc.

For details about using the `LinearRegression` class, see the [LinearRegression documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p0kx8n36nycmj0n1h1o8d3tqfxc3.htm)

#### Splitting into Train and Test

In [None]:
# train-test 70-30 split
subscrlength_df.rename(columns={'Subscription Length': 'SubscriptionLength'}, inplace=True)
X_df = subscrlength_df.drop(['SubscriptionLength', 'Churn'], axis=1)
y = subscrlength_df['SubscriptionLength']

X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.3, random_state=3)

#### Using RFE
Now, we have 13 predictor features. To build the model using Recursive Feature Elimination (RFE) for feature selection, we need to specify the number of features we want in the final model. RFE then executes a feature elimination algorithm.

It is important to note that the number of features to be included in the model is a **hyperparameter**.

In [None]:
# num of max features
len(X_train.columns)

In [None]:
# first model with an arbitrary choice of n_features
# running RFE with number of features=8

lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, n_features_to_select=8)             
rfe = rfe.fit(X_train, y_train)

In [None]:
# tuples of (feature name, whether selected, ranking)
# note that the 'rank' is > 1 for non-selected features

list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# predict prices of X_test
y_pred = rfe.predict(X_test)

# evaluate the model on test set
r2 = sklearn.metrics.r2_score(y_test, y_pred)
print('r2:', '{:.4f}'.format(r2))

In [None]:
# try with another value of RFE
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, n_features_to_select=11)             
rfe = rfe.fit(X_train, y_train)

# predict prices of X_test
y_pred = rfe.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test, y_pred)
print('r2:', '{:.4f}'.format(r2))

### Hyperparameter Tuning Using Grid Search Cross-Validation
A common use of cross-validation is for tuning hyperparameters of a model. The most common technique is what is called grid search cross-validation.

In [None]:
# number of features in X_train
len(X_train.columns)

In [None]:
# step-1: create a cross-validation scheme
folds = KFold(n_splits = 5, shuffle = True, random_state = 100)

# step-2: specify range of hyperparameters to tune
hyper_params = [{'n_features_to_select': list(range(1, 13))}]


# step-3: perform grid search
# 3.1 specify model
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm)             

# 3.2 call GridSearchCV()
model_cv = GridSearchCV(estimator = rfe, 
                        param_grid = hyper_params, 
                        scoring= 'r2', 
                        cv = folds, 
                        verbose = 1,
                        return_train_score=True)      

# fit the model
model_cv.fit(X_train, y_train)    

In [None]:
# cv results
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results

In [None]:
# plotting cv results
plt.figure(figsize=(16,6))

plt.plot(cv_results["param_n_features_to_select"], cv_results["mean_test_score"])
plt.plot(cv_results["param_n_features_to_select"], cv_results["mean_train_score"])
plt.xlabel('number of features')
plt.ylabel('r-squared')
plt.title("Optimal Number of Features")
plt.legend(['test score', 'train score'], loc='upper left')

Now we can choose the optimal value of number of features and build a final model.

In [None]:
# final model
n_features_optimal = 10

lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, n_features_to_select=n_features_optimal)             
rfe = rfe.fit(X_train, y_train)

# predict prices of X_test
y_pred = lm.predict(X_test)
r2 = sklearn.metrics.r2_score(y_test, y_pred)
print('r2:', '{:.4f}'.format(r2))