## 5.3.3 Multiple Linear Regression
Regressions are the most common method for numeric estimation problems when the target variable is numeric or continuous. In this section, we will implement Estimation to predict the estimate that a given customer will churn based on his/her demographic and transaction profiles. 

For the purpose of demonstrating the concept of Estimation modeling, and reusing the same churn dataset,  we need to transform the initially categorical target attribute type values ('yes' or 'no')  with numeric  values (1 or 0), though not entirely precise but somehow the 1 and 0 present a churn estimate value. 

The following Python codes show the example implementation of the data modeling phase to solve the Estimation problem using a Linear Regression (i.e., `LinearRegression()` function) algorithm.  The comments embedded in the codes give explanations to guide the rationale of the programming logic. 

In [11]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn import metrics 
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats
import statsmodels.api as sm

# access the dataset from source
df = pd.read_csv('data/ChurnFinal.csv')

# assign categorical value to numeric values for Estimation modeling
df.loc[df['Churn'] == 'yes', 'Churn'] = 1
df.loc[df['Churn'] == 'no', 'Churn'] = 0
df['Churn'] = pd.to_numeric(df['Churn'], errors='coerce').astype('float')

# specify inputs and label
df_inputs = pd.get_dummies(df[['Gender', 'Age', 'PostalCode', 'Cash', 'CreditCard', 
        'Cheque', 'SinceLastTrx', 'SqrtTotal', 'SqrtMax', 'SqrtMin']])
df_label = df['Churn']

# split dataset to train and test sets, proportion of 70% and 30% respectively
#The random state is a random seed to ensure same order number.
X_train, X_test, Y_train, Y_test = train_test_split(df_inputs, df_label, test_size=0.3, random_state=7) 

# create a LinearGression object
model = LinearRegression()

# train the model using train data set
model.fit(X_train, Y_train)

# obtain the model intercept and coefficient of each input attribute
print(f"intercept: {np.round(model.intercept_,2)}")
fieldList = np.array(list(df_inputs)).reshape(-1,1)
coeffs = np.reshape(np.round(model.coef_,2),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)

coeffs_inputs = pd.DataFrame(coeffs,columns=['Attribute','Coefficient'])

#convert the coefficient data type from object to float for later numrical comparison
coeffs_inputs['Coefficient'] = coeffs_inputs['Coefficient'].astype(float, errors = 'raise')

# to select only those coefficients have magnitude more than 0.01 (1% impact on target)
ci1 = coeffs_inputs[coeffs_inputs['Coefficient'] >= 0.01]
ci2 = coeffs_inputs[coeffs_inputs['Coefficient'] <= -0.01]
final_ci = pd.concat([ci1, ci2])
print(final_ci)

# assess the model using f-test p-value, r-squared and MSE measures
y_pred = model.predict(X_test)  

from sklearn import metrics

# est = sm.OLS(Y_train, sm.add_constant(X_train))

# if est.fit().f_pvalue > 0.001:
#         print('F-test p-value: ', round(est.fit().f_pvalue,3))
# else:
#         print('F-test p-value: <0.001')
        
print('R-sq score: ', round(metrics.r2_score(Y_test, y_pred),2))
print('Mean Squared Error(MSE) : ', round(metrics.mean_squared_error(Y_test, y_pred),4))

intercept: 0.31
        Attribute  Coefficient
0             Age         0.01
2            Cash         0.01
9   Gender_female         0.05
1      PostalCode        -0.01
3      CreditCard        -0.22
4          Cheque        -0.09
6       SqrtTotal        -0.01
10    Gender_male        -0.05
R-sq score:  0.27
Mean Squared Error(MSE) :  0.1819


Observe that in this example, we filter to select only those input attributes with a coefficient magnitude of 0.01 (i.e., 1% impact on the target attribute). This is one of the model selection methods we can choose to perform.  

In Week 6, we will examine more approaches in model performance evaluation and comparison for selection. 

After running all the codes together given above, we obtain the model performance in Mean Squared Error (MSE) of 0.1819 printed on the console terminal as Mean Squared Error (MSE): 0.1819. This error indicates that the model has a variance of 0.1819 in predicting the estimated churn likelihood compared to the actual value.

Besides MSE, we will examine more measures of model performance in Week 6. 

We also observe the output of intercept: [0.31] with the Coefficient (in 2 decimal points) of each input attribute as follows:  
- Age = 0.01
- Cash = 0.01
- Gender_female = 0.05
- PostalCode =  -0.01
- CreditCard = -0.22
- Cheque = -0.09
- SqrtTotal = -0.01
- Gender_male = -0.05. 

Thus, the generated Linear Regression model is mathematically represented as follows:

       Churn = 0.31 + 0.01(Age) + 0.01(Cash) + 0.05(Gender_female) - 0.01(PostalCode) - 0.22(CreditCard)
                      - 0.09(Cheque) - 0.01(SqrtTotal) - 0.05(Gender_male)

Further, the R-sq score: 0.27 result indicates that the model explains 27% of the target churn data's variability (i.e. characteristics). 

The F-test statistics test the overall significance of the regression model. Assuming 0.001 is the acceptance threshold for F-test significant level. Therefore, the `F-test p-value of <0.001` implies the model is significant, suggesting that the model provides a better fit to the data than a model that contains no input attributes or random guessing.


For a detailed explanation of the LinearRegression() API parameters, refer to the official website,https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression