# Chapter-3: Regression and Logistics Regression

In [None]:
import pandas as pd

## Regression

In [None]:
from google.colab import drive
drive.mount('/content/drive')

We have this employee profile data. We have three variables in it. Import the data, draw the scatter plot and try to answer the below questions.
1. Is there any association between monthly income and monthly expense, if there is any association, then is it positive or negative? 
2. Is there any association between monthly income and time spent on reading books, if there is any association, then is it positive or negative? 


In [None]:
#emp_profile=pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-3/Datasets/employee_profile.csv")
emp_profile=pd.read_csv(r"https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter3_Regression_Logistic/Datasets/employee_profile.csv")

#First few rows
emp_profile.head()

#Column names 
print(emp_profile.columns)

### Drawing the scatter plot

In [None]:
import matplotlib.pyplot as plt
plt.scatter(emp_profile["Monthly_Income"], emp_profile["Monthly_Expenses"])
plt.title('Income vs Expesnes Plot')
plt.xlabel('Monthly Income')
plt.ylabel('Monthly Expenses')
plt.show()

From the above scatter plot, It is very evident that there is a strong association between monthly income and monthly expenses. Higher the income higher expenses, lower the income lower the expenses. This is a clear indication of a strongly positive relationship. 

In [None]:
import matplotlib.pyplot as plt
plt.scatter(emp_profile["Monthly_Income"], emp_profile["Time_Spent_Reading_Books"])
plt.title('Income vs Time_Spent_Reading_Books')
plt.xlabel('Monthly Income')
plt.ylabel('Time_Spent_Reading_Books')
plt.show()

From the above scatter plot, it is clear that there is no relation between 'Monthly Income' and 'Time_Spent_Reading_Books'. The graph between these two variables is very scattered.

### Regression Model Building
Import the data and start fitting regression line. We don't need to draw the data, we can directly start building the regression model.Here we are not doing data exploration or data cleaning since the data is already clean and well-formatted before starting the model building.

In [None]:
#air_pass=pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-3/Datasets/Air_Passengers.csv")
air_pass=pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter3_Regression_Logistic/Datasets/Air_Passengers.csv")
print(air_pass.columns)

We can use two packages for building a regression model, 'sklearn' and 'statsmodels'. We will start with “statsmodels” package. We need to import subpackage from statsmodels package. 

In [None]:
import statsmodels.formula.api as sm

####Code description
* model = sm.ols(formula='y~x',data = data_name),where 
      
1. y : target variable
2. x : predictor variables
3. sm.ols – sm is our package name, ols function is used for minimizing the error squares to give us the values of regression beta coefficients. Ols stands for “ordinary least squares”.

* fitted = model.fit()
1. The previous step is model configuration and this step is model building.
2. Here the actual data will be submitted, and optimization will be performed,  the model will be built.
3. After this step we are ready with the beta coefficients, they will be stored in fitted. 

* fitted.summary()
1. This command gives the output summary.

In [None]:
model1 = sm.ols(formula='Passengers_count ~ marketing_cost', data=air_pass)
fitted1 = model1.fit()
print(fitted1.summary())

Now the model is fitted so we will find the predictions using this model.We can predict Passengers_count at any given point of marketing_cost.

EXAMPLE:
When marketing_cost=4500, we will find predicted value of Passengers_count. 

In [None]:
new_data=pd.DataFrame({"marketing_cost":[4500]})
print(fitted1.predict(new_data))

Similarly we can predict Passengers_count at more than one point in one attempt.

EXAMPLE: marketing_cost=4500,3600,3000,5000

In [None]:
new_data1=pd.DataFrame({"marketing_cost":[4500,3600, 3000,5000]})
print(fitted1.predict(new_data1))

### R-squarred

In [None]:
air_pass["passengers_count_pred"]=round(fitted1.predict(air_pass))
keep_cols=["marketing_cost", "Passengers_count", "passengers_count_pred"]
air_pass[keep_cols]

If the model is good, then the predicted values will be very close to actual values.  Any gap between actual values and predicted values should be considered as an error. For a perfect model, the actual values are exactly equal to predicted values, and error will be zero in that case. 

R-Squared is used for measuring the accuracy or the goodness of fit of the model. R-squared is also known as an explained variance by the model. For good model,   R-squarred value should be near to 1.


In [None]:
model2 = sm.ols(formula='Passengers_count ~ customer_ratings', data=air_pass)
fitted2 = model2.fit()
print(fitted2.summary())

Looking at the output summary, we can identify that model-1 has an R-squared value of 0.761 and model-2 has an R-squared value of 0.102. Using model-1, we have explained 76% of the variance in the target variable, whereas using model-2, we could explain only 10% variance in the target variable.  Hence, model-1 is better than model-2 when it comes to the accuracy of predictions.  

## Multiple regression
The model building process and interpretation of the R-squared value remain the same. In fact, in multiple regression line, we are utilizing multiple variables information for predicting the target variable. 

In [None]:
import statsmodels.formula.api as sm
model3 = sm.ols(formula='Passengers_count ~ marketing_cost+percent_delayed_flights+number_of_trips+customer_ratings+poor_weather_index+percent_female_customers+Holiday_week+percent_male_customers', data=air_pass)
fitted3 = model3.fit()
print(fitted3.summary())

As you can observe, we need to mention all the predictor variables on the right-hand side. Here we are not summing the columns; it is just the syntax to supply the independent variables.We can see the R-squarred value has gone beyond 90%. We have added multiple predictor variables, and we expect the model to get better. It doesn’t necessarily mean that all the predictor variables are really important in the model.

## Multicollinearity in Regression

In [None]:
#income_expenses=pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-3/Datasets/customer_income_expenses.csv")
income_expenses=pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter3_Regression_Logistic/Datasets/customer_income_expenses.csv")

print(income_expenses.columns)

In [None]:
import statsmodels.formula.api as sm
model4=sm.ols(formula='Monthly_Expenses ~ Monthly_Income_in_USD+Number_of_Credit_cards+Number_of_personal_loans+Monthly_Income_in_Euro', data=income_expenses)
fitted4 = model4.fit()
print(fitted4.summary())

This model is a standard multiple regression model with four predictor variables.  The model gave us a 96% R-squared value. Let us look at the individual coefficients. 

* “Monthly_Income_in_USD” has a positive impact on overall expenses. Which means higher the variable, higher the target variable i.e monthly expenses. 
* “Number_of_Credit_cards” has a positive impact on overall expenses. Which means higher the variable, higher the target variable i.e monthly expenses. 
* “Number_of_personal_loans” has a positive impact on overall expenses. Which means higher the variable, higher the target variable i.e monthly expenses. 
* “Monthly_Income_in_Euro” has a negative impact on overall expenses. Which means higher the variable, lower the target variable i.e monthly expenses. 


The monthly expenses should be directly proportional to monthly income whether the income is measured in dollars or euros. Monthly income in euros is just 0.9 times monthly income in dollars. Here one dollar=0.9 euros.But here the target variable show negative relation with Monthly_Income_in_Euro and positive relation with Monthly_Income_in_USD.In the next example, we will remove the dollars variable from the model and rebuild the whole model with just three variables. 


In [None]:
model5=sm.ols(formula='Monthly_Expenses ~Number_of_Credit_cards+Number_of_personal_loans+Monthly_Income_in_Euro', data=income_expenses)
fitted5 = model5.fit()
print(fitted5.summary())

Here are a few observations from the above output.
* The model has the same R-squared value. So, dropping Monthly_Income_in_USD didn’t have a significant impact on the overall accuracy of the model.
* “Number_of_Credit_cards” has a positive impact on overall expenses. Which means higher the variable, higher the target variable, i.e., monthly expenses. 
* “Number_of_personal_loans” has a positive impact on overall expenses. Which means higher the variable, higher the target variable, i.e., monthly expenses. 
* “Monthly_Income_in_Euro” has a positive impact on overall expenses. Which means higher the variable, higher the target variable, i.e., monthly expenses. 


**The above example is a classical illustration of the adverse effects of “Multicollinierity.”**


##  VIF-Variance Inflation Factor calculation
To detect multicollinearity, we follow a very simple technique. First of all, multicollinearity is related to only predictor variables, and the target variable has nothing to do while detecting multicollinearity. 

The actual measure for detecting multicollinearity is the Variance Inflation Factor known as VIF. This measure is derived from the individual model R-squared values.


Firstly, we need to write our VIF calculation function. A function that takes all the predictor variables and builds individual models. Here is the VIF function and code explanation

In [None]:
def vif_cal(x_vars):
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

In [None]:
vif_cal(x_vars=income_expenses.drop(["Monthly_Expenses"], axis=1))

* This output shows high VIF values for all the variables. All the variables have more than 5 VIF. Shall we drop all the four variables from the model? Then how can we build a model? 
* This output shows high VIF values for Monthly_Income_in_USD and Monthly_Income_in_Euro. It does NOT mean that we drop both of these variables. In the presence of Monthly_Income_in_USD, the other variable Monthly_Income_in_Euro is redundant. Similarly, in the presence of Monthly_Income_in_Euro, the other variable Monthly_Income_in_USD is redundant. We should not drop both of them. Drop any one of them. 
* This is important while dealing with multicollinearity, and we should not drop all the variables with  VIF>5 in one iteration. We should drop one variable at a time. That may auto-correct the VIF value of several other variables.  
* We will start with the variable that has the highest VIF. Here we will drop Monthly_Income_in_Euro first. That will leave three variables in the model. Then we will check multicollinearity among these three variables. 


In [None]:
vif_cal(x_vars=income_expenses.drop(["Monthly_Expenses","Monthly_Income_in_Euro"], axis=1))

From the above output we can observe that the VIF value of Monthly_Income_in_USD has reduced significantly, and it happened due to the elimination of the Monthly_Income_in_Euro variable. There is still multicollinearity present in the system. Number_of_credit_cards and Number_of_personal_loans both the variables have higher than five VIF values. We need to drop the highest VIF variable and re-calculate the VIF values for the remaining variables.  

In [None]:
vif_cal(x_vars=income_expenses.drop(["Monthly_Expenses","Monthly_Income_in_Euro","Number_of_personal_loans"], axis=1))

Now both the variables have VIF values less than five, and it indicates that both of them carry independent information. So we can conclude that we don't need four variables for building this model, two variables are sufficient. 

Now let us build a final model after removing nulticollinearity.

In [None]:
model6=sm.ols(formula='Monthly_Expenses ~ Monthly_Income_in_USD+Number_of_Credit_cards', data=income_expenses)
fitted6 = model6.fit()
print(fitted6.summary())

We can observe that there is no significant change in overall R-square. Since we have dropped only redundant information, the model won’t get impacted adversely. While building multiple regression models, one of the important steps is to check for multicollinearity. 

In [None]:
vif_cal(x_vars=air_pass.drop(["Passengers_count","passengers_count_pred"], axis=1))

In [None]:
vif_cal(x_vars=air_pass.drop(["Passengers_count","passengers_count_pred", "percent_male_customers"], axis=1))

In [None]:
vif_cal(x_vars=air_pass.drop(["Passengers_count","passengers_count_pred","percent_male_customers", "percent_delayed_flights"], axis=1))

In [None]:
import statsmodels.formula.api as sm
model7 = sm.ols(formula='Passengers_count ~ marketing_cost+number_of_trips+customer_ratings+poor_weather_index+percent_female_customers+Holiday_week', data=air_pass)
fitted7 = model7.fit()
print(fitted7.summary())


The model is not suffering from multicollinearity now.Two variables are independent or not does not tell us anything about their impact on the target? We check for independence among the predictor variables. The impact is measured by comparing the predictor variable, against the target variable. 

## **Individual impact of the variables in regression**
While building multiple regression lines, we try to add as many dimensions(predictor variables) as possible to increase the model prediction accuracy.If there are 30 variables in the model that don’t necessarily mean all the variables have a significant impact on the target variable. To find the impact of variables we will use specific measure. 


### **p-value**
There is a specific measure to identify the impact of the individual variables. The measure name is p-value. To measure the impact of the variables, we perform a test. The test finally gives us a p-value. If P-value is less than 0.05, then that variable is significantly impactful on the target. If the P-value of a variable more than or equal to 0.05, then that variable is not impactful on the target variable. 

Let us do one example and find individual impact of variables.Build a model using all the variables.

You can check the individual variable P-values under the heading P>|t |.

For example, in model7 Two variables have higher than 0.05 P-value i.e percent_female_customers and number_of_trips have p-value greater than 0.05. The rest of the variable's P-value is less than 0.05. Their values might have been very small. Those values are rounded-off to 0.000. We can drop these variables from the model. 

In [None]:
import statsmodels.formula.api as sm
model8 = sm.ols(formula='Passengers_count ~ marketing_cost+customer_ratings+poor_weather_index+Holiday_week', data=air_pass)
fitted8 = model8.fit()
print(fitted8.summary())

We can see from the output that R-squared has not dropped significantly. Since we dropped non-impactful variables, they will not have any adverse effects on the model and its accuracy. On the other hand, what will happen if we drop an impactful variable? We will see a significant drop in un R-squared value.

Let us see what happens if we drop an impactful variable. Suppose we drop marketing_cost and build a model on remaining variables.

In [None]:
import statsmodels.formula.api as sm
model9 = sm.ols(formula='Passengers_count ~  customer_ratings+poor_weather_index+Holiday_week', data=air_pass)
fitted9 = model9.fit()
print(fitted9.summary())

We dropped an impactful variable, and the output shows a significant drop in R-squared value. The value of R-squared dropped from 90% to 62%. We should keep this variable in the model. So we can safely conclude that if P-value is more than more equal to 0.05, we can drop such variables. This 0.05 is the most widely used industry-standard value.

The variable impact has no relation to variable independence. We started with six independent variables. There are only four impactful variables, and two are unimpactful. So, a variable is impactful or not can not be estimated based on their independence. 

## Logistic Regression
We have discussed linear regression in detail till now. Let us look at this below example where we are trying to predict whether a customer will buy a product or not based on the income. A very simple example where income is the predictor and buying is the target. 

In [None]:
#product_sales=pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-3/Datasets/Product_sales.csv")
product_sales=pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter3_Regression_Logistic/Datasets/Product_sales.csv")
print(product_sales.columns)

In [None]:
import statsmodels.formula.api as sm
model10 = sm.ols(formula='Bought ~  Income', data=product_sales)
fitted10 = model10.fit()
print(fitted10.summary())

There seem to be no major issues in the model. We will go ahead with the predictions. This model accepts income as input and predicts whether a customer will buy or not. Below is the code for obtaining the predicted values from the model

In [None]:
new_data=pd.DataFrame({"Income":[4000]})
print(fitted10.predict(new_data))

In [None]:
new_data1=pd.DataFrame({"Income":[85000]})
print(fitted10.predict(new_data1))

The above output shows that when income is 4,000, the predicted value is -0.096753 when income is 85,000, then the predicted value is 1.599893. 
* There is something wrong with these predictions. There is something wrong with this model itself. 
* The target variable bought takes two values 0 and 1. Here class-0 means not-buying, and class-1 mean buying. There is no other value it takes. 
* The predicted value for 4,000 income is  -0.096753. What does a negative value mean? We know the meaning of class-0. 
* There is no such negative class in output. 
Similarly, the predicted value for 85,000 income is  1.599893. What is the meaning of 1.59? We know the meaning of class-1. There is no definition for 1.59.  

Let us have a look at some sample data points and draw the scatter plot between predictor and target variables to get a better idea of the data.

In [None]:
print(product_sales.sample(10))

In [None]:
import matplotlib.pyplot as plt
plt.scatter(product_sales["Income"], product_sales["Bought"])
plt.title('Income vs Bought Plot')
plt.xlabel('Income')
plt.ylabel('Bought')
plt.show()

From the output, it is clear that the data is not in the format that we generally assume for a regression line. The whole data is concentrated at two places, class-0 and class-1. For this data, we are trying to fit a linear regression line. A straight line is not a good representation of this data. There is no way that a straight line can go through all the points in this data. 

Let us try to draw the regression line on top of this data to get a better idea.

In [None]:
pred_values= fitted10.predict(product_sales["Income"]) 
plt.scatter(product_sales["Income"], product_sales["Bought"])
plt.plot(product_sales["Income"], pred_values, color='green')
plt.title('Income vs Bought Plot')
plt.xlabel('Income')
plt.ylabel('Bought')
plt.show()

We see the regression line. It is not going through all the points in this data. It is not possible to build a linear regression model for this data. We can conclude that linear regression I not suitable for classification. The target variable in our classification takes limited classes. In our example, we are trying to predict buying or not buying.

In fact, in most of the business cases, we find classification problems. For example, in the fraud detection model, we give more preference to the prediction of Fraud vs. No-Fraud. The amount of fraud in a transaction is secondary. Most of the problem statements in the real-world are related to classification. Regression works for continuous output or point predictions. Regression can not be used for solving classification problems. 

### Logistic Regression model building
Since the logistic function is the best option for our data, we will be building a logistic regression line instead of a linear regression line.The syntax for building a logistic regression is not very different from building a linear regression line. Below is the code for building a logistic regression line. 

In [None]:
import statsmodels.api as sm
logit_model=sm.Logit(product_sales["Bought"],product_sales["Income"])
#Model with intercept
logit_model1=sm.Logit(product_sales["Bought"],sm.add_constant(product_sales["Income"]))
logit_fit1=logit_model1.fit()
print(logit_fit1.summary())

From this output, we can extract the logistic regression coefficients 0 and 1 and, we can complete the logistic regression line equation. This equation can further be used in prediction. 

In the previous example, we build a linear regression line for this data. And we got the predictions for the x-values 4000 and 85000. We will use the above logistic regression line and try to get the predictions for the same values. We expect this line to give meaningful and accurate predictions. 


In [None]:
new_data=pd.DataFrame({"Constant":[1,1],"Income":[4000, 85000]})
print(logit_fit1.predict(new_data))

In the code, we need to mention the constant value as 1, and this is to accommodate for the intercept term. We have added an intercept using the function sm.add_constant(). We didn’t need to do it in the case of linear regression. But here we need to mention it separately. 

The predicted value for income 4000 is  0, and income 85000 is 1. There is a slight difference between our manual calculation and predict() function. This difference is due to rounding off the values of beta_0 and beta_1. Always use predict function to get accurate results. 


This below code is used for drawing the logistic regression line. You may not need to draw this while solving any business problem. This code is just for the demonstration purpose.

In [None]:
new_data=product_sales.drop(["Bought"], axis=1)
new_data["Constant"]=1
new_data=new_data[["Constant","Income"]]
#Pass the variables to get the predicted values. Add actual values in a new column 
new_data["pred_values"]= logit_fit1.predict(new_data)
new_data["Actual"]=product_sales["Bought"]
#Sort the data and draw the graph
new_data=new_data.sort_values(["pred_values"])
plt.scatter(new_data["Income"], new_data["Actual"])
plt.plot(new_data["Income"], new_data["pred_values"], color='green')
#Add lables and title 
plt.title('Predicted vs Actual Plot')
plt.xlabel('Income')
plt.ylabel('Bought')
plt.show()

The above code gives us the below graph. We can observe that the logistic regression line is a much better option than a linear regression line. 

### Accuracy of Logistic Regression line
In the above exercise, we built a logistic regression line, and we got the predicted values from the model. Before going for predictions, we should get an idea of the accuracy of the model. In linear regression, we used R-squared value to get an idea of the accuracy of the model. R-square value tells us the explained variation in the target variable. A variance-based measure may not work here. The target variable takes two values 0 and 1. There is hardly any variance. We will try to define a better measure of model performance. 

In [None]:
print(product_sales.head(10))

Predict function is used for extracting the predicted values. We need to pass a “constant” column separately. Logistic regression predicted values are bounded between 0 and 1. We are rounding off the predicted values so that the final values will be either class-0 or class-1

In [None]:
product_sales["Constant"]=1

In [None]:
product_sales["pred_Bought"]=logit_fit1.predict(product_sales[["Constant","Income"]])
product_sales["pred_Bought"]=round(product_sales["pred_Bought"])

In [None]:
print(product_sales[["Bought","pred_Bought"]])

Predicting zero as zero and one as one is the right classification by the model. The other two cases i.e predicting zero as one and one as zero are wrong classifications by the model.
There is one matrix which is  very famous in classification algorithms named as confusion matrix. We create a confusion matrix and calculate the accuracy from it.Below code creates confusion matrix on our data. 

In [None]:
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(product_sales["Bought"],product_sales["pred_Bought"])
print(cm1)

In [None]:
accuracy1=(cm1[0,0]+cm1[1,1])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])
print(accuracy1)

The output shows that the model has 98% accuracy. A general industry standard for a good model is to have above 80% accuracy.

## **Multiple Logistic Regression line**
For the classification type of problems, we go for logistic regression. It is not just one predictor that helps us in predicting the target variable. There can be several factors that will impact the target class. A multiple logistic regression line is used in all such cases. It is just an extension of a simple logistic regression line. 

In [None]:
#telco_cust=pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-3/Datasets/telco_data.csv")
telco_cust=pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter3_Regression_Logistic/Datasets/telco_data.csv")

print(telco_cust.shape)
print(telco_cust.columns)

The target variable is still class-0 and class-1. The confusion matrix calculation and accuracy measure remain the same.

The below code is used for building the multiple logistic regression line. The code is the same as the simple logistic regression line. We mention all the rest of the variables at the place of predictor variables. 


In [None]:
import statsmodels.discrete.discrete_model as sd
logit_model2=sd.Logit(telco_cust['Active_cust'],telco_cust[["estimated_income"]+['months_on_network']+['complaints_count']+['plan_changes_count']+['relocated_new_place']+['monthly_bill_avg']+["CSAT_Survey_Score"]+['high_talktime_flag']+['internet_time']])
logit_fit2=logit_model2.fit()
print(logit_fit2.summary())

In [None]:
telco_cust["pred_Active_cust"]=logit_fit2.predict(telco_cust.drop(["Id","Active_cust"],axis=1))
telco_cust["pred_Active_cust"]=round(telco_cust["pred_Active_cust"])

In [None]:
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(telco_cust["Active_cust"],telco_cust["pred_Active_cust"])
print(cm2)

In [None]:
accuracy2=(cm2[0,0]+cm2[1,1])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1])
print(accuracy2)

## Multicollinearity in Logistic Regression 
In multiple linear regression, we have seen multicollinearity as an issue. The beta coefficients can not be trusted in the presence of interdependency of variables. Multicollinearity is an issue even in multiple logistic regression. The only notable change that happened from linear regression to logistic regression is in the target variable y. In linear regression, the target variable y was continuous. In logistic regression, the target variable is a class or categorical. 

While dealing with multicollinearity, we ignore the target variable. Multicollinearity is related to predictor variables. Multicollinearity exists in logistic regression also. The multicollinearity problem in linear regression is almost the same as the multicollinearity problem in logistic regression. The detection of multicollinearity using VIF is also exactly the same. We can use the same VIF function.

Here also we will drop the variable if VIF value is greater than 5.

In [None]:
def vif_cal(x_vars):
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.ols(formula="y~x", data=x_vars).fit().rsquared  
        vif=round(1/(1-rsq),2)
        print (xvar_names[i], " VIF = " , vif)

We will use the above function to detect multicollinearity in our data

In [None]:
import statsmodels.formula.api as sm
vif_cal(x_vars=telco_cust.drop(["Id","Active_cust","pred_Active_cust"], axis=1))

From the output, it is very clear that few variables are interdependent. We need to drop the variable with the highest VIF value. We will drop “CSAT_Survey_Score” and re-calculate VIF values for the rest of the variables

In [None]:
vif_cal(x_vars=telco_cust.drop(["Id","Active_cust","pred_Active_cust","CSAT_Survey_Score"], axis=1))

From the output, we can observe that all the variables have VIF less than five. We can conclude that all these variables are independent. We can use all of them in the model. Below is the model after removing the multicollinearity. 

In [None]:
logit_model3=sd.Logit(telco_cust['Active_cust'],telco_cust[["estimated_income"]+['months_on_network']+['complaints_count']+['plan_changes_count']+['relocated_new_place']+['monthly_bill_avg']+['high_talktime_flag']+['internet_time']])
logit_fit3=logit_model3.fit()
print(logit_fit3.summary())

In [None]:
telco_cust["pred_Active_cust"]=logit_fit3.predict(telco_cust.drop(["Id","Active_cust","pred_Active_cust","CSAT_Survey_Score"],axis=1))
telco_cust["pred_Active_cust"]=round(telco_cust["pred_Active_cust"])

In [None]:
from sklearn.metrics import confusion_matrix
cm3 = confusion_matrix(telco_cust["Active_cust"],telco_cust["pred_Active_cust"])
print(cm3)

In [None]:
accuracy3=(cm3[0,0]+cm3[1,1])/(cm3[0,0]+cm3[0,1]+cm3[1,0]+cm3[1,1])
print(accuracy3)

This model has an accuracy of 86.4%. We have to look into individual variable impacts now. 

## Individual impact of variables in Logistic Regression

Once we are done with the multicollinearity detection with VIF. We will be left with all the independent variables. All these independent variables may not be impactful. If we have a list of 20 variables in the model, it doesn’t necessarily mean that all of them are impactful on the target.

Again this is same as linear regression.

In [None]:
logit_model4=sd.Logit(telco_cust['Active_cust'],telco_cust[['months_on_network']+['complaints_count']+['plan_changes_count']+['relocated_new_place']+['monthly_bill_avg']+['internet_time']])
logit_fit4=logit_model4.fit()
print(logit_fit4.summary())

In [None]:
telco_cust["pred_Active_cust"]=logit_fit4.predict(telco_cust.drop(["Id","Active_cust","pred_Active_cust","CSAT_Survey_Score","estimated_income","high_talktime_flag"],axis=1))
telco_cust["pred_Active_cust"]=round(telco_cust["pred_Active_cust"])

In [None]:
from sklearn.metrics import confusion_matrix
cm4= confusion_matrix(telco_cust["Active_cust"],telco_cust["pred_Active_cust"])
print(cm3)

In [None]:
accuracy4=(cm4[0,0]+cm4[1,1])/(cm4[0,0]+cm4[0,1]+cm4[1,0]+cm4[1,1])
print(accuracy4)