# Micro Credit Loan Defaulter Case

#Problem Statement: 
A Microfinance Institution (MFI) is an organization that offers financial services to low income populations. MFS becomes very useful when targeting especially the unbanked poor families living in remote areas with not much sources of income. The Microfinance services (MFS) provided by MFI are Group Loans, Agricultural Loans, Individual Business Loans and so on. 
Many microfinance institutions (MFI), experts and donors are supporting the idea of using mobile financial services (MFS) which they feel are more convenient and efficient, and cost saving, than the traditional high-touch model used since long for the purpose of delivering microfinance services. Though, the MFI industry is primarily focusing on low income families and are very useful in such areas, the implementation of MFS has been uneven with both significant challenges and successes.
Today, microfinance is widely accepted as a poverty-reduction tool, representing $70 billion in outstanding loans and a global outreach of 200 million clients.
We are working with one such client that is in Telecom Industry. They are a fixed wireless telecommunications network provider. They have launched various products and have developed its business and organization based on the budget operator model, offering better products at Lower Prices to all value conscious customers through a strategy of disruptive innovation that focuses on the subscriber. 
They understand the importance of communication and how it affects a person’s life, thus, focusing on providing their services and products to low income families and poor customers that can help them in the need of hour. 
They are collaborating with an MFI to provide micro-credit on mobile balances to be paid back in 5 days. The Consumer is believed to be defaulter if he deviates from the path of paying back the loaned amount within the time duration of 5 days. For the loan amount of 5 (in Indonesian Rupiah), payback amount should be 6 (in Indonesian Rupiah), while, for the loan amount of 10 (in Indonesian Rupiah), the payback amount should be 12 (in Indonesian Rupiah). 
The sample data is provided to us from our client database. It is hereby given to you for this exercise. In order to improve the selection of customers for the credit, the client wants some predictions that could help them in further investment and improvement in selection of customers. 
Exercise:
Build a model which can be used to predict in terms of a probability for each loan transaction, whether the customer will be paying back the loaned amount within 5 days of insurance of loan. In this case, Label ‘1’ indicates that the loan has been payed i.e. Non- defaulter, while, Label ‘0’ indicates that the loan has not been payed i.e. defaulter.  
Points to Remember:
•	There are no null values in the dataset. 
•	There may be some customers with no loan history. 
•	The dataset is imbalanced. Label ‘1’ has approximately 87.5% records, while, label ‘0’ has approximately 12.5% records.
•	For some features, there may be values which might not be realistic. You may have to observe them and treat them with a suitable explanation.
•	You might come across outliers in some features which you need to handle as per your understanding. Keep in mind that data is expensive and we cannot lose more than 7-8% of the data.  

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
df=pd.read_csv('Micro_Credit_Loan_Data_file.csv')
df

In [None]:
#Dropping the 1st column as it is not necessary, only contains serial no.
df.drop(df.columns[[0]], axis=1, inplace=True)

In [None]:
df.head(10)

In [None]:
df.shape

In [None]:
#total 36 columns
df.columns

In [None]:
df.info()
#No null values

In [None]:
#pdate should be of date type
df['pdate']= pd.to_datetime(df['pdate'])

In [None]:
df.describe()
#Presence of outliers in some columns like last_rech_date_ma, last_rech_date_da

In [None]:
#As observed in minimum values above-
#aon (age on cellular network in days) cannot be negative
#daily_decr30 (Daily amount spent from main account, averaged over last 30 days (in Indonesian Rupiah)) 
#and daily_decr90 (Daily amount spent from main account, averaged over last 90 days (in Indonesian Rupiah)) cannot be negative.

#last_rech_date_ma (Number of days till last recharge of main account)
#and last_rech_date_da (Number of days till last recharge of data account) cannot be negative.

#Also, rental30(Average main account balance over last 30 days) and 
#rental90(Average main account balance over last 90 days) being negative are not favorable to provide them credit.

In [None]:
#Replacing negative values with zero
df = df._get_numeric_data()
df[df<0] = 0
df.describe()


#All negative values are now replaced with zero
#Or Impute these values

In [None]:
#Working with 'label' column
df['label'].unique()
#Binary values in target 'label'

In [None]:
#Imbalanced dataset in target
df.label.value_counts()
#imbalanced dataset problem so we can use SMOTE just to increase instances of minority classes in training dataset

In [None]:
#Univariate analysis

In [None]:
df.plot.bar(x='label')

In [None]:
#Checking outliers

In [None]:
#sns.boxplot(data=df.daily_decr30)

In [None]:
plt.figure(figsize=(15, 10))
ax = sns.boxplot(data=df.iloc[:,1:9], orient="h", palette="Set1")
#Presence of outliers in many columns

In [None]:
#https://www.kaggle.com/shrutidandagi/eda-bank-loan-default

In [None]:
#Total amount of recharges vs amount of loans taken versus default (cont. vs cont. vs cat.)

In [None]:
#For 30 days-
#sumamnt_ma_rech30	Total amount of recharge in main account over last 30 days (in Indonesian Rupiah)
#amnt_loans30	Total amount of loans taken by user in last 30 days
#versus label
#3 columns- 2 continuous and 1 categorical(as hue)
sns.relplot(x="amnt_loans30", y="sumamnt_ma_rech30", hue="label", kind="line", data=df)

In [None]:
#For 90 days-
#sumamnt_ma_rech90	Total amount of recharge in main account over last 90 days (in Indonasian Rupiah)
#amnt_loans90	Total amount of loans taken by user in last 90 days
#versus label

#3 columns- 2 continuous and 1 categorical(as hue)
sns.relplot(x="amnt_loans90", y="sumamnt_ma_rech90", hue="label", kind="line", data=df)

In [None]:
#Recharge frequency/no. of times of recharging versus no. of loans taken vs default (cont. vs cont. vs cat.)

In [None]:
#For 30 days-
#cnt_ma_rech30	Number of times main account got recharged in last 30 days
#cnt_loans30	Number of loans taken by user in last 30 days
#versus label
sns.relplot(x="cnt_loans30", y="cnt_ma_rech30", hue="label", kind="line", data=df)

In [None]:
#max. loan vs payback time vs default(cont. vs cont. vs cat.)
#There are only two options: 5 & 10 Rs., for which the user needs to pay back 6 & 12 Rs. respectively

In [None]:
#For 30 days-
#amnt_loans30	Total amount of loans taken by user in last 30 days
#payback30	Average payback time in days over last 30 days
#versus label

#sns.displot(df, x="maxamnt_loans30", hue="label", multiple="dodge")
#sns.relplot(x="payback30", y="maxamnt_loans30", hue="label", kind="line", data=df)
#df["payback30"].plot.hist(bins=10, figsize=(10,8))

sns.displot(df, x="amnt_loans30", kind="kde")

In [None]:
#For 90 days-
#maxamnt_loans90	maximum amount of loan taken by the user in last 90 days
#payback90	Average payback time in days over last 90 days
#versus label
#sns.relplot(x="maxamnt_loans90", y="payback90", hue="label", kind="line", data=df)

In [None]:
#Average main account balance vs no. of loans taken vs payback time (cont. vs cont. vs cont.)

In [None]:
#For 30 days-
#rental30	Average main account balance over last 30 days
#cnt_loans30	Number of loans taken by user in last 30 days
#payback30	Average payback time in days over last 30 days

#Two columns (cont. var.) on same x-axis but different y-axis
#df.plot(x="rental30", y="cnt_loans30")
#ax = df.plot(secondary_y="payback30")


#df["cnt_loans90"].plot(secondary_y=True, style="g")  #Use of keyword
#Four columns (cont. var.) on two y-axis
#plt.figure()
#ax = df.plot(secondary_y=["A", "B"]) #Two columns on one y-axis
#ax.set_ylabel("CD scale")    #Other two columns on other y-axis
#ax.right_ax.set_ylabel("AB scale")

In [None]:
#For 90 days-
#rental90	Average main account balance over last 90 days
#cnt_loans90	Number of loans taken by user in last 90 days
#payback90	Average payback time in days over last 90 days
df.plot(x="rental90", y="cnt_loans90")
plt.figure()
ax = df.plot(secondary_y="payback90")

In [None]:
#main account vs data account recharge (cont. vs cont.)

In [None]:
#For 30 days-
#cnt_ma_rech30	Number of times main account got recharged in last 30 days
#versus
#cnt_da_rech30	Number of times data account got recharged in last 30 days
#df["cnt_ma_rech30"].plot()
#df["cnt_da_rech30"].plot(secondary_y=True, style="g")

In [None]:
#For 90 days-
#cnt_ma_rech90	Number of times main account got recharged in last 90 days
#versus
#cnt_da_rech90	Number of times data account got recharged in last 90 days
df["cnt_ma_rech30"].plot()
df["cnt_da_rech30"].plot(secondary_y=True, style="g")

In [None]:
#frequency of main and data recharge vs no. of loans (cont vs cont)

In [None]:
#fr_ma_rech30	Frequency of main account recharged in last 30 days
#versus
#cnt_loans30	Number of loans taken by user in last 30 days


In [None]:
#fr_da_rech30	Frequency of data account recharged in last 30 days
#versus
#cnt_loans30	Number of loans taken by user in last 30 days


In [None]:
#Number of days till last recharge of main account and vs Number of days till last recharge of data account (cont. vs cont.)
#last_rech_date_ma	Number of days till last recharge of main account
#last_rech_date_da	Number of days till last recharge of data account


In [None]:
#For 30 days-


In [None]:
#For and 90 days-

In [None]:
#Main account-related- 
#This will give insight into if there is relation between daily spendings and recharge frequencies and amount.
#With label as hue for defaulter
#(if default is done more by those having high no. of recharges of main account or for a specific amount of recharge.)

#versus label-
1. # medianamnt_ma_rech30	Median of amount of recharges done in main account over last 30 days at user level (in Indonesian Rupiah)
# and  medianamnt_ma_rech90	Median of amount of recharges done in main account over last 90 days at user level (in Indonasian Rupiah)

2. #daily_decr30	Daily amount spent from main account, averaged over last 30 days (in Indonesian Rupiah) and 
#daily_decr90	Daily amount spent from main account, averaged over last 90 days (in Indonesian Rupiah) versus label


#Do it in tight layout/subplot

plt.figure(figsize = (10, 15))
sns.displot(df, x="medianamnt_ma_rech30", hue="label", kind="kde")

sns.displot(df, x="medianamnt_ma_rech90", hue="label", kind="kde")

#For 30 days data-
#Defaulters, most of the recharges are done in range of 0 to 4000 rupiah
#Non-defaulters, most of the recharges are done roughly in range of 0 to 8000 rupiah
#For 90 days data- Shows the same trend.
#No significant relation

plt.figure(figsize = (10, 15))
sns.displot(df, x="daily_decr30", hue="label", kind="kde")
sns.displot(df, x="daily_decr90", hue="label", kind="kde")
#For both data of 30 and 90 days, defaulters spend lesser money from main account than non-defaulters.
#Most defaulters spend roughly 0 to 5000 rupiah as against majority of non-defaulters who spend upto roughly 40000 rupiah daily from main account

In [None]:
#Observation 1.
#Non- defaulters spend more amount from main account along with doing more amount of recharges as compared to defaulters.

In [None]:
#Try histogram (Histplot)

df.hist(by='label', column='payback30')
#Syntax- by='cat. var', column='cont. var.'


#Loan-related (resp. for 30 and 90 days)
#To check relation between frequency of recharges and loans-
#For 30 days-
1. #cnt_loans30	Number of loans taken by user in last 30 days versus 
#cnt_ma_rech30	Number of times main account got recharged in last 30 days With label as hue for defaulter

2.#amnt_loans30	Total amount of loans taken by user in last 30 days versus
#payback30	Average payback time in days over last 30 days With label as hue

In [None]:
plt.figure(figsize=(15, 8))
sns.lineplot(x="cnt_ma_rech30", y="cnt_loans30", hue="label", data=df)
#There is a positive relation between no. of loans and no. of times of recharging main account
#Non-defaulters take more no. of loans with increase in no. of recharges but is highly skewed.
#Defaulters' max no. of loans stand at approx 26 and max no. of recharges at approx 40, 
#i.e. they take lesser no. of loans and lesser no. of recharges 


In [None]:
plt.figure(figsize=(10, 8))
sns.lineplot(x="amnt_loans30", y="payback30", hue="label", data=df)

In [None]:
plt.figure(figsize=(10, 8))
sns.lineplot(x="amnt_loans90", y="payback90", hue="label", data=df)

In [None]:
#Observation 2- 
#Defaulters have higher payback time and that too for lesser amount of loans and vice-versa.
#Non-defaulters payback in lesser no. of days.

In [None]:
#Recharge-related, 
#Data account related- deep insight if people are recharging more for internet or calls .
#Main account vs data account recharge comparisons
1. #cnt_da_rech30	Number of times data account got recharged in last 30 days versus
#cnt_ma_rech30	Number of times main account got recharged in last 30 days

#Main account vs data account recharge defaulters

3). #last_rech_date_ma	Number of days till last recharge of main account and
#last_rech_date_da	Number of days till last recharge of data account versus label

#rental30	Average main account balance over last 30 days
#rental90	Average main account balance over last 90 days

In [None]:
sns.displot(df, x="cnt_da_rech30")

In [None]:
sns.displot(df, x="cnt_ma_rech30")

In [None]:
#Observation 3-
#Data account is recharged more than main account.

In [None]:
#X
plt.figure(figsize=(10, 8))
sns.lineplot(x="last_rech_date_da", y="last_rech_date_ma", hue="label", data=df)

In [None]:
#Countplot for target
sns.countplot(df['label'])
#Class imbalance

In [None]:
#Bivariate analysis
#pairplot with subplots
sns.pairplot(df, kind="kde")

In [None]:
#Correlation analysis with corr()
df.corr()

In [None]:
#Correlation with heatmap
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot= True, linewidth=1.0, linecolor="black", fmt='0.2f')

In [None]:
#Checking columns which are strongly correlated to target 'label'
plt.figure(figsize=(22,7))
df.corr()['label'].sort_values(ascending=False).drop(['label']).plot(kind='bar', color='c')
plt.xlabel('Feature', fontsize=14)
plt.ylabel('Column with target names', fontsize=14)
plt.title('Correlation', fontsize=18)
plt.show()

In [None]:
#Checking skewness
df.skew()

In [None]:
print("Count of features which are significantly skewed: ",len(df.skew().loc[abs(df.skew())>0.5]))

In [None]:
#Encoding
#lets use pd.get_dummies function to convert categorical columns into numeric form which machine can uderstand
df=pd.get_dummies(df, drop_first = True)
df.head()

In [None]:
#Data cleaning
#Remove outliers with quantiles or imputing or zscore

#Checking outliers with zscore
from scipy.stats import zscore
z=np.abs(zscore(df))
z

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
#Removing outliers using zscore
df_new=df[(z<3).all(axis=1)]
df_new.shape

In [None]:
#Check for percentile in describe method for particular column
print(df['aon'].quantile(0.01))
print(df['aon'].quantile(0.75))
print(df['aon'].quantile(0.99))
#each of these give 28.0, 35.0, and 54.7 as respective quantiles value

In [None]:
#Actual otlier removal in column on basis of selected quantile values
df['aon']=np.where(df['aon']<0, 55.0, df['aon'])
#Outliers removed in column

In [None]:
df['aon'].describe()
#Check for the column
#Negative values removed

In [None]:
df.shape

In [None]:
#%Loss of data
Data_loss=((209593-161457)/209593)*100
Data_loss

In [None]:
#since data loss is more than 8%, so use IQR method to remove outliers

In [None]:
#Use IQR method to remove outliers

In [None]:
#dividing it into input and output
x=df_new.drop("label", axis=1)
y=df_new[["label"]]

In [None]:
#Check skewness in features
x.skew()

In [None]:
#Remove skewness if any with transformation or other method
for index in x.skew().index:
    if x.skew().loc[index]>0.5:
        x[index]=np.log1p(x[index])
        if x.skew().loc[index]<-0.5:
            x[index]=np.square(x[index])

In [None]:
#Data scaling/normalizing
from sklearn.preprocessing import StandardScaler
stds=StandardScaler()
X=stds.fit_transform(x)
X

In [None]:
#Treating imbalance with SMOTE
# import SMOTE module from imblearn library

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 2)

print('After OverSampling, the shape of train_X: {}'.format(X_train_sm.shape))
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_sm.shape))
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_sm == 1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_sm == 0)))

In [None]:
#since it is an imbalanced dataset so we will focus on auc-roc score
!pip install imblearn
from imblearn.over_sampling import SMOTE
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
#Finding best random state using SMOTE

def max_aucroc_score(clfr,x,y):
    max_aucroc_score=0
    for r_state in range(30,100):
        X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = r_state,test_size=0.30,stratify=y)
        X_train_sm, y_train_sm = SMOTE().fit_sample(X_train, y_train.ravel())
        clfr.fit(X_train_sm,y_train_sm)
        y_pred = clfr.predict(X_test)
        auro_scr=roc_auc_score(y_test,y_pred)
        print("auc roc score corresponding to ",r_state," is ",auro_scr)
        if auro_scr>max_aucroc_score:
            max_aucroc_score=auro_scr
            final_r_state=r_state
    print("max auc roc score corresponding to ",final_r_state," is ",max_aucroc_score)
    return final_r_state

In [None]:
#Prediction and Recall
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train_sm, y_train_sm.ravel())
predictions = lr.predict(X_test)
# print classification report
print(classification_report(y_test, predictions))

print(max_aucroc_score(lr,X,y))))

In [None]:
#lets use cross_val_score for logistic regression
from sklearn.model_selection import cross_val_score
print(cross_val_score(lr,X,y,cv=5,scoring="roc_auc"))

In [None]:
#Lets use logistic regression and check
#from sklearn.linear_model import LogisticRegression
#lr=LogisticRegression()
#max_aucroc_score(lr,x,y)

In [None]:
#lets check decision tree
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier()
max_aucroc_score(dtc,df_x,y)

In [None]:
#lets check cross_val_score for decision tree
print("Mean auc roc score for decision tree classifier: ",cross_val_score(dtc,df_x,y,cv=5,scoring="roc_auc").mean())
print("standard deviation in auc roc score for decision tree classifier: ",cross_val_score(dtc,df_x,y,cv=5,scoring="roc_auc").std())
print(cross_val_score(dtc,df_x,y,cv=5,scoring="roc_auc"))

In [None]:
#lets use random forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
parameters={"n_estimators":[10,100,500]}
rfo_clf=RandomForestClassifier()
clfs = GridSearchCV(rfo_clf, parameters, cv=5,scoring="roc_auc")
clfs.fit(df_x,y)
clfs.best_params_

In [None]:
rfo_clf=RandomForestClassifier(n_estimators=500)
max_aucroc_score(rfo_clf,df_x,y)

In [None]:
#lets check cross_val_score for random forest
print("Mean auc roc score for random forest classifier: ",cross_val_score(rfo_clf,df_x,y,cv=5,scoring="roc_auc").mean())
print("standard deviation in auc roc score for random forest classifier: ",cross_val_score(rfo_clf,df_x,y,cv=5,scoring="roc_auc").std())
print(cross_val_score(rfo_clf,df_x,y,cv=5,scoring="roc_auc"))

In [None]:
#Lets use KNN
#For KNN we need to know the best value of k using grid search
from sklearn.neighbors import KNeighborsClassifier
knnc=KNeighborsClassifier()
neighbors={"n_neighbors":range(1,30)}
clfs = GridSearchCV(knnc, neighbors, cv=5,scoring="roc_auc")
clfs.fit(x,y)
clfs.best_params_

In [None]:
knnc=KNeighborsClassifier(n_neighbors=28)
max_aucroc_score(knnc,x,y)

In [None]:
#lets use cross_val_score for knn 
print("Mean roc auc score for knn classifier: ",cross_val_score(knnc,x,y,cv=5,scoring="roc_auc").mean())
print("standard deviation in roc auc score for knn classifier: ",cross_val_score(knnc,x,y,cv=5,scoring="roc_auc").std())
print(cross_val_score(knnc,x,y,cv=5,scoring="roc_auc"))

In [None]:
#Lets use SVM
from sklearn.svm import SVC
svc=SVC()
parameters={"kernel":["linear", "poly", "rbf"],"C":[0.001,0.01,0.1,1,10]}
clfs = GridSearchCV(svc, parameters, cv=5,scoring="roc_auc")
clfs.fit(x,y)
clfs.best_params_

In [None]:
svc=SVC(kernel="linear",C=0.1)
max_aucroc_score(svc,x,y)

In [None]:
#lets use cross_val_score for svm
print("Mean roc auc score for svm classifier: ",cross_val_score(svc,x,y,cv=5,scoring="roc_auc").mean())
print("standard deviation in roc auc score for svm classifier: ",cross_val_score(svc,x,y,cv=5,scoring="roc_auc").std())
print(cross_val_score(svc,x,y,cv=5,scoring="roc_auc"))

In [None]:
#Lets use Gradient boosting classifier
from sklearn.ensemble import GradientBoostingClassifier
parameters={"learning_rate":[0.001,0.01,0.1,1],"n_estimators":[10,100,500,1000]}
grb_clf=GradientBoostingClassifier()
clfs = GridSearchCV(grb_clf, parameters, cv=5,scoring="roc_auc")
clfs.fit(df_x,y)
clfs.best_params_

In [None]:
grb_clf=GradientBoostingClassifier(learning_rate=0.1,n_estimators=500)
max_aucroc_score(grb_clf,df_x,y)

In [None]:
#lets check cross_val_score for gradient boosting
print("Mean auc roc score for gradient boosting classifier: ",cross_val_score(grb_clf,df_x,y,cv=5,scoring="roc_auc").mean())
print("standard deviation in auc roc score for gradient boosting classifier: ",cross_val_score(grb_clf,df_x,y,cv=5,scoring="roc_auc").std())
print(cross_val_score(grb_clf,df_x,y,cv=5,scoring="roc_auc"))

In [None]:
#Lets choose svm as our final model and random state 70
x_train, x_test, y_train, y_test = train_test_split(x, y,random_state = 70,test_size=0.20,stratify=y)
x_train, y_train = SMOTE().fit_sample(x_train, y_train)
svc=SVC(kernel="linear",C=0.1)
svc.fit(x_train,y_train)
y_pred=svc.predict(x_test)

In [None]:

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
print("Confusion matrix \n",confusion_matrix(y_test,y_pred))
print("f1 score is : ",f1_score(y_test,y_pred))
print("classification report \n",classification_report(y_test,y_pred))
print("AUC ROC Score: ",roc_auc_score(y_test,y_pred))

In [None]:
#Check multicollinearity with VIF

In [None]:
#Finding best random state
#Apply the logistic regression model
from sklearn.model_selection import LogisticRegression
from sklearn.metrics import accuracy_score
lr=LogisticRegression()
for i in range(0,10000):
    x_train, x_test, y_train, y_test= train_test_split(x_scaled, y, test_size=0.30, random_state=i)
    lr.fit(x_train, y_train)
    #scores of test and train datasets
    pred_train=lr.predict(x_train)
    pred_test=lr.predict(x_test)
    if round(accuracy_score(y_train, pred_train)*100,2)==round(accuracy_score(y_test, pred_test)*100,2):
        print("At random state", i, "this model performs well")
        print("Training score is:", accuracy_score(y_train, pred_train)*100)
        print("Testing score is:", accuracy_score(y_test, pred_test)*100)

In [None]:

#Here, the best random state is:
#So applying it in another algorithm


In [None]:
#Checking for overfitting/underfitting with train and test scores with cross_val_score
pred_lr=lr.predict(x_test)
from sklearn.model_selection import cross_val_score
lrs=accuracy_score(y_test, pred_lr)
for j in range(2,10):
    lrscore=cross_val_score(lr, x_scaled, y, cv=j)
    lsc=lrscore.mean()
    print("At cv:", j)
    print("Cross validation score is:", lsc*100)
    print("Accuracy score is:", lrs*100)
    print("\n")

In [None]:
#AUC-ROC value for the above algo
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds=roc_curve(pred_test, y_test)
roc_auc=auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=10, label='ROCcurve(area=%0.2f)', %roc_curve)
plt.plot([0,1], [0,1], color='navy', lw=10, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('Receiver operating characteristics')
plt.legend(loc="lower right")
plt.show()

In [None]:
#Regularization (if needed)
#3 more algos with the above random state in train_test_split
#Steps till AUC-ROC score

In [None]:
#Best AUC-ROC to find best model

In [None]:
#Model saving