### Classify input application data into expected defaulter and non-defaulter

In [1]:
# Close : real number
# default : 0 or 1 (categorical column)

In [2]:
import pandas as pd
import plotly.express as px

path = r"/home/harshit/Desktop/IntroductionToML/Dataset/original.csv"

df=pd.read_csv(path, index_col="clientid")

df

Unnamed: 0_level_0,income,age,loan,default
clientid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,66155.925095,59.017015,8106.532131,0
2,34415.153966,48.117153,6564.745018,0
3,57317.170063,63.108049,8020.953296,0
4,42709.534201,45.751972,6103.642260,0
5,66952.688845,18.584336,8770.099235,1
...,...,...,...,...
1996,59221.044874,48.518179,1926.729397,0
1997,69516.127573,23.162104,3503.176156,0
1998,44311.449262,28.017167,5522.786693,1
1999,43756.056605,63.971796,1622.722598,0


In [3]:
# information of data

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 1 to 2000
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   income   2000 non-null   float64
 1   age      1997 non-null   float64
 2   loan     2000 non-null   float64
 3   default  2000 non-null   int64  
dtypes: float64(3), int64(1)
memory usage: 78.1 KB


In [4]:
# null value check

df.isna().sum()

income     0
age        3
loan       0
default    0
dtype: int64

In [5]:
# number of unique values for each column
df.nunique()

income     2000
age        1997
loan       2000
default       2
dtype: int64

In [6]:
# column names, shape 

print(df.columns)
print(df.shape)

Index(['income', 'age', 'loan', 'default'], dtype='object')
(2000, 4)


In [7]:
df.dropna(inplace=True) #drop a row if it contains even a single missing value
df.reset_index(inplace=True)

In [8]:
df.isna().sum()

clientid    0
income      0
age         0
loan        0
default     0
dtype: int64

In [9]:
df[  ["income","age","loan"]       ].describe()

Unnamed: 0,income,age,loan
count,1997.0,1997.0,1997.0
mean,45333.864334,40.807559,4445.487716
std,14325.131177,13.624469,3046.792457
min,20014.48947,-52.42328,1.37763
25%,32804.904487,28.990415,1936.813257
50%,45788.7471,41.317159,3977.287432
75%,57787.565659,52.58704,6440.861434
max,69995.685578,63.971796,13766.051239


Z score scaling?  z-scaling?

Age                      Income

30                          50000



Harshit: 32 : Age is 2 units above average age
Harshit 60000: Income 10 units above average 

Rohan: 27 : 3 units below average age

In [10]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

df[    ["loan","age","income"]     ]=sc.fit_transform(df[["loan","age","income"]])

df



Unnamed: 0,clientid,income,age,loan,default
0,1,1.453898,1.336861,1.201907,0
1,2,-0.762398,0.536639,0.695744,0
2,3,0.836733,1.637207,1.173812,0
3,4,-0.183244,0.362998,0.544366,0
4,5,1.509532,-1.631534,1.419754,1
...,...,...,...,...,...
1992,1996,0.969671,0.566081,-0.826899,0
1993,1997,1.688523,-1.295454,-0.309357,0
1994,1998,-0.071390,-0.939016,0.353673,1
1995,1999,-0.110170,1.700619,-0.926703,0


In [11]:
#data bias


df['default'].value_counts(normalize=True)

0    0.858287
1    0.141713
Name: default, dtype: float64

### step 3 : Selection of feature & target!

In [12]:
features=df[["income","age","loan"]]
label=df[['default']]

### step 4 : splitting into training & testing

In [13]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(features,label,test_size=0.2)

### step 5 : train the model

In [14]:
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier() #object

In [15]:
model.fit(x_train,y_train)

### ask the model to make prediction for category 0 or 1?

In [16]:
ans=model.predict(x_test)

### convert predicted values into a one column data frame

In [17]:
predicted=pd.DataFrame(ans, columns=["Prediction"])
predicted

Unnamed: 0,Prediction
0,0
1,0
2,0
3,0
4,0
...,...
395,0
396,0
397,0
398,0


# take the actual answers from y_test and make it a dataframe (reset index since index is random)

In [18]:
actual_test=pd.DataFrame(y_test).reset_index(drop=True)


In [19]:
pd.concat([predicted,actual_test],axis=1)

Unnamed: 0,Prediction,default
0,0,0
1,0,0
2,0,0
3,0,0
4,0,0
...,...,...
395,0,0
396,0,0
397,0,0
398,0,0


In [20]:
from sklearn.metrics import accuracy_score

accuracy_score(actual_test,predicted)

0.985

In [21]:
from sklearn.metrics import confusion_matrix
confusion_matrix(actual_test,predicted)

array([[339,   3],
       [  3,  55]])

### identify data points that were predicted incorrectly by the model

In [22]:
#storing those comparison table values in a variable
result = pd.concat([predicted,actual_test],axis=1)

#design a filter (to identify rows where values don't match)
condition = (    result["Prediction"] != result["default"]  )

#applying the filter
result[condition]

Unnamed: 0,Prediction,default
13,1,0
173,0,1
195,1,0
215,0,1
327,1,0
358,0,1


In [23]:

df["default"] = df["default"].astype("category")
px.scatter(
    y="loan",
    x="income",
    data_frame=df,
    color="default",
    color_discrete_map={    0: "green", 1: "red"   }
)

model is right 386 times and wrong 14 times

4 scenarios while we make a prediction(assume that 0 means positive, 1 means negative)

Model's Prediction            Actual Answer        Result
0                               0                   True Positive
0                               1                   False Positive
1                               0                   False Negative
1                               1                   True  Negative

TP          FP
FN          TN




P : probability of somebody being a defaulter 
1-P : probability of the same person not being a defaulter



log odds means logarithm of ratio of P and 1-P

log(        P             =        m * x + C
        ------ )
        1-P)



P                   (m*x+C)  
-----          = e 
1-P



P =            1
      -------------
            (m*x +c)
        1+e 

--------------------------------------

SIGMOID FUNCTION

1
------------
      (-x)
1 + e

In [24]:
always returns 0 or 1 (approximately)

SyntaxError: invalid syntax (800004324.py, line 1)