# Loan Prediction
## Binary Classification using Logistic Regression

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') 

### Importing & Loading the dataset

In [2]:
df1 = pd.read_csv('train2.csv')
df1.head()

Unnamed: 0,id,discount,price,no of items,location,class,segment,delivery_type,RID,address code,profit
0,1,0.2,16.448,2,Central,kariox,Consumer,Standard Class,7981,77095,5.5512
1,2,0.0,29.7,5,Central,kariox,Consumer,Standard Class,6334,48185,13.365
2,3,0.0,14.73,3,Central,qexty,Consumer,Standard Class,6333,48185,4.8609
3,4,0.0,43.92,3,Central,kariox,Consumer,Standard Class,6332,48185,12.7368
4,5,0.0,66.58,2,Central,kariox,Consumer,Standard Class,6331,48185,15.9792


In [3]:
df2 = pd.read_csv('test2.csv')
df2.head()

Unnamed: 0,id,discount,price,no of items,location,class,segment,delivery_type,RID,address code
0,6701,0.0,24.2,5,West,kariox,Consumer,Standard Class,2408,94122
1,6702,0.2,359.976,3,West,fynota,Consumer,Standard Class,2409,94122
2,6703,0.0,3.52,2,East,kariox,Consumer,Standard Class,5425,6708
3,6704,0.2,11.52,5,Central,kariox,Consumer,First Class,7408,60653
4,6705,0.0,242.94,3,West,kariox,Home Office,Standard Class,733,98115


### Dataset Info:

In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6700 entries, 0 to 6699
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             6700 non-null   int64  
 1   discount       6700 non-null   float64
 2   price          6700 non-null   float64
 3   no of items    6700 non-null   int64  
 4   location       6700 non-null   object 
 5   class          6700 non-null   object 
 6   segment        6700 non-null   object 
 7   delivery_type  6699 non-null   object 
 8   RID            6700 non-null   int64  
 9   address code   6700 non-null   int64  
 10  profit         6700 non-null   float64
dtypes: float64(3), int64(4), object(4)
memory usage: 575.9+ KB


In [8]:
df1.dropna(inplace=True)

### Dataset Shape:

In [9]:
df1.shape

(6699, 11)

## Data Cleaning

### Checking the Missing Values

In [10]:
df2.isnull().sum()

id               0
discount         0
price            0
no of items      0
location         0
class            0
segment          0
delivery_type    0
RID              0
address code     0
dtype: int64

### Now, Let's drop all the missing values remaining.

### Let's check the Missing values for the final time!

In [11]:
df1.isnull().sum()

id               0
discount         0
price            0
no of items      0
location         0
class            0
segment          0
delivery_type    0
RID              0
address code     0
profit           0
dtype: int64

Here, we have dropped all the missing values to avoid disturbances in the model. The Loan Prediction requires all the details to work efficiently and thus the missing values are dropped.

### Now, Let's check the final Dataset Shape

In [12]:
df2.shape

(3294, 10)

### Exploratory Data Analyis

#### Comparison between Parameters in getting the Loan:

In [13]:
df1['location'].unique()

array(['Central', 'South', 'West', 'East'], dtype=object)

In [14]:
df1['class'].unique()

array(['kariox', 'qexty', 'fynota'], dtype=object)

In [15]:
df1['segment'].unique()

array(['Consumer', 'Corporate', 'Home Office'], dtype=object)

In [16]:
df1['delivery_type'].unique()



array(['Standard Class', 'First Class', 'Second Class', 'Same Day'],
      dtype=object)

### Let's replace the Variable values to Numerical form & display the Value Counts

The data in Numerical form avoids disturbances in building the model. 

In [18]:
df1['class'].replace('kariox',2,inplace=True)
df1['class'].replace('qexty',1,inplace=True)
df1['class'].replace('fynota',0,inplace=True)

In [19]:
df2['class'].replace('kariox',2,inplace=True)
df2['class'].replace('qexty',1,inplace=True)
df2['class'].replace('fynota',0,inplace=True)

In [20]:
df1['class'].value_counts()

2    4033
1    1437
0    1229
Name: class, dtype: int64

In [21]:
df1.segment=df1.segment.map({'Home Office':0,'Corporate':1,'Consumer':2})
df1['segment'].value_counts()

2    3522
1    2048
0    1129
Name: segment, dtype: int64

In [22]:
df2.segment=df2.segment.map({'Home Office':0,'Corporate':1,'Consumer':2})
df2['segment'].value_counts()

2    1668
1     972
0     654
Name: segment, dtype: int64

In [23]:
df1.location=df1.location.map({'East':0,'West':1,'South':2,'Central':3})
df1['location'].value_counts()

1    2117
0    1929
3    1552
2    1101
Name: location, dtype: int64

In [24]:
df2.location=df2.location.map({'East':0,'West':1,'South':2,'Central':3})
df2['location'].value_counts()

1    1085
0     919
3     771
2     519
Name: location, dtype: int64

In [25]:
df1.delivery_type=df1.delivery_type.map({'Standard Class':0, 'First Class':1, 'Second Class':2, 'Same Day':3})
df1['delivery_type'].value_counts()

0    4086
2    1290
1     966
3     357
Name: delivery_type, dtype: int64

In [26]:
df2.delivery_type=df2.delivery_type.map({'Standard Class':0, 'First Class':1, 'Second Class':2, 'Same Day':3})
df2['delivery_type'].value_counts()

0    1882
2     654
1     572
3     186
Name: delivery_type, dtype: int64

In [27]:
df1.head()

Unnamed: 0,id,discount,price,no of items,location,class,segment,delivery_type,RID,address code,profit
0,1,0.2,16.448,2,3,2,2,0,7981,77095,5.5512
1,2,0.0,29.7,5,3,2,2,0,6334,48185,13.365
2,3,0.0,14.73,3,3,1,2,0,6333,48185,4.8609
3,4,0.0,43.92,3,3,2,2,0,6332,48185,12.7368
4,5,0.0,66.58,2,3,2,2,0,6331,48185,15.9792


In [28]:
df2.head()

Unnamed: 0,id,discount,price,no of items,location,class,segment,delivery_type,RID,address code
0,6701,0.0,24.2,5,1,2,2,0,2408,94122
1,6702,0.2,359.976,3,1,0,2,0,2409,94122
2,6703,0.0,3.52,2,0,2,2,0,5425,6708
3,6704,0.2,11.52,5,3,2,2,1,7408,60653
4,6705,0.0,242.94,3,1,2,0,0,733,98115


In [29]:
X1 = df1.iloc[:, 1:-1].values
y1 = df1.iloc[:, -1].values

In [30]:
X1

array([[2.00000e-01, 1.64480e+01, 2.00000e+00, ..., 0.00000e+00,
        7.98100e+03, 7.70950e+04],
       [0.00000e+00, 2.97000e+01, 5.00000e+00, ..., 0.00000e+00,
        6.33400e+03, 4.81850e+04],
       [0.00000e+00, 1.47300e+01, 3.00000e+00, ..., 0.00000e+00,
        6.33300e+03, 4.81850e+04],
       ...,
       [0.00000e+00, 2.04000e+00, 1.00000e+00, ..., 0.00000e+00,
        9.39000e+03, 5.40800e+03],
       [1.00000e-01, 2.07846e+02, 3.00000e+00, ..., 2.00000e+00,
        4.40000e+02, 1.00240e+04],
       [2.00000e-01, 1.60776e+02, 3.00000e+00, ..., 0.00000e+00,
        1.65900e+03, 9.00450e+04]])

### Splitting the data into Train and Test set

In [31]:
X2 = df2.iloc[:, 1:].values

In [32]:
X2

array([[0.00000e+00, 2.42000e+01, 5.00000e+00, ..., 0.00000e+00,
        2.40800e+03, 9.41220e+04],
       [2.00000e-01, 3.59976e+02, 3.00000e+00, ..., 0.00000e+00,
        2.40900e+03, 9.41220e+04],
       [0.00000e+00, 3.52000e+00, 2.00000e+00, ..., 0.00000e+00,
        5.42500e+03, 6.70800e+03],
       ...,
       [0.00000e+00, 2.72940e+02, 3.00000e+00, ..., 2.00000e+00,
        1.16900e+03, 1.00350e+04],
       [2.00000e-01, 1.13568e+02, 2.00000e+00, ..., 0.00000e+00,
        4.14000e+02, 9.41100e+04],
       [2.00000e-01, 3.02400e+00, 3.00000e+00, ..., 0.00000e+00,
        5.09200e+03, 8.05380e+04]])

In [33]:
y1

array([ 5.5512, 13.365 ,  4.8609, ...,  0.9588,  2.3094, 10.0485])

In [34]:
y1 = y1.reshape(len(y1),1)
y1

array([[ 5.5512],
       [13.365 ],
       [ 4.8609],
       ...,
       [ 0.9588],
       [ 2.3094],
       [10.0485]])

### Logistic Regression (LR)

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. 

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

In [35]:
from sklearn.preprocessing import StandardScaler
sc_X1 = StandardScaler()
sc_X2 = StandardScaler()
sc_y1 = StandardScaler()
X1 = sc_X1.fit_transform(X1)
X2 = sc_X2.fit_transform(X2)
y1 = sc_y1.fit_transform(y1)

In [37]:
sample = pd.read_csv('sampleSolution.csv')
y2 = sample.iloc[:, -1].values

In [38]:
y2

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

#### Sigmoid Function

In [36]:
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X1, y1)

SVR()

In [40]:
y_pred = regressor.predict(X2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y2.reshape(len(y2),1)),1))

print(y_pred)

print(y_pred.reshape(len(y_pred),1))

[[-0.02717758  0.        ]
 [ 0.03083043  0.        ]
 [-0.08250771  0.        ]
 ...
 [ 0.17017487  0.        ]
 [-0.12598762  0.        ]
 [-0.07036839  0.        ]]
[-0.02717758  0.03083043 -0.08250771 ...  0.17017487 -0.12598762
 -0.07036839]
[[-0.02717758]
 [ 0.03083043]
 [-0.08250771]
 ...
 [ 0.17017487]
 [-0.12598762]
 [-0.07036839]]


In [41]:
df = pd.DataFrame(y_pred)
print(df)

df.to_csv(r'E:\data analysis\project kaggle\1kdag\abcd.csv', index = False)

             0
0    -0.027178
1     0.030830
2    -0.082508
3    -0.014260
4     0.108193
...        ...
3289 -0.049494
3290 -0.062632
3291  0.170175
3292 -0.125988
3293 -0.070368

[3294 rows x 1 columns]
