# Logistic regression: This type of statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables.

# Problem Statement: Suppose if a person is 40 years of age, currently smoking 15 cigarettes per day, taking BP medicines, with a total cholestrol of 200, sys BP of 120, dia BP of 90.. whether this person will get heart attack in next ten years or not ?

In [20]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [3]:
#Step 1: Read the data
df = pd.read_csv(r"E:/Dataset/framingham.csv")
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [61]:
# Step 2: Data Pre-Processing:
# Creating a subset of data with only input and output variables
df_new = df[['age','cigsPerDay','BPMeds','totChol', 'sysBP','diaBP','TenYearCHD']]
df_new.head()

Unnamed: 0,age,cigsPerDay,BPMeds,totChol,sysBP,diaBP,TenYearCHD
0,39,0.0,0.0,195.0,106.0,70.0,0
1,46,0.0,0.0,250.0,121.0,81.0,0
2,48,20.0,0.0,245.0,127.5,80.0,0
3,61,30.0,0.0,225.0,150.0,95.0,1
4,46,23.0,0.0,285.0,130.0,84.0,0


In [11]:
# Check the missing values in dataset
df_new.isnull().sum() #Sum gives you the missing values present in data

age            0
cigsPerDay    29
BPMeds        53
totChol       50
sysBP          0
diaBP          0
TenYearCHD     0
dtype: int64

Total no of missing values is : 132
    
Total no of rows are : 4240

% of rows with missing values is 132/4240 : 3%

We have 3% data missing in our dataset

IF missing values are < 10%-15% of the overall data --> We can drop them

We can also replace the missing values and build our model

 We wil replace the missing values in our data
 
Categorical column : Mode

Numerical column : (i.e totChol,cigsPerDay) : Mean/Median

(a) Mean : If data is normal --> If skew value is between -1 or +1

(b) Median : If data is not normal --> If skew value is < -1 or >+1

In [14]:
# For Numerical column (i.e totChol,cigsPerDay) : Checking skewness value
print(df_new['cigsPerDay'].skew())
print(df_new['totChol'].skew())

1.2470523561848126
0.8718805634765354


For numerical columns :

1.Since skewness value for totchol is 0.8718805634765354 which is in between -1 and +1(data is normal) we can replace the missing value with the mean

2.Since skewness value for cigsperday is 1.2470523561848126 which is in greator than +1(data is not normal) we can replace the missing value with the mean

For categorical columns :

3.For BP Meds column, which is categorical. We will replace the missing values with Mode i.e. (0)

In [15]:
# For categorical column (i.e BP meds) : Lets identify the mode
df_new['BPMeds'].value_counts()

0.0    4063
1.0     124
Name: BPMeds, dtype: int64

In [62]:
#fillna() will replace the missing values
df_new['cigsPerDay'] = df_new['cigsPerDay'].fillna(df_new['cigsPerDay'].median())
df_new['totChol'] = df_new['totChol'].fillna(df_new['totChol'].mean())
df_new['BPMeds'] = df_new['BPMeds'].fillna(0)

In [63]:
df_new.isnull().sum()

age           0
cigsPerDay    0
BPMeds        0
totChol       0
sysBP         0
diaBP         0
TenYearCHD    0
dtype: int64

In [64]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4240 entries, 0 to 4239
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         4240 non-null   int64  
 1   cigsPerDay  4240 non-null   float64
 2   BPMeds      4240 non-null   float64
 3   totChol     4240 non-null   float64
 4   sysBP       4240 non-null   float64
 5   diaBP       4240 non-null   float64
 6   TenYearCHD  4240 non-null   int64  
dtypes: float64(5), int64(2)
memory usage: 232.0 KB


In [65]:
# Step 3 : Defining X and Y
X = df_new[['age','cigsPerDay','BPMeds','totChol','sysBP','diaBP']]
Y = df_new[['TenYearCHD']]

In [66]:
# Step 4: Splitting the data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size = 0.8, random_state=1234)
 
len(X_train), len(X_test), len(Y_train), len(Y_test)

(3392, 848, 3392, 848)

In [67]:
# Step 4: Creating the model using the training data set

# step a: Create a model object 
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()

# step b: Fit the model object into training data to build a model
model = LR.fit(X_train, Y_train)
model

LogisticRegression()

In [68]:
# Step 6: Predict the values on test data using your model
Y_test['predicted'] = model.predict(X_test)

In [69]:
Y_test

Unnamed: 0,TenYearCHD,predicted
1226,0,0
1011,0,0
165,0,0
1311,0,0
1712,0,0
...,...,...
2981,1,0
374,0,0
2014,0,0
2010,0,0


In [72]:
# Step 7 : Using Confusion Matrix -> Check the accuracy of model
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test['TenYearCHD'],Y_test['predicted'])

array([[716,   1],
       [122,   9]], dtype=int64)

In [73]:
(716+9)/(716+1+122+9)

0.8549528301886793

Accuracy of model is 85.45%
>70% is considered as good model

In [77]:
from sklearn.metrics import classification_report
print(classification_report(Y_test['TenYearCHD'],Y_test['predicted']))
# ideal situation recall value should be similiar, Model should not be biased

              precision    recall  f1-score   support

           0       0.85      1.00      0.92       717
           1       0.90      0.07      0.13       131

    accuracy                           0.85       848
   macro avg       0.88      0.53      0.52       848
weighted avg       0.86      0.85      0.80       848



# Next we will check our imbalance dataset (dataset is balance or not)      SMOTE : To handle imbalance in data