# Logistic Regression

_class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)_

Value of Logistic regression strictly ranges from 0 to 1. Therefore, it can be used in Classification purpose like if Specifed email is Spam (1) or Not-Spam/Ham (0) {OR a tumour is malignant/cancerous or NOT}

If value is greater than 0.5 (> 0.5) it is classified in class 1 (Spam in this case) and if it is less than 0.5 (< 0.5) it is classified in class 0 (Ham in this case).

If model is predicting values very close to 1 like 0.90, 0.97, 0.99, etc... means model is strongly definite that it is in class 1 and vice versa.

###### Now, Let us study about Mathematical formula of Logistic Regression

First, it is essential to know the formula of Sigmoid function or Logistic Function which is:

<img src="https://latex.codecogs.com/gif.latex?%5Cdpi%7B150%7D%20sigmoid%28t%29%20%3D%20%5Cfrac%7B1%7D%7B1%20&plus;%20e%5E%7B-t%7D%7D" title="sigmoid(t) = \frac{1}{1 + e^{-t}}" />


Basically, Sigmoid Function takes any value (from -$ \infty $ to +$ \infty $) and convert it to the range {0, 1}

###### Now, Let's understand sigmoid function by visualization...

In [None]:
import numpy as np

def sigmoid(t):
    return 1/(1 + np.e**-t)

range_vals = np.linspace(-10, 10, 50)

sigmoid_values = sigmoid(range_vals)

In [None]:
import matplotlib.pyplot as plt

plt.plot(range_vals, sigmoid_values)
plt.title("Sigmoid Function")
plt.xlabel("t")
plt.ylabel("sigmoid(t)")
plt.show()

Logistic Regression is used for Binary Classification as we discussed above then why we call it as Logistic 'Regression'? This is because we use Regression approach and then set a threshold, above or below, we can classify it to a specific class.

Let us consider a completely ficticious dataset of a Person being Obese or NOT by their given Weight.

Obese == 1

Not Obese == 0

Let us Plot this...

In [None]:
X = [20, 23, 24, 25, 30, 35, 40, 50, 53, 55, 60, 65, 70] # Weights
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # Obese or not

In [None]:
plt.plot(X[:7], y[:7], 'bo', color="black")
plt.plot(X[7:], y[7:], 'bo', color='red')
plt.show()

Now, let us fit Linear Regression in this data...

In [None]:
X = np.array(X).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X, y)

In [None]:
y_pred = reg.predict(X)

In [None]:
plt.plot(X[:7], y[:7], 'bo', color="black")
plt.plot(X[7:], y[7:], 'bo', color='red')
plt.plot(X, y_pred)
plt.show()

As you can see this is not at all a Good Technique and Linear Regression Fails to do this...

Here, Sigmoid or Logistic Function comes in to play. Let me show you the data and Sigmoid function in one plot and you will automatically understand why is it so

In [None]:
sig_log_vals = sigmoid(np.linspace(-100, 100, 90))

plt.plot(X[:7], y[:7], 'bo', color="black")
plt.plot(X[7:], y[7:], 'bo', color="red")
plt.plot(sig_log_vals)
plt.show()

### Making Logistic Regression Algorithm with Scikit learn

In [None]:
from sklearn.linear_model import LogisticRegression

logit_reg = LogisticRegression().fit(X, y.ravel())

#### Coefficient and intercept of fitted model

In [None]:
logit_reg.coef_

In [None]:
logit_reg.intercept_

#### Prediction

In [None]:
print(logit_reg.predict([[10]]))
print(logit_reg.predict([[56]]))

We can see that our model is Predicting Weight 10 as 'NOT obese' and Weight 55 as 'Obese', which is True as per our data!

In [None]:
y_pred = logit_reg.predict(X)

In [None]:
y_pred

## Working with Real world Dataset

Now, we will work with Real world dataset and make predictions using Logistic regression

### About Dataset :

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

### Importing Dataset with Pandas:

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(r'../input/heart-disease-prediction-using-logistic-regression/framingham.csv')

In [None]:
df.head()

### About dataset:

• Sex: male or female(Nominal)

• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

• Current Smoker: whether or not the patient is a current smoker (Nominal)

• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

• BP Meds: whether or not the patient was on blood pressure medication (Nominal)

• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)

• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)

• Diabetes: whether or not the patient had diabetes (Nominal)

• Tot Chol: total cholesterol level (Continuous)

• Sys BP: systolic blood pressure (Continuous)

• Dia BP: diastolic blood pressure (Continuous)

• BMI: Body Mass Index (Continuous)

• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

• Glucose: glucose level (Continuous)

• 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

In [None]:
df.info()

As we can see this data has many Null values.

So we will replace this Null value by the mean. Using this pandas function :

In [None]:
df.mean()

In [None]:
df = df.fillna(df.mean())

In [None]:
df.info()

We have succesfully got rid of the nan/Null values! Now we can fit our DataFrame

In [None]:
X = df.iloc[:, :-1] # X has all columns except the last column because it is the column we have to make predictions for.
y = df.iloc[:, -1]  # y has last column

X = np.array(X)
y = np.array(y)

In [None]:
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(X)

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

### Making model

In [None]:
chd = LogisticRegression()

In [None]:
chd.fit(X_train, y_train)

In [None]:
chd.score(X_test, y_test)

In [None]:
y_pred = chd.predict(X_test)

In [None]:
print("Predicted Values (top 10):- \n")
y_pred[:10]

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

In [None]:
def plot_confusion_matrix(cm, classes,
                        normalize=False,
                        title='Confusion matrix',
                        cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()


In [None]:
import itertools

plot_confusion_matrix(cm, ["No risk of CHD", "Risk of CHD"])

Our model is performing good, but it has predicted 'No risk of CHD' for 117 people which actually have 'Risk of CHD'.

Thanks for reading!