# Introduction to Logistic Regression
### This acts as a good benchmark to start understanding Logistic Regression
### If you find this interesting, do Upvote the kernel!

# Background and data exploration

Each biopsy sample in the dataset is labeled with an ID number and whether or not the tumor it came from is malignant (M) or benign (B). Each sample also has 10 different features associated with it, some of which are described above. Remember that each feature value for a given biopsy sample is a real-valued number.

In [None]:
# Load the data
import pandas as pd
from sklearn import metrics
data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
data['diagnosis'].replace({'M':1, 'B':0}, inplace = True)
data.to_csv('data.csv')
del data

## Loading our annotated dataset

The first step in building our breast cancer tumor classification model is to load in the dataset we'll use to "teach" (or "train") our model.

In [None]:
# First, import helpful Python tools for loading/navigating data
import os             # Good for navigating your computer's files 
import numpy as np    # Great for lists (arrays) of numbers
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv)

In [None]:
data_path  = 'data.csv'

In [None]:
# Use the 'pd.read_csv('file')' function to read in read our data and store it in a variable called 'dataframe'
dataframe = pd.read_csv(data_path)

 ## Looking at our dataset
 
 A key step in machine learning (and coding in general!) is to view the structure and dimensions of our new dataframe, which stores all our training data from the tumor biopsies. We want to confirm that the size of our table is correct, check out the features present, and get a more visual sense of what it looks like overall.

In [None]:
# Let's look at just a few of the biopsy sample features to start by subsetting our 'dataframe'
dataframe = dataframe[['diagnosis','radius_mean','area_mean', 'radius_se', 'area_se', 'smoothness_mean','smoothness_se']]

You can think of dataframes like Google or Microsoft Excel spreadsheets (large tables with row/column headers) 

**Use the 'head()' method to show the first five rows of the table and their corresponding column headers (our 7 biopsy features!)**

In [None]:
dataframe.head()


* $diagnosis$: Whether or not the tumor was diagnosed as malignant (M) or benign (B).
* $radius$_$mean$: The radius data feature, averaged across cells in that particular biopsy
* $area$_$mean$: The area data feature, averaged across cells in that particular biopsy
* $radius$_$se$: The standard error of the radius data feature for cells in that particular biopsy
* $area$_$se$: The standard error of the area data feature for cells in that particular biopsy
* $smoothness$_$mean$: The smoothness feature, averaged across cells in that particular biopsy
* $smoothness$_$se$: The standard error of the smoothness data feature for cells in that particular biopsy

Recall that the term mean refers to taking an average (summing the values for each cell and dividing by the total number of cells observed in that biopsy). Additionally, standarded error gives a sense of the standard deviation (how much variance there is between cells in that biopsy for that feature). 

In [None]:
# Next, we'll use the 'info' method to see the data types of each column
dataframe.info()

In [None]:
# First, we'll import some handy data visualization tools
import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
# To see how well mean radius correlates with diagnosis, we'll plot the data
# separated based on diagnosis category on the x-axis and have the points' y-value
# be its mean radius value

sns.catplot(x = 'diagnosis', y = 'radius_mean', data = dataframe)

Next, we might want to check just how well mean radius can be used to classify, or separate, the datapoints in either category
Let's pick a boundary value for the radius mean and see how well it separates the data

In [None]:
boundary = 10
sns.scatterplot(x = 'radius_mean', y = 'diagnosis', data = dataframe)
plt.plot([boundary, boundary], [0, 1], 'g', linewidth = 6)

Using a boundary value, we can build a boundary classifier function. This function will take in a boundary value of our choosing and then classify the data points based on whether or not they are above or below the boundary

#### Building the boundary classifier
Here we build the function that takes in a target boundary (value of radius mean). Write a function to implement a boundary classifier. Think about what the return 'type' of this classifier might be.

The code below chooses a boundary and runs it for us. 

In [None]:
def boundary_classifier(target_boundary,x):
  result = []
  for i in x:
    if i > target_boundary:
      result.append(1)
    else:
      result.append(0)
  return result
     
chosen_boundary = 15
y_pred = boundary_classifier(chosen_boundary, dataframe['radius_mean'])
dataframe['predicted'] = y_pred
y_true = dataframe['diagnosis']
sns.scatterplot(x = 'radius_mean', y = 'diagnosis', hue = 'predicted', data = dataframe)
plt.plot([chosen_boundary, chosen_boundary], [0, 1], 'g', linewidth = 6)


In [None]:
accuracy = metrics.accuracy_score(y_true,y_pred)
accuracy

**True positive rate (TPR)**: Sometimes called sensitvity, the TPR is the proportion of actual positives that are correctly identified as such. An analogy would be the percentage of sick people who are correctly identified as having the disease in some population.

**True negative rate (TNR)**: Sometimes called specificity, the TNR is the proportion of actual negatives that are correctly identified as such. An analogy would be the the percentage of healthy people who are correctly identified as not having the disease in some population.

**False positive rate (FPR)**: The FPR is the proportion of actual negatives that are incorrectly identified as positives. An analogy would be the percentage of healthy people who are incorrectly identified as having the disease.

**False negative rate (FNR)**: The FPR is the proportion of actual positives that are incorrectly identified as negatives. An analogy would be the percentage of sick people who are incorrectly identified as healthy.

A key insight is that there is a tradeoff when trying to reduce the different types of errors. For instance, if we want to increase our TPR (thus decrease our FNR by correctly identifying more sick people), our improvements will have to increase the number of people we guess to be sick. However, such an improvement will decrease our TNR (thus inrease our FPR by guessing more healty people are sick). 

Sometimes, one type of error is worse than the others for a given problem. Other times, however, we must strike an acceptable balance between the two.

![alt text](https://drive.google.com/uc?export=view&id=1S4S2MBM86D74C-Q0aPPwHzbU8iUveLKq)

In [None]:
# Import the metrics class
from sklearn import metrics

# Create the Confusion Matrix
y_test = dataframe['diagnosis']
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

# Visualizing the Confusion Matrix
class_names = [0,1] # Our diagnosis categories

fig, ax = plt.subplots()
# Setting up and visualizing the plot (do not worry about the code below!)
tick_marks = np.arange(len(class_names)) 
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g') # Creating heatmap
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y = 1.1)
plt.ylabel('Actual diagnosis')
plt.xlabel('Predicted diagnosis')

In [None]:
# YOUR CODE HERE:  
def model_stats(y_test, y_pred):
  print("Accuracy: ", metrics.accuracy_score(y_test, y_pred))
  print("Precision: ", metrics.precision_score(y_test, y_pred))
  print("Recall: ", metrics.recall_score(y_test, y_pred))

In [None]:
model_stats(y_test, y_pred)

# Finding a better separation with logistic regression


In [None]:
# Let's pull our handy linear fitter from our 'prediction' toolbox: sklearn!
from sklearn import linear_model

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(dataframe, test_size = 0.4, random_state = 1)


In [None]:
print('\n\nTraining dataframe has %d rows'%train_df.shape[0])
train_df.head()

In [None]:
print('\n\nTesting dataframe has %d rows'%test_df.shape[0])
test_df.head()

In [None]:
input_labels = ['radius_mean']
output_label = 'diagnosis'


x_train = train_df[input_labels]
print('Our x variables')
print(x_train.head())
print('\n\n')

y_train = train_df[output_label]
print('Our y variable:')
print(y_train.head())

In [None]:
# Here, we create a 'reg' object that handles the line fitting for us!
class_rm = linear_model.LogisticRegression()

In [None]:
class_rm = linear_model.LogisticRegression()
class_rm.fit(x_train, y_train)

In [None]:
x_test = test_df[input_labels]

In [None]:
y_test = test_df[output_label].values.squeeze()

In [None]:
y_pred = class_rm.predict(x_test)

In [None]:
print(y_pred)

Run the code below to visualize the results

In [None]:
y_pred = y_pred.squeeze()
x_test_view = x_test[input_labels].values.squeeze()
sns.scatterplot(x = x_test_view, y = y_pred, hue = y_test)
plt.xlabel('Radius')
plt.ylabel('Predicted')
plt.legend()

### Finally, let's re-evalute the recall, accuracy, and precision for the model by calling the functions we created.

In [None]:
model_stats(y_test, y_pred)

In [None]:
# Let's visualize the probabilities for `x_test`
y_prob = class_rm.predict_proba(x_test)
sns.scatterplot(x = x_test_view, y = y_prob[:,1], hue = y_test)

## Visualization: linear vs. logistic regression



This plot shows the graphical representations described above. As you can see, the linear model can yield predicted values outside the [0,1] range because it is a continuous linear function. 

On the other hand, the logistic model stays within our bounds. You can see that the logistic model gives a "line" with curvy ends in the [0,1] range, which is the best approximation for a line that will also always respect these boundaries. 

**Confusingly, the biggest difference between linear and logistic regression is that linear regression is used for regression problems (predicting the value of continuous variables) while logistic regression is used for classification problems!**

*Linear Regression:*

![Linear Regression](https://i.stack.imgur.com/kW8YP.png)

*Logistic Regression:*

![Logistic Regression](https://techdifferences.com/wp-content/uploads/2018/01/graph-logistic-regression.jpg)