# Logistic Regression
[COMP20121 Machine Learning for Data Analytics](https://sites.google.com/site/hejunhomepage/Teaching/machine-learning-for-data-analytics)

Author: Jun He 

## Learning objectives
* Implement logistic regression for classification 
* Tune parameters in logistic regression

In [None]:
#import Python libraries 
import pandas as pd 
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.preprocessing import OrdinalEncoder
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression 
from sklearn import metrics
import numpy as np

## Activity 1 Implement Logistic Regression on Indians Diabetes Data 
### Load data and understand data
Add Pima Indians Diabetes data, which is available at https://www.kaggle.com/uciml/pima-indians-diabetes-database. You can add this data set from `Add Data` button in your Kaggle Kernel with the above URL. 

*Question: what is the  data shape? What are columns names? Which is the class  name?*



In [None]:
import pandas as pd
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head(10)

In [None]:
print("data shape", df.shape)

### Exploratory data analysis
(1) Use a pie-chart to show the percentage of `Outcome` =1 and 0. 

In [None]:
df['Outcome'].value_counts().plot(kind = 'pie', title = 'Outcome', autopct='%1.1f%%') 

(2)	Understand the relationship between two predictor variables and the target variable.

For example, scatter plot  "BloodPressure", y="Glucose" and "Outcome" 

*Question: what is your finding 1 from EDA?*


In [None]:
import seaborn as sns
sns.scatterplot(data=df, x="BloodPressure", y="Glucose", hue="Outcome")

(3) Check outliers in the data. For example, check features `BMI`, `Pregnancies` and `BloodPressure`

*Question: is there any outlier in the data? If  yes, how do you handle these outliers?*

In [None]:
plt.figure() 
df["BMI"].plot(kind='box', fontsize=15)
plt.show()

In [None]:
plt.figure() 
df["Pregnancies"].plot(kind='box', fontsize=15)
plt.show()

In [None]:
plt.figure() 
df["BloodPressure"].plot(kind='box', fontsize=15)
plt.show()

### Prepare data
(1) Split dataset in predictor and target variables

In [None]:
X = df.iloc[:,0:8] #stop is excluded
y = df.iloc[:,8] 
print(X)

(2) Split dataset into training  data and test data
    * 70% records for training
    * 30% records for test

In [None]:
import sklearn.model_selection as model_selection
X_train,X_test,y_train,y_test = model_selection.train_test_split(X,y,test_size=0.3,random_state=4)


### Build a logistic regression model
(1) Create a logistic regression object (classifier)

(2) Train the classifier on training data `fit(X_train,y_train)`

(3) Print parameters used in the classifier

In [None]:
from sklearn.linear_model import LogisticRegression
# Create LogisticRegression object
clf = LogisticRegression(max_iter =2000)

# Train LogisticRegression Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test) 
clf.get_params()


### Evaluate the model
(1) Predict the label of patients in test data set

(2) Calculate the accuracy of prediction

In [None]:
# import the metrics class
from sklearn import metrics
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

(3) Print coefficients of the logistic regression curve

In [None]:
df3 = pd.DataFrame(zip(X_train.columns, np.transpose(clf.coef_.tolist()[0])), columns=['features', 'coefficient']) # create a dataframe
df3 = df3.append({'features':'intercept','coefficient' : clf.intercept_.tolist()[0]}, ignore_index=True) # append a new row
df3

### Tune parameters in logistic regression
Tune the following parameters in  logistic regression and find out the best parameters
* `max_iter`: the maximum number of iterations for a solver  to iterate
* `penalty`:  Used to specify penalization in regularization
    * ‘l1’, ‘l2’, ‘elasticnet’, ‘none’
* `solver`: used for fitting the model in logistic regression
    * For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
    * ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
    * ‘liblinear’ and ‘saga’ also handle L1 penalty
    * ‘saga’ also supports ‘elasticnet’ penalty
    * ‘liblinear’ does not support setting penalty='none' 

## Activity 2 Comparison with KNN and decision tree
You have learned two classifiers, KNN and decision trees, in previous lectures. In this activity, you will apply KNN and decision trees to the above data.

*Question: which classifier perform the best?*


## Reflect
Briefly note what you’ve learnt, found easy and found challenging in your Jupyter notebook. Keep these notes safe and maintain a reflective log for each lab session.

## Resources/references
1. Sklearn  logistic regression: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
2. Understanding Logistic Regression in Python: https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python 