
# EXPERIMENT 8 MACHINE LEARNING 
# PRANJAL NIGAM N252 MBA TECH. CE 
# CLASSIFICATION  

Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”.

Popular algorithms that can be used for binary classification include:

* Logistic Regression
* k-Nearest Neighbors
* Decision Trees
* Support Vector Machine
* Naive Bayes

Logistic Regression and Support Vector Machines algorithms are specifically designed for binary classification and do not natively support more than two classes.

First we need to import libraries which need and import dataset to have a insight look.

In [None]:
import pandas as pd  # data processing
import numpy as np   # linear algebra
import matplotlib.pyplot as plt  #Plotting
from sklearn.preprocessing import StandardScaler

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
import seaborn as sns
%matplotlib inline

sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

## Initial look at Data

In [None]:
data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
data.head()

In [None]:
data.info()

In [None]:
data.describe()

## Check Missing Values
Here by looking about results we can see that there are some misleading data points. For example here in the SkinThickness varibale contains 0 as a value which is not correct. The vaibales 'pregnancies', 'Insulin' & 'Outcome' can have 0 as its value, but in other vaibles it can't. Therefore we have to identify those values as missing or misleading value.

In [None]:
data.isnull().sum()

### Replacing 0's with 'NaN'

In [None]:
data[['Glucose','BloodPressure','SkinThickness','BMI','DiabetesPedigreeFunction','Age']] = data[['Glucose','BloodPressure','SkinThickness','BMI','DiabetesPedigreeFunction','Age']].replace(0,np.NaN)
data.head(10)

Since we have indenfied misleading values, we need to have an idea about how many misleading values present and how they are going to effect the final prediction.

In [None]:
data_nan = data.isna().sum()
data_nan = pd.DataFrame(data_nan, columns=['NaN count'])
data_nan

### Filling Missing Values

The missing values are filled by thier medians

In [None]:
data['Glucose'].fillna(data['Glucose'].median(), inplace=True)
data['BloodPressure'].fillna(data['BloodPressure'].median(), inplace=True)
data['SkinThickness'].fillna(data['SkinThickness'].median(), inplace=True)
data['BMI'].fillna(data['BMI'].median(), inplace=True)
data['DiabetesPedigreeFunction'].fillna(data['DiabetesPedigreeFunction'].median(), inplace=True)
data['Age'].fillna(data['Age'].median(), inplace=True)
data.describe()


## EDA(exploratory data analysis)

In this section, we will create graphs to displays different distributions of the data and available relationships to allow us to understand it much better.

### Checking the distribution of the target variable

In [None]:
sns.countplot(x = 'Outcome', data = data)

Next, we will proceed in checking the relationships by visualizing correlations as shown in the table below.
And then we will plot the correlation using 'Heatmap'

In [None]:
data.corr()

In [None]:
plt.subplots(figsize = (12,8))
sns.set(font_scale = 1.5)
sns.heatmap(data.corr(), annot=True, fmt='.2f')
plt.title('Correlation Plot', fontsize = 20)
plt.show()

## Univariate Distribution using Displot

In [None]:
fig, ax = plt.subplots(4,2, figsize=(16,16))
sns.distplot(data.Age, bins = 20, ax=ax[0,0]) 
sns.distplot(data.Pregnancies, bins = 20, ax=ax[0,1]) 
sns.distplot(data.Glucose, bins = 20, ax=ax[1,0]) 
sns.distplot(data.BloodPressure, bins = 20, ax=ax[1,1]) 
sns.distplot(data.SkinThickness, bins = 20, ax=ax[2,0])
sns.distplot(data.Insulin, bins = 20, ax=ax[2,1])
sns.distplot(data.DiabetesPedigreeFunction, bins = 20, ax=ax[3,0]) 
sns.distplot(data.BMI, bins = 20, ax=ax[3,1]) 

# Appplying Classification Models

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
X = data.iloc[:, :-1]
y = data.iloc[:, -1]


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

### Logistic Regression
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is binary. Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

In [None]:
LR = LogisticRegression()

#fiting the model
LR.fit(X_train, y_train)

#prediction
y_pred = LR.predict(X_test)

#Accuracy
print("Accuracy ", LR.score(X_test, y_test)*100)

#Plot the confusion matrix
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()

### Decision Tree
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

In [None]:
DT = DecisionTreeClassifier()

#fiting the model
DT.fit(X_train, y_train)

#prediction
y_pred = DT.predict(X_test)

#Accuracy
print("Accuracy ", DT.score(X_test, y_test)*100)

#Plot the confusion matrix
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()

### Naive Bayes
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

In [None]:
NB = GaussianNB()

#fiting the model
NB.fit(X_train, y_train)

#prediction
y_pred = NB.predict(X_test)

#Accuracy
print("Accuracy ", NB.score(X_test, y_test)*100)

#Plot the confusion matrix
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()

### KNN
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems.
KNN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

In [None]:
KNN = KNeighborsClassifier(n_neighbors=3)

#fiting the model
KNN.fit(X_train, y_train)

#prediction
y_pred = KNN.predict(X_test)

#Accuracy
print("Accuracy ", KNN.score(X_test, y_test)*100)

#Plot the confusion matrix
sns.set(font_scale=1.5)
cm = confusion_matrix(y_pred, y_test)
sns.heatmap(cm, annot=True, fmt='g')
plt.show()