<a href="https://colab.research.google.com/github/shubhangi-singh21/Data-Science/blob/master/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Uploading dataset

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
# importing libraries
import io
import pandas as pd

import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

Scikit-learn is a free software machine learning library for the Python programming language.


In [None]:
# reading the dataset

df = pd.read_csv(io.BytesIO(uploaded['data_banknote_authentication.csv']))

In [None]:
df.head()

We observe that the column names are not present here. So we load the dataset again, including the column names this time.

In [None]:
cols = ["Variance", "Skewness", "Kurtosis", "Entropy", "Class"]

In [None]:
df = pd.read_csv(io.BytesIO(uploaded['data_banknote_authentication.csv']), names=cols)

In [None]:
df.head()

# Exploratory Data Analysis

In [None]:
df.shape

There are 1372 rows and 5 feature columns.

In [None]:
df.info()

Observation : 

In [None]:
df.describe()

Observation : 

In [None]:
sns.pairplot(df)

Observation :

In [None]:
sns.heatmap(df.corr())

Observation : 

# Scaling

In [None]:
X = df[["Entropy", "Skewness", "Variance", "Kurtosis"]].copy()         # Features
y = df["Class"].copy()                                                 # Target

In [None]:
X.head()

In [None]:
y.head()

Standardize features by removing the mean and scaling to unit variance. 

The standard score of a sample x is calculated as:

    z = (x - u) / s

where u is the mean of the training samples and s is the standard deviation of the training samples. 

In [None]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [None]:
X

# Train-test split

Split the dataframe into random train and test subsets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
print(len(X_train))
print(len(X_test))

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

Although the count of class "0" is slightly more than that of class "1", we can approximately consider this to be a case of balanced classification.

# Logistic Regression

In linear regression, we had -

$ y = a + b*X $ , 

where a and b were unknown parameters.



The logistic function, also called the sigmoid function, is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

Now, the logistic function is given by - 

$ f(x) = \frac{exp(x)}{1+exp(x)}$

![Logistics Function](https://miro.medium.com/max/386/1*P4RYaqxDUEZF16dKf0vyNg.png)

If we pass the y value found out using linear regression to the logistic function, we get - 

$ p = f(y) = \frac{exp(a + b*X)}{1+exp(a + b*X)}$

Here, p is a value between 0 and 1. This value is the probability of the first class. 

If p > 0.5 => 1st class
If p < 0.5 => 2nd class

Read more [here](https://machinelearningmastery.com/logistic-regression-for-machine-learning/).

In [None]:
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

coef_ returns the coefficient of the features.

In [None]:
clf.coef_

In [None]:
# Accuracy score 

clf.score(X_test, y_test)

In [None]:
# Mean Squared Error

mean_squared_error(y_test, clf.predict(X_test))

In [None]:
# prediction 

clf.predict(X_test)

# Exploration 

In [None]:
X_ = df["Variance"].copy()

In [None]:
import numpy as np

lin_eq = np.poly1d([-4.536, 0])

In [None]:
lin_eq

In [None]:
val = list(lin_eq(X_))

In [None]:
val

In [None]:
from math import exp

In [None]:
y = [exp(i) for i in val]

In [None]:
prob = []

for i in range(len(y)):
  tmp = y[i]/(1+y[i])
  prob.append(tmp)

In [None]:
len(prob)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.scatter(X_, prob)