In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Reference Blogs

**Interactive CNN**

* **Interactice Convolution Neural Network on MNIST -** https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

* **Interactive CNN with CIFAR-10 dataset -** https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

* **Toy 2D data set classification using CNN -** https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

**Blogs to follow for Computer Vision**

* **Computer Vision for Dummies -** https://www.visiondummy.com/

* **Learn OpenCV -** https://www.learnopencv.com/

* **Tombone's CV blog -** https://www.computervisionblog.com/

* **Andrez Karpathy Blog -** http://karpathy.github.io/

* **AI Shack Blog -** https://aishack.in/

* **Computer Vision Talks -** https://computer-vision-talks.com/

# <center> Case Study on Image Classification:


# Context:
- We are given a **dataset** which contains image data. 
- The data contains **pixel values** of the images in csv format. 
- Each image represents one of the number from 0, 1, 2, ..., 8, 9. So there are 10 possible outcomes for each row of data. As the data contains the pixel values of the image in each row. 
- Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. 
- Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning lighter shade. 
- This pixel-value is an integer between 0 and 255, inclusive.
- The data set has 785 columns. 
- The first column, called "label", is the digit which is in the image. 
- *The* rest of the columns contain the pixel-values of the associated image.

In [None]:
28 * 28

# Problem:
- Classify the image based on the pixel value.
- The result of the model should be the number which is represented by the pixel values.
- Use Supervised Learning method for it.

# Data:
- **label:** A value between 0 and 9. Both inclusive. Total 10 unique values.
- **pixel0, pixel1, pixel2, ..., pixel782, pixel783:** Each value in these columns is between 0 and 255. Which represents the pixel intensity. 

In [None]:
# ! pip install opencv-python

In [None]:
import cv2
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import svm
from sklearn.model_selection import train_test_split

## 1. Loading the data
- We use panda's read_csv to read train.csv into a dataframe.
- Then we separate our images and labels for supervised learning.
- We also do a train_test_split to break our data into two sets, one for training and one for testing. This let's us measure how well our model was trained by later inputting some known test data.

In [None]:
data = pd.read_csv('/kaggle/input/d/aggarwalrahul/dl-intro-to-nn/dataset.csv')    # Load the dataset by providing the path to the file.

In [None]:
data

In [None]:
data.shape

In [None]:
# The labels for images.
y = data["label"]

Let's see some of the labels.

In [None]:
print(y[0])               # Label for 1st image.
print(y[2000])            # Label for 2001st image.

### 2. Data Preparation

**Add on:**

What should we do to prepare the data according to the model input?

Let's see in the next steps.

In [None]:
# Drop 'label' column.
X = data.drop(labels = ["label"], axis = 1)

In [None]:
type(X)

In [None]:
print(y[0])               # Label for 1st image.
print(y[2000])            # Label for 2001st image.

In [None]:
i=2000
img = X.iloc[i].values
img = img.reshape((28,28))
plt.imshow(img, cmap='gray')
plt.title(y[i]);

## Think about it:

Why did we drop the labels from the data and saved it into new variable "X"?

## Think about it:

Do we know, how the data distribution looks like across all the numbers?

i.e. We need to know the images corresponding to each number.

In [None]:
y.value_counts()

In [None]:
g = sns.countplot(x=y);

**Insight -** We can see from the above plot that the data is evenly distributed among all the classes (from 0, 1, 2, ..., 8, 9). So, the dataset is perfectly balanced.

### 3. Check for Null and Missing Values

In [None]:
# Check the data
X.isnull().sum()

**Insights** We checked for corrupted images (missing values inside) - There are no missing values in the dataset. So we can safely go ahead.

### 4. Normalization [Depends upon the Algorithm]
* We perform a grayscale normalization to reduce the effect of illumination's differences.

* https://machinelearningmastery.com/how-to-manually-scale-image-pixel-data-for-deep-learning/

In [None]:
# Normalize the data
X = X.astype('float32') / 255.0

### 5. Train and Test Split

In [None]:
X = X.values # Convert the features (pixel values) to numpy array to feed into the supervised learning model.
y = y.values # Convert the labels to numpy array to feed into the supervised learning model.

In [None]:
# Split data into test and train to build the model.
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=0)

In [None]:
type(X_train) # As we can see that the data to be fed into model is of the type numpy array.

#### We can get a better sense for one of these examples by visualising the image and looking at the label.

### 6. Viewing an Image
- Since the image is currently one-dimension, we load it into a numpy array and reshape it so that it is two-dimensional (28x28 pixels)
- Then, we plot the image and label with matplotlib

* You can change the value of variable i to check out other images and labels.

In [None]:
i=10
img = X[i]
img = img.reshape((28,28))
plt.imshow(img, cmap='gray')
plt.title(y[i]);

## Add-on:
- What is `X[i]`'s shape?
- How to reshape the whole array instead of just one row?
- What's the shape of the reshaped array?

* In the original data, the pixel values were in a 1-Dimensional array.
* We converted that 1-D array of 784 pixel values into an 2-D array of shape (28 X 28).
* Note that 28 multiplied by 28 is equal to 784.
* As each pixel value represents dark or white spot, when we plot the 28x28 pixel's array, we get the above image.
* So, it validates the fact that image can be represented by an numpy array.
* Each value of the above numpy array represents a pixel, which has value between 0 and 255.

### 7. Examining the Pixel Values:
- Note that these images aren't actually black and white (0,1). They are gray-scale (0-255).
- A histogram of this image's pixel values shows the range.

In [None]:
plt.figure(figsize=(10,7), edgecolor='blue')

n, bins, patches = plt.hist(X[9], bins=10, range=(0.0, 1.0))
plt.xlabel('Pixel value')
plt.ylabel('Number of Pixels')
plt.title('Histogram of Pixel values')
plt.show();

## Think about it:

- What does the height = 641 mean?
Similarly other height values in the result.
- The width is same for all = 0.1

* From above histogram, we can see that the there are more number of pixels which have value equal to zero. 
* Zero value represents black pixel.
* From the image we saw before, we observed that the black portion of image was more than the white portion.
* It confirms that white pixel is represented by value equal to 1.

----------------------------------------------------

## Think about it:

- What should be our next step while understanding about the images?

### 8. Training our model
- First, we use the sklearn.ensemble module to create a **random forest classifier**.
- Next, we pass our training images and labels to the classifier's fit method, which trains our model.
- Finally, the test images and labels are passed to the score method to see how well we trained our model. Score will return a float between 0-1 indicating our accuracy on the test data set

* Try playing with the parameters of RandomForestClassifier() to see how the results change.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
?RandomForestClassifier

In [None]:
# random forest model creation
clf = RandomForestClassifier(random_state=190,verbose=False,n_estimators=201,n_jobs=8,max_depth=8)

In [None]:
clf.fit(X_train, y_train)

In [None]:
train_pred_cl=clf.predict(X_test)
train_pred_cl

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
y_pred = clf.predict(X_test)
y_pred

 How did our model do?
- You should have gotten around 0.9376, or **93.76% accuracy**. This is good.

In [None]:
i=0
img = X_test[i]
img = img.reshape((28,28))
plt.imshow(img, cmap='gray')
plt.title(y_pred[i]);

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print("=== Confusion Matrix ===")
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
df_cm = pd.DataFrame(cm, index = [i for i in "0123456789"],
                     columns = [i for i in "0123456789"])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='d');

In [None]:
print("=== Classification Report ===")
print(classification_report(y_test, y_pred))

**Insights** So, we got a pretty good accuracy on classification of the images of digits, as the classification report shows.

## Please Note:
NEVER loop through a numpy array or pandas df! Use vectorized operations instead!
https://www.pythonlikeyoumeanit.com/Module3_IntroducingNumpy/VectorizedOperations.html

### So, we got a pretty good accuracy on classification of the images of digits, as the classification report shows.

# Session Summary
- The images can be represented using numpy array.
- If the array is 1-D, then we can visualize the image after reshaping the array to suitable shape. In this case we converted it to 2-D array.
- We visualized the image by plotting the numpy array using matplotlib.
- After that we saw that the pixel values' distribution in histogram, the black pixel's value is zero, and the white pixel's value is 1. These values are after normalization of pixel values, i.e. after dividing each value by 255.
- We used RandomForestClassifier as supervised classification method.