Steps in Creating a Data Science Project

* Defining the project
* Preparing the data
* Exploratory Data Analysis & Preprocessing
* Creating a machine learning model
* Predictions
* Presenting your findings

## 1. Defining the project
MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this problem, our goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.

##2. Preparing the data

In [None]:
#Import libraires

import tensorflow as tf
#tf.__version__
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore the warnings
import warnings
warnings.filterwarnings('ignore')

# display all dataframe columns & rows
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# to set the limit to 3 decimals
pd.options.display.float_format = '{:.7f}'.format


The MNIST database contains 60,000 training images and 10,000 testing images taken from American Census Bureau employees and American high school students. 

In [None]:
train = pd.read_csv('../input/digit-recognizer/train.csv')
train.head()

In [None]:
test = pd.read_csv('../input/digit-recognizer/test.csv')
test.head()

## 3. Exploratory Data Analysis & Preprocessing

In [None]:
train.describe()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(train['label'])
plt.tight_layout(); #for better visualization

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(12,6))
#distribution plot
sns.distplot(train['label'], axlabel="Labels", color='red')

Let's extract the features for train and test to be used while modeling

In [None]:
# Extract features
features = train.drop('label', axis=1)

# Extract label
y_train = train['label']

# Train images
X_ = np.array(features)
X_train = X_.reshape(X_.shape[0], 28, 28)

# Test images
X_test = np.array(test)

To see the image associated with the index number use ``imshow()``,  with   ``X_train[]``

In [None]:
plt.imshow(X_train[0])
plt.colorbar()
plt.grid(False)
plt.show()

To verify that the data is in the correct format and that you're ready to build and train the network, let's display the first 16 images from the training set.

In [None]:
fig = plt.figure(figsize=(10,5))

for i in range(16):
  fig.add_subplot(4,4, i+1)
  plt.xticks([])
  plt.yticks([])
  plt.imshow(X_train[i], cmap='gray')
  plt.xlabel('Digit: ' + str(y_train[i]))
  plt.tight_layout(); # to see clear graph
plt.show();


Check the shape of train and test set.

In [None]:
X_train.shape, X_test.shape

Our data is not in the same shape so here there is a need to reshape our train and test data respectively. 
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels. (28*28 = 784)

In [None]:
X_train = X_train.reshape(X_train.shape[0], 28, 28,1)
X_test = X_test.reshape(X_test.shape[0], 28, 28,1)

In [None]:
# lets check the shape again
X_train.shape, X_test.shape

Now its time to scale, normalize in other words, our data. 
> To do so we need to divide the train and test dataset with pixel-value as an integer between 0 and 255, inclusive. (In simple words dividing with maximum number to bring them in a format of 0 and 1)

In [None]:
X_train.min(), X_train.max()

In [None]:
X_test.min(), X_train.max()

In [None]:
X_train = X_train/255.
X_test = X_test/255.

Check the ``min()`` , ``max()`` of data to confirm it is now normalized. 

In [None]:
X_train.min(), X_train.max()

## 4. Creating a machine learning model

Since we are working on a binary classification problem, So lets create our neural network with tensorflow. Steps in Modeling to classify whether a circle is (Blue or Red)
The steps in modeling with Tensorflow are typically:

* Create or import a model
* Compile a model
* Fit the model
* Evaluate the model
* Tweak
* Evaluate.

> The first layer in this network, tf.keras.layers.Flatten, transforms the format of the images from a two-dimensional array (of 28 by 28 pixels) to a one-dimensional array (of 28 * 28 = 784 pixels). Think of this layer as unstacking rows of pixels in the image and lining them up. This layer has no parameters to learn; it only reformats the data.

The from_logits=True attribute inform the loss function that the output values generated by the model are not normalized, a.k.a. logits. In other words, the softmax function has not been applied on them to produce a probability distribution.

In [None]:
# 1. Create the model

model = tf.keras.Sequential([
                             tf.keras.layers.Flatten(input_shape=(28,28)),
                             tf.keras.layers.Dense(128, activation="relu"),
                             tf.keras.layers.Dense(10) #10 is the number of classes
])

#2 Compile the model

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

# 3. Fit the model

history = model.fit(X_train, y_train, epochs=50)


History variable here will be used in the next step to plot a graph of loss and accuracy parameter to see where trend is heading to.

**Accuracy & Loss plot:**

In [None]:
#ploting the loss and accuracy graph

pd.DataFrame(history.history).plot(figsize=(12,7))

## 5: Predictions

In [None]:
predictions = np.argmax(model.predict(X_test), axis=1)

In [None]:
predictions

Making a submission file for competition.

In [None]:
submission = pd.read_csv('../input/digit-recognizer/sample_submission.csv')
submission.head()

Shape has 28000 rows of data, and 2 columns ``ImageId`` , ``Label``

In [None]:
submission.shape

In [None]:
submission['Label']  = predictions

In [None]:
submission.to_csv('submission.csv',index=False)
submission.head()