Hello, my name is Rohan. I am interested in learning more techniques and tools in data analytics to strengthen my skills. I hope to work with a series of Kaggle datasets slowly tackling more and more complex problems and learning new techniques along the way. 

In this first notebook I will be working with a "beginner" dataset. This will just be to get hands on experience in the Kaggle Kernel and explore some more basic data science libraries in Python. I have previously used Python pandas, numpy and Matplotlib for basic visualization. In this notebook I hope to describe, visualize and extend the Iris data to be able to predict Iris species based on given parameters. I hope to use either Logistical Regression or K-nearest neighbors (or both) for class fitting. I will be outlining my steps as I go for reference.

I start with some basic imports and commands to view the structure of the data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import seaborn as sns
import os
import tensorflow as tf
from tensorflow import keras
from keras import Model
# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv("../input/iris.csv")
df.head()

In [None]:
df.info()

In [None]:
df.describe()

The input features would be Sepal length, Sepal Width, Petal Length and Petal Width to predict Species. We examine the species variable further.

In [None]:
df['Species'].value_counts()

So I didn't know about this library, the seaborn library. This pairplot command constructs a square matrix for each variable plotting one against the other, and the diagonals showing the distribution of the univariable.

In [None]:
sns.pairplot(df,hue ='Species')

The hue parameter labels the different categories in the "Species" column of the dataframe. So we see how each parameter plotted against another can result in different species. For example, the bottom row - second column shows how low petal width and low sepal length flowers tend to be Setosa species. These plots show a large distance between the different groups, which makes me think the nearest neighbors algorithm might be a promising classification method. 

Looking at the diagonals we see that Petal length and Width are significantly smaller and separated for the Setosa species. Sepal Width by itself is almost undistinguishable for the three different types of species, and thus when we look at the data plotted against other parameters we see the classes of species separated in only a vertical or horizontal direction for graphs in the middle row or column. 

Now that I have visualized the data, let's try and train the data.

I want to get more familiar with the Python Tensorflow and Keras libraries so I will be using them to perform a simple regression analysis. 

I first need to separate the data into training and testing data and then prep the data by normalizing it (making the different features scale the same and have the same range). This can be done as follows:

![](http://www.statisticshowto.com/wp-content/uploads/2016/11/alternate-z-score.png)

Where:
xi is a data point (x1, x2…xn).
x̄ is the sample mean.
s is the sample standard deviation.

So we separate the data (70:30 training to testing) and then apply this formula to the Pandas Dataframe.

In [None]:
#Splitting data into training and testing categories as well as input/output
training_data = df[0:104]
training_results = training_data['Species']
training_data = training_data.drop(columns='Species')

testing_data = df[105::]
testing_results = testing_data['Species']
testing_data = testing_data.drop(columns='Species')

In [None]:
#normalize data
training_data = (training_data - training_data.mean())/training_data.std()
training_data.describe()


In [None]:
sns.pairplot(training_data)

Now we can see that the training data has roughly the same scale, all having the same standard deviation and meanas close to zero. I wanted to know why exactly normalization is so important, and upon further research I found that normalization takes different variables that might be scaled differently and have different means/standard deviations and makes them directly comparable. This is great for regression analysis where we must compare the different features of the data.

Now we must create a model.

In [None]:
#initialize a sequential model object
model = keras.Sequential()

#add input layer
#Add a Dense layer with the input shape of the training data.Use
#a reLu function for nonlinear activation.
model.add(keras.layers.Dense(64, activation=tf.nn.relu,
                       input_shape=(training_data.shape[1],)))

#add hidden layer
#Add a Dense layer again using a reLu function.
model.add(keras.layers.Dense(64, activation=tf.nn.relu))

model.add(keras.layers.Dense(1))

model.summary()

Above we've created a simple neural network. This takes the training data variables as inputs to the network and assigns them arbitrary weights. Regression (back propogation) is used to find the best fit for these weights so divulge how each data parameter most likely affects the species result. The activation parameter uses a rectifier function (ReLU) to nonlinearize the data and extend it to predict more sophisticated models. 

![](https://cdn-images-1.medium.com/max/1000/0*kETHX4MtZfu8_0sE.png)

Now let's train our model.

In [None]:
#RMSProp used to speed up gradient descent
optimizer = tf.train.RMSPropOptimizer(0.001)

#mean squared error loss(error) function to be minimized for the NN.
model.compile(loss='mse',
        optimizer=optimizer,
        metrics=['accuracy'])

history = model.fit(training_data, training_results, epochs=500,
                    validation_split=0.3, verbose=0)

#TODO: Convert training_reults into floating points? Review optimizers and 
#loss functions.