# <center>  MBA Tech Sem VI : Machine Learning 
#  Prepared by : Prof. Santosh Bothe </center>

In this session you will learn how you can prepare your data for machine learning in Python using scikit-learn. You now have recipes to:
 Rescale data.
 Standardize data.
 Normalize data.
 Binarize data.

# **1. Rescale Data**
When your data is comprised of attributes with varying scales, many machine learning algorithms can benet from rescaling the attributes to all have the same scale. Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is  useful for optimization algorithms used in the core of machine learning algorithms like gradient
descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like k-Nearest Neighbors. You can rescale your data using scikit-learn using the MinMaxScaler class.

Reading

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

In [None]:
# Rescale data (between 0 and 1)
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import StandardScaler

filename = '../input/cern-electron-collision-data/dielectron.csv'
dataframe = read_csv(filename)
array = dataframe.values
print(array[0])
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
print(X)
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

# ** 2. Standardize Data**

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression,
logistic regression and linear discriminate analysis. You can standardize data using scikit-learn with the StandardScaler class

Reading : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.
html

In [None]:
# Standardize data (0 mean, 1 stdev)
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

# 3. Normalize Data
Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra). This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors. You can normalize data in Python with scikit-learn using the Normalizer class


Reading : http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

In [None]:
# Normalize data (length of 1) separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

# 4. Binarize Data (Make Binary)
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. This is called binarizing your data or thresholding your data. It can be useful when you have probabilities that you want to make into crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful. You can create new binary attributes in Python using scikit-learn with the Binarizer class.

Reading: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html

In [None]:
# binarization separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

# Simple Linear Regression
Linear regression is a prediction method that is more than 200 years old. Simple linear regression is a great first machine learning algorithm to implement as it requires you to estimate properties from your training dataset, but is simple enough for beginners to understand. In this tutorial, you will discover how to implement the simple linear regression algorithm from scratch in Python.
After completing this tutorial you will know:
 How to estimate statistical quantities from training data.
 How to estimate linear regression coecients from data.
 How to make predictions using linear regression for new data.
Let's get started.


Let's look at a normal distribution. Below is some code to generate and plot an idealized
Gaussian distribution.

Running the example generates a plot of an idealized Gaussian distribution. The x-axis are  the observations and the y-axis is the likelihood of each observation. In this case, observations around 0.0 are the most common and observations around -3.0 and 3.0 are rare or unlikely.Technically, this is called a probability density function

In [None]:
# generate and plot an idealized gaussian
from numpy import arange
from matplotlib import pyplot
from scipy.stats import norm
# x-axis for the plot
x_axis = arange(-3, 3, 0.001)
# y-axis as the gaussian
y_axis = norm.pdf(x_axis, 0, 1)
# plot data
pyplot.plot(x_axis, y_axis)
pyplot.show()

We can almost see the Gaussian shape to the data, but it is blocky.
This highlights an important point. Sometimes, the data will not be a perfect Gaussian, but
it will have a Gaussian-like distribution. It is almost Gaussian and maybe it would be more
Gaussian if it was plotted in a dierent way, scaled in some way, or if more data was gathered.
Often, when working with Gaussian-like data, we can treat it as Gaussian and use all of the
same statistical tools and get reliable results.

We can then plot the dataset using a histogram and look for the expected shape of the
plotted data. The complete example is listed below 

In [None]:
# generate a sample of random gaussians
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# histogram of generated data
pyplot.hist(data)
pyplot.show()

Example of calculating and plotting the sample of Gaussian random numbers with
more bins.

In [None]:
# generate a sample of random gaussians
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# histogram of generated data
pyplot.hist(data, bins=100)
pyplot.show()

# Central Tendency
The central tendency of a distribution refers to the middle or typical value in the distribution.
The most common or most likely value. In the Gaussian distribution, the central tendency is
called the mean, or more formally, the arithmetic mean, and is one of the two main parameters
that denes any Gaussian distribution.

Example of calculating the **mean** of a data sample.

In [None]:
# calculate the mean of a sample
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from numpy import median
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# calculate mean
result = mean(data)
print('Mean: %.3f' % result)
result = median(data)
print('Mean: %.3f' % result)

# Variance
The variance of a distribution refers to how much on average that observations vary or dier
from the mean value. It is useful to think of the variance as a measure of the spread of a
distribution. A low variance will have values grouped around the mean (e.g. a narrow bell
shape), whereas a high variance will have values spread out from the mean (e.g. a wide bell
shape.) We can demonstrate this with an example, by plotting idealized Gaussians with low
and high variance. The complete example is listed below.

In [None]:
# generate and plot gaussians with different variance
from numpy import arange
from matplotlib import pyplot
from scipy.stats import norm
# x-axis for the plot
x_axis = arange(-3, 3, 0.001)
# plot low variance
pyplot.plot(x_axis, norm.pdf(x_axis, 0, 0.5))
# plot high variance
pyplot.plot(x_axis, norm.pdf(x_axis, 0, 1))
pyplot.show()

**Example of calculating the variance of a data sample.**

In [None]:
# calculate the variance of a sample
from numpy.random import seed
from numpy.random import randn
from numpy import var
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# calculate variance
result = var(data)
print('Variance: %.3f' % result)

Calculate the standard deviation of a sample

In [None]:
# calculate the standard deviation of a sample
from numpy.random import seed
from numpy.random import randn
from numpy import std
# seed the random number generator
seed(1)
# generate univariate observations
data = 5 * randn(10000) + 50
# calculate standard deviation
result = std(data)
print('Standard Deviation: %.3f' % result)

Example creating a line plot from data

In [None]:
# example of a line plot
from numpy import sin
from matplotlib import pyplot
# consistent interval for x-axis
x = [x*0.1 for x in range(100)]
# function of x for y-axis
y = sin(x)
# create line plot
pyplot.plot(x, y)
# show line plot
pyplot.show()

# Bar Chart 

bar chart is generally used to present relative quantities for multiple categories. The x-axis
represents the categories and are spaced evenly. The y-axis represents the quantity for each
category and is drawn as a bar from the baseline to the appropriate level on the y-axis. A
bar chart can be created by calling the bar() function and passing the category names for the
x-axis and the quantities for the y-axis.

In [None]:
# example of a bar chart
from random import seed
from random import randint
from matplotlib import pyplot
# seed the random number generator
seed(1)
# names for categories
x = ['red', 'green', 'blue']
# quantities for each category
y = [randint(0, 100), randint(0, 100), randint(0, 100)]
# create bar chart
pyplot.bar(x, y)
# show line plot

pyplot.show()

# A histogram plot 
A histogram plot is generally used to summarize the distribution of a data sample. The x-axis
represents discrete bins or intervals for the observations. For example observations with values
between 1 and 10 may be split into ve bins, the values [1,2] would be allocated to the rst bin,

[3,4] would be allocated to the second bin, and so on. The y-axis represents the frequency or
count of the number of observations in the dataset that belong to each bin. Essentially, a data
sample is transformed into a bar chart where each category on the x-axis represents an interval
of observation values.
Histograms are density estimates. A density estimate gives a good impression of
the distribution of the data.[...] The idea is to locally represent the data density by
counting the number of observations in a sequence of consecutive intervals (bins) ...
Example creating a bar chart from data.****

Example creating a histogram plot from data

In [None]:
# example of a histogram plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# random numbers drawn from a Gaussian distribution
x = randn(1000)
# create histogram plot
pyplot.hist(x)
# show line plot
pyplot.show()

# A box and whisker plot
A box and whisker plot, or boxplot for short, is generally used to summarize the distribution of
a data sample. The x-axis is used to represent the data sample, where multiple boxplots can be
drawn side by side on the x-axis if desired.
The y-axis represents the observation values. A box is drawn to summarize the middle
50% of the dataset starting at the observation at the 25th percentile and ending at the 75th
percentile. The median, or 50th percentile, is drawn with a line. A value called the interquartile
range, or IQR, is calculated as 1.5 * the dierence between the 75th and 25th percentiles. Lines
called whiskers are drawn extending from both ends of the box with the length of the IQR to

demonstrate the expected range of sensible values in the distribution. Observations outside the
whiskers might be outliers and are drawn with small circles.
Example creating a box and whisker plot from data

In [None]:
# example of a box and whisker plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# random numbers drawn from a Gaussian distribution
x = [randn(1000), 5 * randn(1000), 10 * randn(1000)]
# create box and whisker plot
pyplot.boxplot(x)
# show line plot
pyplot.show()


Scatter plots are useful for showing the association or correlation between two variables. A
correlation can be quantied, such as a line of best first, that too can be drawn as a line plot on
the same chart, making the relationship clearer. A dataset may have more than two measures
(variables or columns) for a given observation. A scatter plot matrix is a cart containing scatter
plots for each pair of variables in a dataset with more than two variables. The example below
creates two data samples that are related. The first is a sample of random numbers drawn
from a standard Gaussian. The second is dependent upon the first by adding a second random
Gaussian value to the value of the first measure.

In [None]:
# example of a scatter plot
from numpy.random import seed
from numpy.random import randn
from matplotlib import pyplot
# seed the random number generator
seed(1)
# first variable
x = 20 * randn(1000) + 100
# second variable
y = x + (10 * randn(1000) + 50)
# create scatter plot
pyplot.scatter(x, y)
# show line plot
pyplot.show()

Combining this with the functions to estimate the mean and standard deviation summary
statistics, we can standardize our contrived dataset.

In [None]:
#Example of standardizing a contrived dataset
from math import sqrt
# calculate column means
def column_means(dataset):
    means = [0 for i in range(len(dataset[0]))]
    for i in range(len(dataset[0])):
        col_values = [row[i] for row in dataset]
        means[i] = sum(col_values) / float(len(dataset))
    return means
# calculate column standard deviations
def column_stdevs(dataset, means):
    stdevs = [0 for i in range(len(dataset[0]))]
    for i in range(len(dataset[0])):
        variance = [pow(row[i]-means[i], 2) for row in dataset]
        stdevs[i] = sum(variance)
    stdevs = [sqrt(x/(float(len(dataset)-1))) for x in stdevs]
    return stdevs
# standardize dataset
def standardize_dataset(dataset, means, stdevs):
    for row in dataset:
        for i in range(len(row)):
            row[i] = (row[i] - means[i]) / stdevs[i]
# Standardize dataset
dataset = [[50, 30], [20, 90], [30, 50]]
print(dataset)
# Estimate mean and standard deviation
means = column_means(dataset)
stdevs = column_stdevs(dataset, means)
print(means)
print(stdevs)
# standardize dataset
standardize_dataset(dataset, means, stdevs)
print(dataset)

# Simple Linear Regression
Linear regression is a prediction method that is more than 200 years old. Simple linear regression
is a great first machine learning algorithm to implement as it requires you to estimate properties
from your training dataset, but is simple enough for beginners to understand. In this tutorial,
you will discover how to implement the simple linear regression algorithm from scratch in
Python.
After completing this tutorial you will know:
 How to estimate statistical quantities from training data.
 How to estimate linear regression coecients from data.
 How to make predictions using linear regression for new data.
Let's get started.

In [None]:
# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))

In [None]:
# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

In [None]:
#calculate mean and variance
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
x = [row[0] for row in dataset]
y = [row[1] for row in dataset]
mean_x, mean_y = mean(x), mean(y)
var_x, var_y = variance(x, mean_x), variance(y, mean_y)
print('x stats: mean=%.3f variance=%.3f' % (mean_x, var_x))
print('y stats: mean=%.3f variance=%.3f' % (mean_y, var_y))

In [None]:
# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar

In [None]:
# calculate covariance
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
x = [row[0] for row in dataset]
y = [row[1] for row in dataset]
mean_x, mean_y = mean(x), mean(y)
covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))

In [None]:
# Calculate coefficients
def coefficients(dataset):
    x = [row[0] for row in dataset]
    y = [row[1] for row in dataset]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

In [None]:
# calculate coefficients
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
b0, b1 = coefficients(dataset)
print('Coefficients: B0=%.3f, B1=%.3f' % (b0, b1))

In [None]:
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

In [None]:
def rmse_metric(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        prediction_error = predicted[i] - actual[i]
        sum_error += (prediction_error ** 2)
        mean_error = sum_error / float(len(actual))
    return sqrt(mean_error)

In [None]:
# Evaluate regression algorithm on training dataset
def evaluate_algorithm(dataset, algorithm):
    test_set = list()
    for row in dataset:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
        predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse = rmse_metric(actual, predicted)
    return rmse

In [None]:
# Test simple linear regression
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
rmse = evaluate_algorithm(dataset, simple_linear_regression)
print('RMSE: %.3f' % (rmse))