**Experiment No. 02**

---

**Aim:**Using Scipy Library.
---

**Objectives:**

* Reading the Data
* PreProcessing and cleaning the data



In [0]:
import scipy.io as sio
import numpy as np

SciPy is a collection of mathematical algorithms and convenience functions built on the NumPy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data. With SciPy, an interactive Python session becomes a data-processing and system-prototyping environment rivaling systems, such as MATLAB, IDL, Octave, R-Lab, and SciLab.

Reading Data Using Scipy

We can read only MATLAB files using scipy.io library. There are three options to do that 
i.e 


loadmat(file_name[, mdict, appendmat])   -  Load MATLAB file.

savemat(file_name, mdict[, appendmat, …])  -  Save a dictionary of names and arrays into a MATLAB-style .mat file.

whosmat(file_name[, appendmat]) - List variables inside a MATLAB file


In [0]:
vect = np.arange(10)

In [0]:
sio.savemat('np_vector.mat', {'vect':vect})

In [0]:
mat_contents = sio.loadmat('np_vector.mat')

In [0]:
mat_contents['vect']

array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])

PREPROCESSING THE DATA

Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. For achieving better results from the applied model in Machine Learning projects the format of the data has to be in a proper manner. Some specified Machine Learning model needs information in a specified format, for example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null values have to be managed from the original raw data set.


In [0]:
import pandas 
from sklearn.preprocessing import MinMaxScaler 

In [0]:
import io

In [0]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv('diabetes.csv')

In [0]:
array = dataframe.values
array

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]])

In [0]:
X = array[:,1:8] 
Y = array[:,8] 

RESCALING THE DATA


When our data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors. We can rescale your data using scikit-learn using the MinMaxScaler class.

In [0]:
scaler = MinMaxScaler(feature_range=(0, 1))

In [0]:
rescaledX = scaler.fit_transform(X)

In [0]:
np.set_printoptions(precision=3)

In [0]:
print(rescaledX[0:5,:])

[[0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


BINARIZE THE DATA

We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful. We can create new binary attributes in Python using scikit-learn with the Binarizer class.

In [0]:
from sklearn.preprocessing import Binarizer 

In [0]:
binarizer = Binarizer(threshold=0.01).fit(X) 
binaryX = binarizer.transform(X) 

In [0]:
print(binaryX[0:5,:])

[[1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1.]]


Standardizing the data:

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. We can standardize data using scikit-learn with the StandardScaler class.




In [0]:
from sklearn.preprocessing import StandardScaler

In [0]:
scaler = StandardScaler().fit(X) 
rescaledX = scaler.transform(X) 

In [0]:
print(rescaledX[0:5,:]) 

[[ 0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [ 0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


Conclusion

In this experiment we studied how to read files using scipy library and the various ways we can preprocess the data using sklearn from the scipy library.