In [1]:
from datetime import date
today = date.today()
print("Last Updated Date:", today.strftime("%d %B %Y"))

Last Updated Date: 30 April 2021


# Scikit Learn Module

<br>[Scikit Learn](#Scikit_Learn)
<br>[SciKitLearn Cheat Sheet](#SciKitLearn_Cheat_Sheet)
<br>[Scaling our data and splitting our data into Train and Test Datasets](#Scaling_our_data_and_splitting_our_data_into_Train_and_Test_Datasets)
<br>[Nomalization of our Data / Scaling our Data](#Nomalization_of_our_Data_or_Scaling_our_Data)
<br>[Splitting our Data into Train and Test Datasets / Sets](#Splitting_our_Data_into_Train_and_Test_Datasets_or_Sets)
<br>[Now we do a train test split using an existing train-test-split library](#Now_we_do_a_train_test_split_using_an_existing_train-test-split_library)

# SciPy Module

<br>[SciPy](#SciPy)
<br>[Compute The Nth Derivate Of A Function](#Compute_The_Nth_Derivate_Of_A_Function)
<br>[Permutation And Combinations](#Permutation_And_Combinations)
<br>[Linear Algebra](#Linear_Algebra)

# <a id='Scikit_Learn'></a>Scikit Learn

* Sci Kit Learn is actually its own machine learning library for Python and it's one of most popular libraries out there but it doesn't support the deep neural networks that pytorch / tensorflow can do so. Which is why we're not really going to be using it in this course, if you're interested in some of those other scikit learn machine learning model methods, you can check out that Python for data science and machine learning bootcamp course
* SciKit Learn is a Python’s Machine Learning Library / Module and contains machine learning models. We are going to use SciKit * Learn mainly for pre-processing. In preprocessing, specifically for two things
    * 1) Scaling our data (Normalization of data)
    * 2) Splitting our data into train and test datasets / sets

In [3]:
                                            # Questions
# import libraries?
# what does MinMaxScaler does?
# How do you fit and transform data? and what is purpose of Normalization
# How do split data into training and test data and what is the purpose of splitting the data
# What Dataframe in pandas?

# <a id='SciKitLearn_Cheat_Sheet'></a>SciKitLearn Cheat Sheet

In [7]:
from sklearn.preprocessing import MinMaxScaler
scalar_model = MinMaxScaler()
scalar_model.fit_transform(data) (or in two steps scalar_model.fit(data) and scalar_model.transform(data))
mydata = np.arange(0,100).reshape(10,10)
df = pd.DataFrame(mydata,columns = ['f1','f2','f3','label'])
X = df[['f1','f2','f3']]
y = df['label']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101) #Just type train_test_split and enter

- It provides simple and efficient tools for pre-processing and predictive modeling
<br>![](./Media/Py2.png)

***Steps to build a model in scikit-learn.***
1. Import the model
2. Prepare the data set
3. Separate the independent and target variables.
4. Create an object of the model
5. Fit the model with the data
6. Use the model to predict target.

***Learn more about the scikit-learn here: https://scikit-learn.org/stable/index.html***


In [0]:
# import the scikit-learn library
import sklearn

In [0]:
# check the version 
sklearn.__version__

'0.22.1'

- ***We have seen in the pandas notebook, that we have some missing values in out data.***
- ***We will impute those missing values using the scikit-learn Imputer.***

---

In [0]:
# read the data set and check for thre null values
import pandas as pd
data = pd.read_csv('big_mart_sales.csv')
data.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [0]:
# import the SimpleImputer
from sklearn.impute import SimpleImputer

---

- For imputing the missing values, we will use `SimpleImputer`.
- First we will create an object of the Imputer and define the strategy.
- We will impute the Item_Weight by `mean` value and Outlet_Size by `most_fequent` value.
- Fit the objects with the data.
- Transform the data

---

In [0]:
# create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy= 'mean')
impute_size   = SimpleImputer(strategy= 'most_frequent')

In [0]:
# fit the Item_Weight imputer with the data and transform
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [0]:
# fit the Outlet_Size imputer with the data and transform
impute_size.fit(data[['Outlet_Size']])
data.Outlet_Size = impute_size.transform(data[['Outlet_Size']])

In [0]:
# check the null values.
data.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

- ***Now, after the preprocessing step, we separate the independent and target variable and pass the data to the model object to train the model.***
---

- ***If we have a problem in which we have to identify the category of an object based on some features. For example whether the given picture is of a cat or a dog. These are `classification problems`.***
- ***Or, if we have to identify a continous attribute like predicting sales based on some features. These are `Regression Problems`***

---

***`SCIKIT-LEARN` has tools which will help you build Regression, Classification models and many others.***

---

In [0]:
# some of the very basic models scikit learn has.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

---

After we have build the model now whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process!

So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

---

In [0]:
from sklearn.pipeline import Pipeline

# <a id='Scaling_our_data_and_splitting_our_data_into_Train_and_Test_Datasets'></a>Scaling our data and splitting our data into Train and Test Datasets
* We're going to 'fit your training data' and then 'transform your training data' and then 'transform your test data'. 
* And the reason for that is because you don't really want to cheat by fitting to your test data as well as your training data because you don't want to assume that you're going to know what your test data is going to look like. So typically you fit to your training data and then you transform to your test data and training data. But the model itself has only been fitted to your training data

# <a id='Nomalization_of_our_Data_or_Scaling_our_Data'></a>Nomalization of our Data / Scaling our Data

In [4]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [5]:
data = np.random.randint(0,100,(10,2))

In [6]:
data

array([[42, 88],
       [19, 48],
       [83, 35],
       [87, 35],
       [19, 93],
       [67, 33],
       [ 2, 79],
       [58, 55],
       [54, 65],
       [58, 14]])

__If we want to run the Data in the Neural Networks then we have to scale that particular data, we can achieve that using the MinMaxScaler__
* from sklearn.preprocessing import MinMaxScaler
<br>
Note: There are otherways of Normalizing the data but since we are dealing with simple datasets MinMaxScaler is sufficient

In [7]:
#Creating an instance of MinMaxScaler
scaler_model =  MinMaxScaler()  #MinMaxScaler() is class or datatype
type(scaler_model)

sklearn.preprocessing.data.MinMaxScaler

In [8]:
scaler_model.fit(data) #It gives the range of values here it is (0,1)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [9]:
scaler_model.transform(data)  #It transforms the actual data into model's version of data

array([[0.47058824, 0.93670886],
       [0.2       , 0.43037975],
       [0.95294118, 0.26582278],
       [1.        , 0.26582278],
       [0.2       , 1.        ],
       [0.76470588, 0.24050633],
       [0.        , 0.82278481],
       [0.65882353, 0.51898734],
       [0.61176471, 0.64556962],
       [0.65882353, 0.        ]])

* scaler_model.fit_transform(data)  **(in a single step instead of two below steps)**
* scaler_model.fit(data)
* scaler_model.transform(data)

In [10]:
scaler_model.fit_transform(data)   #Instead of two lines of code, we can complete do it in a single step
#There are otherways of Normalizing the data, but since we are dealing with most basic data sets MinMaxScaler() is enough

array([[0.47058824, 0.93670886],
       [0.2       , 0.43037975],
       [0.95294118, 0.26582278],
       [1.        , 0.26582278],
       [0.2       , 1.        ],
       [0.76470588, 0.24050633],
       [0.        , 0.82278481],
       [0.65882353, 0.51898734],
       [0.61176471, 0.64556962],
       [0.65882353, 0.        ]])

# <a id='Splitting_our_Data_into_Train_and_Test_Datasets_or_Sets'></a>Splitting our Data into Train and Test Datasets / Sets

In [11]:
mydata = np.random.randint(0,101,(50,4))

In [12]:
mydata

array([[ 65,   3,  51,  35],
       [ 10,  53,  49, 100],
       [ 21,  72,  96,  74],
       [ 63,  83,  51,  82],
       [  5,  74,   0,  29],
       [ 31,  57,  29,  68],
       [ 83,  84,   6,  68],
       [ 93,  43,   6,  77],
       [ 38,  20,  97,  97],
       [ 79,  95,   7,   6],
       [ 40,  99,  32,  17],
       [ 54,  58,  19,  34],
       [ 61,  16,  98,  86],
       [ 72,  66,   4,  60],
       [ 39,  32,  70,  80],
       [ 56,  74,  19,   0],
       [ 51,  89,  31,  45],
       [ 43,  87,  45,  89],
       [ 88,  94,  64,  13],
       [ 97,  28,  43,  47],
       [ 89,  89,  43,  22],
       [  9,  35,  30,  82],
       [ 22,  70,  71,  73],
       [ 40,   5,  73,  25],
       [ 47,  96,  39,  75],
       [ 31,  13,  86,  34],
       [ 44,  28,   2,  86],
       [ 90,  25,   1,  10],
       [  2,  14,  52,  94],
       [ 40,  35,  76,  75],
       [ 52,  15,  39,  67],
       [ 78,  44,  62,  71],
       [ 40,  41,  36,  34],
       [ 66,  71,   9,  86],
       [ 65,  

In [13]:
import pandas as pd

__Data Frames are like tables or excel sheets where you can perform operations on individual columns or indexes. The best example for operations on matrix is given below (where f1,f2,f3 are feature columns and last column is label column)__

In [14]:
df = pd.DataFrame(mydata,columns = ['f1','f2','f3','label']) 
# 3 features f1,f2,f3 (features / magnitudes / weights) and label (solution)
# This is a supervised learning problem. This is exactly supervised learning problem looks like
# We are having 3 features and we are trying predict this 'label' (which is nothing but solving a 3 variable linear equation)
# 3 variables and we need 3 linear equations to solve them (similary to solve 'n' variables we need 'n' equations)

In [15]:
df

Unnamed: 0,f1,f2,f3,label
0,65,3,51,35
1,10,53,49,100
2,21,72,96,74
3,63,83,51,82
4,5,74,0,29
5,31,57,29,68
6,83,84,6,68
7,93,43,6,77
8,38,20,97,97
9,79,95,7,6


In [16]:
# Let's imagine df is our entire dataset and now we have to split the DataFrame into Training Set and Test Set.

X = df[['f1','f2','f3']]     #Passing features / feature columns to X

In [17]:
y = df['label'] # 'label' we are trying to Predict

In [18]:
# Ok. So now we have features data set and label that we are trying to predict

# <a id='Now_we_do_a_train_test_split_using_an_existing_train-test-split_library'></a>Now we do a train test split using an existing train-test-split library

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)  
# X is the 'What is the Input'
# y is the 'What is the Output we need'
# random_state is for 'repeatability' just like numpy's random set seed. In that way we can make sure we always get a same random split everytime you run the code
# test_size is very situation specific number, most times we give 70% or 80% training size and 30% or 20% testing size. Sometimes 50% and 50% makes sense


__When we run the above code we get 4 variables X_train,X_test,y_train,y_test__

In [21]:
X_train.shape  #This is the 'feature' dataset for the Training Set

(35, 3)

In [22]:
X_test.shape  #This is the 'feature' dataset for the Testing Set

(15, 3)

* __So the basic idea here would be, once I actually have my neural network model working in pytorch / tensorflow and I want to do some sort of training process for supervised learning, I would feed it in the training sets for X_train and y_train and the model would try to basically build some sort of understanding of how the X training features are able to predict the y training labels__. 
* __Once they have that, then I can evaluate my Model by feeding it the 'X test data' and then it will try to predict what those labels should be. I can then compare those predictive values to the true 'y test values/data'. And that's the reason for a train-test-split (there by completes full evaluation of a dataset).__

<br>[SciPy](#SciPy)
<br>[Compute The Nth Derivate Of A Function](#Compute_The_Nth_Derivate_Of_A_Function)
<br>[Permutation And Combinations](#Permutation_And_Combinations)
<br>[Linear Algebra](#Linear_Algebra)

# <a id='SciPy'></a>SciPy

* SciPy is a free and open-source Python library used for scientific computing and technical computing.
* The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation.
* SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.
* SciPy builds on the NumPy array object and is part of the NumPy stack which includes tools like Matplotlib, pandas and SymPy, and an expanding set of scientific computing libraries. This NumPy stack has similar users to other applications such as MATLAB, GNU Octave, and Scilab. The NumPy stack is also sometimes referred to as the SciPy stack

***Learn more about SciPy here: https://docs.scipy.org/doc/***

In [1]:
# import the scipy library
import scipy

In [2]:
# check the version of scipy
scipy.__version__

'1.4.1'

# <a id='Compute_The_Nth_Derivate_Of_A_Function'></a>Compute The Nth Derivate Of A Function

In [3]:
# import the derivative from scipy
from scipy.misc import derivative

In [4]:
# define the function
def my_function(x):
    return x**2 + x + 1

# calculate the first derivative of the function at x = 2

derivative(func= my_function, x0=2)

5.0

#### Function: `f(x) = x**2 + x + 1`
#### Derivate:  `f'(x) = 2*x + 1`
#### Solution:  `f'(2) = 2*2 + 1 = 5`

---

***Now, calculate the second derivative***

---

In [5]:
derivative(func=my_function,x0=2,n=2)

2.0

# <a id='Permutation_And_Combinations'></a>Permutation And Combinations

In [0]:
# COMBINATIONS
from scipy.special import comb

# total number of combinations from 4 different values taken 2 at a time
# Value of 4C2
com = comb(4, 2)

print(com)

In [0]:
# PERMUTATIONS: Value of 4P2
from scipy.special import perm
per = perm(4, 2)

print(per)

# <a id='Linear_Algebra'></a>Linear Algebra

In [0]:
# import linear algebra module and numpy
from scipy import linalg
import numpy as np

In [0]:
# square matrix
matrix = np.array([[1, 5, 2],
                   [3, 2, 1],
                   [1, 2, 1]])

In [0]:
matrix

In [0]:
# pass values to det() function
linalg.det(matrix)

In [0]:
# inverse of a matrix
linalg.inv(matrix)