## Learning Notebook: SkLearn

## Learning Objectives

At the end of the experiment, you will be able to:

* have an overview of the basics of Machine Learning

* understand the SkLearn Machine Learning framework

* understand the implementation of Train/Test Split

* perform Linear Regression using SkLearn


### Introduction

**Machine learning** is a subfield of artificial intelligence (AI). The goal of machine learning is to understand the structure of data and model (fit) the data so that it can accurately predict the label or output for similar unseen data.

**Machine Learning use cases:**

Detecting tumors in brain scans, automatically classifying news articles, automatically flagging offensive comments on discussion forums,
summarizing long documents automatically,
creating a chatbot or a personal assistant,
detecting credit card fraud,
making your app react to voice commands,
building an intelligent bot for a game.

**Machine Learning Workflow:**

1. Frame the ML problem by looking at the business need
2. Gather the data and do Data Munging/Wrangling for each subproblem
3. Explore different models, perform V&V and shortlist promising candidates
4. Fine-tune shortlisted models and combine them together to form the final  solution
5. Present your solution  
6. Deploy


**Model training and testing**

![wget](https://cdn.iisc.talentsprint.com/CDS/Images/model_train_test1.png)



### Training, Validation, and Test Set

A machine learning algorithm splits the Dataset into two sets.

Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into two subsets:

**Training Dataset:** The sample of data used to fit the model.

**Test Dataset:** The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

We usually split the data as 80% for training stage and 20% for testing stage. 70% train and 30% test or 75% train and 25% test are also often used.

**Validation Set:** This is a separate section of your dataset that you will use during training to get a sense of how well your model is doing on data that are not being used in training.

In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets.


<img src="https://miro.medium.com/max/700/1*aNPC1ifHN2WydKHyEZYENg.png" alt="drawing" width="500"/>


#### Prerequisites for using train_test_split()

We will use scikit-learn, or sklearn library which has many packages for machine learning in Python.

Refer the sklearn documentation [here](https://scikit-learn.org/stable/)

**Applying train_test_split()**

You need to import:

1.   train_test_split()
2.   NumPy

We import NumPy because, in supervised machine learning applications, you’ll typically work with two such sequences:

* A two-dimensional array with the inputs (x)
* A one-dimensional array with the outputs (y)






**sklearn.model_selection.train_test_split(arrays, options)**

* **arrays** is the sequence of lists, NumPy arrays, pandas DataFrames, or similar array-like objects that hold the data you want to split. All these objects together make up the dataset and must be of the same length.

* **options** are the optional keyword arguments that you can use to get desired behavior:

  * **train_size** is the number that defines the size of the training set.

  * **test_size** is the number that defines the size of the test set. You should provide either train_size or test_size.
      * If neither is given, then the default share of the dataset that will be used for testing is 0.25, or 25 percent.
      * If float (eg 0.25), it represents the proportion of the dataset to include in the test split and should be between 0.0 and 1.0.
      * If int (eg. 4), it represents the absolute number of test samples, eg. 4 samples of 12.
      * If None, the value is set to the complement of the train size.
      * If train_size is also None, it will be set to 0.25.

  * **random_state** is the object that controls randomization during splitting. It can be either an int or an instance of RandomState. The default value is None.

  * **shuffle** is the Boolean object (True by default) that determines whether to shuffle the dataset before applying the split.

  * **stratify** is an array-like object that, if not None, determines how to use a stratified split.



### Importing required packages

In [None]:
# Importing Standard Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Importing sklearn Libraries
from sklearn import datasets
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### Let us use a small synthetically created dataset to understand how to implement a train and test split

#### Creating a simple dataset to work with

In [None]:
# inputs in the two-dimensional array X
X = np.arange(1, 25).reshape(12, 2)

# outputs in the one-dimensional array y
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

In [None]:
print(X)

In [None]:
print(y)

#### Splitting input and output datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=4, random_state=4)

In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

### Develop an understanding of Least Squares

**Least Squares** method is a statistical procedure to find the best fit for a set of data points by minimizing the sum of the offsets or residuals of points from the plotted curve.

**Calculate Line Of Best Fit**

A more accurate way of finding the line of best fit is the least square method.

Use the following steps to find the equation of line of best fit for a set of ordered pairs $(x_1,y_1),(x_2,y_2),...(x_n,y_n)$.

**Step 1:** Calculate the slope ‘m’ by using the following formula:

$$m = \frac{\sum \left ( x-\bar{x} \right )* \left ( y-\bar{y} \right )}{\sum \left ( x-\bar{x} \right )^{2}}$$


**Step 2:** Compute the y -intercept of the line by using the formula:

$$c = y - mx$$

**Step 3:** Substitute the values in the final equation

$$y = mx + c$$

* y: dependent variable
* m: the slope of the line
* x: independent variable
* c: y-intercept


As an example, we will try to find the least squares regression line for the below data set:

\begin{array} {|r|r|}\hline Hours Spent & Grade \\\hline 6 & 82 \\ \hline 10 & 88 \\ \hline 2 & 56 \\ \hline 4 & 64 \\ \hline 6 & 77 \\ \hline 7 & 92 \\ \hline 0 & 23 \\ \hline 1 & 41 \\ \hline 8 & 80 \\ \hline 5 & 59 \\ \hline 3 & 47 \\ \hline  \end{array}

$x$ = HoursSpent

$y$ = Grade

$\bar{x}$ = 4.72

$\bar{y}$ = 64.45


\begin{array} {|r|r|}\hline Hours Spent & Grade &  x - \bar{x}  & y - \bar{y} & (x - \bar{x})*(y - \bar{y}) \\ \hline 6 & 82 & 1.27 & 17.55 & 22.33 \\ \hline 10 & 88 & 5.27 & 23.55 & 124.15 \\ \hline 2 & 56 & -2.73 & -8.45 & 23.06 \\ \hline 4 & 64 & -0.73 & -0.45 & 0.33 \\ \hline 6 & 77 & 1.27 & 12.55 & 15.97 \\ \hline 7 & 92 & 2.27 & 27.55 & 62.60 \\ \hline 0 & 23 & -4.73 & -41.45 & 195.97 \\ \hline 1 & 41 & -3.73 & -23.42 & 87.42 \\ \hline 8 & 80 & 3.27 & 15.55 & 50.88 \\ \hline 5 & 59 & 0.27 & -5.45 & -1.49 \\ \hline 3 & 47 & -1.73 & -17.45 & 30.15 \\ \hline  \end{array}


$$\sum \left ( x-\bar{x} \right )* \left ( y-\bar{y} \right ) = 611.36$$

$$\sum \left ( x-\bar{x} \right )^{2} = 94.18$$

$$m = \frac{611.36}{94.18}$$

$$m = 6.49$$

**Calculate the intercept:**

$$c = y - mx$$

$$c = 64.45-(6.49*4.72)$$

$$c = 64.45 – 30.63$$

$$c = 30.18$$

Now that we have all the values to fit into the equation. If we want to know the predicted grade of someone who spends 2.35 hours on their essay, all we need to do is substitute that in for X.

$$y =  (6.49 * X) + 30.18 $$

$$y = (6.49 * 2.35) + 30.18$$

$$y = 45.43$$










### Example: Ordinary least squares Linear Regression

Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data.

In [None]:
# Generating Sample data

rng = np.random.RandomState(1)              # instantiate random number generator
x = 10 * rng.rand(50)                       # generate 50 random numbers from uniform distribution
y = 2 * x - 5 + rng.randn(50)               # use 50 random numbers from normal distribution as noise
plt.scatter(x, y, c='b');

**Using Scikit-Learn's Linear Regression estimator to fit the above data and construct the best-fit line**

In [None]:
model = LinearRegression(fit_intercept=True)                   # instantiate LinearRegression

model.fit(x[:, np.newaxis], y)                                 # fit the model on data using 'x' as column vector

xfit = np.linspace(0, 10, 1000)                                # create 1000 points between 0 and 10
yfit = model.predict(xfit[:, np.newaxis])                      # predict the values for dependent variable

plt.scatter(x, y, c='b')
plt.plot(xfit, yfit, 'k');

### Example: Machine Learning Workflow using Linear-Regression with Auto-MPG Dataset

#### Dataset

In this example, we will be using the “Auto-MPG” dataset.

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Attribute Information:

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Number of instances: 398

**Problem statement:** Predict the fuel consumption in miles per gallon.

#### Loading Data

In [None]:
# Load and read the data
!wget https://cdn.iisc.talentsprint.com/CDS/Datasets/auto_mpg.csv
auto = pd.read_csv("auto_mpg.csv")

Displaying Dataframe

In [None]:
auto.head()

#### Exploring the dataset

In [None]:
# print names of the features
print(auto.columns)

In [None]:
# generating descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
auto.describe()

In [None]:
# summary of the DataFrame
auto.info()

#### Checking for Missing values

In [None]:
auto.isna().sum()

### Visualization of Auto-MPG Dataset

#### Creating a pairplot and a heatmap to check which features seems to be more correlated


In [None]:
# Pairplot
plt.style.use('ggplot')
sns.pairplot(auto)

In [None]:
# Heatmap

auto = auto.apply(pd.to_numeric, errors='coerce')

plt.figure(figsize=(8, 8))
sns.heatmap(auto.corr(), annot=True, linewidth=0.5, center=0)
plt.show()

From the above plots, we can see that the features cylinders, displacement, and weight are highly correlated. We can use anyone of them for modeling.

### Modeling and Prediction (Linear Regression)

In [None]:
auto.head()

In [None]:
# Datatypes of all features
auto.dtypes

In [None]:
# Unique values in horsepower column
auto['horsepower'].unique()

In [None]:
# Removing '?' from horsepower column
auto = auto[auto['horsepower'] != '?']
auto['horsepower'].unique()

In [None]:
# Converting horsepower column datatype from string to float
auto['horsepower'] = auto['horsepower'].astype(float)
auto.dtypes

In [None]:
# Pridiction features
X = auto[['displacement', 'horsepower', 'acceleration', 'model year', 'origin']]
# Imputing missing values in X
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
X = pd.DataFrame(X, columns=['displacement', 'horsepower', 'acceleration', 'model year', 'origin'])
# Target feature
y = auto['mpg']
X.head()

In [None]:
# Splitting the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

In [None]:
X_train.head()

In [None]:
# Instantiating LinearRegression() Model
lr = LinearRegression()

In [None]:
# Training/Fitting the Model
lr.fit(X_train, y_train)

Testing

In [None]:
# Making Predictions
pred = lr.predict(X_test)

In [None]:
# Evaluating Model's Performance
print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
print('Mean Squared Error:', mean_squared_error(y_test, pred))
print('Mean Root Squared Error:', np.sqrt(mean_squared_error(y_test, pred)))
print('Coefficient of Determination:', r2_score(y_test, pred))

Predicting the value

In [None]:
pred = lr.predict(X_test)
print('Predicted fuel consumption(mpg):', pred[2])
print('Actual fuel consumption(mpg):', y_test.values[2])

### Let us now apply the above learnings to perform a linear regression based price prediction, using a 'Real estate' dataset (Practice section)

Linear regression model implementation

  * Fit the model
  * Do the prediction
  * Plot the straight line for the predicted data using linear regression model



#### Dataset

In this example, we will be using the “Real estate price prediction” dataset

- Transaction date (purchase)
- House age
- Distance to the nearest MRT station (metric not defined)
- Amount of convenience stores
- Location (latitude and longitude)
- House price of unit area

Problem statement: Predict the house price of unit area based on various features provided such as house age, location, etc.

#### Importing all the required libraries

In [None]:
# Your Code Here

#### Load and importing the dataset

In [None]:
# Download the dataset

!wget https://cdn.iisc.talentsprint.com/CDS/Datasets/Real_estate.csv

# Convert it into a pandas dataframe:

df = pd.read_csv('Real_estate.csv')

In [None]:
# View the data

df.head()

#### Dropping non-useful columns

In [None]:
#dropping columns

# YOUR CODE HERE

#### Finding if there are any null values

In [None]:
# YOUR CODE HERE

#### Exploring the data using a scatter plot

In [None]:
# YOUR CODE HERE

#### Training our model

In [None]:
# Separating the data into independent and dependent variables

# YOUR CODE HERE

Splitting the data into training and testing data

In [None]:
# YOUR CODE HERE

#### Training the Linear Regression model on the Training set

In [None]:
# Instantiate a LinearRegression() model

Training/Fitting the Model

In [None]:
# YOUR CODE HERE

#### Exploring the results

In [None]:
# Scatter plot of predicted values

# YOUR CODE HERE

## Q&A


1. What is the difference between the training set and the test set?

    The training set is a subset of your data on which your model will learn how to predict the dependent variable with the independent variables.

    The test set is the complimentary subset from the training set, on which you will evaluate your model to see if it manages to predict correctly the dependent variable with the independent variables.



2. Why do we split on the dependent variable?

    We want to have well-distributed values of the dependent variable in the training and test set. For example, if we only had the same value of the dependent variable in the training set, our model wouldn't be able to learn any correlation between the independent and dependent variables.




3. What is the purpose of a validation set?

    The Validation Set is a separate section of your dataset that you will use during training to get a sense of how well your model is doing on data that are not being used in training.



4. If your model performs great on the training data but generalizes poorly to new instances, what is happening? Can you name three possible solutions?

   If the model performs poorly to new instances, then it has overfitted on the training data. To solve this, we can do any of the following three: get more data, implement a simpler model, or eliminate outliers or noise from the existing data set.

## Reference Reading:

1. https://livebook.manning.com/book/real-world-machine-learning/chapter-3/173
(Section 3.3 and 3.3.1)