<a href="https://colab.research.google.com/github/reshma-03/IISc-Projects/blob/main/M4_SNB_MiniProject_01_MLR_MPI_OpenMP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Implementation of Multiple Linear Regression using MPI, OpenMP

**DISCLAIMER:** THIS NOTEBOOK IS PROVIDED ONLY AS A REFERENCE SOLUTION NOTEBOOK FOR THE MINI-PROJECT. THERE MAY BE OTHER POSSIBLE APPROACHES/METHODS TO ACHIEVE THE SAME RESULTS.

## Learning Objectives

At the end of the mini-project, you will be able to :

* understand the collective communication operations like scatter, gather, broadcast
* understand the blocking and non-blocking communication
* implement multiple linear regression and run it using MPI
* implement the multiple linear regression based predictions using OpenMP

### Dataset

The dataset chosen for this mini-project is [Combined Cycle Power Plant](https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant). The dataset is made up of 9568 records and 5 columns. Each record contains the values for Ambient Temperature, Exhaust Vaccum, Ambient Pressure, Relative Humidity and Energy Output.

Predicting full load electrical power output of a base load power plant is important in order to maximize the profit from the available megawatt hours.  The base load operation of a power plant is influenced by four main parameters, which are used as input variables in the dataset, such as ambient temperature, atmospheric pressure, relative humidity, and exhaust steam pressure. These parameters affect electrical power output, which is considered as the target variable.

**Note:** The data was collected over a six year period (2006-11).

## Information

#### MPI in a Nutshell

MPI stands for "Message Passing Interface". It is a library of functions (in C / Python) or subroutines (in Fortran) that you insert into source code to perform data communication between processes. MPI was developed over two years of discussions led by the MPI Forum, a group of roughly sixty people representing some forty organizations.

To know more about MPI click [here](https://hpc-tutorials.llnl.gov/mpi/)


#### Multiple Linear Regression

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

**Note:** We will be using the mpi4py Python package for MPI based code implementation

## Grading = 20 Points

**Run the below code to install mpi4py package**

In [None]:
!pip install mpi4py

Collecting mpi4py
  Downloading mpi4py-4.0.0.tar.gz (464 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.8/464.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mpi4py: filename=mpi4py-4.0.0-cp310-cp310-linux_x86_64.whl size=4266273 sha256=6eb5c002f808605bf577f104f459caafd6bcc914ccb862027fcb2aa9d04b5409
  Stored in directory: /root/.cache/pip/wheels/96/17/12/83db63ee0ae5c4b040ee87f2e5c813aea4728b55ec6a37317c
Successfully built mpi4py
Installing collected packages: mpi4py
Successfully installed mpi4py-4.0.0


#### Importing Necessary Packages

In [None]:
# Importing pandas
import pandas as pd
# Importing Numpy
import numpy as np
# Importing MPI from mpi4py package
from mpi4py import MPI
# Importing sqrt function from the Math
from math import sqrt
# Importing Decimal, ROUND_HALF_UP functions from the decimal package
from decimal import Decimal, ROUND_HALF_UP
import time

In [None]:
#@title Downloading the data
!wget -qq https://cdn.iisc.talentsprint.com/CDS/Datasets/PowerPlantData.csv

### Overview

* Load the data and perform data pre-processing
* Identify the features, target and split the data into train and test
* Implement multiple Linear Regression by estimating the coefficients on the given data
* Use MPI package to distribute the data and implement `communicator`
* Define functions for each objective and make a script (.py) file to execute using MPI command
* Use OpenMP component to predict the data and calculate the error on the predicted data
* Implement the Linear Regression from `sklearn` and compare the results

#### Exercise 1: Load data (1 point)

Write a function that takes the filename as input and loads the data in a pandas dataframe with the column names as Ambient Temperature, Exhaust Vaccum, Ambient Pressure, Relative Humidity and Energy Output respectively.

**Hint:** read_csv()


In [None]:
FILENAME = "/content/PowerPlantData.csv" # Storing File path
# Defining a function to load the data
def loadData(filename):
    # Loading the dataset with column names as
    data = pd.read_csv(filename, header=0 , names = ['AmbientTemperature', 'ExhaustVaccum', 'AmbientPressure', 'RelativeHumidity', 'EnergyOutput'])
    # Returning the dataframe
    return data
# Calling the function loadData and storing the dataframe in a variable named df
df = loadData(FILENAME)

#### Exercise 2: Explore data (1 point)

Write a function that takes the data loaded using the above defined function as input and explore it.

**Hint:** You can define and check for following things in the dataset inside a function

- checking for the number of rows and columns
- summary of the dataset
- check for the null values
- check for the duplicate values

In [None]:
# Defining a function
def exploreData(data):
    print(data.shape) # Checking for number of rows and columns
    print(data.describe()) # Summary of the data
    print(data.isna()) # Checking for the null values in the data
    print(sum(data.duplicated()))  # Checking for the duplicate values in the data

In [None]:
# Calling the function exploreData to understand the dataset
exploreData(df)

#### Exercise 3: Handle missing data (1 point)

After exploring the dataset if there are any null values present in the dataset then define a function that takes data loaded using the above defined function as input and handle the null values accordingly.

**Hint:**

- Drop the records containing the null values - dropna()
- Replace the null values with the mean/median/mode - fillna()

In [None]:
# Function to handle missing data
def handleMissingData(data):
    data = data.dropna() # dropping the records containing null values using dropna function
    # returning the dataframe after dropping the values
    return data

In [None]:
newdf = handleMissingData(df) # storing the data after removing the null values from it

#### Exercise 4: Scale the data (1 point)

Write a function that takes the data after handling the missing data as input and returns the standardized data.

**Hint:**

- standardization of the data  can be performed using the below formula

$ (x - mean(x)) / std(x) $

In [None]:
# Defining a function to standardize the data
def standardizeData(dataFile):
    # Applying standardization formula
    dataFile = (dataFile - dataFile.mean()) / dataFile.std()
    # returning the standardization data
    return dataFile

In [None]:
ScaledData = standardizeData(newdf) # Storing the data after applying standardization on the data

#### Exercise 5: Feature selection (1 point)

Write a function that takes scaled data as input and returns the features and target variable values

**Hint:**

- Features: AmbientTemperature, ExhaustVaccum, AmbientPressure, RelativeHumidity
- Target Variable: EnergyOutput

In [None]:
# Function which returns features and target variables
def FeatureSelector(data, target_name):
    target = data[target_name] # Storing the target values
    features = data.drop([target_name],axis=1) # Storing the features by dropping the target variable column
    return features, target # Returning the features and target

In [None]:
features, target = FeatureSelector(ScaledData,'EnergyOutput') # Storing the features and targets in variables respectively

#### Exercise 6: Correlation (1 point)

Calculate correlation between the variables

In [None]:
def correlation_factor(features):
    corr = features.corr()
    print(np.triu(corr))
    return np.triu(corr)

import seaborn as sns
sns.heatmap(correlation_factor(features),annot=True,xticklabels=features.columns, yticklabels=features.columns)
# heatmap is optional

#### Exercise 7: Estimate the coefficients (2 points)

Write a function that takes features and target as input and returns the estimated coefficient values

**Hint:**

- Calculate the estimated coefficients using the below formula

$ β = (X^T X)^{-1} X^T y $

- transpose(), np.linalg.inv()

In [None]:
# Calculating the coefficients
def estimatedCoefficients(x, y):
    # Implementing above formula
    xT = x.transpose() # Transpose of x
    inversed = np.linalg.inv( xT.dot(x) ) # Inverse of a matrix
    coefficients = inversed.dot( xT ).dot(y) # performing final dot operation
    # Returning the coefficients
    return coefficients

In [None]:
estimatedCoefficients(features, target) # Calculating the estimatedCoefficients

#### Exercise 8: Fit the data to estimate the coefficients (2 points)

Write a function named fit which takes features and targets as input and returns the intercept and coefficient values.

**Hint:**

- create a dummy column in the features dataframe which is made up of all ones
- convert the features dataframe into numpy array
- call the estimated coefficients function which is defined above
- np.ones(), np.concatenate()

In [None]:
# function to add dummy column into features dataframe and converting it into numpy array
def dummyvariable(features):
    # create a array of ones
    m = np.ones((features.shape[0],1))
    # combining the array of ones with features array
    f = np.concatenate((m,features),axis=1)
    # returning the features array
    return f

In [None]:
# defining a fit function
def fit(x, y):
    # prepare x and y values for coefficient estimates
    x = dummyvariable(x) # adding a dummy column
    print(x)
    #y = y.values
    betas = estimatedCoefficients(x, y) # calculating the estimated coefficients
    # intercept becomes a vector of ones
    intercept = betas[0]
    # coefficients becomes the rest of the betas
    coefficients = betas[1:]
    # returning the intercept and coefficients
    return intercept, coefficients

In [None]:
intercept, coefficients = fit(features, target) # fitting the data and calculating the intercept and coefficients

#### Exercise 9: Predict the data on estimated coefficients (1 point)

Write a function named predict which takes features, intercept and coefficient values as input and returns the predicted values.

**Hint:**

- Fit the intercept, coefficients values in the below equation

  $y = b_0 + b_1*x + ... + b_i*x_i$

In [None]:
# fucntion to predict the values
def predict(x, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    predictions = [] # Defining empty list to store the predicted values
    for index, row in x.iterrows(): # iterating over features
        values = row.values # converting each row into a array
        pred = np.multiply(values, coefficients) # multiply the coefficients with the features values
        pred = sum(pred) # storing the sum of each features
        pred += intercept # finally adding the intercept value
        predictions.append(pred) # appending the values to the list
    # returning the predictions
    return predictions

In [None]:
predict(features, intercept, coefficients ) # Calling the predict function

#### Exercise 10: Root mean squared error (1 point)

Write a function to calculate the RMSE error.

**Hint:**

- [How to calculate the RSME error](https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e)

In [None]:
# function to calculate the error
def rmse(actual, predicted):
        # To store the value
        sum_err = 0.0
        # iterating over the actual values
        for i in range(len(actual)):
            # calculating mean squared error
            pred_err = predicted[i] - actual[i]
            sum_err += pred_err ** 2
        mean_err = sum_err / float(len(actual))
        # squaring the mean squared error to get the RMSE error value
        return sqrt(mean_err)

#### Exercise 11: Split the data into train and test (1 point)

Write a function named train_test_split which takes features and targets as input and returns the train and test sets respectively.

**Hint:**

- Shuffle the data
- Consider 70 % of data as a train set and the rest of the data as a test set

In [None]:
# Divide the data into 70% train set and 30% test set
def train_test_split(features_X , expected_target_Y ):
    #Randomly pick 70% 0f the data
    set_of_data = np.random.rand(len(features_X)) <= 0.7
    X_train = features_X[set_of_data] # Training features set
    Y_train = expected_target_Y[set_of_data] # target train set
    #the remaining 30% is for the test set
    X_test  = features_X[~set_of_data] # Test features set
    Y_test  = expected_target_Y[~set_of_data] # target test sets
    # Returning the train features set, train targets set, test features sets, test target sets
    return X_train, X_test, Y_train, Y_test

#### Exercise 12: Implement predict using OpenMP (1 point)

Get the predictions for test data and calculate the test error(RMSE) by implementing the OpenMP (pymp)

**Hints:**

* Using the pymp.Parallel implement the predict function (use from above)

* Call the predict function by passing test data as an argument

* calculate the error (RMSE) by comparing the Actual test data and predicted test data

In [None]:
!pip install pymp-pypi

In [None]:
import pymp
def predict(x, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    st = time.perf_counter()
    predictions = pymp.shared.array(Y_test.shape) # Defining empty list to store the predicted values
    with pymp.Parallel(4) as p:
      for index in p.range(len(x)): # iterating over features
          values = x[index] # converting eaach row into a array
          pred = np.multiply(values, coefficients) # multiply the coefficients with the features values
          pred = sum(pred) # storing the sum of each features
          pred += intercept # finally adding the intercept value
          predictions[index]= pred # appending the values to the list
    # returning the predictions
    print(time.perf_counter() - st)

    return predictions

In [None]:
X_train, X_test, Y_train, Y_test =  train_test_split(features, target)

In [None]:
# fit the data with x, y to calculate the coefficients
b0_openmp, new_coefficients_openmp = fit(X_train, Y_train)

In [None]:
# predicting the test data
test_predictions = np.array(predict(X_test.values, b0_openmp, new_coefficients_openmp))

# calculating the error
print("Test set error(RMSE) is {}" .format(rmse(Y_test.values, test_predictions)))

#### Exercise 13: Create a communicator (1 point)

Create a comunicator and define the rank and size

In [None]:
# Initialize communicator
comm = MPI.COMM_WORLD
# ID of the current worker
rank = comm.Get_rank()
# Rank ID of sender
status = MPI.Status()
# Number of workers
size = comm.Get_size()
root = 0   # Root
# to calculate time
t_start = 0
t_diff = 0

#### Exercise 14: Divide the data into slices (1 point)

Write a function named dividing_data which takes train features set, train target set, and size of workers as inputs and returns the sliced data for each worker.

![img](https://cdn.iisc.talentsprint.com/CDS/Images/MiniProject_MPI_DataSlice.JPG)

For Example, if there are 4 processes, slice the data into 4 equal parts with 25% ratio

**Hint:**

- Divide the Data equally among the workers
  - Create an empty list
  - Iterate over the size of workers
  - Append each slice of data to the list

In [None]:
def dividing_data(x_train, y_train, size_of_workers):
    #Divide the data among the workers
    slice_for_each_worker = int(Decimal(x_train.shape[0]/size_of_workers).quantize(Decimal('1.'), rounding = ROUND_HALF_UP))
    print('Slice of data for each worker: {}'.format(slice_for_each_worker))
    x_data_for_worker = []
    y_data_for_worker = []
    for i in range(0,size_of_workers):
        if i < size_of_workers - 1:
            x_data_for_worker.append(x_train[slice_for_each_worker*i:slice_for_each_worker*(i+1)])
            y_data_for_worker.append(y_train[slice_for_each_worker*i:slice_for_each_worker*(i+1)])
        else:
            x_data_for_worker.append(x_train[slice_for_each_worker*i:])
            y_data_for_worker.append(y_train[slice_for_each_worker*i:])
    return x_data_for_worker, y_data_for_worker

# Alternate way is to use np.split()

#### Exercise 15: Prepare the data in root worker to assign data for all the workers (1 point)

- When it is the root worker, perform the below operation:
    - Store the features and target values in separate variables
    - Split the data into train and test sets using the train_test_split function defined above
    - Divide the data among the workers using the dividing_data function above

In [None]:
if rank == root:
    t_start = MPI.Wtime()
    # Splitting the data into train and test
    X_train, X_test, Y_train, Y_test =  train_test_split(features, target)
    #Divide the data among the workers
    x_data_for_worker, y_data_for_worker = dividing_data(X_train, Y_train, size)
wt = MPI.Wtime()

#### Exercise 16: Scatter and gather the data (1 point)

Perform the below operations:

- Send slices of the training set(the features data X and the expected target data Y) to every worker including the root worker
    - **Hint:** scatter()
    - use `barrier()` to block workers until all workers in the group reach a Barrier, to scatter from root worker.
- Every worker should get the predicted target Y(yhat) for each slice
- Get the new coefficient of each instance in a slice
    - **Hint:** fit function defined above
- Gather the new coefficient from each worker
    - **Hint:** gather()
    - Take the mean of the gathered coefficients
- Calculate the root mean square error for the test set

To know more about `scatter`, `gather` and `barrier` click [here](https://nyu-cds.github.io/python-mpi/05-collectives/)

In [None]:
# Send the slice of data to work on to each worker
sliced_features_X_train = comm.scatter(x_data_for_worker, root = root)
sliced_expected_target_Y_train = comm.scatter(y_data_for_worker, root = root)
Xm = sliced_features_X_train.values
ym = sliced_expected_target_Y_train.values

In [None]:
# checking the shape of features and target received scatter
Xm.shape, ym.shape

In [None]:
# fit the data with x, y to calculate the coefficients
b0, new_coefficients = fit(Xm, ym)

In [None]:
b0, new_coefficients

#### Predict the output using the new coefficients

In [None]:
# fucntion to predict the values
def predict(x, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    predictions = [] # Defining empty list to store the predicted values
    for index, row in x.iterrows(): # iterating over features
        values = row.values # converting each row into a array
        pred = np.multiply(values, coefficients) # multiply the coefficients with the features values
        pred = sum(pred) # storing the sum of each features
        pred += intercept # finally adding the intercept value
        predictions.append(pred) # appending the values to the list
    # returning the predictions
    return predictions

In [None]:
predicted_y_sliced = predict(sliced_features_X_train, b0, new_coefficients)

#### Gather the new coefficients and calculate the error on train and test data

In [None]:
# Gather the new coeffiecient for each slice of the training data
gather_new_coefficients = pd.DataFrame(comm.gather(new_coefficients, root=0))
comm.barrier()
if rank == root:
    coef = gather_new_coefficients.mean()
    #print(coef)
    predicted_y = predict(X_test, intercept, coef)
    print("Test set error(RMSE) is {}" .format(rmse(Y_test.values, np.array(predicted_y))))
    #print("Train set error(RMSE) is {}" .format( rmse(Y_train.values, np.array(predicted_y_sliced))))
t_diff = MPI.Wtime() - t_start
print('Process {}: {} secs.' .format(rank,t_diff))

#### Exercise 17: Make a script and execute everything in one place (1 point)

Write a script(.py) file which contains the code of all the above exercises in it so that you can run the code on multiple processes using MPI.

**Hint:**

- magic commands
- put MPI related code under main function
- !mpirun --allow-run-as-root -np 4 python filename.py

In [None]:
%%writefile mlrMPI.py

import pandas as pd # Importing pandas package under a name pd
import numpy as np # Importing Numpy package under a name np
from mpi4py import MPI # Importing MPI fro mpi4py package
from math import sqrt # Importing sqrt function from the Math package
from decimal import Decimal, ROUND_HALF_UP # Importing Decimal, ROUND_HALF_UP functions from the decimal package

FILENAME = "/content/PowerPlantData.csv" # Storing File path
# Defining a function to load the data
def loadData(filename):
    # Loading the dataset with column names as
    data = pd.read_csv(filename, header=0 , names = ['AmbientTemperature', 'ExhaustVaccum', 'AmbientPressure', 'RelativeHumidity', 'EnergyOutput'])
    # Returning the dataframe
    return data
# Calling the function loadData and storing the dataframe in a variable named df
df = loadData(FILENAME)

 # Defining a function
def exploreData(data):
    print(data.shape) # Checking for number of rows and columns
    print(data.describe()) # Summary of the data
    print(data.isna()) # Checking for the null values in the data
    print(sum(data.duplicated()))  # Checking for the duplicate values in the data
 # Function to handle missing data
def handleMissingData(data):
    data = data.dropna() # dropping the records containing null values using dropna function
    # returning the dataframe after dropping the values
    return data

newdf = handleMissingData(df) # storing the data after removing the null values from it

 # Defining a function to standardization the data
def standardizeData(dataFile):
    # Applying standardize formula
    dataFile = (dataFile - dataFile.mean()) / dataFile.std()
    # returning the standardization data
    return dataFile

ScaledData = standardizeData(newdf) # Storing the data after applying normalization on the data

# Function which returns features and target variables
def FeatureSelector(data,target_name):
    target = data[target_name] # Storing the target values
    features = data.drop([target_name],axis=1) # Storing the features by dropping the target variable column
    return features, target # Returning the features and target

features, target = FeatureSelector(ScaledData,'EnergyOutput') # Storing the features and targets in variables respectively

 # Calculating the coefficients
def estimatedCoefficients(x, y):
    # Implementing above formula
    xT = x.transpose() # Transpose of x
    inversed = np.linalg.inv( xT.dot(x) ) # Inverse of a matrix
    coefficients = inversed.dot( xT ).dot(y) # performing final dot operation
    # Returning the coefficients
    return coefficients

 # function to add dummy column into features dataframe and converting it into numpy array
def dummyvariable(features):
    # create a array of ones
    m = np.ones((features.shape[0],1))
    # combining the array of ones with features array
    f = np.concatenate((m,features),axis=1)
    # returning the features array
    return f

 # defining a fit function
def fit(x, y):
    # prepare x and y values for coefficient estimates
    x = dummyvariable(x) # adding a dummy column
    # y = y.values
    betas = estimatedCoefficients(x, y) # calculating the estimated coefficients
    # intercept becomes a vector of ones
    intercept = betas[0]
    # coefficients becomes the rest of the betas
    coefficients = betas[1:]
    # returning the intercept and coefficients
    return intercept, coefficients

intercept, coefficients = fit(features, target) # fitting the data and calculating the intercept and coefficients

  # function to predict the values
def predict(x, intercept, coefficients):
    '''
    y = b_0 + b_1*x + ... + b_i*x_i
    '''
    predictions = [] # Defining empty list to store the predicted values
    for index, row in x.iterrows(): # iterating over features
        values = row.values # converting eaach row into a array
        pred = np.multiply(values, coefficients) # multiply the coefficients with the features values
        pred = sum(pred) # storing the sum of each features
        pred += intercept # finally adding the intercept value
        predictions.append(pred) # appending the values to the list
    # returning the predictions
    return predictions

 # function to calculate the error
def rmse(actual, predicted):
        # To store the value
        sum_err = 0.0
        # iterating over the actual values
        for i in range(len(actual)):
            # calculating mean squared error
            pred_err = predicted[i] - actual[i]
            sum_err += pred_err ** 2
        mean_err = sum_err / float(len(actual))
        # squaring the mean squared error to get the RMSE error value
        return sqrt(mean_err)

 # Divide the data into 70% train set and 30% test set
def train_test_split(features_X , expected_target_Y ):
    #Randomly pick 70% 0f the data
    set_of_data = np.random.rand(len(features_X)) <= 0.7
    X_train = features_X[set_of_data] # Training features set
    Y_train = expected_target_Y[set_of_data] # target train set
    #the remaining 30% is for the test set
    X_test  = features_X[~set_of_data] # Test features set
    Y_test  = expected_target_Y[~set_of_data] # target test sets
    # Returning the train features set, train targets set, test features sets, test target sets
    return X_train, X_test, Y_train, Y_test

def dividing_data(x_train, y_train, size_of_workers):
    #Divide the data among the workers
    slice_for_each_worker = int(Decimal(x_train.shape[0]/size_of_workers).quantize(Decimal('1.'), rounding = ROUND_HALF_UP))
    print('Slice of data for each worker: {}'.format(slice_for_each_worker))
    x_data_for_worker = []
    y_data_for_worker = []
    for i in range(0,size_of_workers):
        if i < size_of_workers - 1:
            x_data_for_worker.append(x_train[slice_for_each_worker*i:slice_for_each_worker*(i+1)])
            y_data_for_worker.append(y_train[slice_for_each_worker*i:slice_for_each_worker*(i+1)])
        else:
            x_data_for_worker.append(x_train[slice_for_each_worker*i:])
            y_data_for_worker.append(y_train[slice_for_each_worker*i:])
    return x_data_for_worker, y_data_for_worker

def main():
    comm = MPI.COMM_WORLD                       # Initialize communicator
    rank = comm.Get_rank()                        # ID of the cureent worker
    status = MPI.Status()                       # Rank ID of sender
    size = comm.Get_size()                      # Number odf workers
    root = 0   # Root
    x_data_for_worker, y_data_for_worker = [], []
    if rank == root:
        # Splitting the data into train and test
        X_train, X_test, Y_train, Y_test =  train_test_split(features, target)
        #Divide the data among the workers
        x_data_for_worker, y_data_for_worker = dividing_data(X_train, Y_train, size)

    # Send the slice of data to work on to each worker
    sliced_features_X_train = comm.scatter(x_data_for_worker, root = root)
    sliced_expected_target_Y_train = comm.scatter(y_data_for_worker, root = root)
    Xm = sliced_features_X_train.values
    ym = sliced_expected_target_Y_train.values
    # fit the data with x, y to calculate the coefficients
    b0, new_coefficients = fit(Xm, ym)
    ##predicted_y_sliced = predict(sliced_features_X_train, b0, new_coefficients)
    # Gather the new coeffiecient for each slice of the training data
    gather_new_coefficients = pd.DataFrame(comm.gather(new_coefficients, root=0))
    comm.barrier()
    if rank == root:
        coef = gather_new_coefficients.mean()
        #print(coef)
        predicted_y = predict(X_test, intercept, coef)
        print("Test set error(RMSE) is {}" .format(rmse(Y_test.values, np.array(predicted_y))))

main()

In [None]:
!mpirun --allow-run-as-root -np 4 python mlrMPI.py

Note: In case any issue is encountered while executing MPI file then please copy only the MPI related code in the new notebook and execute.

#### Exercise 18: Use Sklearn to compare (1 point)

Apply the Linear regression on the given data using sklearn package and compare with the above results

**Hint:**
* Split the data into train and test
* Fit the train data and predict the test data using `sklearn Linear Regression`
* Compare the coefficients and intercept with above estimated coefficients
* calculate loss (RMSE) on test data and predictions and compare

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(features, target)

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(xtrain, ytrain)
predictions = reg.predict(xtest)

In [None]:
# coefficients from sklearn Linear Regression
print(reg.coef_)
# intercept from sklearn Linear Regression
print(reg.intercept_)

In [None]:
from sklearn.metrics import mean_squared_error
print(sqrt(mean_squared_error(predictions, ytest)))