<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Exports and imports of India Analysis

# Lab. 4. Model Developement

Estimated time needed: **30** minutes

## About dataset and a course

<details><summary>Click here for details about the dataset and the course.</summary>

# Incoming data
<ul>
    <li>Export: This field represents the total value of goods and services exported by India to other countries during a specific period. The value is measured in million US dollars.</li>
    <li>Import: this field represents the total value of goods and services imported by India from other countries during a specific period. The value is measured in million US dollars.</li>
    <li>Total Trade: This field represents the total value of exports and imports combined. It shows the volume of international trade that India has with other countries.</li>
    <li>Trade Balance: This field represents the difference between the total value of exports and the total value of imports. A positive trade balance indicates that India is exporting more than it is importing, while a negative trade balance indicates the opposite.</li>
    
</ul>

# Target value
<ul>
    <li>Trade Balance</li>
</ul>
    
# Course objectives
In this dataset, you will get data about exports and imports of India(1997-July 2022).</br> 
During this course you will be learning Data Analysis with Python. You will learn how to:
    <li>Find average volume of export and import of India;</li>
    <li>Analyze the structure of export-import operations by partner countries;</li>
    <li>Examine the impact of the volume of exports and imports on the overall trade balance;</li>
    <li>Estimate the planned volume and structure of imports and exports from India's partner countries for the next period,</li>
    
using Python libraries.
</details>

## Laboratory work objectives

After completing this lab you will be able to:

*   Develop prediction models


<p>In this section, we will develop several models that will predict the total trade between USA and India using the variables or features. This is just an estimate but should give us an objective idea of how much total trade will be between two countries.</p>


<p>In data analytics, we often use <b>Model Development</b> to help us predict future observations from the data we have.</p>

<p>A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.</p>


#### Setup


Import libraries:


In [ ]:
! mamba install scikit-learn -y

In [ ]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn import set_config
# set_config(display="diagram")
warnings.filterwarnings('ignore')

#### Reading the dataset from the URL and adding the related headers


First, we assign the URL of the dataset to "path".


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0SO3EN/new_df.csv'

Use the Pandas method <b>read_csv()</b> to load the data from the web address.


In [ ]:
df = pd.read_csv(path)
df.set_index('Financial Year', inplace = True)

In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.head()

## 1. Linear Regression and Multiple Linear Regression


#### Linear Regression


<p>One example of a Data  Model that we will be using is:</p>
<b>Simple Linear Regression</b>

<br>
<p>Simple Linear Regression is a method to help us understand the relationship between two variables:</p>
<ul>
    <li>The predictor/independent variable (X)</li>
    <li>The response/dependent variable (that we want to predict)(Y)</li>
</ul>

<p>The result of Linear Regression is a <b>linear function</b> that predicts the response (dependent) variable as a function of the predictor (independent) variable.</p>


$$
Y: Response \ Variable\\\\\\\\\\\\\\
X: Predictor \ Variables
$$


<b>Linear Function</b>
$$
Yhat = a + b  X
$$


<ul>
    <li>a refers to the <b>intercept</b> of the regression line, in other words: the value of Y when X is 0</li>
    <li>b refers to the <b>slope</b> of the regression line, in other words: the value with which Y changes when X increases by 1 unit</li>
</ul>


#### Let's load the modules for linear regression:


In [ ]:
from sklearn.linear_model import LinearRegression

#### Create the linear regression object:


In [ ]:
lm = LinearRegression()
lm

#### How could "Export" help us predict total trade?


For this example, we want to look at how export can help us predict trade balance.
Using simple linear regression, we will create a linear function with "Export" as the predictor variable and the "Trade Balance" as the response variable.


In [ ]:
X = df[['Export']]
Y = df['Trade Balance']

Fit the linear model using export:


In [ ]:
lm.fit(X, Y)

We can output a prediction:


In [ ]:
Yhat = lm.predict(X)
Yhat[0:5]

#### What is the value of the intercept (a)?


In [ ]:
lm.intercept_

#### What is the value of the slope (b)?


In [ ]:
lm.coef_

### What is the final estimated linear model we get?


As we saw above, we should get a final linear model with the structure:


$$
Yhat = a + b  X
$$


Plugging in the actual values we get:


<b>Total Trade</b> = -495.83 + 0.40 x <b>Export</b>


In [ ]:
plt.figure()
Y.plot()
plt.plot(Y.index, lm.predict(X))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1 a):</b>

<b>Create a linear regression object called "lm1".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1 = LinearRegression()
lm1

<details><summary>Click here for the solution</summary>

```python
lm1 = LinearRegression()
lm1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1 b): </b>

<b>Train the model using "Import" as the independent variable and "Trade Balance" as the dependent variable.</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1.fit(df[['Import']], df[['Trade Balance']])
lm1

<details><summary>Click here for the solution</summary>

```python
lm1.fit(df[['Import']], df[['Trade Balance']])
lm1
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1 c): </b>

<b>Find the slope and intercept of the model.</b>

</div>


#### Slope


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1.coef_

<details><summary>Click here for the solution</summary>
    
```python
# Slope 
lm1.coef_
```
</details>


#### Intercept


In [ ]:
# Write your code below and press Shift+Enter to execute 
lm1.intercept_

<details><summary>Click here for the solution</summary>

```python
# Intercept
lm1.intercept_
```

</details>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1 d): </b>

<b>What is the equation of the predicted line? You can use x and yhat or "Import" or "Total Trade".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
# using X and Y  
#if engine is not defined it will print error (NameError: name 'engine' is not defined).

Yhat = 1617.82 + 0.54*X

Trade_Balance = 1617.82 + 0.54*df['Import']

In [ ]:
Trade_Balance

<details><summary>Click here for the solution</summary>

```python
# using X and Y  
Yhat = 1617.82 + 0.54*X

Trade_Balance = 1617.82 + 0.54*df['Import']

```

</details>


#### Multiple Linear Regression


<p>What if we want to predict trade balance using more than one variable?</p>

<p>If we want to use more variables in our model to predict trade balance, we can use <b>Multiple Linear Regression</b>.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and <b>two or more</b> predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:</p>


$$
Y: Response \ Variable\\\\\\\\\\\\\\
X_{\scriptsize{1}} :Predictor\ Variable \ 1\\\\
X_{\scriptsize{2}}: Predictor\ Variable \ 2\\\\
$$


$$
a: intercept\\\\\\\\\\\\\\
b_{\scriptsize{1}} :coefficients \ of\ Variable \ 1\\\\
b_{\scriptsize{2}}: coefficients \ of\ Variable \ 2\\\\
$$


The equation is given by:


$$
Yhat = a +  b_{\scriptsize{1}} X_{\scriptsize{1}} + b_{\scriptsize{2}} X_{\scriptsize{2}}
$$


<p>From the previous section  we know that other good predictors of trade balance could be:</p>
<ul>
    <li>Export</li>
    <li>Import</li>
</ul>
Let's develop a model using these variables as the predictor variables.


In [ ]:
Z = df[['Export', 'Import']]

Fit the linear model using the three above-mentioned variables.


In [ ]:
lm.fit(Z, df['Trade Balance'])

What is the value of the intercept(a)?


In [ ]:
lm.intercept_

What are the values of the coefficients (b1, b2, b3)?


In [ ]:
lm.coef_

What is the final estimated linear model that we get?


As we saw above, we should get a final linear function with the structure:

$$
Yhat = a +  b_{\scriptsize{1}} X_{\scriptsize{1}} + b_{\scriptsize{2}} X_{\scriptsize{2}}
$$


What is the linear function we get in this example?


<b>Trade Balance</b> = 0.004713630762125831 + 0.99999974 x <b>Export</b> + -0.99999974 x <b>Import</b>


In [ ]:
plt.figure()
Y.plot()
plt.plot(Y.index, lm.predict(Z))

As we can see, two graphs overlap each other. This is caused by the direct linear relationship between the "Import", "Export" and "Trade Balance" fields. Going forward, to forecast the Trade Balance, we will treat the dataset as a time series.


#### Linear Regression for time series


First of all, let's create a new df that will contain only one column: Trade Balance


In [ ]:
df_Trade_Balance = df['Trade Balance']

Let's visualize our new data set using plot_acf


In [ ]:
from statsmodels.graphics.tsaplots import plot_acf, acf

In [ ]:
plot_acf(df_Trade_Balance, lags=10)

Now, we are provided with plots, that show autocorrelation. We can see, how trade balance of current years is dependant on previous years. For example, year ago, correlation coficient year ago was around 0.7, 2 years ago around 0.55, etc. We can make a conclusion that current data is quite dependant on previous two years data. We can also get an array of  autocorrelation using <code>acf</code>.


In [ ]:
acf(df_Trade_Balance, nlags=10)

Let's create a new dataFrame, that will contain three columns:
<ul>
    <li><code>Trade Balance</code>: this field represents the difference between the total value of exports and the total value of imports.</li>
    <li><code>Trade Balance 1 year ago</code>: this field represents the difference between the total value of exports and the total value of imports 1 year ago. </li>
    <li><code>Trade Balance 2 years ago</code>: this field represents the difference between the total value of exports and the total value of imports 2 years ago. </li>
</ul>


In [ ]:
time_series_df = pd.DataFrame({'Trade Balance': df_Trade_Balance,
                   'Trade Balance 1 year ago': df_Trade_Balance.shift(1),
                   'Trade Balance 2 years ago': df_Trade_Balance.shift(2)})

# set the index name to 'Financial Year'
time_series_df.index.name = 'Financial Year'

# drop NaN values
time_series_df.dropna(inplace = True)

# print the new DataFrame
time_series_df

Now we have data that can be used for prediction of <code>Trade Balance</code>.


#### Linear Regression for time series


In [ ]:
X = time_series_df[['Trade Balance 1 year ago']]
Y = time_series_df['Trade Balance']

Fit the linear model using Trade Balance 1 year ago:


In [ ]:
lm.fit(X,Y)

We can output a prediction:


In [ ]:
Yhat=lm.predict(X)
Yhat[0:5]

#### What is the value of the intercept (a)?


In [ ]:
lm.intercept_

#### What is the value of the slope (b)?


In [ ]:
lm.coef_

### What is the final estimated linear model we get?


As we saw above, we should get a final linear model with the structure:


$$
Yhat = a + b  X
$$


Plugging in the actual values we get:


<b>Trade Balance</b> = 3713.712 - 0.69145869 x <b>Trade Balance 1 year ago</b>


In [ ]:
plt.figure()
Y.plot()
plt.plot(Y.index, lm.predict(X))

#### Multiple Linear Regression for time series


<p>Let's predict trade balance using two variables: <code>Trade Balance 1 year ago</code>; <code>Trade Balance 2 years ago</code>.</p>


$$
Y: Response \ Variable\\\\\\\\\\\\\\
X_{\scriptsize{1}} :Predictor\ Variable \ 1\\\\
X_{\scriptsize{2}}: Predictor\ Variable \ 2\\\\
$$


$$
a: intercept\\\\\\\\\\\\\\
b_{\scriptsize{1}} :coefficients \ of\ Variable \ 1\\\\
b_{\scriptsize{2}}: coefficients \ of\ Variable \ 2\\\\
$$


$$
Yhat = a +  b_{\scriptsize{1}} X_{\scriptsize{1}} + b_{\scriptsize{2}} X_{\scriptsize{2}}
$$


Let's develop a model using these variables as the predictor variables.


In [ ]:
Z = time_series_df[['Trade Balance 1 year ago', 'Trade Balance 2 years ago']]

Fit the linear model using the two above-mentioned variables.


In [ ]:
lm_ts = LinearRegression()
lm_ts

In [ ]:
lm_ts.fit(Z, time_series_df['Trade Balance'])

What is the value of the intercept(a)?


In [ ]:
lm_ts.intercept_

What are the values of the coefficients (b1, b2)?


In [ ]:
lm_ts.coef_

What is the final estimated linear model that we get?


As we saw above, we should get a final linear function with the structure:

$$
Yhat = a +  b_{\scriptsize{1}} X_{\scriptsize{1}} + b_{\scriptsize{2}} X_{\scriptsize{2}}
$$

What is the linear function we get in this example?


<b>Trade Balance</b> = 911.3831529376312 + 0.70005002 x <b>Trade Balance 1 year ago</b> + 0.10936497 x <b>Trade Balance 2 years ago</b>


In [ ]:
plt.figure()
Y.plot()
plt.plot(Z.index, lm_ts.predict(Z))

In [ ]:
Z = time_series_df[['Trade Balance 1 year ago', 'Trade Balance 2 years ago']]

## 2. Model Evaluation Using Visualization


Now that we've developed some models, how do we evaluate our models and choose the best one? One way to do this is by using a visualization.


Import the visualization package, seaborn:


In [ ]:
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline 

### Regression Plot


<p>When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using <b>regression plots</b>.</p>

<p>This plot will show a combination of a scattered data points (a <b>scatterplot</b>), as well as the fitted <b>linear regression</b> line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).</p>


Let's visualize **Trade Balance 1 year ago** as potential predictor variable of Trade Balance:


In [ ]:
width, height = 10, 5
plt.figure(figsize=(width, height))
sns.regplot(x="Trade Balance", y="Trade Balance 2 years ago", data=time_series_df)
plt.ylim(0,)

<p>We can see from this plot that price is positively correlated to highway-mpg since the regression slope is positive.

One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data.

Let's compare this plot to the regression plot of "Trade Balance 1 year ago".</p>


In [ ]:
plt.figure(figsize=(width, height))
sns.regplot(x="Trade Balance", y="Trade Balance 1 year ago", data=time_series_df)
plt.ylim(0,)

<p>Comparing the regression plot of "Trade Balance 1 year ago" and "Trade Balance 1 years ago", we see that the points for "Trade Balance 1 year ago" are much closer to the generated line. Both models are good for prediction, but <code>Trade Balance 1 year ago</code> linear model might be better model for this data.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2: </b>
    
<b>Given the regression plots above, is "Trade Balance 1 year ago" or "Trade Balance 2 years ago" more strongly correlated with "Trade Balance"? Use the method  ".corr()" to verify your answer.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
time_series_df[["Trade Balance 1 year ago","Trade Balance 2 years ago","Trade Balance"]].corr()

<details><summary>Click here for the solution</summary>

```python
# The variable "Trade Balance 1 year ago" has a slightly stronger correlation with "Trade Balance", it is approximate 0.703202  compared to "Trade Balance 2 years ago" which is approximate 0.701038. You can verify it using the following command:

time_series_df[["Trade Balance 1 year ago","Trade Balance 2 years ago","Trade Balance"]].corr()

```

</details>


#### Residual Plot

<p>A good way to visualize the variance of the data is to use a residual plot.</p>

<p>What is a <b>residual</b>?</p>

<p>The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.</p>

<p>So what is a <b>residual plot</b>?</p>

<p>A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.</p>

<p>What do we pay attention to when looking at a residual plot?</p>

<p>We look at the spread of the residuals:</p>

<p>- If the points in a residual plot are <b>randomly spread out around the x-axis</b>, then a <b>linear model is appropriate</b> for the data.

Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.</p>


In [ ]:
width, height = 10, 5
plt.figure(figsize=(width, height))
sns.residplot(x=time_series_df['Trade Balance 1 year ago'],y=time_series_df['Trade Balance'])
plt.show()

<i>What is this plot telling us?</i>

<p>We can see from this residual plot that the residuals are randomly spread around the x-axis, leading us to believe that this linear model is more appropriate for this data.</p>


### Multiple Linear Regression


<p>How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.</p>

<p>One way to look at the fit of the model is by looking at the <b>distribution plot</b>. We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.</p>


First, let's make a prediction:


In [ ]:
Y_hat = lm_ts.predict(Z)


In [ ]:
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['Trade Balance'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="orange", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted trade balance')
plt.xlabel('Total Trade (millions of dollars)')
plt.ylabel('Proportion of Trade')

plt.show()
plt.close()

<p>We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.</p>


## 3. Polynomial Regression and Pipelines


<p><b>Polynomial regression</b> is a particular case of the general linear regression model or multiple linear regression models.</p> 
<p>We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.</p>

<p>There are different orders of polynomial regression:</p>


<center><b>Quadratic - 2nd Order</b></center>
$$
Yhat = a + b_{\scriptsize{1}} X +b_{\scriptsize{2}} X^2 
$$

<center><b>Cubic - 3rd Order</b></center>
$$
Yhat = a + b_{\scriptsize{1}} X +b_{\scriptsize{2}} X^2 +b_{\scriptsize{3}} X^3\\\\\\\\\\\\\\
$$

<center><b>Higher-Order</b>:</center>
$$
Y = a + b_{\scriptsize{1}} X +b_{\scriptsize{2}} X^2 +b_{\scriptsize{3}} X^3 ....\\\\
$$


<p>We will use the following function to plot the data:</p>


In [ ]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(-500, 35000, 1000)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Trade Balance ~ Trade Balance year ago')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Trade Balance Between India and USA')

    plt.show()
    plt.close()

Let's get the variables:


In [ ]:
x = time_series_df['Trade Balance 1 year ago']
y = time_series_df['Trade Balance']

Let's fit the polynomial using the function <b>polyfit</b>, then use the function <b>poly1d</b> to display the polynomial function.


In [ ]:
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

Let's plot the function:


In [ ]:
plt.figure()
plt.title('Polynomial regression of 3rd order')
plt.xlabel('Trade Balance')
y.plot()
plt.plot(y.index, p(x))

Now let's call a Function <code>PlotPolly()</code>:


In [ ]:
PlotPolly(p, x, y, 'Trade Balance 1 year ago')

We can also make a distribution plot, to see how difference between actual values and fitted values depends on order of polynomial regression. 


In [ ]:
f = np.polyfit(x, y, 3)
p = np.poly1d(f)

plt.figure(figsize=(30, 5))
plt.subplot(131)
ax1 = sns.distplot(y, hist=False, color="r", label="Actual Value")
sns.distplot(p(x), hist=False, color="orange", label="Fitted Values" , ax=ax1)

plt.title('Polynomial regression (of the 5th order)')
plt.xlabel('Total Trade (millions of dollars)')
plt.ylabel('Proportion of Trade')

f = np.polyfit(x, y, 8)
p = np.poly1d(f)

plt.subplot(132)
ax1 = sns.distplot(y, hist=False, color="r", label="Actual Value")
sns.distplot(p(x), hist=False, color="orange", label="Fitted Values" , ax=ax1)

plt.title('Polynomial regression (of the 8th order)')
plt.xlabel('Total Trade (millions of dollars)')
plt.ylabel('Proportion of Trade')

f = np.polyfit(x, y, 10)
p = np.poly1d(f)

plt.subplot(133)
ax1 = sns.distplot(y, hist=False, color="r", label="Actual Value")
sns.distplot(p(x), hist=False, color="orange", label="Fitted Values" , ax=ax1)

plt.title('Polynomial regression (of the 10th order)')
plt.xlabel('Total Trade (millions of dollars)')
plt.ylabel('Proportion of Trade')


We can see, that epsilon between actual values and fitted values depends on order of the polynomial.


In [ ]:
np.polyfit(x, y, 3)

<p>We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function  "hits" more of the data points.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3: </b>

<b>Create 11 order polynomial model with the variables x and y from above.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1, x, y, 'Trade Balance 1 year ago')

<details><summary>Click here for the solution</summary>

```python
# Here we use a polynomial of the 11rd order (cubic) 
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1, x, y, 'Trade Balance 1 year ago')

```

</details>


<p>The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:</p>


$$
Yhat = a + b_{\scriptsize{1}} X_{\scriptsize{1}} + b_{\scriptsize{2}} X_{\scriptsize{2}} + b_{\scriptsize{3}} X_{\scriptsize{1}} X_{\scriptsize{2}} + b_{\scriptsize{4}}X_{\scriptsize{1}}^2 + b_{\scriptsize{5}} X_{\scriptsize{2}}^2
$$


We can perform a polynomial transform on multiple features. First, we import the module:


In [ ]:
from sklearn.preprocessing import PolynomialFeatures

We create a <b>PolynomialFeatures</b> object of degree 2:


In [ ]:
pr=PolynomialFeatures(degree=2)
pr

In [ ]:
Z_pr=pr.fit_transform(Z)

In the original data, there are 24 samples and 2 features.


In [ ]:
Z.shape

After the transformation, there are 24 samples and 6 features.


In [ ]:
Z_pr.shape

## Pipeline


<p>Data Pipelines simplify the steps of processing the data. We use the module <b>Pipeline</b> to create a pipeline. We also use <b>StandardScaler</b> as a step in our pipeline.</p>


In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.


In [ ]:
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]

We input the list as an argument to the pipeline constructor:


In [ ]:
pipe=Pipeline(Input)
pipe

First, we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs.

Then, we can normalize the data,  perform a transform and fit the model simultaneously.


In [ ]:
Z = Z.astype(float)
pipe.fit(Z,y)

Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously.


In [ ]:
ypipe=pipe.predict(Z)
ypipe[0:4]

In [ ]:
y.plot()
plt.plot(y.index, pipe.predict(Z))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #4: </b>

<b>Create a pipeline that standardizes the data, then produce a prediction using a linear regression model using the features Z and target y.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

<details><summary>Click here for the solution</summary>

```python
Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

```

</details>


## 4. Measures for In-Sample Evaluation


<p>When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.</p>

<p>Two very important measures that are often used in Statistics to determine the accuracy of a model are:</p>
<ul>
    <li><b>$$R^2 / R-squared$$</b></li>
    <li><b>Mean Squared Error (MSE)</b></li>
</ul>

<b>R-squared</b>

<p>R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.</p>

<p>The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.</p>

<b>Mean Squared Error (MSE)</b>

<p>The Mean Squared Error measures the average of the squares of errors. That is, the difference between actual value (y) and the estimated value (ŷ).</p>


### Model 1: Simple Linear Regression


Let's calculate the R^2:


In [ ]:
lm.fit(X, Y)
# Find the R^2
r1 = lm.score(X, Y)
print('The R-square is: ', r1)

We can say that \~49.449% of the variation of the price is explained by this simple linear model "Total Trade 1 year ago".


Let's calculate the MSE:


We can predict the output i.e., "yhat" using the predict method, where X is the input variable:


In [ ]:
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

Let's import the function <b>mean_squared_error</b> from the module <b>metrics</b>:


In [ ]:
from sklearn.metrics import mean_squared_error

We can compare the predicted results with the actual results:


In [ ]:
mse1 = mean_squared_error(time_series_df['Trade Balance'], Yhat)
print('The mean square error of price and predicted value is: ', mse1)

### Model 2: Multiple Linear Regression for time series</h3>


Let's calculate the R^2:


In [ ]:
# fit the model 
lm.fit(Z, time_series_df['Trade Balance'])
r2 = lm.score(Z, time_series_df['Trade Balance'])
# Find the R^2
print('The R-square is: ', r2)

We can say that \~51.716 % of the variation of Trade Balance is explained by this multiple linear regression "multi_fit".


Let's calculate the MSE.


We produce a prediction:


In [ ]:
Y_predict_multifit_ts = lm.predict(Z)

In [ ]:
mse2 = mean_squared_error(time_series_df['Trade Balance'], Y_predict_multifit_ts)
print('The mean square error of price and predicted value using multifit is: ', mse2)

### Model 3: Polynomial Fit


Let's calculate the R^2.


Let’s import the function <b>r2\_score</b> from the module <b>metrics</b> as we are using a different function.


In [ ]:
from sklearn.metrics import r2_score

We apply the function to get the value of R^2:


In [ ]:
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)

We can say that \~93.12 % of the variation of trade balance is explained by this polynomial fit.


### MSE


We can also calculate the MSE:


In [ ]:
mse3 = mean_squared_error(time_series_df['Trade Balance'], p(x))
print('The mean square error of trade balance and predicted value using polynomial fit is: ', mse3)

### Model 4: Pipeline


Let's calculate the R^2:


In [ ]:
Input2=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
pipe2=Pipeline(Input2)
pipe2.fit(X, y)
r4 = pipe2.score(X, y)
# Find the R^2
print('The R-square is: ', r4)

We can say that \~70.43 % of the variation of trade balance is explained by this polynomial fit.


### MSE


Let's calculate the MSE.


We produce a prediction:


In [ ]:
pipe2.fit(Z,y)
ypipe=pipe2.predict(Z)

In [ ]:
mse4 = mean_squared_error(time_series_df['Trade Balance'], ypipe)
print('The mean square error of price and predicted value using multifit is: ', mse4)

## 5. Prediction and Decision Making
### Prediction

<p>In the previous section, we trained the model using the method <b>fit</b>. Now we will use the method <b>predict</b> to produce a prediction. Lets import <b>pyplot</b> for plotting; we will also be using some functions from numpy.</p>


Create a new input:


In [ ]:
new_input=np.arange(0, 100, 1).reshape(-1, 1)

Fit the model:


In [ ]:
lm_ts.fit(X, Y)
lm

Produce a prediction:


In [ ]:
yhat=lm_ts.predict(new_input)
yhat[0:5]

We can plot the data:


In [ ]:
plt.plot(new_input, yhat)
plt.show()

### Decision Making: Determining a Good Model Fit


<p>Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?
<ul>
    <li><i>What is a good R-squared value?</i></li>
</ul>
</p>

<p>When comparing models, <b>the model with the higher R-squared value is a better fit</b> for the data.
<ul>
    <li><i>What is a good MSE?</i></li>
</ul>
</p>

<p>When comparing models, <b>the model with the smallest MSE value is a better fit</b> for the data.</p>

<h4>Let's take a look at the values for the different models.</h4>
<p>Simple Linear Regression: Using Trade Balance 1 year ago as a Predictor Variable of Trade Balance.
<ul>
    <li>R-squared: 0.4944934231882335</li>
    <li>MSE: 3.2255348 x 10^7</li>
</ul>
</p>

<p>Multiple Linear Regression: Using Trade Balance 1 year ago, and Trade Balance 2 years ago as Predictor Variables of Trade Balance.
<ul>
    <li>R-squared: 0.5171551483341978</li>
    <li>MSE: 4.392092 x 10^6</li>
</ul>
</p>

<p>Polynomial Fit: Using Trade Balance 1 year ago as a Predictor Variable of Price.
<ul>
    <li>R-squared: 0.9311670358839029</li>
    <li>MSE: 1.196660 x 10^7</li>
</ul>
</p>

<p>Pipeline: Using Trade Balance 1 year ago as a Predictor Variable of Price.
<ul>
    <li>R-squared: 0.7043046020634133</li>
    <li>MSE: 1.85079296 x 10^7</li>
</ul>
</p>


We can visualize our results:


In [ ]:
lm_results = LinearRegression()

X = time_series_df[['Trade Balance 1 year ago']]
Y = time_series_df['Trade Balance']
lm_results.fit(X, Y)

plt.figure(figsize=(30,5))

plt.subplot(141)
plt.title("Simple Linear Model (SLR)")
Y.plot()
plt.plot(Y.index, lm_results.predict(X))


Z = time_series_df[['Trade Balance 1 year ago', 'Trade Balance 2 years ago']]
lm_results.fit(Z, time_series_df['Trade Balance'])

plt.subplot(142)
plt.title('Multiple Linear Regression Model (MLR)')
Y.plot()
plt.plot(Y.index, lm_results.predict(Z))

plt.subplot(143)
f = np.polyfit(x, y, 15)
p = np.poly1d(f)
plt.title('Polynomial regression of 15th order')
plt.xlabel('Trade Balance')
y.plot()
plt.plot(y.index, p(x))



pipe=Pipeline(Input)
pipe
Z = Z.astype(float)
pipe.fit(Z,y)


plt.subplot(144)
plt.title('Pipeline')
y.plot()
plt.plot(y.index, pipe.predict(Z))

# Display the plots
plt.show()

In [ ]:
measures_data = [[r1, mse1], [r2, mse2], [r_squared, mse3], [r4, mse4]]
measures = pd.DataFrame(data=measures_data, columns=['R-squared', 'MSE'],
                        index=['Simple Linear Model(SLR)', 'Multiple Linear Model(MLR)', 'Polynomial Regression Model', 'Pipeline Model'])
measures

### Simple Linear Regression Model (SLR) vs Multiple Linear Regression Model (MLR)


<p>Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and even act as noise. As a result, you should always check the MSE and R^2.</p>

<p>In order to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.
<ul>
    <li><b>MSE</b>: The MSE of SLR is  3.2255348 x 10^7  while MLR has an MSE of 3.080935 x 10^7.  The MSE of MLR is smaller.</li>
    <li><b>R-squared</b>: In this case, we can also see that there is a big difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (~0.4945) is smaller compared to the R-squared for the MLR (~0.51716).</li>
</ul>
</p>

This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.


### Simple Linear Model (SLR) vs. Polynomial Fit


<ul>
    <li><b>MSE</b>: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.</li> 
    <li><b>R-squared</b>: The R-squared for the Polynomial Fit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.</li>
</ul>
<p>Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting "Trade Balance" with "Trade Balance 1 year ago" as a predictor variable.</p>


### Simple Linear Model (SLR) vs. Pipeline


<ul>
    <li><b>MSE</b>: The MSE for the SLR is much larger than the MSE for the Pipeline.</li>
    <li><b>R-squared</b>: The R-squared for the Pipeline is also larger than for the SLR.</li>
</ul>
That means that Pipeline was a better fit model than the simple linear regression for predicting "Trade Balance"


### Multiple Linear Regression (MLR) vs. Polynomial Fit


<ul>
    <li><b>MSE</b>: The MSE for the MLR is much larger than the MSE for the Polynomial Fit.</li>
    <li><b>R-squared</b>: The R-squared for the Polynomial Fit is also much larger than for the MLR.</li>
</ul>
That means that Polynomial Regression was a better fit model than the multiple linear regression for predicting "Trade Balance"


### Multiple Linear Regression (MLR) vs. Pipeline


<ul>
    <li><b>MSE</b>: The MSE for the MLR is much larger than the MSE for the Pipeline.</li>
    <li><b>R-squared</b>: The R-squared for the Pipeline is also much larger than for the MLR.</li>
</ul>
That means that Pipeline was a better fit model than the multiple linear regression for predicting "Trade Balance"


### Polynomial Fit vs. Pipeline


<ul>
    <li><b>MSE</b>: The MSE for the Pipeline is smaller than the MSE for the Polynomial Fit.</li>
    <li><b>R-squared</b>: The R-squared for the Polynomial Fit is also larger than for the Pipeline.</li>
</ul>
That means that Polynomial Regression was a better fit model than the Pipeline for predicting "Trade Balance"


## Conclusion


<h1><span style='background:red'>  
І поправте авторів Дайте посилання на профілі  авторів
</span></h1>


<p>Comparing these four models, we conclude that <b>the Polynomial Regression is the best model</b> to be able to predict trade balance from our dataset.</p>


### Thank you for completing this lab!

## Authors

<a href='https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0SO3EN3037-2023-01-01'> Yaroslav Vyklyuk</a>

<a href='https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0SO3EN3037-2023-01-01'>Olga Kavun</a>

<a href='https://author.skills.network/instructors/petro_slobodian?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0SO3EN3037-2023-01-01'>Petro Slobodian</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
| 2023-04-20        | 1.0     | Slobodian    | Created Laboratory work                            |



<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
