If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression since linear regression presumes that the data is linear.
Let's learn about non linear regressions and apply an example in python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014.


<h2 id="importing_libraries">Importing required libraries</h2>


In [17]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go


Although linear regression can do a great job at modeling some datasets, it cannot be used for all datasets. First recall how linear regression, models a dataset. It models the linear relationship between a dependent variable y and the independent variables x. It has a simple equation, of degree 1, for example y = $2x$ + 3.


In [9]:
x = np.arange(-5.0, 5.0, 0.1)

# You can adjust the slope and intercept to verify the changes in the graph
y = 2 * (x) + 3
y_noise = 2 * np.random.normal(size=x.size)
y_data = y + y_noise

# scatter plot
fig = px.scatter(x=x, y=y_data)
fig.update_layout(
    title="Non-Linear Regression",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = 2x + 3"))
fig.show()


Non-linear regression is a method to model the non-linear relationship between the independent variables $x$ and the dependent variable $y$. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of $k$ degrees (maximum power of $x$).  For example:

$$ \ y = a x^3 + b x^2 + c x + d \ $$

Non-linear functions can have elements like exponentials, logarithms, fractions, and so on. For example: $$ y = \log(x)$$

We can have a function that's even more complicated such as :
$$ y = \log(a x^3 + b x^2 + c x + d)$$


Let's take a look at a cubic function's graph.


In [10]:
x = np.arange(-5.0, 5.0, 0.1)

# You can adjust the slope and intercept to verify the changes in the graph
y = 1 * (x ** 3) + 1 * (x ** 2) + 1 * x + 3
y_noise = 20 * np.random.normal(size=x.size)
y_data = y + y_noise

fig = px.scatter(x=x, y=y_data)
fig.update_layout(
    title="Non-Linear Regression",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = x^3 + x^2 + x + 3"))
fig.show()


As you can see, this function has $x^3$ and $x^2$ as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function.


Some other types of non-linear functions are:


### Quadratic


$$ Y = X^2 $$


In [11]:
x = np.arange(-5.0, 5.0, 0.1)

# You can adjust the slope and intercept to verify the changes in the graph
y = np.power(x, 2)
y_noise = 2 * np.random.normal(size=x.size)
y_data = y + y_noise

fig = px.scatter(x=x, y=y_data)
fig.update_layout(
    title="Non-Linear Regression",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = x^2"))
fig.show()


### Exponential


An exponential function with base c is defined by $$ Y = a + b c^X$$ where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable.


In [13]:
x = np.arange(-5.0, 5.0, 0.1)

# You can adjust the slope and intercept to verify the changes in the graph
y = np.exp(x)

fig = px.scatter(x=x, y=y)
fig.update_layout(
    title="Non-Linear Regression",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = e^x"))
fig.show()



### Logarithmic

The response $y$ is a results of applying the logarithmic map from the input $x$ to the output $y$. It is one of the simplest form of **log()**: i.e. $$ y = \log(x)$$

Please consider that instead of $x$, we can use $X$, which can be a polynomial representation of the $x$ values. In general form it would be written as\
\begin{equation}
y = \log(X)
\end{equation}


In [14]:
x = np.arange(-5.0, 5.0, 0.1)
y = np.log(x)

fig = px.scatter(x=x, y=y)
fig.update_layout(
    title="Non-Linear Regression",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = log(x)"))
fig.show()



invalid value encountered in log



### Sigmoidal/Logistic


$$ Y = a + \frac{b}{1+ c^{(X-d)}}$$


In [15]:
x = np.arange(-5.0, 5.0, 0.1)
y = 1 - 4 / (1 + np.power(3, x - 2))

fig = px.scatter(x=x, y=y)
fig.update_layout(
    title="Non-Linear Regression",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = 1 - 4 / (1 + 3^(x - 2))"))
fig.show()


<a id="ref2"></a>

# Non-Linear Regression example


For an example, we're going to try and fit a non-linear model to the datapoints corresponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year.


In [19]:
df = pd.read_csv("../../data/raw/china_gdp.csv")
df.head(10)


Unnamed: 0,Year,Value
0,1960,59184120000.0
1,1961,49557050000.0
2,1962,46685180000.0
3,1963,50097300000.0
4,1964,59062250000.0
5,1965,69709150000.0
6,1966,75879430000.0
7,1967,72057030000.0
8,1968,69993500000.0
9,1969,78718820000.0


### Plotting the Dataset

This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it decelerate slightly in the 2010s.


In [20]:
fig = px.scatter(x=df["Year"], y=df["Value"])
fig.update_layout(
    title="China GDP",
    xaxis_title="Year",
    yaxis_title="GDP",
)
fig.show()


### Choosing a model

From an initial look at the plot, we determine that the logistic function could be a good approximation,
since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:


In [21]:
x = np.arange(-5.0, 5.0, 0.1)
y = 1.0 / (1.0 + np.exp(-x))

fig = px.scatter(x=x, y=y)
fig.update_layout(
    title="Sigmoid Function",
    xaxis_title="Dependent Variable",
    yaxis_title="Independent Variable",
)
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name="y = 1 / (1 + e^-x)"))
fig.show()


The formula for the logistic function is the following:

$$ \hat{Y} = \frac1{1+e^{\beta\_1(X-\beta\_2)}}$$

$\beta\_1$: Controls the curve's steepness,

$\beta\_2$: Slides the curve on the x-axis.


### Building The Model

Now, let's build our regression model and initialize its parameters.


In [22]:
def sigmoid(x, Beta_1, Beta_2):
    y = 1 / (1 + np.exp(-Beta_1 * (x - Beta_2)))
    return y


Lets look at a sample sigmoid line that might fit with the data:


In [23]:
beta_1 = 0.10
beta_2 = 1990.0

# logistic function
x = df["Year"]
y = df["Value"]
y_pred = sigmoid(x, beta_1, beta_2)

# plot initial prediction against datapoints
fig = px.scatter(x=x, y=y)
fig.update_layout(
    title="Logistic Function",
    xaxis_title="Year",
    yaxis_title="GDP",
)
fig.add_trace(go.Scatter(x=x, y=y_pred * 15000000000000.0, mode="lines", name="y = 1 / (1 + e^-x)"))
fig.show()


Our task here is to find the best parameters for our model. Lets first normalize our x and y:


In [24]:
# Lets normalize our data
x_data = x / max(x)
y_data = y / max(y)


#### How we find the best parameters for our fit line?

we can use **curve_fit** which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, \*popt) - ydata is minimized.

popt are our optimized parameters.


In [25]:
from scipy.optimize import curve_fit

popt, pcov = curve_fit(sigmoid, x_data, y_data)
# print the final parameters
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))


 beta_1 = 690.451700, beta_2 = 0.997207


Now we plot our resulting regression model.


In [26]:
x = np.linspace(1960, 2015, 55)
x = x / max(x)

fig = px.scatter(x=x_data, y=y_data)
fig.update_layout(
    title="Data",
    xaxis_title="Year",
    yaxis_title="GDP",
)
fig.add_trace(go.Scatter(x=x, y=sigmoid(x, *popt), mode="lines", name="y = 1 / (1 + e^-x)"))
fig.show()


Accuracy of our model


In [27]:
# split data into train/test
msk = np.random.rand(len(df)) < 0.8
x_train = x_data[msk]
x_test = x_data[~msk]
y_train = y_data[msk]
y_test = y_data[~msk]

# build the model using train set
popt, pcov = curve_fit(sigmoid, x_train, y_train)

# predict using test set
y_hat = sigmoid(x_test, *popt)

# evaluation
print("Mean absolute error: %.2f" % np.mean(np.absolute(y_hat - y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_hat - y_test) ** 2))
from sklearn.metrics import r2_score

print("R2-score: %.2f" % r2_score(y_hat, y_test))


Mean absolute error: 0.22
Residual sum of squares (MSE): 0.19
R2-score: -99531148111881984193100087944019968.00



Covariance of the parameters could not be estimated

