Variable transformation is a way to make the data work better in your model. Compare before and after.

In [None]:
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))
import numpy as np # linear algebra
print("NumPy version: {}". format(np.__version__))
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
print("pandas version: {}". format(pd.__version__))
import matplotlib # collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn
from sklearn.linear_model import LinearRegression
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# skip the first 3 rows because garbage
gdp_data = pd.read_csv('../input/gdp2017/GDP.csv',skiprows = 3)
# Drop the last column Unnamed
gdp_data.drop(['Unnamed: 62'],axis=1, inplace=True)
gdp_data.shape # (264, 62)

In [None]:
gdp_data.head(3)

In [None]:
id_vars=['Country Name','Country Code', 'Indicator Name', 'Indicator Code']
df = pd.melt(frame=gdp_data, id_vars=id_vars, var_name='year', value_name='GDP') # country_x_2018_forecast
# df.describe()
df['year'] = df['year'].astype(int) # convert from object to int or float
df.info() # confirm data types

In [None]:
df.shape

In [None]:
df = df.dropna() # drop rows where GDP is NaN
df.shape

In [None]:
df.drop(['Country Code', 'Indicator Name', 'Indicator Code'], axis=1, inplace=True)
df.rename(columns={'Country Name':'Country'}, inplace=True)
df.head(50)

In [None]:
values = ['Arab World',
          'Caribbean small states',
          'Central Europe and the Baltics',
          'Early-demographic dividend',
          'East Asia & Pacific (excluding high income)',
          'Early-demographic dividend',
          'East Asia & Pacific',
          'East Asia & Pacific (IDA & IBRD countries)',
          'Europe & Central Asia',
          'Europe & Central Asia (IDA & IBRD countries)',
          'Europe & Central Asia (excluding high income)',
          'Euro area',
          'European Union',
          'Fragile and conflict affected situations',
          'Heavily indebted poor countries (HIPC)',
          'High income',
          'IBRD only',
          'IDA & IBRD total',
          'IDA total',
          'IDA blend',
          'IDA only',
          'Late-demographic dividend',
          'Latin America and Caribbean',
          'Latin America & Caribbean',
          'Latin America & Caribbean (excluding high income)',
          'Latin America & the Caribbean (IDA & IBRD countries)',
          'Lower middle income',
          'Low & middle income',
          'Middle income',
          'Middle East & North Africa (IDA & IBRD countries)',
          'Middle East & North Africa',
          'Middle East & North Africa (excluding high income)',
          'North America',
          'OECD members',
          'Pacific island small states',
          'Post-demographic dividend',
          'Pre-demographic dividend',
          'South Asia (IDA & IBRD)',
          'Sub-Saharan Africa (IDA & IBRD countries)',
          'Sub-Saharan Africa (excluding high income)',
          'Sub-Saharan Africa',
          'Small states',
          'Upper middle income',
          'World']
for i in range(0, 60):
    for value in values:
        condition = df[df.Country == value].index
        df.drop(condition, inplace=True)

# df[df['column name'].map(lambda x: str(x)!=".")]

# df.where(m, -df)
df.head(50)

In [None]:
df.tail(50)

In [None]:
df['Country'].value_counts() # Pre-demographic dividend?

In [None]:
df.shape

In [None]:
filename = 'GDP_tidy.csv'
df.to_csv(filename, index=False)
print("{} saved".format(filename))

In [None]:
# Adding a default
country = 'United States'
filter = df['Country'] != country
dfus = df.drop(df[filter].index, inplace=False) # filter by country
# df.shape
dfus.tail(5)

In [None]:
print('UNITED STATES')
x = dfus[['year']].values
y = dfus.GDP.values
regr = sklearn.linear_model.LinearRegression()
model = regr.fit(x,y) # SciKit-Learn
score = regr.score(x, y)
score = round(score*100,2)
title = f"USA Linear Regression Score = {score}"
plt.title(title)
print('score = {}'.format(score))
coef = regr.coef_
print('coef = {}'.format(coef)) # 1.0
intercept = regr.intercept_
print('intercept = {}'.format(intercept)) # 3.0000...
y_pred = model.predict(x)
print('SciKit-Learn')
plt.scatter(x, y, color='gray') # sklearn
plt.plot(x, y_pred, color='orange') # model
plt.ylim(0) # start at zero
plt.show()

Conclusion: score = 93% is good but we can improve it later in the next steps.

In [None]:
import matplotlib.pyplot as plt
from scipy import stats
X = dfus.year
y = dfus.GDP
slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy
def modelPrediction(x):
  return slope * x + intercept
# Model Prediction GDP US (2018) = $16,904,994,673,321.25 USD
model = list(map(modelPrediction, X)) # scipy
x_pred = 2018
y_pred = modelPrediction(x_pred)
title='GDP US (2018) = ${} USD'.format(y_pred)
plt.title(title)
print('SciPy')
plt.scatter(X, y) # Scatter Plot
plt.plot(X, model, color='red') # linestyle='dashed', marker='o', markersize=12, markerfacecolor='blue'
plt.ylim(ymin=0) # starts at zero
plt.legend(['Model Prediction using Linear Regression', 'GDP US data (1960-2017)'])
plt.show()

In [None]:
gdpus_pred = y_pred # 2018
gdpus_pred = gdpus_pred / 1000000000000
round(gdpus_pred, 2)
gdpus_pred

In [None]:
gdpus = y[15297] # 2017
gdpus = gdpus / 1000000000000
round(gdpus, 2)
gdpus

In [None]:
# Adding a default
country = 'China'
filter = df['Country'] != country
dfch = df.drop(df[filter].index, inplace=False) # filter by country
# df.shape
dfch.tail(5)

In [None]:
print('CHINA')
x = dfch[['year']].values
y = dfch.GDP.values
regr = sklearn.linear_model.LinearRegression()
model = regr.fit(x,y) # SciKit-Learn
score = regr.score(x, y)
score = round(score*100,2)
title = f"CHINA Linear Regression Score = {score}"
plt.title(title)
print('score = {}'.format(score))
coef = regr.coef_
print('coef = {}'.format(coef)) # 1.0
intercept = regr.intercept_
print('intercept = {}'.format(intercept)) # 3.0000...
y_pred = model.predict(x)
print('SciKit-Learn')
plt.scatter(x, y, color='gray') # sklearn
plt.plot(x, y_pred, color='orange') # model
plt.ylim(0) # start at zero
plt.show()

In [None]:
from scipy import stats
X = dfch.year
y = dfch.GDP
slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy
def modelPrediction(x):
  return slope * x + intercept
# Model Prediction GDP CHINA (2018) = $6,347,500,525,036.9375
model = list(map(modelPrediction, X)) # scipy
x_pred = 2018
y_pred = modelPrediction(x_pred)
title='GDP CHINA (2018) = ${}'.format(y_pred)
plt.title(title)
print('SciPy')
plt.scatter(X, y, color='red') # Scatter Plot
plt.plot(X, model, color='orange') # linestyle='dashed', marker='o', markersize=12
plt.ylim(ymin=0) # starts at zero
plt.legend(['Model Prediction using Linear Regression', 'GDP CHINA data (1960-2017)'])
plt.show()

In [None]:
gdpch_pred = y_pred # 2018
gdpch_pred = gdpch_pred / 1000000000000
round(gdpch_pred, 2)
gdpch_pred

In [None]:
gdpch = y[15086] # 2017
gdpch = gdpch / 1000000000000
round(gdpch, 2)
gdpch

In [None]:
# Adding a default
country = 'Mexico'
filter = df['Country'] != country
dfmx = df.drop(df[filter].index, inplace=False) # filter by country
# df.shape
dfmx.tail(5)

In [None]:
import sklearn
from sklearn.linear_model import LinearRegression
print('MEXICO')
x = dfmx[['year']].values
y = dfmx.GDP.values
regr = sklearn.linear_model.LinearRegression()
model = regr.fit(x,y) # SciKit-Learn
score = regr.score(x, y)
score = round(score*100,2)
title = f"MEXICO Linear Regression Score = {score}"
plt.title(title)
print('score = {}'.format(score))
coef = regr.coef_
print('coef = {}'.format(coef)) # 1.0
intercept = regr.intercept_
print('intercept = {}'.format(intercept)) # 3.0000...
y_pred = model.predict(x)
print('SciKit-Learn')
plt.scatter(x, y, color='gray') # sklearn
plt.plot(x, y_pred, color='orange') # model
plt.ylim(0) # start at zero
plt.show()

In [None]:
from scipy import stats
X = dfmx.year
y = dfmx.GDP
slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy
def modelPrediction(x):
  return slope * x + intercept
# Model Prediction GDP Mexico (2018) = $1,131,888,421,568.4062 MXN
model = list(map(modelPrediction, X)) # scipy
x_pred = 2018
y_pred = modelPrediction(x_pred)
title='GDP Mexico (2018) = ${} MXN'.format(y_pred)
plt.title(title)
print('SciPy')
plt.scatter(X, y, color='green') # Scatter Plot
plt.plot(X, model, color='red') # linestyle='dashed', marker='o', markersize=12
plt.ylim(ymin=0) # starts at zero
plt.legend(['Model Prediction using Linear Regression', 'GDP Mexico data (1960-2017)'])
plt.show()

In [None]:
gdpmx_pred = y_pred # 2018
gdpmx_pred = gdpmx_pred / 1000000000000
round(gdpmx_pred, 2)
gdpmx_pred

In [None]:
gdpmx = y[15200] # 2017
gdpmx = gdpmx / 1000000000000
round(gdpmx, 2)
gdpmx

In [None]:
# Fixing random state for reproducibility
plt.rcdefaults()
fig, ax = plt.subplots()
y = ('United States', 'China', 'Mexico')
y_pos = np.arange(len(y))
x = (gdpus, gdpch, gdpmx)
ax.barh(y_pos, x, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(y)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('GDP')
ax.set_title('GDP per Country 2017')
for i, v in enumerate(x):
    ax.text(v + 1, i, str(v), color='black', va='center', fontweight='normal')
plt.show()

In [None]:
# Fixing random state for reproducibility
plt.rcdefaults()
fig, ax = plt.subplots()
y = ('United States', 'China', 'Mexico')
y_pos = np.arange(len(y))
x = (gdpus_pred, gdpch_pred, gdpmx_pred)
ax.barh(y_pos, x, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(y)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('GDP')
ax.set_title('GDP per Country 2018')
for i, v in enumerate(x):
    ax.text(v + 1, i, str(v), color='black', va='center', fontweight='normal')
plt.show()

## Transformation
variable transformation improve model accuracy using log

In [None]:
print("Skewness: %f" % dfus['GDP'].skew())
print("Kurtosis: %f" % dfus['GDP'].kurt())
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(dfus['GDP'], ax=ax[0])
sns.boxplot(dfus['GDP'], ax=ax[1])
from scipy import stats
stats.probplot(dfus['GDP'], plot=ax[2])
plt.show()

In [None]:
dfus['GDP'] = np.log(dfus['GDP'])

In [None]:
print("Skewness: %f" % dfus['GDP'].skew())
print("Kurtosis: %f" % dfus['GDP'].kurt())
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(dfus['GDP'], ax=ax[0])
sns.boxplot(dfus['GDP'], ax=ax[1])
from scipy import stats
stats.probplot(dfus['GDP'], plot=ax[2])
plt.show()

In [None]:
print('UNITED STATES')
x = dfus[['year']].values
y = dfus.GDP.values
import sklearn
from sklearn.linear_model import LinearRegression
regr = sklearn.linear_model.LinearRegression()
model = regr.fit(x,y) # SciKit-Learn
score = regr.score(x, y)
score = round(score*100,2)
title = f"USA Linear Regression Score = {score}"
plt.title(title)
print('score = {}'.format(score))
coef = regr.coef_
print('coef = {}'.format(coef)) # 1.0
intercept = regr.intercept_
print('intercept = {}'.format(intercept)) # 3.0000...
y_pred = model.predict(x)
print('SciKit-Learn')
plt.scatter(x, y, color='gray') # sklearn
plt.plot(x, y_pred, color='orange') # model
# plt.ylim(0) # start at zero
plt.show()

Conclusion: score = 97% is a great improvement from our previous score which was 93%.

In [None]:
import matplotlib.pyplot as plt
from scipy import stats
X = dfus.year
y = dfus.GDP
slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy
def modelPrediction(x):
  return slope * x + intercept
# Model Prediction GDP US (2018) = $16,904,994,673,321.25 USD
model = list(map(modelPrediction, X)) # scipy
x_pred = 2018
y_pred = modelPrediction(x_pred)
print('Model Prediction GDP US (2018) = ${} USD'.format(y_pred))
print('SciPy')
plt.scatter(X, y) # Scatter Plot
plt.plot(X, model, color='red') # linestyle='dashed', marker='o', markersize=12, markerfacecolor='blue'
# plt.ylim(ymin=0) # starts at zero
plt.legend(['Model Prediction using Linear Regression', 'GDP US data (1960-2017)'])
plt.show()

In [None]:
gdpus_pred = y_pred # 2018
# gdpus_pred = gdpus_pred / 1000000000000
# round(gdpus_pred, 2)
gdpus_pred

In [None]:
print("Skewness: %f" % dfch['GDP'].skew())
print("Kurtosis: %f" % dfch['GDP'].kurt())
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(dfch['GDP'], ax=ax[0])
sns.boxplot(dfch['GDP'], ax=ax[1])
from scipy import stats
stats.probplot(dfch['GDP'], plot=ax[2])
plt.show()

In [None]:
dfch['GDP'] = np.log(dfch['GDP'])

In [None]:
print("Skewness: %f" % dfch['GDP'].skew())
print("Kurtosis: %f" % dfch['GDP'].kurt())
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(dfch['GDP'], ax=ax[0])
sns.boxplot(dfch['GDP'], ax=ax[1])
from scipy import stats
stats.probplot(dfch['GDP'], plot=ax[2])
plt.show()

In [None]:
print('CHINA')
x = dfch[['year']].values
y = dfch.GDP.values
regr = sklearn.linear_model.LinearRegression()
model = regr.fit(x,y) # SciKit-Learn
score = regr.score(x, y)
score = round(score*100,2)
title = f"CHINA Linear Regression Score = {score}"
plt.title(title)
print('score = {}'.format(score))
coef = regr.coef_
print('coef = {}'.format(coef)) # 1.0
intercept = regr.intercept_
print('intercept = {}'.format(intercept)) # 3.0000...
y_pred = model.predict(x)
print('SciKit-Learn')
plt.scatter(x, y, color='gray') # sklearn
plt.plot(x, y_pred, color='orange') # model
# plt.ylim(0) # start at zero
plt.show()

In [None]:
from scipy import stats
X = dfch.year
y = dfch.GDP
slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy
def modelPrediction(x):
  return slope * x + intercept
# Model Prediction GDP CHINA (2018) = $6,347,500,525,036.9375
model = list(map(modelPrediction, X)) # scipy
x_pred = 2018
y_pred = modelPrediction(x_pred)
title='GDP CHINA (2018) = ${}'.format(y_pred)
plt.title(title)
print('SciPy')
plt.scatter(X, y, color='red') # Scatter Plot
plt.plot(X, model, color='orange') # linestyle='dashed', marker='o', markersize=12
# plt.ylim(ymin=0) # starts at zero
plt.legend(['Model Prediction using Linear Regression', 'GDP CHINA data (1960-2017)'])
plt.show()

In [None]:
gdpch_pred = y_pred # 2018
# gdpch_pred = gdpch_pred / 1000000000000
# round(gdpch_pred, 2)
gdpch_pred

In [None]:
print("Skewness: %f" % dfmx['GDP'].skew())
print("Kurtosis: %f" % dfmx['GDP'].kurt())
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(dfmx['GDP'], ax=ax[0])
sns.boxplot(dfmx['GDP'], ax=ax[1])
from scipy import stats
stats.probplot(dfmx['GDP'], plot=ax[2])
plt.show()

In [None]:
dfmx['GDP'] = np.log(dfmx['GDP'])

In [None]:
print("Skewness: %f" % dfmx['GDP'].skew())
print("Kurtosis: %f" % dfmx['GDP'].kurt())
import seaborn as sns
f, ax = plt.subplots(nrows=1, ncols=3, figsize=(18, 4))
sns.distplot(dfmx['GDP'], ax=ax[0])
sns.boxplot(dfmx['GDP'], ax=ax[1])
from scipy import stats
stats.probplot(dfmx['GDP'], plot=ax[2])
plt.show()

In [None]:
import sklearn
from sklearn.linear_model import LinearRegression
print('MEXICO')
x = dfmx[['year']].values
y = dfmx.GDP.values
regr = sklearn.linear_model.LinearRegression()
model = regr.fit(x,y) # SciKit-Learn
score = regr.score(x, y)
score = round(score*100,2)
title = f"MEXICO Linear Regression Score = {score}"
plt.title(title)
print('score = {}'.format(score))
coef = regr.coef_
print('coef = {}'.format(coef)) # 1.0
intercept = regr.intercept_
print('intercept = {}'.format(intercept)) # 3.0000...
y_pred = model.predict(x)
print('SciKit-Learn')
plt.scatter(x, y, color='gray') # sklearn
plt.plot(x, y_pred, color='orange') # model
# plt.ylim(0) # start at zero
plt.show()

In [None]:
from scipy import stats
X = dfmx.year
y = dfmx.GDP
slope, intercept, r, p, std_err = stats.linregress(X, y) # scipy
def modelPrediction(x):
  return slope * x + intercept
# Model Prediction GDP Mexico (2018) = $1,131,888,421,568.4062 MXN
model = list(map(modelPrediction, X)) # scipy
x_pred = 2018
y_pred = modelPrediction(x_pred)
title='GDP Mexico (2018) = ${} MXN'.format(y_pred)
plt.title(title)
print('SciPy')
plt.scatter(X, y, color='green') # Scatter Plot
plt.plot(X, model, color='red') # linestyle='dashed', marker='o', markersize=12
# plt.ylim(ymin=0) # starts at zero
plt.legend(['Model Prediction using Linear Regression', 'GDP Mexico data (1960-2017)'])
plt.show()

In [None]:
gdpmx_pred = y_pred # 2018
# gdpmx_pred = gdpmx_pred / 1000000000000
# round(gdpmx_pred, 2)
gdpmx_pred

In [None]:
x = (gdpus_pred, gdpch_pred, gdpmx_pred)
x

In [None]:
plt.rcdefaults()
fig, ax = plt.subplots()
y = ('United States', 'China', 'Mexico')
y_pos = np.arange(len(y))
x = (gdpus_pred, gdpch_pred, gdpmx_pred)
ax.barh(y_pos, x, align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(y)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('GDP')
ax.set_title('GDP per Country 2018')
for i, v in enumerate(x):
    ax.text(v + 1, i, str(v), color='black', va='center', fontweight='normal')
plt.show()

## Conclusion

[Linear Regression](https://towardsdatascience.com/a-summary-of-the-basic-machine-learning-models-e0a65627ecbe) tends to be the Machine Learning algorithm that all teachers explain first, most books start with, and most people end up learning to start their career with.

It is a very simple algorithm that takes a vector of features (the variables or characteristics of our data) as an input, and gives out a numeric, continuous output. 

As its name and the previous explanation outline, it is a regression algorithm, and the main member and father of the family of linear algorithms where Generalised Linear Models (GLMs) come from.

It can be trained using a closed form solution, or, as it is normally done in the Machine Learning world, using an iterative optimisation algorithm like Gradient Descent.

Linear Regression is a parametric machine learning model (with a fixed number of parameters that depend on the nº of features of our data and that trains quite quickly) that works well for data that is linearly correlated with our target variable (the continuous numeric feature that we want to later predict), that is very intuitive to learn, and easy to explain. 

It is what we call an ‘explainable AI model’, as the predictions it makes are very easy to explain knowing the model weights.

An example of a Linear Regression model could be a model that predicts house prices taking into account the characteristics of each home like the surface area, location, number of rooms, or if it has an elevator or not.

The following figure shows how Linear Regression would predict the price of a certain house using only 1 feature: 

The surface area in squared meters of the house. 

In the case of more variables being included in our model, the X axis would reflect a weighted linear combination of these features.

The line from the previous figure would have been fit in the training process using an optimisation algorithm, like gradient descent, that iteratively changes the slope of the line until the best possible line for our task is obtained.