## In this notebook i will try to make linear model which can predict GDP of Japan

> I get some datas from https://databank.worldbank.org/reports.aspx?source=2&series=NY.GDP.MKTP.CD&country 

#### 1. Load necessary libraries

In [103]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

#### 2. Load data

In [104]:
data = pd.read_excel('P_Popular Indicators.xlsx', index_col=0)
data.index.name = False
data = data.T
data.head()

Unnamed: 0,"Population, total",Population growth (annual %),Surface area (sq. km),Poverty headcount ratio at national poverty lines (% of population),"GNI, Atlas method (current US$)","GNI per capita, Atlas method (current US$)","GNI, PPP (current international $)","GNI per capita, PPP (current international $)",Income share held by lowest 20%,"Life expectancy at birth, total (years)",...,"Foreign direct investment, net inflows (BoP, current US$)",Net ODA received per capita (current US$),GDP per capita (current US$),"Foreign direct investment, net (BoP, current US$)","Inflation, consumer prices (annual %)",NaN,NaN.1,NaN.2,Data from database: World Development Indicators,Last Updated: 12/22/2022
Series Code,SP.POP.TOTL,SP.POP.GROW,AG.SRF.TOTL.K2,SI.POV.NAHC,NY.GNP.ATLS.CD,NY.GNP.PCAP.CD,NY.GNP.MKTP.PP.CD,NY.GNP.PCAP.PP.CD,SI.DST.FRST.20,SP.DYN.LE00.IN,...,BX.KLT.DINV.CD.WD,DT.ODA.ODAT.PC.ZS,NY.GDP.PCAP.CD,BN.KLT.DINV.CD,FP.CPI.TOTL.ZG,,,,,
Country Name,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,...,Japan,Japan,Japan,Japan,Japan,,,,,
Country Code,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,...,JPN,JPN,JPN,JPN,JPN,,,,,
1960 [YR1960],93216000,..,..,..,..,..,..,..,..,67.666098,...,..,..,475.319076,..,3.574512,,,,,
1961 [YR1961],94055000,0.896034,377800,..,..,..,..,..,..,68.31,...,..,..,568.907743,..,5.368462,,,,,


#### 3. Data preparation

In [105]:
# Checking null values but we have another undifined values later i will correct it

data.isna().sum()

False
Population, total                                                                                    0
Population growth (annual %)                                                                         0
Surface area (sq. km)                                                                                0
Poverty headcount ratio at national poverty lines (% of population)                                  0
GNI, Atlas method (current US$)                                                                      0
GNI per capita, Atlas method (current US$)                                                           0
GNI, PPP (current international $)                                                                   0
GNI per capita, PPP (current international $)                                                        0
Income share held by lowest 20%                                                                      0
Life expectancy at birth, total (years)                            

In [106]:
# Drop absolute NaN columns

data.dropna(axis=1, thresh=1, inplace=True)
data.head()

Unnamed: 0,"Population, total",Population growth (annual %),Surface area (sq. km),Poverty headcount ratio at national poverty lines (% of population),"GNI, Atlas method (current US$)","GNI per capita, Atlas method (current US$)","GNI, PPP (current international $)","GNI per capita, PPP (current international $)",Income share held by lowest 20%,"Life expectancy at birth, total (years)",...,Net barter terms of trade index (2000 = 100),"External debt stocks, total (DOD, current US$)",Total debt service (% of GNI),Net migration,"Personal remittances, paid (current US$)","Foreign direct investment, net inflows (BoP, current US$)",Net ODA received per capita (current US$),GDP per capita (current US$),"Foreign direct investment, net (BoP, current US$)","Inflation, consumer prices (annual %)"
Series Code,SP.POP.TOTL,SP.POP.GROW,AG.SRF.TOTL.K2,SI.POV.NAHC,NY.GNP.ATLS.CD,NY.GNP.PCAP.CD,NY.GNP.MKTP.PP.CD,NY.GNP.PCAP.PP.CD,SI.DST.FRST.20,SP.DYN.LE00.IN,...,TT.PRI.MRCH.XD.WD,DT.DOD.DECT.CD,DT.TDS.DECT.GN.ZS,SM.POP.NETM,BM.TRF.PWKR.CD.DT,BX.KLT.DINV.CD.WD,DT.ODA.ODAT.PC.ZS,NY.GDP.PCAP.CD,BN.KLT.DINV.CD,FP.CPI.TOTL.ZG
Country Name,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,...,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan,Japan
Country Code,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,...,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN,JPN
1960 [YR1960],93216000,..,..,..,..,..,..,..,..,67.666098,...,..,..,..,-46245,..,..,..,475.319076,..,3.574512
1961 [YR1961],94055000,0.896034,377800,..,..,..,..,..,..,68.31,...,..,..,..,-33403,..,..,..,568.907743,..,5.368462


In [107]:
# Drop not necessary rows

data.drop(["Series Code", 'Country Name', 'Country Code'], axis=0, inplace=True)

In [108]:
# Make index more comfortable to work with it

data.index = [int(i[:4]) for i in data.index]

In [109]:
# Here is we have undifined values we will change it np.nan

data.iat[0,1]

'..'

In [110]:
# Sum of nans for all columns

data = data.replace('..', np.nan)
data.isna().sum()

False
Population, total                                                                                    0
Population growth (annual %)                                                                         1
Surface area (sq. km)                                                                                2
Poverty headcount ratio at national poverty lines (% of population)                                 62
GNI, Atlas method (current US$)                                                                      2
GNI per capita, Atlas method (current US$)                                                           2
GNI, PPP (current international $)                                                                  30
GNI per capita, PPP (current international $)                                                       30
Income share held by lowest 20%                                                                     59
Life expectancy at birth, total (years)                            

In [111]:
# I drop columns where nans above 30 rows

data_new = data.dropna(axis=1, thresh= 30)

# Here is we can see that up to 1970 we have a lot of nan values

data_new.isna().sum(axis = 1).sort_values(ascending = False).head(15)

1960    20
1961    17
1962    15
1963    15
1964    15
1966    15
1967    15
1968    15
1969    15
1965    14
2021    13
1970    10
1974     9
1972     9
1971     9
dtype: int64

In [112]:
# I decide to drop rows

data_new = data_new[(data_new.index > 1969) & (data_new.index < 2021)]
data_new

Unnamed: 0,"Population, total",Population growth (annual %),Surface area (sq. km),"GNI, Atlas method (current US$)","GNI per capita, Atlas method (current US$)","GNI, PPP (current international $)","GNI per capita, PPP (current international $)","Life expectancy at birth, total (years)","Fertility rate, total (births per woman)","Adolescent fertility rate (births per 1,000 women ages 15-19)",...,Gross capital formation (% of GDP),Market capitalization of listed domestic companies (% of GDP),Military expenditure (% of GDP),Mobile cellular subscriptions (per 100 people),Merchandise trade (% of GDP),Net migration,"Personal remittances, paid (current US$)","Foreign direct investment, net inflows (BoP, current US$)",GDP per capita (current US$),"Inflation, consumer prices (annual %)"
1970,103403000,1.15164,377800.0,194091300000.0,1880.0,,,71.950244,2.135,4.3486,...,43.507909,,0.77323,0.0,17.966768,124197,,94000000.0,2056.122046,6.924174
1971,105697000,2.194254,377800.0,226071600000.0,2140.0,,,72.882927,2.16,4.4788,...,39.867284,,0.824119,,18.199738,110997,,210000000.0,2272.077802,6.395349
1972,107188000,1.400779,377800.0,291828400000.0,2720.0,,,73.506585,2.14,4.609,...,39.613974,,0.842475,,16.649619,93814,,169000000.0,2967.041996,4.843517
1973,108707000,1.407189,377800.0,393149700000.0,3620.0,,,73.757561,2.14,4.4342,...,42.448537,,0.815014,,17.451753,67245,,-42000000.0,3974.745605,11.608624
1974,110162000,1.329582,377800.0,488054900000.0,4430.0,,,74.393902,2.05,4.2594,...,41.608935,,0.862068,,24.48095,38925,,202000000.0,4353.824355,23.222246
1975,111573000,1.272708,377800.0,576400600000.0,5170.0,,,75.057317,1.909,4.0846,...,36.546478,27.188437,0.907387,0.0,21.796714,22942,,226000000.0,4674.445481,11.731266
1976,112775000,1.07156,377800.0,609684800000.0,5410.0,,,75.456829,1.85,3.9098,...,35.505791,30.861945,0.891651,0.0,22.553327,47790,,113000000.0,5197.622337,9.374036
1977,113872000,0.968033,377800.0,676949100000.0,5940.0,,,75.898293,1.8,3.735,...,34.400408,2.984485,0.890681,0.0,21.128432,31569,130000000.0,21000000.0,6335.286871,8.161827
1978,114913000,0.910031,377800.0,856521800000.0,7450.0,,,76.038293,1.79,3.8548,...,34.464487,3.373889,0.891286,0.0,17.574079,852,160000000.0,8000000.0,8820.691945,4.209566
1979,115890000,0.846615,377800.0,1094764000000.0,9450.0,,,76.337561,1.77,3.9746,...,36.232092,27.404794,0.907392,0.0,20.106878,-21454,200000000.0,239000000.0,9103.564756,3.701851


In [113]:
# Fill nan values with method 'bfill'

data_filled = data_new.fillna(method='bfill')

In [114]:
# Fill another nan values with method 'ffill'

data_filled = data_filled.fillna(method='ffill')

In [115]:
data_filled.isna().sum()

False
Population, total                                                                                   0
Population growth (annual %)                                                                        0
Surface area (sq. km)                                                                               0
GNI, Atlas method (current US$)                                                                     0
GNI per capita, Atlas method (current US$)                                                          0
GNI, PPP (current international $)                                                                  0
GNI per capita, PPP (current international $)                                                       0
Life expectancy at birth, total (years)                                                             0
Fertility rate, total (births per woman)                                                            0
Adolescent fertility rate (births per 1,000 women ages 15-19)               

In [116]:
# Correlation betwen features and GDP

corr = data_filled.corrwith(data_filled['GDP (current US$)']).sort_values(ascending=False)
corr = pd.DataFrame(corr)
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0_level_0,0
False,Unnamed: 1_level_1
GDP (current US$),1.0
GDP per capita (current US$),0.999825
"GNI, Atlas method (current US$)",0.989124
"GNI per capita, Atlas method (current US$)",0.98912
Electric power consumption (kWh per capita),0.963015
"School enrollment, secondary (% gross)",0.935889
"Water productivity, total (constant 2015 US$ GDP per cubic meter of total freshwater withdrawal)",0.930573
"Life expectancy at birth, total (years)",0.922684
"Population, total",0.916887
"Immunization, measles (% of children ages 12-23 months)",0.868245


In [117]:
# I will drop some indicators because they also one another type of GDP or another method of GDP calculation

data_filled.drop(['GDP per capita (current US$)',
                  'GNI, Atlas method (current US$)',
                  "GNI per capita, Atlas method (current US$)"], axis=1, inplace=True)

#### 4. Modeling

In [118]:
# Separate data

X_data = data_filled.drop('GDP (current US$)', axis=1)
y_data = data_filled['GDP (current US$)']

In [137]:
# Split into train and test data. We have a bit data and i deside to split 10% of data to test

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.1, random_state=73)

In [138]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [139]:
# Training

model_linear = LinearRegression()
model_linear.fit(X_train_scaled,y_train)

LinearRegression()

In [140]:
# MSE and R2 score for train set

pred = model_linear.predict(X_train_scaled)
print(f'MSE: {mean_squared_error(y_train, pred)**0.5:.2f} US$')
print(f'R2 score: {r2_score(y_train, pred) * 100:.2f} %')

MSE: 145791199492.60 US$
R2 score: 99.44 %


In [141]:
X_test_scaled = scaler.fit_transform(X_test)

In [142]:
# Predict by test data

pred_test = model_linear.predict(X_test_scaled)

In [143]:
# MSE and R2 score for test set

print(f'MSE: {mean_squared_error(y_test, pred_test)**0.5:.2f} US$')
print(f'R2 score: {r2_score(y_test, pred_test) * 100:.2f} %')

MSE: 646346276369.70 US$
R2 score: 83.64 %


In [148]:
# I created DataFrame which shows difference between actual and predicted values for test data

diff = pd.DataFrame()
diff.index = y_test.index
diff['actual'] = y_test
diff['pred'] = pred_test
diff['diff'] = y_test - pred_test
diff['%'] = ((y_test - pred_test) / y_test)*100

diff

Unnamed: 0,actual,pred,diff,%
1998,4098363000000.0,3830711000000.0,267651700000.0,6.530698
1990,3132818000000.0,3125811000000.0,7006457000.0,0.223647
2020,5040108000000.0,3919184000000.0,1120924000000.0,22.240074
2004,4893116000000.0,4531223000000.0,361893100000.0,7.395963
2002,4182846000000.0,5018801000000.0,-835955400000.0,-19.985326
1972,318031300000.0,-272423300000.0,590454600000.0,185.659264


> In conclusion i can say there are too many factors which have strong correlation but we have a bit data to create a good model. We get higher difference because of filled Nans. 