# <p style="text-align: center;">EDA FOR COUNTRIES OF THE WORLD</p>


In [None]:
from IPython.display import HTML
from IPython.display import Image
Image(url= "https://www.worldatlas.com/r/w728-h425-c728x425/upload/0f/59/b2/untitled-design-275.jpg")

# <p style="text-align: center;">ABSTRACT</p>
[Reference Link 1](#1)

Countries of the World is a dataset by Fernando Lasso, which has a list of various factors which affect the GDP per capita of the countries. In this Notebook, I have shown the factors which highly affect the GDP per capita and some factors which more or less do not affect the GDP per capita. I have also discussed about the GDP which is GDP per capita multiplied by the total population of the country. Linear Regression and Correlation are the Exploratory Data Analysis methods used to find the factors affecting the GDP. A few factors that affect the GDP of a Country are Phones owned by people per Thousand, Birthrate of a country and Literacy rate of the country etc.


In [None]:
from IPython.core.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

In [None]:
# importing libraries
%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import r2_score

In [None]:
# read csv file
df=pd.read_csv('../input/countries of the world.csv', decimal = ',')

### Dataset Overview
Here we are looking at the top 5 rows of the dataset to view, what type of dataset it is. We also look at the columns which show the various attributes in the dataset.

In [None]:
# View first 5 rows(default) to see the general distribution of data
df.head()

### Statistical Analysis
Here we are running basic Statistical analysis on the given data to find any abnormal values in the dataset

In [None]:
# run basic statistical analysis on the given data to find any abnormal values
df.describe()

### Checking for empty fields
[Reference Link 2](#2)

Next we will check if there are any missing or Null values in the dataset.

In [None]:
print("Are there Null Values in the dataset? ")
df.isnull().values.any()

### Finding the location of Null values
[Reference Link 2](#2)

Now that we know that there are missing(null) values in the dataset, we need to find the columns which have missing values and then find the percentage of how much data is missing in those columns to get a better picture.

In [None]:
# finding the missing or null values in the data
total = df.isnull().sum()[df.isnull().sum() != 0].sort_values(ascending = False)
percent = pd.Series(round(total/len(df)*100,2))
pd.concat([total, percent], axis=1, keys=['Total Missing', 'Percent'])

### How to find the missing values in the dataset?
[Reference Link 2](#2)

Now that we know that there are missing values in the dataset, we need a remedy to this. We could either ignore the missing values if the number of missing values is low, but in some cases like climate and literacy there are many missing values (about 10%) of the data,this could significantly affect the graph. So before we run our analysis, we need to figure out a way to replace these null values. Below we are plotting our dependant variable (GDP per capita) to find out its distribution. Based on this distribution, we will be able to decide how to replace our missing values.

In [None]:
# sorting and plotting Countries based on GDP
top_gdp_countries = df.sort_values('GDP ($ per capita)',ascending=False)
fig, ax = plt.subplots(figsize=(16,6))
sns.barplot(x='Country', y='GDP ($ per capita)', data=top_gdp_countries.head(33), palette='Set1')
ax.set_xlabel(ax.get_xlabel(), labelpad=15)
ax.set_ylabel(ax.get_ylabel(), labelpad=30)
ax.xaxis.label.set_fontsize(16)
ax.yaxis.label.set_fontsize(16)
ax.set_title('GDP of the top 33 countries sorted in a descending order.')
plt.xticks(rotation=90)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(16,6))
sns.barplot(x='Country', y='GDP ($ per capita)', data=top_gdp_countries.tail(33), palette='Set1')
ax.set_xlabel(ax.get_xlabel(), labelpad=15)
ax.set_ylabel(ax.get_ylabel(), labelpad=30)
ax.xaxis.label.set_fontsize(16)
ax.yaxis.label.set_fontsize(16)
ax.set_title('GDP of the bottom 33 countries sorted in a descending order.')
plt.xticks(rotation=90)
plt.show()

### Replacing missing values in the dataset
[Reference Link 2](#2)

From the graphs plotted above, we can see that the GDP per capita is distributed in a skewed manner i.e. it starts with a high value and then exponentially drops to a low value. Since the distribution is skewed, it is advisable to replace the missing values with the central tendancy of median. We do not use mean or mode because mean is influenced by outliers while mode can have multiple values and mode is generally used for categorical data.

A region is an area of land that includes a number of places which have something in common. It is generally seen, that countries in the same region have the same climate, and thus agricultural patterns are similar. Other socio-cultural attributes such as literacy, industry etc are also found to be similar in a region.

Thus we are grouping our data by the region and calculating the median of the values of each attributes. Now these median values can be used to replace the missing values in our data set. Below I have shown the median values of GDP, Literacy % and Agriculture grouped by region.(I have randomly selected Literacy and Agriculture to show the grouping by region)

PS- Since climate is a categorical data, we are using mode instead of mean for climate.

In [None]:
df.groupby('Region')[['GDP ($ per capita)', 'Literacy (%)', 'Agriculture']].median()

In [None]:
for col in df.columns.values:
    if df[col].isnull().sum() == 0:
        continue
    if col == 'Climate':
        guess_values = df.groupby('Region')['Climate'].apply(lambda x: x.mode().max())
    else:
        guess_values = df.groupby('Region')[col].median()
    for region in df['Region'].unique():
        df[col].loc[(df[col].isnull())&(df['Region']==region)] = guess_values[region]

In [None]:
print("Are there Null Values in the dataset? ")
df.isnull().values.any()

In [None]:
print(df.isnull().sum())

### No missing values
As we can see, that the missing values in the dataset are now gone, we can actually begin with our Exploratory data analysis. EDA is the process of figuring out what the data can tell us and we use EDA to find patterns, relationships, or anomalies to inform our subsequent analysis.

# <p style="text-align: center;"> Correlation<p>
[Reference Link 1](#1)
    
In the first step of our EDA we are finding out the correlation among the various attributes of the dataset. Correlation value gives us the measure of linear relationship amongst two numerical quantities. The range of correlation is between -1 and 1.

When two variables have a positive correlation, it means the variables move in the same direction. This means that as one variable increases, so does the other one. In a negative correlation, the variables move in inverse, or opposite, directions. In other words, as one variable increases, the other variable decreases. When the correlation value is 0, no correlation exists between the attributes

In this case, it will give us the attributes which are most related to GDP per capita. And thus these attributes will help us analyze the factors which affect the GDP per capita in a country

In [None]:
df.corr()

### Heatmap
Below is a Heatmap, which is a visualization of the correlation

In [None]:
plt.figure(figsize=(16,12))
ax=plt.axes()
sns.heatmap(data=df.iloc[:,2:].corr(),annot=True,fmt='.2f',cmap='coolwarm',ax=ax)
ax.set_title('Heatmap of all the Correlated values')
plt.show()

### Heatmap for Higly Correlated Values
As we can see from the Upper Heatmap, attributes such as Literacy, Phones per thousand and Service are highly positively correlated with GDP per capita whereas Infant Mortality Rate, Birthrate and Agriculture are highly negatively correlated with GDP per capita.
This means that if GDP per Capita would increase then likely Phones per thousand will also increase and vice versa.
And, if GDP per Capita would increase then likely the infant Mortality rate would go down and vice versa.

In [None]:
# choose attributes which shows relation
x = df[['GDP ($ per capita)','Literacy (%)','Phones (per 1000)','Service','Infant mortality (per 1000 births)','Birthrate','Agriculture']]

In [None]:
# show corr of the same
plt.figure(figsize=(9,7))
ax=plt.axes()
sns.heatmap(x.corr(), annot=True, cmap='coolwarm',ax=ax)
ax.set_title('Heatmap of all the highly Correlated values')
plt.show()

### Scatterplot
[Reference Link 1](#1)

Scatterplot is a graph of plotted points that show the relationship between two sets of data. It helps us find the linear relationship between dependant and independant variable (x vs y),which is (Factors vs GDP per capita) in this case. It basically shows how strongly two variables have linear relationship. 

Since we have some factors which are inversely correlated with GDP per capita , we are going to take the absolute value of correlation coefficient and plot it.

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(25,16))
plt.subplots_adjust(hspace=0.4)

corr_to_gdp = pd.Series()
for col in df.columns.values[2:]:
    if ((col!='GDP ($ per capita)')&(col!='Climate')&(col!='Coastline (coast/area ratio)')&(col!='Pop. Density (per sq. mi.)')):
        corr_to_gdp[col] = df['GDP ($ per capita)'].corr(df[col])
abs_corr_to_gdp = corr_to_gdp.abs().sort_values(ascending=False)
corr_to_gdp = corr_to_gdp.loc[abs_corr_to_gdp.index]

for i in range(3):
    for j in range(3):
        sns.regplot(x=corr_to_gdp.index.values[i*3+j], y='GDP ($ per capita)', data=df,
                   ax=axes[i,j], fit_reg=False, marker='.')
        title = 'correlation='+str(corr_to_gdp[i*3+j])
        axes[i,j].set_title(title)
axes[1,2].set_xlim(0,102)
fig.suptitle('Scatter Plot GDP against the factors',fontsize=30)
plt.show()

### Pairplot

[Reference Link 1](#1)

While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables(Here GDP, Phones per thousand and Region). Pair plots are a great method to identify trends for follow-up analysis.

The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two variables.

In [None]:
x = df[['GDP ($ per capita)','Phones (per 1000)','Service', 'Region']]
pp=sns.pairplot(x, hue="Region", diag_kind="hist", aspect=1.55, markers="o")
pp.fig.suptitle('Scatter Plot of GDP, Phones per Thousand and Service',y=1.05)

### Distplot
#### Positively Correlated Factors
[Reference Link 1](#1)

The following figure gives a plot for density of positively correlated factors (where r>0). And it is a univariate distribution

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(23,20))
plt.subplots_adjust(hspace=0.4)

z = pd.Series()
for col in df.columns.values[2:]:
    if ((col!='Deathrate')&(col!='Net migration')&(col!='Industry')&(col!='Agriculture')&(col!='Birthrate')&(col!='Area (sq. mi.)')&(col!='Population')&(col!='Other (%)')&(col!='Crops (%)')&(col!='Arable (%)')&(col!='Infant mortality (per 1000 births)')&(col!='Climate')&(col!='Coastline (coast/area ratio)') &(col!='Pop. Density (per sq. mi.)')):
    # if ((col=='GDP ($ per capita)')&(col=='Literacy (%)')&(col=='Service')&(col=='Phones (per 1000)')):
        colums=np.array(df[col])
        z[col]=colums
for i in range(2):
    for j in range(2):
        
        x=z[i*2+j]
        y=z.index[i*2+j]
        sns.distplot(x,axlabel=y,ax=axes[i,j])
        
fig.suptitle('Distplot of Positively Correlated Factors with GDP per Capita',fontsize=30)        
plt.show()

#### Negatively Correlated Factors
The following figure gives a plot for density of negatively correlated factors (where r<0). And it is a univariate distribution

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(23,20))
plt.subplots_adjust(hspace=0.2)

z = pd.Series()
for col in df.columns.values[2:]:
     if ((col!='Service')&(col!='Deathrate')&(col!='Net migration')&(col!='Industry')&(col!='Literacy (%)')&(col!='Area (sq. mi.)')&(col!='Population')&(col!='Other (%)')&(col!='Crops (%)')&(col!='Arable (%)')&(col!='Phones (per 1000)')&(col!='Climate')&(col!='Coastline (coast/area ratio)') &(col!='Pop. Density (per sq. mi.)')):
            
        colums=np.array(df[col])
        z[col]=colums
        
for i in range(2):
    for j in range(2):
        
        x=z[i*2+j]
        y=z.index[i*2+j]
        sns.distplot(x,axlabel=y,ax=axes[i,j])
      
fig.suptitle('Distplot of Negatively Correlated Factors with GDP per capita',fontsize=30)                
plt.show()

### Boxplot
[Reference Link 1](#1)

It is often used in explanatory data analysis in order to show the shape of the distribution, its central value, and its variability. The following figure gives us the boxplot for the first three factors that are highly positively and negatively correlated.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(23,20))
plt.subplots_adjust(hspace=0.2)

z = pd.Series()
for col in df.columns.values[2:]:
     if ((col!='Deathrate')&(col!='Net migration')&(col!='Industry')&(col!='GDP ($ per capita)')&(col!='Area (sq. mi.)')&(col!='Population')&(col!='Other (%)')&(col!='Crops (%)')&(col!='Arable (%)')&(col!='Climate')&(col!='Coastline (coast/area ratio)') &(col!='Pop. Density (per sq. mi.)')):
            
        colums=np.array(df[col])
        z[col]=colums

for i in range(2):
    for j in range(3):
        x=z[i*3+j]
        y=z.index[i*3+j]
        
        sns.boxplot(x,ax=axes[i,j])
        title=str(y)
        axes[i,j].set_title(title)
        
fig.suptitle('Boxplot of Correlated Factors with GDP per Capita',fontsize=30)                      
plt.show()

# <p style="text-align: center;">Linear Regression(Data Training and Modeling)<p>
[Reference Link 3](#3)
    
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable (Factors), and the other is considered to be a dependent variable (GDP). Linear regression looks at various data points and plots a trend line. Linear regression can create a predictive model on apparently random data, showing trends in data. Linear regression looks at various data points and plots a trend line. Linear regression can create a predictive model on apparently random data, showing trends in data. 
#### RMSE
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. Root mean square error is commonly used in forecasting and regression analysis to verify experimental results.

#### MSLE
The mean squared error tells you how close a regression line is to a set of points. It does this by taking the distances from the points to the regression line (these distances are the “errors”) and squaring them. The squaring is necessary to remove any negative signs. It also gives more weight to larger differences. It’s called the mean squared error as you’re finding the average of a set of errors. MSLE stands for Mean Squared Log Error.

In [None]:
df.head()

### Label Encoder
From the above dataset we can see that the Region has a value that is not numeric, so it is not possible to run Linear Regression Analysis on this data. So we use Label Encoder to transform this data of Region to a Region Label. This label will assign a numeric value to all the Region Entries.

In [None]:
LE = LabelEncoder()
df['Regional_label'] = LE.fit_transform(df['Region'])
df1 = df[['Region','Regional_label']]
df1.head(5)

### Multiple Linear Regression 
Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.

In [None]:
train, test = train_test_split(df, test_size=0.3, shuffle=True)
training_features = ['Population', 'Area (sq. mi.)',
       'Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)',
       'Net migration', 'Infant mortality (per 1000 births)',
       'Literacy (%)', 'Phones (per 1000)',
       'Arable (%)', 'Crops (%)', 'Other (%)', 'Birthrate',
       'Deathrate', 'Agriculture', 'Industry', 'Service', 'Regional_label','Service']
target = 'GDP ($ per capita)'
train_X = train[training_features]
train_Y = train[target]
test_X = test[training_features]
test_Y = test[target]

In [None]:
model = LinearRegression()
model.fit(train_X, train_Y)
train_pred_Y = model.predict(train_X)
test_pred_Y = model.predict(test_X)
train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index)
test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index)

rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y))
msle_train = mean_squared_log_error(train_pred_Y, train_Y)
rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y))
msle_test = mean_squared_log_error(test_pred_Y, test_Y)

print('Root Mean Squared Error for Training Data is:', '%.2f' %rmse_train,'\t\tMean Squared Log Error for Training Data is:', '%.2f' %msle_train)
print('Root Mean Squared Error for Test Data is:', '%.2f' %rmse_test,'\t\tMean Squared Log Error for Test Data is:', '%.2f' %msle_test)


### Simple Linear Regression

Simple Linear Regression Analysis The simplest form of a regression analysis uses on dependent variable and one independent variable. In this simple model, a straight line approximates the relationship between the dependent variable and the independent variable.

In [None]:
train, test = train_test_split(df, test_size=0.3, shuffle=True)
training_features = ['Phones (per 1000)']
target = 'GDP ($ per capita)'
train_X = train[training_features]
train_Y = train[target]
test_X = test[training_features]
test_Y = test[target]

In [None]:
model = LinearRegression()
model.fit(train_X, train_Y)
train_pred_Y = model.predict(train_X)
test_pred_Y = model.predict(test_X)
train_pred_Y = pd.Series(train_pred_Y.clip(0, train_pred_Y.max()), index=train_Y.index)
test_pred_Y = pd.Series(test_pred_Y.clip(0, test_pred_Y.max()), index=test_Y.index)

rmse_train = np.sqrt(mean_squared_error(train_pred_Y, train_Y))
msle_train = mean_squared_log_error(train_pred_Y, train_Y)
rmse_test = np.sqrt(mean_squared_error(test_pred_Y, test_Y))
msle_test = mean_squared_log_error(test_pred_Y, test_Y)

print('Root Mean Squared Error for Training Data is:', '%.2f' %rmse_train,'\t\tMean Squared Log Error for Training Data is:', '%.2f' %msle_train)
print('Root Mean Squared Error for Test Data is:', '%.2f' %rmse_test,'\t\tMean Squared Log Error for Test Data is:', '%.2f' %msle_test)

plt.scatter(test_X, test_Y,  color='red')
plt.plot(test_X, test_pred_Y, color='blue', linewidth=1)
plt.title('Linear Regression of Phones per Thousand with GDP')
plt.xlabel('Phones per Thousand')
plt.ylabel('GDP per Capita')
plt.xticks()
plt.yticks()

plt.show()

# <p style="text-align: center;">Total GDP<p>
    
GDP, which stands for Gross Domestic Product, is a measure describing the value of a country's economy. GDP takes into account all of the goods produced and services made available in a country over a specific period of time. Often, GDP is obtained quarterly and annually. GDP is a number that will ultimately indicate the overall economic health of the country.



In [None]:
df['Total_GDP ($)'] = df['GDP ($ per capita)'] * df['Population']
#plt.figure(figsize=(16,6))
top_gdp_countries = df.sort_values('Total_GDP ($)',ascending=False).head(10)
other = pd.DataFrame({'Country':['Other'], 'Total_GDP ($)':[df['Total_GDP ($)'].sum() - top_gdp_countries['Total_GDP ($)'].sum()]})
gdps = pd.concat([top_gdp_countries[['Country','Total_GDP ($)']],other],ignore_index=True)

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7), gridspec_kw = {'width_ratios':[2,1]})
sns.barplot(x='Country', y='Total_GDP ($)', data=gdps, ax=axes[0], palette='Set3')
axes[0].set_xlabel('Country', labelpad=30, fontsize=16)
axes[0].set_ylabel('Total_GDP', labelpad=30, fontsize=16)

colors = sns.color_palette("Set3", gdps.shape[0]).as_hex()
axes[1].pie(gdps['Total_GDP ($)'], labels=gdps['Country'], colors=colors, autopct='%1.1f%%', shadow=True)
axes[1].axis('equal')
plt.show()

### Ranking of Countries according to the total GDP vs GDP per Capita

As we can see that countries like India and China which have low GDP per Capita because of their high population, jump up when it comes to calculating Total GDP.

In [None]:
Rank1 = df[['Country','Total_GDP ($)']].sort_values('Total_GDP ($)', ascending=False).reset_index()
Rank2 = df[['Country','GDP ($ per capita)']].sort_values('GDP ($ per capita)', ascending=False).reset_index()
Rank1 = pd.Series(Rank1.index.values+1, index=Rank1.Country)
Rank2 = pd.Series(Rank2.index.values+1, index=Rank2.Country)
Rank_change = (Rank2-Rank1).sort_values(ascending=False)
print('rank of total GDP - rank of GDP per capita:')
Rank_change.loc[top_gdp_countries.Country]

### Correlation of Total GDP with all the factors

In [None]:
df.corr()

### Heatmap Representation of the same

In [None]:
plt.figure(figsize=(16,12))
ax=plt.axes()
sns.heatmap(data=df.iloc[:,2:].corr(),annot=True,fmt='.2f',cmap='coolwarm',ax=ax)
ax.set_title('Heatmap of all the Correlated values')
plt.show()

In [None]:
# choose attributes which shows relation
x = df[['Total_GDP ($)','Population','Area (sq. mi.)','GDP ($ per capita)','Literacy (%)','Phones (per 1000)','Service','Infant mortality (per 1000 births)','Birthrate','Agriculture']]

### Heatmap of Highly correlated values of Total GDP and GDP per Capita

In [None]:
# show corr of the same
plt.figure(figsize=(9,7))
ax=plt.axes()
sns.heatmap(x.corr(), annot=True, cmap='coolwarm',ax=ax)
ax.set_title('Heatmap of all the highly Correlated values')
plt.show()

# <p style="text-align: center;">Conclusion<p>
    
1. Given Dataset is rightly skewed and hence therefore it's measure of central tendency is median.
2. GDP per capita is highly correlated with phones, services ,literacy rate(positively correlated) and infant mortality rate, agriculture ,birthrate (negatively correlated).
3. On being grouped region wise, GDP per capita is positively correlated with phones and services. As in the region where people tend to buy more phones those regions tend to have more GDP per capita and as for services , more the services more is the GDP per capita.
4. For highly correlated factors , the density distribution is mostly skewed. 
5. Climate has no effect on GDP per capita.
6. For multiple linear regression, we conclude that RMSE(test)=-- and MSLE(test)=--,so RMSE value is low in the range (55000) is a good measure and hence tells us that model is a good predictor as in we can make theoritical claims and ##(lower value of MSE shows that whether our model is a good estimator. (As in test data fits well with line of regression or not).
7. For single linear regression (phones per 1000 vs GDP per capita), we see that both RMSE and MSE values are good measure. Phones per 1000 is a good factor that can predict GDP per capita values.
8. According to total GDP countries like India and China which have low GDP per capita (Rank 146 and Rank 118 respectively) jump to positions 4 and 2 respectively. This shows that although their total GDP is high, but their GDP per capita is low.
9. Countries with high total GDP is quite different from countries with high GDP per capita.
10. Total GDP is highly correlated with Area and Population.
11. Factors which were highly correlated with GDP per capita has almost no effect on total GDP except for Phones per 1000 , which has correlation of 0.23

# <p style="text-align: center;">Contribution<p>
As this was a learning assignment, the majority of the code has been taken from the GitHub account of the professor Nik Brown.
    
- Code by self : 25%
- Code from external Sources : 75%

# <p style="text-align: center;">Citation<p>
1. https://github.com/nikbearbrown/INFO_6105/blob/master/Assignments/Countries_of_the_World_EDA_Assignment_1.ipynb - GitHub Account of Professor 
   <a id="1"></a>
2. https://www.kaggle.com/stieranka/predicting-gdp-world-countries - Kaggle Kernel on the same dataset
   <a id="2"></a>
3. https://www.youtube.com/watch?v=E5RjzSK0fvY&t=394s - Youtube video of Linear Regression
   <a id="3"></a>

# <p style="text-align: center;">License<p>
Copyright (c) 2019 Rushabh Nisher

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.