# 1.Introduction

The goal of this notebook is to **prepare universal geographical data** with the use of huge spatial data set.

* I clean the 11 million records data set approximating the values of elevation etc.
* Second, I plot the created data set and check whether it is reasonable
* I merge the data set with population information and sample on the basis of this

As a result I created big data set containing information about location (latitude, longitude), geogrphical elevation and denisty which can be used as auxiliary data set for any other analysis.

**Majority of code was hidden for clarity. Click "unhide" to look in there !**

# 2.Data preparation

In [None]:
import numpy as np
import pandas as pd
import random
import os
import matplotlib.pyplot as plt
import seaborn as sns 
from mpl_toolkits.basemap import Basemap
import math
from math import cos, asin, sqrt
from numpy import nansum
from numpy import nanmean
import netCDF4
%matplotlib inline

First, I define libraries. Second, the underlying data set with the list of all locations in the Earth is very big. This means ~11 million records. I decide to load take all 11 million rows. And it can be changed to other value below.

In [None]:
n = 11061987
s = 11061987
skip = sorted(random.sample(range(n),n-s))
        
df_path = "../input/geonames-database/geonames.csv"
df = pd.read_csv(df_path,index_col='geonameid',skiprows=skip)

First, I check the missing entries for whole the data set.

In [None]:
C = (df.dtypes == 'object')
CategoricalVariables = list(C[C].index)
Integer = (df.dtypes == 'int64') 
Float   = (df.dtypes == 'float64') 
NumericVariables = list(Integer[Integer].index) + list(Float[Float].index)

Missing_Percentage = (df.isnull().sum()).sum()/np.product(df.shape)*100

print("The number of missing entries: " + str(round(Missing_Percentage,2)) + " %")

In [None]:
All_NaN = df.isnull().sum()
RowsCount = len(df.index)

print("The percentage number of missing entries per variable: ", format(round(All_NaN/RowsCount * 100,5)) )

Let's list some cleaning decisions:
* cc2 and admin codes higher than 1 are to be dropped
* I will estimate elevation by surroudning areas
* I drop alternatenames

In [None]:
df=df.drop(['alternatenames','admin2 code','admin3 code','admin4 code','cc2'], axis=1)

Next, I estimate elevation by its local neighbours. We define this function:

In [None]:
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
    return 12742 * asin(sqrt(a))

print("The length is circa: "+ format(round(distance(51,14,55,24))) + " kilometers.")

It lets us to calculate the distance between two points by use of Harversine function. Above I checked what is the distance between two points with latitutde and longitude different of Polish territory. It is how I expected around 800 kilometers difference between furthest points. The function works. I will use in the further part of analysis. For now, simple approximation will be enough.

To approximate elevation I am applying simplified net of latitude and logitude where each value is rounded to full grade (for example: 42.523432 E = ~43 E).

What I find, round in that way is still not enough. Data for elevation is so bad, that I decide to round to even (for example: 42.523432 E = ~42 E).

In [None]:
def round_up_to_even(f):
    return math.ceil(f / 2.) * 2

In [None]:
df["latitude_app"] = df.apply(lambda row: round_up_to_even(row['latitude']),axis=1)
df["longitude_app"] = df.apply(lambda row: round_up_to_even(row['longitude']),axis=1)

elevation_table = df[['elevation','latitude_app','longitude_app']].groupby(['latitude_app',
                'longitude_app']).agg({'elevation': lambda x: x.mean(skipna=True)}).sort_values(by=['latitude_app', 
                'longitude_app'], ascending=False).reset_index()

df = pd.merge(df,  elevation_table,  on =['latitude_app', 'longitude_app'],  how ='inner')

In [None]:
print("Still NAs in 'elevation': "+ format(      round((df[['elevation_y']].isnull().sum()).sum()/np.product(df.shape[0])*100,2)     )  + " %.")

Some regions (like desserts and polar areas are absolutely empty. I will apply there world elevation average - it is 840m, more than I expected.

In [None]:
WorldAverageElevation = 840
df['elevation_y']=df['elevation_y'].fillna(WorldAverageElevation)

df=df.drop(['elevation_x'], axis=1)
df

This is not very obvious (at least for me) what all these country shortcuts mean. But I identify them as *universal Alpha-2 code* used in whole the world. I decide to match to them to simple country names by use of translation data in standard ISO-3166, Alpha 2 digits code.

In [None]:
ISO = pd.read_csv('../input/alpha-country-codes/Alpha__2_and_3_country_codes.csv', sep=';')

ISO['Country'] = ISO.apply(lambda row: str.rstrip(row['Country']),axis=1)

ISO_toMerge = ISO.drop(['Alpha-3 code','Numeric'], axis=1)
ISO_toMerge=ISO_toMerge.rename(columns={"Alpha-2 code": "country code"})
df = pd.merge(df, ISO_toMerge,  on ='country code',  how ='inner')

df

Alright, my base spatial data is ready for use.

# 3.Data analysis

For the memory reasons, let's plot sample equals to 100.000 records.

In [None]:
df_sample = df.sample(n=100000)

plt.figure(1, figsize=(12,6))
m1 = Basemap(projection='merc',llcrnrlat=-60,urcrnrlat=65,llcrnrlon=-180,urcrnrlon=180,
             lat_ts=0,resolution='c')

m1.fillcontinents(color='#191919',lake_color='#000000') 
m1.drawmapboundary(fill_color='#000000')                
m1.drawcountries(linewidth=0.2, color="w")              

# Plot the data
mxy = m1(df_sample["longitude"].tolist(), df_sample["latitude"].tolist())
m1.scatter(mxy[0], mxy[1], s=3, c="#1292db", lw=0, alpha=1, zorder=5)

plt.title("Sample of 100.000 locations in the world")
plt.show()

Alright, number of points more less reflect the population's world density as well. Similarly, I will zoom a bit for Europe.

In [None]:
lon_min, lon_max = -10, 40
lat_min, lat_max = 35, 65

idx_europe = (df["longitude"]>lon_min) &\
            (df["longitude"]<lon_max) &\
            (df["latitude"]>lat_min) &\
            (df["latitude"]<lat_max)

df_europe = df[idx_europe].sample(n=100000)

plt.figure(2, figsize=(12,6))
m2 = Basemap(projection='merc',llcrnrlat=lat_min,urcrnrlat=lat_max,llcrnrlon=lon_min,
             urcrnrlon=lon_max,lat_ts=35,resolution='c')

m2.fillcontinents(color='#191919',lake_color='#000000') 
m2.drawmapboundary(fill_color='#000000')                
m2.drawcountries(linewidth=0.2, color="w")              

mxy = m2(df_europe["longitude"].tolist(), df_europe["latitude"].tolist())
m2.scatter(mxy[0], mxy[1], s=5, c="#1292db", lw=0, alpha=0.05, zorder=5)

plt.title("Sample of 100.000 locations in Europe")
plt.show()

In this case, situation is different than expected. Countries like UK, Benelux are not dense enough. Surprisingly, Norway which has very low population received big number of points. I can see that these values do not reflect the population's density then.

In [None]:
Aggregated = df[['name','Country']]
Aggregated = Aggregated.groupby(['Country']).agg(['count']).sort_values([('name', 'count')], ascending=False)
Aggregated['Percentage'] = round(Aggregated[['name']] / df.shape[0],2)
Aggregated.columns = Aggregated.columns.get_level_values(0)
Aggregated.columns = [''.join(col).strip() for col in Aggregated.columns.values]
Aggregated

USA, China and India or even Mexico make sense being in the top. Norway is not expected. I have an idea: let's use this data set but apply sampling to the countries based on its population. In other words, I use the given points, but I sample it by density. So for example in the above table, both China and India will grow, USA and Mexico will remain high, Norway and other small countries will drop a lot.

I merge the population data to our ISO data, it requires of course some corrections (due to differences between countries' naming). 

In [None]:
Population = pd.read_csv('../input/population-by-country-2020/population_by_country_2020.csv')

Population = Population.rename(columns={"Country (or dependency)": "Country"})

Population[['Country']] = Population[['Country']].replace("Czech Republic (Czechia)", "Czechia")
Population[['Country']] = Population[['Country']].replace("United States", "United States of America")
Population[['Country']] = Population[['Country']].replace("United Kingdom", "United Kingdom of Great Britain and Northern Ireland")
Population[['Country']] = Population[['Country']].replace("Vietnam", "Viet Nam")
Population[['Country']] = Population[['Country']].replace("Laos", "Lao People Democratic Republic")
Population[['Country']] = Population[['Country']].replace("State of Palestine", "Palestine")
Population[['Country']] = Population[['Country']].replace("North Macedonia", "Republic of North Macedonia")
Population[['Country']] = Population[['Country']].replace("Russia", "Russian Federation")
Population[['Country']] = Population[['Country']].replace("Syria", "Syrian Arab Republic")

I check how much population from one data set was successfully merged after corrections. Ok, 97.6% is enough for me.

In [None]:
Population_Merged = pd.merge(ISO_toMerge,Population,  on ='Country',  how ='inner')

Population_Merged[['Population Perc']] = Population_Merged[['Population (2020)']]/Population_Merged[['Population (2020)']].sum()

print(   format(   round(Population_Merged[['Population (2020)']].sum()/Population[['Population (2020)']].sum() ,3) ) )

The next steps:
* For each country I divide its population by total one
* I introduce sample size equals to 1000.000 and I multiply the percentage ratio. In that way every country receives the number of rows it should have
* This ratio is applied to the main data set to assess how many points should be sampled for each country

In [None]:
Sample_Size = 1000000

Population_Merged = pd.merge(Population_Merged,Aggregated,  on ='Country',  how ='inner')

Population_Merged[['Sample size']] = Population_Merged['Population Perc']  / Population_Merged['name']*Population_Merged['name'].sum()*Sample_Size

Population_toMerge = Population_Merged.loc[:, Population_Merged.columns.intersection(['Country','Sample size'])]

df = pd.merge(df,Population_toMerge,  on ='Country',  how ='inner')

Total_Probability = df[['Sample size']].sum()

df[['Sample size']] = df[['Sample size']] / Total_Probability

vec = df[['Sample size']]

df_sampled = df.sample(n=Sample_Size, weights='Sample size')

In [None]:
Aggregated = df_sampled[['name','Country']]
Aggregated = Aggregated.groupby(['Country']).agg(['count']).sort_values([('name', 'count')], ascending=False)
Aggregated['Sampled records'] = round(Aggregated[['name']] / df_sampled.shape[0],2)
Aggregated.columns = Aggregated.columns.get_level_values(0)
Aggregated.columns = [''.join(col).strip() for col in Aggregated.columns.values]

Population_toMerge_2 = Population_Merged.loc[:, Population_Merged.columns.intersection(['Country','Population Perc'])]

Aggregated = pd.merge(Aggregated,Population_toMerge_2,  on ='Country',  how ='inner')
Aggregated

In [None]:
lat_min, lat_max = 35, 65

idx_europe = (df_sampled["longitude"]>lon_min) &\
            (df_sampled["longitude"]<lon_max) &\
            (df_sampled["latitude"]>lat_min) &\
            (df_sampled["latitude"]<lat_max)

df_sampled_europe = df_sampled[idx_europe].sample(n=100000)

plt.figure(2, figsize=(12,6))
m2 = Basemap(projection='merc',llcrnrlat=lat_min,urcrnrlat=lat_max,llcrnrlon=lon_min,
             urcrnrlon=lon_max,lat_ts=35,resolution='c')

m2.fillcontinents(color='#191919',lake_color='#000000') 
m2.drawmapboundary(fill_color='#000000')                
m2.drawcountries(linewidth=0.2, color="w")              

mxy = m2(df_sampled_europe["longitude"].tolist(), df_sampled_europe["latitude"].tolist())
m2.scatter(mxy[0], mxy[1], s=5, c="#1292db", lw=0, alpha=0.05, zorder=5)

plt.title("Sample of 100.000 locations in Europe")
plt.show()

And this is it. The map reflects correctly the denisty thanks to the weighing vector. 

**Perfect stage to do further analysis!**