Proving the randomness of data using CitiPy

The largest hurdle to using random latitiudes and longitudes to call on OpenWeatherMap I could see is that 71% of the earth is water.  Any lat/long that fell into these areas too far from the shore was likely to be returned by OpenWeatherMap as "not found".  By using CitiPy first, you can map the coordinates that fall into the ocean to the nearest city on land.  You might still get some cities that are "not found" on OpenWeatherMap, as well as duplicate cities (most likely due to being the closest city on land to two different sets of coordinates over the ocean).  But overall passing coordinates through CitiPy gives you cleaner data to pass on to OpenWeatherMap.  But do CitiPy's methods of locating the closest city keep the cities spread out over the world?

(additional resource: https://dev.maxmind.com/geoip/legacy/codes/country_continent/)

In [3]:
import csv
import pandas as pd

In [4]:
cities_df = pd.read_csv("WeatherPy.csv")

Out of 195 countries in the world, how many did our random coordinates find?

In [5]:
cities_df["Country"].nunique()

113

In [6]:
continents_df = pd.read_csv("country_continent.csv")
continents_df = continents_df.rename(columns={"iso 3166 country":"Country","continent code":"Continent"})
continents_df.set_index('Country', inplace=True)

In [7]:
coastal_df = cities_df[["Country","City ID"]]
coastal_df = coastal_df.groupby(["Country"]).count()
coastal_df = coastal_df.sort_values(["City ID"],ascending = False)
coastal_df = coastal_df.rename(columns={"City ID": "Cities in Random"})

In [8]:
coastal_df = coastal_df.join(continents_df)

I would expect to see more cities from the largest countries, as well as those countries with the longest coastlines (referencing the ocean coordinates being mapped to land).

In [11]:
rank = ['3rd','8th','1st','17th','6th','2nd','11th','unk','4th','14th']
mass = ['1st','3rd','2nd','5th','6th','unk','4th','8th','unk','14th']
x = 10
for x in range(103):
    #coastlines.append('unk')
    rank.append('unk')
    mass.append('unk')
    x = x + 1
#coastal_df["Coastline (in miles)"] = coastlines
coastal_df["Rank in World (coastline)"] = rank
coastal_df["Rank in World (mass)"] = mass
coastal_df.head(10) 

Unnamed: 0_level_0,Cities in Random,Continent,Rank in World (coastline),Rank in World (mass)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,62,Europe,3rd,1st
US,48,North America,8th,3rd
CA,36,North America,1st,2nd
BR,28,South America,17th,5th
AU,18,Oceania,6th,6th
ID,15,Asia,2nd,unk
CN,15,Asia,11th,4th
AR,12,South America,unk,8th
PH,11,Asia,4th,unk
MX,10,North America,14th,14th


But are the countries, regardless of size, spread across all regions of the earth?

In [12]:
coastal_df["Continent"].value_counts()

Africa           32
Asia             26
Europe           20
North America    15
South America    10
Oceania          10
Name: Continent, dtype: int64

CitiPy, using random lat/long generation, pulled cities from 6 of 7 continents.  Antarctica has no permenant cities, and thus no returns.  So I conclude that passing coordinates through CitiPy to make OpenWeatherMap calls easier, still gave us appropriately random and diverse data.