Proving the randomness of data using CitiPy

The largest hurdle to using random latitiudes and longitudes to call on OpenWeatherMap I could see is that 71% of the earth is water.  Any lat/long that fell into these areas too far from the shore was likely to be returned by OpenWeatherMap as "not found".  By using CitiPy first, you can map the coordinates that fall into the ocean to the nearest city on land.  You might still get some cities that are "not found" on OpenWeatherMap, as well as duplicate cities (most likely due to being the closest city on land to two different sets of coordinates over the ocean).  But overall passing coordinates through CitiPy gives you cleaner data to pass on to OpenWeatherMap.  But do CitiPy's methods of locating the closest city keep the cities spread out over the world?

(additional resource: https://dev.maxmind.com/geoip/legacy/codes/country_continent/)

In [10]:
import csv
import pandas as pd

In [11]:
cities_df = pd.read_csv("WeatherPy.csv")

Out of 195 countries in the world, how many did our random coordinates find?

In [12]:
cities_df["Country"].nunique()

109

In [13]:
continents_df = pd.read_csv("country_continent.csv")
continents_df = continents_df.rename(columns={"iso 3166 country":"Country","continent code":"Continent"})
continents_df.set_index('Country', inplace=True)

In [14]:
coastal_df = cities_df[["Country","City ID"]]
coastal_df = coastal_df.groupby(["Country"]).count()
coastal_df = coastal_df.sort_values(["City ID"],ascending = False)
coastal_df = coastal_df.rename(columns={"City ID": "Cities in Random"})

In [15]:
coastal_df = coastal_df.join(continents_df)

I would expect to see more cities from the largest countries, as well as those countries with the longest coastlines (referencing the ocean coordinates being mapped to land).

In [23]:
rank = ['3rd','8th','1st','6th','17th','11th','16th','24th','7th','2nd']
mass = ['1st','3rd','2nd','6th','5th','4th','7th','8th','62nd','15th']
x = 10
for x in range(99):
    #coastlines.append('unk')
    rank.append('unk')
    mass.append('unk')
    x = x + 1
#coastal_df["Coastline (in miles)"] = coastlines
coastal_df["Rank in World (coastline)"] = rank
coastal_df["Rank in World (area)"] = mass
coastal_df.head(10) 

Unnamed: 0_level_0,Cities in Random,Continent,Rank in World (coastline),Rank in World (area)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RU,62,Europe,3rd,1st
US,43,North America,8th,3rd
CA,31,North America,1st,2nd
AU,26,Oceania,6th,6th
BR,24,South America,17th,5th
CN,17,Asia,11th,4th
IN,15,Asia,16th,7th
AR,13,South America,24th,8th
NO,12,Europe,7th,62nd
ID,10,Asia,2nd,15th


But are the countries, regardless of size, spread across all regions of the earth?

In [24]:
coastal_df["Continent"].value_counts()

Africa           32
Asia             22
Europe           18
North America    15
South America    11
Oceania          11
Name: Continent, dtype: int64

CitiPy, using random lat/long generation, pulled cities from 6 of 7 continents.  Antarctica has no permenant cities, and thus no returns.  So I conclude that passing coordinates through CitiPy to make OpenWeatherMap calls easier, still gave us appropriately random and diverse data.