
## COMP5712M: Programming for Data Science


## Assignment 2: Cities and Earthquakes

### @author: Wasif Shaukat
#### Last Modified: 12th November 2022

### Do not import any other module besides those already provided in the notebook

## Question 1: World Cities

In this coursework exercise, you will download data from the provided link and read it in as a `CSV` file using the Pandas data analysis package for Python.
The data we will use contains a variety of information about cities from around the world.

### Coding Techniques for Working with `DataFrame`s

To complete these tasks you will need to access and filter a `DataFrame`. 
The `DataFrame` data structure has many convenient features for extracting and ordering information. Although conceptually it can be thought of as a comptuational
represention of a table, it is quite a complex data structure and takes a while
to master. The following questions can be done with only a small but powerful
set of `DataFrame` operations; and the following examples of typical forms of programming with `DataFrame`s should be useful 
for coding your answers.

#### Loading data from a CSV file into a `DataFrame`

`DataFrame`s are specifically designed to handle data organised in a tabular format.
Hence, as we would expect, since `CSV` is the standard format for tabular data, it is very easy to 
create a `DataFrame` by loading data from a `CSV` file.

* ```pandas.read_csv(source)``` ---
The  [read_csv function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) can accept a filename or URL as an argument.

#### Getting the Data for this Question

Download the data file ```worldcities.csv``` from Minerva, and put
the file in the same directory as this Jupyter notebook file.
Then, by running the following cell, we can set the 
global variable ```WC_DF``` to a DataFrame containing the information from ```worldcities.csv```. 

In [3]:
## WC_DF Initialisation

import pandas  ## This is the module for creating and manipulating DataFrames

WC_DF = pandas.read_csv("worldcities.csv")
WC_DF
# WC_DF["city"]
# WC_DF[WC_DF["country"] == "United States"]

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.6850,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.1310,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.0170,72.8570,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.6250,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
...,...,...,...,...,...,...,...,...,...,...,...
15488,Timmiarmiut,Timmiarmiut,62.5333,-42.2167,Greenland,GL,GRL,Kujalleq,,10.0,1304206491
15489,Cheremoshna,Cheremoshna,51.3894,30.0989,Ukraine,UA,UKR,Kyyivs’ka Oblast’,,0.0,1804043438
15490,Ambarchik,Ambarchik,69.6510,162.3336,Russia,RU,RUS,Sakha (Yakutiya),,0.0,1643739159
15491,Nordvik,Nordvik,74.0165,111.5100,Russia,RU,RUS,Krasnoyarskiy Kray,,0.0,1643587468


**Be sure to keep the same variable name** `WC_DF` **for this global variable, otherwise most of the
following code will not work and you will break the autograder.**

#### Checking the contents of a DataFrame

Pandas provides the following useful methods that enable you to quickly check the contents of a `DataFrame`:

* ```df.head()``` ---
  For a DataFrame object, ```df```, this method extracts the first 5 rows of data, so you can easily check what the data looks like.
  
* ```df.describe()``` --- for a DataFrame, ```df```, this method provides a table giving and overview of some basic statistical properties of the DataFrame.

Note that the ```head()``` and ```describe()``` methods are actually operations that 
return a new DataFrame object. If this value is returned by the last line of a cell
it will be displayed as a table, but if it is generated elsewhere in the code
you will not see any output unless you use the ```display``` function from the
```IPython.display``` module.

#### Accessing DataFrame columns and rows

Each column of a `DataFrame` is a list-like object called a
`Series`. Elements, and slices of a `Series` can then be accessed in similar
fashion to a list. The following illustrates how get the `Series` containing
the first 5 elements of the `city` column of `WC_DF`:

In [4]:
top_5_cities = WC_DF["city"][:5]   ## selects the first 5 items of the "city" column.
top_5_cities

0          Tokyo
1       New York
2    Mexico City
3         Mumbai
4      São Paulo
Name: city, dtype: object

In the above output, the left hand column of the displayed value of `top_5_cities` 
shows the index label of each element. One of the differences between a `Series` and an ordinary list is that, whereas a list always has integers for its index labels, a `Series` can have different kinds of values for these. For instance (though there
is no reason to do this for the current assignment) we could set the index values to alphabetic letters, as follows:

In [5]:
top_5_cities.index = list("abcde")
top_5_cities

a          Tokyo
b       New York
c    Mexico City
d         Mumbai
e      São Paulo
Name: city, dtype: object

In [6]:
top_5_cities.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

You can also use ```.values``` to return an `array` of the column values without
the index:

In [7]:
WC_DF["city"][:5].values

array(['Tokyo', 'New York', 'Mexico City', 'Mumbai', 'São Paulo'],
      dtype=object)

An `array` is also a list-like datastructure. It does not have an index. The main difference between a list and an `array` is that the list is optimised for
storing large amounts of information and for efficiently applying numerical and
other operations to all elements of the array. Hence, `array`s are usually preferred
to lists when handling large amounts of information, or when storing numerical
vectors.

You can also easily find the column names of the DataFrame using ```.columns```, for example:

In [8]:
WC_DF.columns

Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
       'admin_name', 'capital', 'population', 'id'],
      dtype='object')

__Note:__ The `Index` returned here is yet another type of list-like, object. It is similar to an array,
except that it is used for indexing a `Series` or `DataFrame`. You do not usually
need to create or deal with `Index` objects directly, since this is done automatically when you create and minipulate `DataFrame`s. So you will normally only see one, when
you want to look at the columns or rows of a `DataFrame`. But what you should be
aware of, when dealing with `DataFrames`, is that the word _index_ can refer to
several different types of thing.

In many cases you can treat `Series`, `array` and `Index` objects like lists and if you want to change them to an ordinary list you can just use the `list` operator,
as in the following:

In [9]:
list(WC_DF.columns)

['city',
 'city_ascii',
 'lat',
 'lng',
 'country',
 'iso2',
 'iso3',
 'admin_name',
 'capital',
 'population',
 'id']

We can refer to rows of a `DataFrame` either by the expression `DF.loc[label]`, where `label` is the index label of the row we want, or by `DF.iloc[n]`,
where `n` is an `int` giving the position of the row in the `DataFrame`.
In the case of `WC_DF`, the labels are integers, so we would get the same result using either. You could test this. You could also see the difference if you try finding a row of `top_5_cities` `DataFrame` defined above, after its index labels have been replaced by letters. In this case you could access rows either using letters, using `loc`, or by `int`s, using `iloc`.

In [10]:
WC_DF.iloc[1]

city               New York
city_ascii         New York
lat                 40.6943
lng                -73.9249
country       United States
iso2                     US
iso3                    USA
admin_name         New York
capital                 NaN
population       19354922.0
id               1840034016
Name: 1, dtype: object

#### Iterrating through the rows of a DataFrame
A convenient way of going through the rows of a `DataFrame` to perform some operation i by using the `iterrows` method in a `for` loop. This enables you to get both the index label and the row itself, for each successive row of the `DataFrame`. The following code is a simple example:

In [11]:
for i, row in WC_DF.iterrows():
    print(i, row['city_ascii'], row['lat'], row['lng'])
    if i >3: break

0 Tokyo 35.685 139.7514
1 New York 40.6943 -73.9249
2 Mexico City 19.4424 -99.131
3 Mumbai 19.017 72.857
4 Sao Paulo -23.5587 -46.625


#### Sorting the rows of a DataFrame

It is easy, and often very useful, to sort the DataFrame by column values using ```.sort_values```, for example:

In [12]:
WC_DF.sort_values(by=["country"], ascending=True)[:10] # Sorts countries by alphabet

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
10235,Karukh,Karukh,34.4868,62.5918,Afghanistan,AF,AFG,Herāt,minor,17484.0,1004546127
8527,Kōṯah-ye ‘As̲h̲rō,Kotah-ye `Ashro,34.45,68.8,Afghanistan,AF,AFG,Wardak,,35008.0,1004450357
3341,Shibirghān,Shibirghan,36.658,65.7383,Afghanistan,AF,AFG,Jowzjān,admin,93241.0,1004805783
6050,Khōst,Khost,33.3395,69.9204,Afghanistan,AF,AFG,Khōst,admin,,1004919977
5141,Maḩmūd-e Rāqī,Mahmud-e Raqi,35.0167,69.3333,Afghanistan,AF,AFG,Kāpīsā,admin,7407.0,1004151943
2088,Lashkar Gāh,Lashkar Gah,31.583,64.36,Afghanistan,AF,AFG,Helmand,admin,201546.0,1004765445
3210,Gardēz,Gardez,33.6001,69.2146,Afghanistan,AF,AFG,Paktiyā,admin,103601.0,1004468894
6249,Maīdān Shahr,Maidan Shahr,34.3956,68.8662,Afghanistan,AF,AFG,Wardak,admin,,1004798735
2105,Maīmanah,Maimanah,35.9302,64.7701,Afghanistan,AF,AFG,Fāryāb,admin,199795.0,1004622920
7399,Andkhōy,Andkhoy,36.9317,65.1015,Afghanistan,AF,AFG,Fāryāb,minor,71730.0,1004472345


##### Note on __encodings__ of the city name
there are two columns that hold the city name. The first column name is `'city'` and
the second is `city_ascii`. There are various different ways in which textual
information can be encoded into bytes. These days [Unicode characters](https://home.unicode.org/)
encoded using [UTF-8](https://en.wikipedia.org/wiki/UTF-8) are pretty standard.
But the older [ASCII](https://en.wikipedia.org/wiki/ASCII) code, which uses
a single byte per character is still commonly used. Unicode provides a huge
variaty of text characters and other symbols, whereas ASCII is quite 
limited (mainly to characters and symbols found in standard English). 
But ASCII and is simpler and in 
some ways easier to deal with than UTF-8. In the following questions you
will be asked to use the ASCII version of the city name (from the `city_ascii` column). This mainly just to make you aware that there are different encodings
of text strings, but it will also prevent cerain problems that could occur in the
Autograder, if different people used different encodings.

#### Filtering DataFrames
By _filtering_ we mean keeping some parts that we want and throwing away others.
Typically, we look for rows that match some condition; and the filter condition
is often some constraint involving the values for that row in one or more
columns.
`pandas` `DataFrame`s can  be filtered according values of a column by using a boolean expression, for example:

In [13]:
filtered_DF = WC_DF[ WC_DF['capital'] == 'admin'] # This keeps only administrative capitals
filtered_DF.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
5,Delhi,Delhi,28.67,77.23,India,IN,IND,Delhi,admin,15926000.0,1356872604
6,Shanghai,Shanghai,31.2165,121.4365,China,CN,CHN,Shanghai,admin,14987000.0,1156073548
7,Kolkata,Kolkata,22.495,88.3247,India,IN,IND,West Bengal,admin,14787000.0,1356060520


This way of filtering is a very powerful and useful aspect of `DataFrames`. 
However, the
syntax of the filter operation is rather unusual and a bit difficult to understand.

What is happening can be explained by these steps in the way a filter expression
is evaluated:
*  `DF['label']` (where `DF` is any `DataFrame`), gives a `Series` corresponding
    to    the `'label'` column of `DF.
    
*  `a_series == val` is a special use of `==`. When a Boolean operator
    (such `==`, `<` etc.) is applied to a `Series` object
   the result is actually a `Series` of Boolean values (not a single Boolean).
   The new `Series` obtained will have the value `True` for each element where
   the original `Series` satisfies `element == val`, and `False` for the rest.
   
* `DF[ bool_series ]`, is a special kind of slice-like operation, where a boolean
   series is given as a selection argument to the `DataFrame`. It will return 
   a new DF, containing all the rows of `DF` for which `bool_series` has the 
   value `True`. (These rows can be quickly found because the `DataFrame` and
   the Boolean series both have the same `Index`.)
   
You do not necessarily need to follow all of that precise desciption of filtering
but it will be extremely helpful if you are able to construct filtering
operations similar to the above example. You will see another example below,
in relation to the earthquake data you will be processing.

### Overview of Question 1 tasks

This question requires you to write functions to carry out the following specific tasks. 

* __Question 1a__ --- Find all city names that include *non-ASCII* characters. __[1 Mark]__

* __Question 1b__ --- Find city names that occur multiple times in the dataset. __[2 Marks]__
* __Question 1c__ --- Create a dictionary of the number of cities in each country that are included in the dataset. __[1 Mark]__

* __Question 1d__ --- Create a `DataFrame` of the largest cities (by population) in the world. __[2 Marks]__
* __Quesiton 1e__  --- Find all cities in a given country whose population is above a given number. __[2 Marks]__

* __Question 1f__ --- Find the total population of people living in cities in a given country. __[2 Marks]__

Full details for each task are explained below.

### Question 1a

As we have seen, the names of some cities include accents or symbols that are not represented in the standard ASCII character set. Write a function `non_ascii_cities`, which returns a `Set` containing all the cities that occur in the `world_cities.csv` dataset whose name **cannot** be properly represented using only ASCII characters. 

Your function should not have any arguments. You should assume that the global variable WC_DF has already been initialised by running the first code cell in this notebook (see above).

In [14]:
# Question 1a answer cell

def non_ascii_cities():
    world_cities = WC_DF['city']
    world_cities = ([x for x in WC_DF['city'] if x.isascii() == False])
    return set(world_cities)
    # Modify to return a set of all non-ascii city names in the world_cities data

In [15]:
non_ascii_cities()

{'Bački Petrovac',
 'Alexandroúpoli',
 'Atasū',
 'Xi’an',
 'Bắc Kạn',
 'Kamen’-na-Obi',
 'Colón',
 'Pietà',
 'Medinīpur',
 'Šentjernej',
 'Lezhë',
 'Hà Giang',
 'Vilkaviškis',
 'Shahreẕā',
 'Chornomors’k',
 'Šilalė',
 'Wŏnsan',
 'An Nuhūd',
 'Hūn',
 'Kavála',
 'Gonaïves',
 'Asyūţ',
 'Sodankylä',
 'Behbahān',
 'Beppuchō',
 'Ḩafar al Bāţin',
 'Sept-Îles',
 'Rokiškis',
 'Alīgarh',
 'Labé',
 'Carúpano',
 'Būlaevo',
 'Botoşani',
 'Mazatlán',
 'Baturité',
 'Criciúma',
 'Türkmenabat',
 'Rạch Giá',
 'Palmeira dos Índios',
 'Tvøroyri',
 'Ciego de Ávila',
 'São Tomé',
 'Kaçanik',
 'Zárate',
 'Mežica',
 'Villazón',
 'San Ramón',
 'San José de Chiquitos',
 'San Ġwann',
 'Ríohacha',
 'L’gov',
 'Paraná',
 'Farāh',
 'Santa Maria da Vitória',
 'Délįne',
 'Ust’-Kamchatsk',
 'Qubadlı',
 'Bogotá',
 'Petrópolis',
 'Mālpils',
 'Târgovişte',
 'Bačka Palanka',
 'Khromtaū',
 'Cañas',
 'Örnsköldsvik',
 'Ōita',
 'Bilāspur',
 'Moravče',
 'Bến Tre',
 'Tyumen’',
 'Aïn Defla',
 'Coroatá',
 'Hāthras',
 'Jaú',
 'Salt

### Question 1b

One issue that you will discover if you investigate the `worldcities.csv` data is that there are many different cities that have the same name. You may find it interesting to discover which are the most common city names. But for this question you need to write a function `num_cities_occurring_n_times(n)`, such that:
    
* The argument `n` is an integer;
* The returns an integer, which is the number of different city names that occur `n` times in the `worldcities` dataset. 
* The function should consider the city name to be the value in the `city` column. In other words the name in the form that may contain non-ASCII characters.

In [16]:
# Question 1b answer cell

def num_cities_occurring_n_times(n):
    occur = WC_DF.groupby(['city']).size()
    return len ([x for x in occur if x == n])
# Modify to return a value according to the specification given above

In [17]:

num_cities_occurring_n_times(3)


199

### Question 1c

Write a function that returns a dictionary (a `dict` object), whose keys are all the country name strings that occur in the `worldcities` data and whose values are `int`s giving the number of cities of that country that are included in the dataset.



In [18]:
# Question 1c answer cell

def country_num_cities_dict():
    occur = WC_DF.groupby(['city']).size()
    return occur.to_dict()

In [19]:
# occur = WC_DF.groupby(['city']).size()
# dict_occur = occur.to_dict()
# dict_occur
country_num_cities_dict()

{'A Coruña': 1,
 'Aalborg': 1,
 'Aarau': 2,
 'Aarhus': 1,
 'Aasiaat': 1,
 'Aba': 1,
 'Abadan': 1,
 'Abadla': 1,
 'Abaetetuba': 1,
 'Abakaliki': 1,
 'Abakan': 1,
 'Abancay': 1,
 'Abaza': 1,
 'Abaí': 1,
 'Abbeville': 2,
 'Abbotsford': 1,
 'Abbottabad': 1,
 'Abdurahmoni Jomí': 1,
 'Abengourou': 1,
 'Abeokuta': 1,
 'Aberaeron': 1,
 'Aberdeen': 6,
 'Abertawe': 1,
 'Abhā': 1,
 'Abidjan': 1,
 'Abilene': 2,
 'Abim': 1,
 'Abingdon': 1,
 'Abington': 1,
 'Abohar': 1,
 'Aboisso': 1,
 'Abomey': 1,
 'Abong Mbang': 1,
 'Abra Pampa': 1,
 'Absecon': 1,
 'Abu Dhabi': 1,
 'Abuja': 1,
 'Abunã': 1,
 'Abéché': 1,
 'Acapulco de Juárez': 1,
 'Acaraú': 1,
 'Acarigua': 1,
 'Acatlán de Osorio': 1,
 'Accokeek': 1,
 'Accra': 1,
 'Achacachi': 1,
 'Achinsk': 1,
 'Acquaviva': 1,
 'Acton': 2,
 'Acushnet': 1,
 'Acworth': 1,
 'Ad Dammām': 1,
 'Ad Diwem': 1,
 'Ad Dīwānīyah': 1,
 'Ada': 3,
 'Adams': 1,
 'Adana': 1,
 'Addis': 1,
 'Addis Ababa': 1,
 'Addison': 2,
 'Adel': 1,
 'Adelaide': 1,
 'Adelaide River': 1,
 'Adelanto'

### Question 1d
Write a function `largest_cities_dataframe` that takes an `int` argument `n`  and uses the
pandas DataFrame `WC_DF` to return a new `DataFrame` containing `n` rows corresponding to
the `n` largest cities in terms of population size, in order of decreasing population size.

You should return a dataframe such that it has the same columns as the `WC_DF` and each
row has the same values as a corresponding row of `WC_DF`. It does not matter if the row
indexes are the same. (They may or may not be the same depending on the specific way
that you create the new DataFrame.)

In [20]:
# Question 1d answer cell

def largest_cities_dataframe(n):
    
#     wc_df_sorted = WC_DF.sort_values(by= 'population', ascending = False)
#     return wc_df_sorted.head(n)
# Above code is taking longer time so I have use following code
    return WC_DF.nlargest(n, columns = ["population"], keep = "all")
    # Modify to return a list of the n cities with the largest population

In [21]:
# wc_df_sorted = WC_DF.sort_values(by= 'population', ascending = False)
# wc_df_sorted.head(10)
largest_cities_dataframe(10)

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
5,Delhi,Delhi,28.67,77.23,India,IN,IND,Delhi,admin,15926000.0,1356872604
6,Shanghai,Shanghai,31.2165,121.4365,China,CN,CHN,Shanghai,admin,14987000.0,1156073548
7,Kolkata,Kolkata,22.495,88.3247,India,IN,IND,West Bengal,admin,14787000.0,1356060520
8,Los Angeles,Los Angeles,34.1139,-118.4068,United States,US,USA,California,,12815475.0,1840020491
9,Dhaka,Dhaka,23.7231,90.4086,Bangladesh,BD,BGD,Dhaka,primary,12797394.0,1050529279


__NOTE:__ In answering __1d__ you may assume that no two cities have exactly the same population, which is almost but not quite certain, when dealing with large numbers like this. But, of course, when dealing with quantites where multiple data records could have the same value, we need to be careful, because this may not be the case.
For example, if we are interested in what equipement students own, we might think it would be informative
to find 'the top 10 students owning the most laptops'. In this case there could be: 1 student with 3 laptops, 23 students with 2 laptops, 160 with 1 laptop and 3 who do not own a laptop. In such a case it is not meaningful to pick the 'top 10' in terms of laptop ownership. A similar problem could potentiall occur with the earthquake data that we will look at later, because the earthquake magnitudes are only recorded to 1 decimal place.

### Question 1e
Define a function `big_cities_in_country( country, population)` that takes as arguments a string corresponding to the name of a country and an integer, which will referes to a population number.
The function should return a `list` of the form 
<center>
    <tt>[("city1", pop1), ("city2", pop2),... ]</tt>, 
</center>

where each pair
`("cityN", popN)` is a `tuple` consisting of the name, **in ASCII form**, of a city in the given `country`, followed by an `int`, which is the population of that city (according to the worldcities data). The list should include all and only those cities in the country whose population is greater than or equal to the given `popuplation` argument. The list should be ordered so that the `("cityN", popN)` items occur in **increasing** order of the population size `popN`.



In [22]:
# Complete question 1e in this cell

def big_cities_in_country(country, population):
    if type(country) != str and type(population) != int:
        return ('Please enter country as string and population as Integer')
    country = WC_DF.loc[WC_DF['country'] == country.title()]
    cities = country[country['population'] >= population]
    cities = cities.sort_values(by= 'population')
    tup =()
    lst = []
    for row in cities.itertuples():
        tup = (row[2],int(row[-2]))
        lst +=[tup]
    return lst
    # country is a string argument
   # pass # Edit this function to return a list, as specified above

In [23]:
big_cities_in_country("brazil", 1209091)

[('Vila Velha', 1209091),
 ('Vila Velha', 1209091),
 ('Niteroi', 1500513),
 ('Vitoria', 1704000),
 ('Santos', 1709000),
 ('Manaus', 1753000),
 ('Goiania', 2022000),
 ('Belem', 2167000),
 ('Campinas', 2791000),
 ('Curitiba', 3084000),
 ('Salvador', 3484000),
 ('Fortaleza', 3602319),
 ('Recife', 3651000),
 ('Brasilia', 3716996),
 ('Porto Alegre', 3917000),
 ('Belo Horizonte', 5575000),
 ('Rio de Janeiro', 11748000),
 ('Sao Paulo', 18845000)]

### Question 1f
Create a function that given a country name, returns an `int` which is the total population of 
people liveing in all the cities of that country, as given in `WC_DF`. 

**Hints:** 
* You can ignore cities for which no population value is recorded in `WC_DF`. You will need to be able to deal with `NaN` values. A distinctive feature of `NaN` values is that they have been defined so that `x != x` has the value `True` if `x` has a `NaN` value. 

In [24]:
## Question 1f Answer Code Cell

def country_total_cities_population(country):
    country = WC_DF.loc[WC_DF['country'] == country.title()]
    country = country.dropna(subset = "population")
    return int(country["population"].sum())

In [25]:
country_total_cities_population("pakistan")

41364699

### Question 2: Earthquakes - Web Access and Pandas DataFrames

In this coursework exercise, you will learn how to download live information
from the web and procress it using the Pandas data analysis package for Python.

The data we will use as an example is from the 
[United States Geological Survey (USGS)](https://www.usgs.gov/), which 
provides a wide range of geographic and geological information and data.
We shall be using their data relating to seismological 
events (i.e. Earthquakes) from around the world, which is published in the 
form of continually updated CSV files. Information about these feeds can
be found [here](https://earthquake.usgs.gov/earthquakes/feed/). The URL for the particular feed we shall be using is given below.

Questions Overview

* __Q2a__ --- Initialise a Pandas DataFrame by downloading earthquake data from the web. __[1 mark]__
* __Q2b__ --- Find earthquakes of a given magnitude or higher.           __[1 mark]__
* __Q2c__ --- Make a DataFrame showing the most powerful quakes          __[2 marks]__
* __Q2d__ --- Make a DataFrame showing distance of quakes from a given location __[3 marks]__
* __Q2e__ --- Identify all cities endangered by earthquakes   __[3 marks]__

### Question 2a: Read in data file
Read earthquake data from the USGS live feed CSV ```all_day.csv``` into a Pandas DataFrame.
The data can be obtained directly from  http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv and read into a Pandas DataFrame.

__Note:__ For this question you do not need to download and save the file ```all_day.csv```. It should
be loaded directly from the web feed. However, while testing, if you have no internet connection or
a bad connection you could download a copy of the file. But remember to put it back to downloading
the current one before you submit. Note also that ```all_day.csv``` is a live file, which lists
quakes recorded during the past 24 hours, and is updated every minute, so of course,
you will not always get the same file or the same results. More information about this and other
earthquake feeds provided by USGS can be found [here](https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php).

In [26]:
# Q2a answer code cell

import pandas as pd    ## This is the module for creating and manupulating DataFrames

# Here we have assigned the url of the quake datasource to the global variable 
# 'QUAKE_SOURCE' for your convenience.
QUAKE_SOURCE = ( "http://earthquake.usgs.gov/" +
                 "earthquakes/feed/v1.0/summary/all_day.csv" )


QUAKE_DF = pd.read_csv(QUAKE_SOURCE)## Modify this line to import the data using Pandas

#### You can use the following cell to test if you have read the quake data into `QUAKE_DF`

In [27]:
## If QUAKE_DF is a DataFrame, show the first 5 rows
try:
    if type(QUAKE_DF) == pandas.DataFrame:
        display(QUAKE_DF.head())
    else:
        print("QUAKE_DF is not a DataFrame")
except:
    print("QUAKE_DF has not been assigned a value")

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2022-11-13T18:09:54.690Z,38.825165,-122.854332,1.79,0.37,md,8.0,130.0,0.01854,0.01,...,2022-11-13T18:11:31.682Z,"10km WNW of The Geysers, CA",earthquake,0.37,1.17,,1.0,automatic,nc,nc
1,2022-11-13T17:39:27.620Z,19.466499,-155.61467,-2.51,2.15,ml,25.0,65.0,,0.39,...,2022-11-13T17:44:58.260Z,"26 km E of Honaunau-Napoopoo, Hawaii",earthquake,0.52,0.58,4.06,9.0,automatic,hv,hv
2,2022-11-13T17:03:13.755Z,60.6533,-149.7103,30.4,2.1,ml,,,,0.68,...,2022-11-13T17:05:05.945Z,"19 km NNE of Cooper Landing, Alaska",earthquake,,0.5,,,automatic,ak,ak
3,2022-11-13T16:37:39.774Z,-8.2669,158.8432,101.54,4.7,mb,75.0,94.0,1.596,0.4,...,2022-11-13T17:29:12.040Z,"83 km W of Buala, Solomon Islands",earthquake,10.51,7.966,0.063,81.0,reviewed,us,us
4,2022-11-13T16:01:38.450Z,19.136168,-155.489166,35.400002,2.05,ml,41.0,162.0,,0.16,...,2022-11-13T16:07:08.870Z,"7 km S of Pāhala, Hawaii",earthquake,0.68,0.96,3.87,3.0,automatic,hv,hv


##### Note:
The columns containing latitude and longitude values are labelled differently in the `worldcities.csv` and the earthquake data from USGS. This is a minor but very typical form of incompatibility between data formats that you will often need to deal with when working with real data.

### More examples of useful `pandas` functions

Here we show you some more pandas functions that you may find useful in this exercise. 

As we have seen, versatile filtering and sorting capabilities are provided by pandas. To get more understanding of these, you should look at tutorials of using Pandas DataFrames. But the following example illustrates how you can find and display quakes whose depth is greater than or equal to a given threshold:

In [28]:
def show_deep_quakes( depth ):
    # make deep_quakes DataFrame by selecting rows from QUAKE_DF
    deep_quakes = QUAKE_DF[ QUAKE_DF["depth"] >= depth ]  ## This is how you select rows by a condition
                                                          ## on one of the column values.
        
    print("Number of quakes of depth {} or deeper:".format(depth), 
           len(deep_quakes.index))     ## This finds the number of rows of the deep_quakes DataFrame
    
    display(deep_quakes.sort_values("depth", ascending=False))  ## Sort by descending depth value

**Note:**
The `QUAKES_DF` global variable needs to be set before these examples will work, so I am using a `try`, `except` construct to avoid getting an error.

In [29]:
try:
    show_deep_quakes(100)
except:
    print("Probably QUAKE_DF not correctly set")

Number of quakes of depth 100 or deeper: 6


Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
59,2022-11-13T08:49:29.289Z,-17.4434,-178.6575,550.977,4.6,mb,38.0,93.0,3.152,0.59,...,2022-11-13T09:04:55.040Z,"225 km ENE of Levuka, Fiji",earthquake,10.38,9.158,0.103,32.0,reviewed,us,us
21,2022-11-13T13:45:30.167Z,2.494,124.0157,356.43,4.2,mb,46.0,76.0,3.512,0.43,...,2022-11-13T14:34:39.040Z,"145 km NW of Manado, Indonesia",earthquake,10.02,8.318,0.092,33.0,reviewed,us,us
154,2022-11-12T21:32:45.367Z,-22.6399,-66.2214,262.4,4.0,mwr,29.0,81.0,1.833,0.79,...,2022-11-12T22:48:23.040Z,"54 km W of Abra Pampa, Argentina",earthquake,10.08,11.102,,,reviewed,us,guc
131,2022-11-12T23:37:00.119Z,-23.4167,-66.8542,245.417,4.1,mb,20.0,108.0,1.304,0.24,...,2022-11-13T00:12:37.040Z,"104 km NNW of San Antonio de los Cobres, Argen...",earthquake,6.02,14.272,0.37,2.0,reviewed,us,us
14,2022-11-13T15:09:08.859Z,61.6878,-151.255,151.6,1.9,ml,,,,2.1,...,2022-11-13T15:11:36.944Z,"34 km SSE of Skwentna, Alaska",earthquake,,2.7,,,automatic,ak,ak
3,2022-11-13T16:37:39.774Z,-8.2669,158.8432,101.54,4.7,mb,75.0,94.0,1.596,0.4,...,2022-11-13T17:29:12.040Z,"83 km W of Buala, Solomon Islands",earthquake,10.51,7.966,0.063,81.0,reviewed,us,us


You can also find ```max``` and ```min``` values in a column. Eg:

In [30]:
try:
    QUAKE_DF["depth"].max()
#     print("correct")
except:
    print("Probably QUAKE_DF not correctly set")    

In [31]:
try:
    QUAKE_DF["depth"].min()
except:
    print("Probably QUAKE_DF not correctly set")    

### Question 2b: Find Powerful Quakes

Write a function `powerful_quakes` that takes a numerical argument and returns a `DataFrame` including
all the quakes in `QUAKE_DF` that have a magnitude greater than or equal to the given argument.

In [32]:
# Complete question 2b answer cell

def powerful_quakes(mag):
    pwr_quakes = QUAKE_DF[QUAKE_DF["mag"] >= mag]
    ## This is just returning an empty DataFrame you need to code it to return
    ## a DataFrame with all quakes of magnitude greater than or equal to mag
    return pwr_quakes

In [33]:
powerful_quakes(6)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
109,2022-11-13T02:24:57.806Z,-37.4557,-73.7425,18.0,6.2,mww,102.0,62.0,0.833,0.92,...,2022-11-13T17:22:00.537Z,"Near the coast of Bio-Bio, Chile",earthquake,6.3,1.71,0.069,20.0,reviewed,us,us


### Question 2c: Find `n+` most powerful earthquakes

Produce a `DataFrame` with rows represent the `n`(or maybe more) most powerful quakes
in descending order of magnitude. The returned ``DataFrame`` should show at least `n`
quakes and may sometimes show more since we do not want to leave out any quake that is equally
powerful as the last quake listed in the `DataFrame`,
More specificially, we want the function to return a `DataFrame` that:
* has exactly the same column names as `QUAKES_DF`,
* each row has the same vaues in each column as a corresponding row in `QUAKES_DF` (but it does not matter whether the row indices in the returned `DataFrame` are the same or different from corresponding rows in `QUAKES_DF`),
* the rows of the returned `DataFrame` are ordered in _descending order_ of their magnitude column value (rows of equal magnitude can appear in any order),
* contains all and only those rows `QUAKES_DF`, such that there are fewer than `n` other rows in
  `QUAKES_DF` that have a higher magnitude.


#### Note:
The above definition of the requirements is clear and precise. Though you may ask for help and advice regarding implementation, you will not be given help with understanding the specification. 

In [34]:
# Question 2c answer cell

def most_powerful_n_quakes(n):
    return QUAKE_DF.nlargest(n, columns = ["mag"], keep = "all")
    # Edit this function to make it return a DataFrame of 
    # the 'top n' quakes of the all_day.csv file

In [35]:
most_powerful_n_quakes(7)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
109,2022-11-13T02:24:57.806Z,-37.4557,-73.7425,18.0,6.2,mww,102.0,62.0,0.833,0.92,...,2022-11-13T17:22:00.537Z,"Near the coast of Bio-Bio, Chile",earthquake,6.3,1.71,0.069,20.0,reviewed,us,us
110,2022-11-13T02:24:02.036Z,-37.553,-73.7655,29.523,5.7,mwr,57.0,119.0,0.926,0.82,...,2022-11-13T04:27:37.727Z,"11 km WNW of Lebu, Chile",earthquake,4.8,5.52,0.068,21.0,reviewed,us,us
13,2022-11-13T15:10:22.199Z,-8.9513,108.8289,42.927,5.2,mww,60.0,101.0,1.968,0.59,...,2022-11-13T16:01:26.619Z,"152 km SSW of Kroya, Indonesia",earthquake,8.77,6.951,0.098,10.0,reviewed,us,us
65,2022-11-13T08:16:48.328Z,44.0411,148.2666,37.834,5.1,mb,94.0,123.0,4.089,1.03,...,2022-11-13T08:38:28.040Z,"126 km ENE of Shikotan, Russia",earthquake,8.63,5.644,0.024,565.0,reviewed,us,us
118,2022-11-13T01:47:44.731Z,22.4014,121.1602,5.055,5.1,mww,56.0,101.0,0.421,0.62,...,2022-11-13T05:49:01.819Z,"61 km NE of Hengchun, Taiwan",earthquake,5.56,4.555,0.078,16.0,reviewed,us,us
98,2022-11-13T03:55:37.236Z,-15.8943,-172.7384,10.0,5.0,mb,16.0,123.0,2.663,1.07,...,2022-11-13T04:16:17.040Z,"113 km E of Hihifo, Tonga",earthquake,11.13,1.911,0.17,13.0,reviewed,us,us
105,2022-11-13T03:00:43.776Z,-37.4362,-73.7566,17.514,5.0,mb,42.0,134.0,0.823,0.7,...,2022-11-13T12:35:02.625Z,"21 km NNW of Lebu, Chile",earthquake,4.31,4.359,0.088,41.0,reviewed,us,us


### Distance between locations on the Earth's surface

Clearly, when dealing with data pertaining to locations in space, the distance between such locations is often of great significance when interpreting or extracting further information from the data.

To help answer the following questions you are provided with the function ```haversine_distance```, 
which implements the [_Haversine formula_](https://en.wikipedia.org/wiki/Haversine_formula) to find the surface distance in kilometres between two locations, that are specified in terms of
latitude and longitude values. When finding distances betwen points on the surface of the
Earth we need to use this formula, rather than the simpler Pythagorean distance formula,
because the Earth's surface is a sphere.

In [36]:
## Function to compute distance between locations (kilometres) 
# Returns the surface distance in meters, according to the Haversine formula,
# between two locations given as (latitude, longitude) coordinate pairs.

import math
def haversine_distance( loc1 , loc2 ): 
    '''finds the distance (m) between 2 locations, where locations are defined by
    longitudes and latitudes'''
    lat1, lon1 = loc1
    lat2, lon2 = loc2
    radius = 6371  # kilometers
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon / 2) * math.sin(dlon / 2))
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    d = radius * c
    return d

### Question 2d: Sort quakes by distance from a given location

Write a function `quake_distance_from_loc_dataframe(loc)` satisfying the following requirements:
* It takes a location argument `(latitude,longitude)` consisting of a pair of `float`s 
    (Note: this is a single argument but consists of a pair of values represented as Python `tuple`.)

* It returns a `DataFrame` object derived from `QUAKE_DF` but with one extra column `distance_from_loc`, giving the distance of each quake from the given location. 

* The rows of the returned `DataFrame` should be the same as those in `QUAKE_DF` except for the aditional `distance_from_loc` column. (However, it is not necessary to preserve the index of the `DataFrame`, this will be ignored when your solution is tested.)

* The rows of the returned `DataFrame` should be _sorted_ in order of _increasing_ values of `distance_from_loc`. 

* The original DataFrame `QUAKE_DF` should not be altered by the execution of 
`quake_distance_from_loc_dataframe(loc)`.

#### Note:
You will need to do some research to find out how to create a new column and set its values.

In [37]:
## 2d Answer Code Cell

def quake_distance_from_loc_dataframe(loc):
    QUAKE_DF["distance_from_loc"] = ""
    for i in range(len(QUAKE_DF)):
        QUAKE_DF.at[i,"distance_from_loc"] = haversine_distance((QUAKE_DF['latitude'][i],QUAKE_DF['longitude'][i]),loc)
    return QUAKE_DF.sort_values("distance_from_loc")

In [38]:
quake_distance_from_loc_dataframe((3,5))

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource,distance_from_loc
112,2022-11-13T02:08:07.419Z,34.338000,26.486000,10.000000,4.80,mww,82.0,51.0,1.619000,1.11,...,"Crete, Greece",earthquake,4.33,1.837,0.098000,10.0,reviewed,us,us,4133.491211
137,2022-11-12T22:48:02.041Z,30.242400,57.614800,10.000000,4.20,mb,47.0,96.0,5.423000,0.91,...,"51 km E of Kerman, Iran",earthquake,10.39,1.965,0.098000,29.0,reviewed,us,us,6295.952489
63,2022-11-13T08:29:13.510Z,17.929300,-65.450500,14.000000,3.51,md,22.0,182.0,0.196500,0.13,...,"18 km S of Esperanza, Puerto Rico",earthquake,0.57,1.160,0.130000,20.0,reviewed,pr,pr,7837.626661
81,2022-11-13T05:47:46.980Z,18.740667,-65.598000,37.240000,3.37,md,32.0,217.0,0.406800,0.23,...,"42 km NNE of Luquillo, Puerto Rico",earthquake,1.08,2.340,0.097694,15.0,reviewed,pr,pr,7858.435273
115,2022-11-13T01:56:26.680Z,18.019167,-66.772833,12.770000,2.70,md,10.0,91.0,0.006911,0.09,...,"0 km WNW of Magas Arriba, Puerto Rico",earthquake,0.30,0.590,0.185124,3.0,reviewed,pr,pr,7977.81963
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,2022-11-13T16:01:38.450Z,19.136168,-155.489166,35.400002,2.05,ml,41.0,162.0,,0.16,...,"7 km S of Pāhala, Hawaii",earthquake,0.68,0.960,3.870000,3.0,automatic,hv,hv,16757.692352
158,2022-11-12T20:44:42.740Z,19.016333,-155.414667,34.750000,3.67,ml,49.0,213.0,,0.13,...,"18 km ESE of Naalehu, Hawaii",earthquake,0.60,0.840,0.167428,37.0,reviewed,hv,hv,16762.071482
3,2022-11-13T16:37:39.774Z,-8.266900,158.843200,101.540000,4.70,mb,75.0,94.0,1.596000,0.40,...,"83 km W of Buala, Solomon Islands",earthquake,10.51,7.966,0.063000,81.0,reviewed,us,us,17063.250306
59,2022-11-13T08:49:29.289Z,-17.443400,-178.657500,550.977000,4.60,mb,38.0,93.0,3.152000,0.59,...,"225 km ENE of Levuka, Fiji",earthquake,10.38,9.158,0.103000,32.0,reviewed,us,us,18360.215294


### Question 2e: Identifying Endangered Cities

The idea of this question is to identify possible emergency situations by finding
cities that are likely to suffer from the effects of
an earthquake.

#### Effect of an qarthquake at a distance from its epicenter
The effect of an earthquake on a city or person will depend on their distance from the source of the quake. The effect of an earthquake will depend on many factors and even the dependence on distance to source is very complex. However, after a bit of background research, Brandon has come up with a simple formula which hopefully at least gives a very crude estimate of relative effect of a quake with a particular magnitude and depth on a surface location at a known surface distance from the quake's epicenter. The calculated effective magnitude of an earthquake will be less that the source magnitude, for instance a magnitude 9 quake at a depth of 100km (which is likely to be extremely destructive), would have an effective magnitude of 5 at its epicentre (directly above the source) and 3.585 at a point on the earth surface 500km away from the epicenter.

In [39]:
def effective_magnitude( magnitude, depth, surface_distance ):
    energy = 10**magnitude  # convert logarithmic magnitude to a linear energy value
    if depth < 1:   # Crude fix for small or negative depths (can occur where land is above sea level)
        depth = 1
    ## Calculate distance to source by Pythagorus (ignoring curvature of surface)
    dist_to_source_squared =  depth**2 + surface_distance**2
    ## Apply inverse square distance multiplier to get energy density at distance from source
    ## (Ignores damping effects)
    attenuated_energy = energy/dist_to_source_squared
    attenuated_magnitude =  math.log10(attenuated_energy) ## Convert back to a log base 10 scale
    return attenuated_magnitude

# Some test cases.
# effective_magnitude(9,100,500)
# effective_magnitude(6,50, 100)

The _epicenter_ of an earthquake is the point on the earth that is directly above its source. Thus the effective magnitude at the epicenter is just the effective magnitude of the quake at surface distance zero:

In [40]:
def epicenter_magnitude( magnitude, depth ):
    return effective_magnitude( magnitude, depth, 0)

**Note:** For any given quake, its `effective_magnitude` at any point on Earth is always less than or equal to its `epicenter_magnitude`.

### Specification of the `endangered_cities` function

Now we get to the specification of the function.

Write a function `endangered_cities( minimum_population, minimum_effective_magnitude)`
that takes 
two numerical arguments: an `int` (`minimum_population`) and a `float` (`minimum_effective_magnitude`) 
and  returns a `list` specifying all those cities listed in 
the `WC_DF` such that:
* the city has  a population greater than or equal to the given `minimum_population`;

* for at least one of the quakes recorded in `QUAKE_DF`, the `effective_magnitude` (as determined by the
  function defined above) of that quake
   at the location of the city (as given in `worldcities.csv`) is equal to or greater than the  `minimum_effective_magnitude`.
   
* each city in the list should be represented by a tuple `(city_ascii, country, (lat, lng))`, giving
the city name in ASCII form, the country and the location (as a latitude, longitude pair).

* the returned list should be ordered aphabetically, primarily in terms of `country` and secondarily
  cities of the same country should be ordered in terms of `city_ascii`. 
  This ordering is illustrated by the sample output given below.

* A final condition is that the function should run in a reasonable time frame such that it can be expected to give correct and complete results after no more that 5 minutes excecution time. To ensure that this is feasible, the example cases you will be tested on, will be ones for which Brandon's solution ran in less than 2 minues on the Gradescope autograder platform. (All submissions will be tested on the same saved copy of `all_day.csv` previously downloaded from the USGS feed.)
   
##### Example Output:   

<pre>
 in [165]    %time        
             endangered_cities(200000, 0.5)
 
 out[165]    CPU times: user 1min 58s, sys: 768 µs, total: 1min 58s
             Wall time: 1min 58s
             [('Baghlan', 'Afghanistan', (36.1393, 68.6993)),
              ('Kunduz', 'Afghanistan', (36.728, 68.8725)),
              ('Mazar-e Sharif', 'Afghanistan', (36.7, 67.1)),
              ('Ambon', 'Indonesia', (-3.7167, 128.2)),
              ('Denov', 'Uzbekistan', (38.2772, 67.8872))]
</pre>

##### Notes:

* Your code should make use of the functions defined above: `haversine_distance`, `effective_magnitude` and `epicenter_magnitude`. You are advised not to change these otherwise you may get different results from what the Autograder is expecting.

* The calculation of this function is computationally intensive. For prelimiary testing you might use smaller data sets by, say only using cities in one country. 

* There are many optimisations that can be done to reduce the computational cost of finding the endangered cities. For example, calculating `epicenter_magnitude` enables weaker earthquakes to be ignored without the need to determine their distance from every city on earth. It may also be useful to not that when calcuating the effective magnitude of a given quake (with a specific depth) at different locations, the value always decreases as surface distance from the quake increases. 

* Remember that the autograder will run your actual code, so will probably take a while to grade this function. However, it only runs the function once and then performs 3 different tests on the value that is returned.

* The autograder will run on a presaved set of quakes (so the test is the same for all submissions). However, your code should run on any version of `all_day.csv` that you download from the USGS live feed.

In [41]:
## 2e Answer Code Cell
def endangered_cities(min_population, min_effective_magnitude):
    lst_cities=[]
    world_cities = WC_DF[WC_DF['population'] >= min_population]
    cities = world_cities.filter(items = ["city_ascii","lat","lng","country"])
    quakes = QUAKE_DF.filter(items = ["latitude","longitude","depth","mag"])
    for l1,row1 in cities.iterrows():
        loc1 = (row1[1],row1[2])
        for l2,row2 in quakes.iterrows(): 
            loc2= (row2[0], row2[1])
            
            epi = epicenter_magnitude( row2[3], row2[2])
            if epi>=min_effective_magnitude:
                dist = haversine_distance(loc1,loc2)
                eff_quake = effective_magnitude( row2[3], row2[2], dist)
                if eff_quake >= min_effective_magnitude:
                    lst_cities.append(tuple((row1[0],row1[3],(row1[1],row1[2]))))
    sorting = sorted(sorted(lst_cities, key = lambda sort_by:sort_by[0]),key = lambda sort_by:sort_by[1])
    return sorting

In [42]:
%%time
endangered_cities(200000, -.5)

CPU times: total: 18.4 s
Wall time: 18.4 s


[('Bahia Blanca', 'Argentina', (-38.74, -62.265)),
 ('Bahia Blanca', 'Argentina', (-38.74, -62.265)),
 ('Buenos Aires', 'Argentina', (-34.6025, -58.3975)),
 ('Cordoba', 'Argentina', (-31.4, -64.1823)),
 ('Cordoba', 'Argentina', (-31.4, -64.1823)),
 ('Corrientes', 'Argentina', (-27.49, -58.81)),
 ('Formosa', 'Argentina', (-26.1728, -58.1828)),
 ('La Plata', 'Argentina', (-34.9096, -57.96)),
 ('Mar del Plata', 'Argentina', (-38.0, -57.58)),
 ('Mendoza', 'Argentina', (-32.8833, -68.8166)),
 ('Mendoza', 'Argentina', (-32.8833, -68.8166)),
 ('Neuquen', 'Argentina', (-38.95, -68.06)),
 ('Neuquen', 'Argentina', (-38.95, -68.06)),
 ('Neuquen', 'Argentina', (-38.95, -68.06)),
 ('Parana', 'Argentina', (-31.7333, -60.5333)),
 ('Posadas', 'Argentina', (-27.3578, -55.8851)),
 ('Resistencia', 'Argentina', (-27.46, -58.99)),
 ('Rosario', 'Argentina', (-32.9511, -60.6663)),
 ('Salta', 'Argentina', (-24.7834, -65.4166)),
 ('San Juan', 'Argentina', (-31.55, -68.52)),
 ('San Juan', 'Argentina', (-31.55, 

## Optional Exercises

Having got this far, you may find it interesting and informative to do some more processing
of the city and earthquake information.
Since the previous exercises were designed so they can be quickly and reliably assessed by the autograding software, they involve coding particular functions with very specific requirements. But the following exercises are more open ended and give suggestions for interactive and visual use of the city and earthquake data.

To ensure that your assignment submission works with the autograding software when submitted, it is recommended that you now save this file and make a new copy with a different name, such as `Earthquakes_optional.ipynb`. Then use the new file to continue with the optional exercises.

### Constructing a city risk status alert `DataFrame`

A government or other organisation may want to monitor a certain list of cities with regard to whether
they may be at risk of earthquake damage. To answer this question you should create a function
that uses the `endangered_cities` function you have defined above to create such a `DataFrame`.

Your function `city_risk_alert` should return a pandas DataFrame that includes the status of ```'ENDANGERED'``` or ```'SAFE'``` for a certain city. The dataframe should also contain the city name, country and status for each city input. You could also extend this to add more columns showing
things like the distance and magnitude of the nearest earthquake. And you could perhaps make it
so any endangered cities were put at the top of the list.

For example:
```
display( city_risk_alert( ['Rome', 'Milan', 'Pisa'] )
```

might give the following output:

 
| city  | country|status|
|-------|-------|-----|
| Pisa  | Italy | ENDANGERED |
| Rome  | Italy | SAFE |
| Milan | Italy | SAFE |
 

## Visualisation Exercise: display endangered cities on a map

The code below creates a Map object using the ```ipyleaflet``` module and uses this to
display powerful quakes on the map. If you have coded the `powerful_quakes` function for
Question 2b above, the code in the cell below the map should draw the detected powerful
quakes onto the map at their correct locations.

To install the ```ipyleaflet``` module use ```pip3 install ipyleaflet```. If the map does not display after installation be sure to restart the kernel, and close and reopen this file. We provide the ```draw_circle_on_map``` function, this add circles to a specified location on the map, where the location is defined by longitudes and latitudes.

In [None]:
from ipyleaflet import Map, basemaps, basemap_to_tiles, Circle, Polyline
from ipywidgets import Layout

LEEDS_LOC  = ( 53.8008,  -1.5491  ) # Here we define the longitude and latitude of Leeds
WORLD_MAP = Map(basemap=basemaps.OpenTopoMap, center=LEEDS_LOC, zoom=1.5,
                layout=Layout(height="500px")) # Here we create a map object centred on Leeds

WORLD_MAP

In [None]:
def draw_circle_on_map( a_map, location, radius = 1000, color="red", fill_color=None ):
    if not fill_color:
        fill_color = color
    circle = Circle()
    circle.location = location
    circle.radius = radius
    circle.color = color
    circle.fill_color = fill_color
    a_map.add_layer(circle)

# This will edit your previous map rather than produce a new one    
draw_circle_on_map(WORLD_MAP, LEEDS_LOC, color="green" ) 

def display_powerful_quakes_on_map(mag):
    powerful = powerful_quakes(3)
    for i, quake in powerful.iterrows():
        draw_circle_on_map( WORLD_MAP,
                            (quake["latitude"],quake["longitude"]), 
                            radius= 20000*int(quake["mag"]) )

display_powerful_quakes_on_map(3)

### More Ideas for Graphical Display
It would be nice to also see the endangered cities on the map. For an ambitious exercise you
could see if you can draw lines on the map running from from the locations of powerful
earthquakes to the cities that are endangered by them.