# Final Exam (110 Points)

## General Instructions:

For **Pandas** questions, you may use standard Python and Pandas commands (joining, merging, math, etc.).  Ensure that the answer is printed in the requested format.  (Note: Your solutions will probably require several lines of Python and Pandas DataFrame manipulation.

For **SQLite** questions, you should execute as few queries as possible, which may include subqueries and CTEs as needed.  Ensure that the answer is printed in the requested format.  Obviously, several of the questions may have two parts, in which it is appropriate to provide a separate query for each question.  It is also expected for you to use Python as necessary in order to put the results in the proper format, but you should not use Python to do work or computations that SQLite can do for you.

In [76]:
### CELL 0 ###
# Please enter your name here:
# 1.) Sterling Bhollah
##############
# NOTE: If you are using a nickname, please include your real name in parenthesis.
# Example: Bubba (Corey) Pennycuff
##############

## Loading the data

The following cells will load some datasets into memory as both Pandas DataFrames, and an in-memory SQLite database.

In [77]:
import numpy as np
import pandas as pd
import sqlite3
country = pd.read_csv('https://www3.nd.edu/~cpennycu/2019-2020/Spring/CSE10102-CDT30020/assets/gapminder.csv')

db = sqlite3.connect(':memory:')
country.to_sql('country', db)
cursor = db.cursor()
db.row_factory = sqlite3.Row
fancyCursor = db.cursor()

# Show the contents of the `country` DataFrame.
country

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.853030
2,Afghanistan,1962,10267083.0,Asia,31.997,853.100710
3,Afghanistan,1967,11537966.0,Asia,34.020,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418.0,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340.0,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948.0,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563.0,Africa,39.989,672.038623


In [78]:
co2Raw = pd.read_csv('https://www3.nd.edu/~cpennycu/2019-2020/Spring/CSE10102-CDT30020/assets/co2_emissions_tonnes_per_person.csv')

# Re-shape the data so that it can be easily combined with other tables.
co2 = co2Raw.melt(id_vars=["country"], var_name="year", value_name="co2PerCapita").dropna().reset_index(drop=True)
co2["year"] = pd.to_numeric(co2["year"])
co2.to_sql('co2', db)

# Show the contents of the `co2` DataFrame.
# the data is for the number of metric tons of CO2 that the country releases per year per capita.
co2

Unnamed: 0,country,year,co2PerCapita
0,Canada,1800,0.00733
1,Germany,1800,0.04420
2,Poland,1800,0.04520
3,United Kingdom,1800,2.48000
4,United States,1800,0.04220
...,...,...,...
16900,Venezuela,2014,6.17000
16901,Vietnam,2014,1.82000
16902,Yemen,2014,0.87900
16903,Zambia,2014,0.29200


## 1. Knowing the individual parts (50 Points)

For each part, **(1)** give the type of the variable `data` and **(2)** describe what it contains/represents.  If there is an index, **(3)** include a description of the index in your answer.  **(4)** Indicate whether or not the index labels are unique.  If the variable is a Series, then **(5)** describe where the name comes from (if applicable).  When in doubt, imagine that you are trying to describe what is in the `data` variable to someone who can't see the code that created the `data` variable.

This part is not meant to be code, per se, although it is in a code block.  Don't try to run the code unless you comment out your answers (so that the Python kernel doesn't try to parse it and thrown an error).

Example:

```
# Part 0:
data = co2

# data is a DataFrame containing 3 columns: country, year, and co2PerCapita.
# The 3 columns are from the original co2 data set.
# The index labels are a number that was assigned to each row in the original dataset.
# The index labels are unique.
```

In [79]:
### CELL 1 ###
# Write you answers to each part in this cell.
##############

# Part 1
data = co2.set_index('country')
# Your answer here...

'''
data is a DataFrame containg 2 columns: Year, and co2PerCapita
The data shows the co2PerCapita for each year for each country.
For example England will show CO2 levels listed for each year as a unique row in the DataFrame
The index labels are the country that was assigned to each row in the original dataset
The index labels are not unique
'''


# Part 2
data = country["gdpPercap"]
# Your answer here...
'''
data is a series, which is actually a single column of the country DataFrame
the name of the series is gdpPercap which comes from te column name of the country DataFrame
data is showing gdpPercap of each country for each year in the data since 1952
The index is the same index in the orignial country dataset
The index labels are unique
'''

# Part 3
data = country.set_index('country')["gdpPercap"]
# Your answer here...
'''
data is a series which is named gdpPercap 
the name of the series is gdpPercap which comes from te column name of the country DataFrame
Each row of the series shows the gdpPercap of a certain country for every year that there is data
The index is country
the index labels are not unique
'''

# Part 4
data = country[country["year"] == 2007].set_index('country')["gdpPercap"]
# Your answer here...
'''
data is a series that displays the gdpPercap for each country in 2007
The name of the series, gdpPercap, comes from the column name of the country DataFrame
The index is country
The index labels are unique
'''

# Part 5
data = country[["gdpPercap"]]
# Your answer here...
'''
data is a Dataframe containing only one column: gdpPercap
The data shows the gdpPercap for each country for each unique year in the dataset
The index labels are the same index labels from the original country DataFrame
The index labels are unique
'''


# Part 6
data = co2.set_index('year')[3:5]
# Your answer here...
'''
data is a DataFrame containing 2 columns: country and co2Percap and all of the rows between index 3 and index 5
The data shows the co2Percapita for the countries in the rows between index position 3 and 5
The index is the year
The index labels are not unique
'''


# Part 7
data = co2.set_index('year').loc[1801]
# Your answer here...

'''
data is a dataframe containing 2 columns: country and co2Percap for each country that has data in the year 1801
data only shows the countries whose index is '1801'
the index is year
the index labels are not unique
'''
# Part 8
data = co2.set_index('year').iloc[1801]
# Your answer here...


#NEED HELP
'''
data is a series that's name is 1888
data displays the row which had an index position of 1801 when you set the index to year
data only shows the countries whose index is in position 1801
the index labels are country and co2PerCapita
the index labels are not unique
'''

# Part 9
data = country["country"].iloc[0:1]
# Your answer here...
'''
data is a series that's name is country
the name comes from the column header of the country column in the orignial country DataFrame
the series shows the country in the original country DataFrame that has an index at position 0
the index labels are the same labels that were assigned in the country DataFrame
the index labels are unique
'''


# Part 10
data = country["country"].iloc[0:1][0]
# Your answer here...

'''
data is a string that contains Afghanistan
the data comes from the series names country that had index position between 0 and 1 (inclusive,exclusive), which came from the orignial country DataFrame
the string, data, comes from the first item, index 0, in the country series, which was Afghanistan
there is no index
'''


'\ndata is a string that contains Afghanistan\nthe data comes from the series names country that had index position between 0 and 1 (inclusive,exclusive), which came from the orignial country DataFrame\nthe string, data, comes from the first item, index 0, in the country series, which was Afghanistan\nthere is no index\n'

## 2. Imperfect Datasets (20 Points)

Datasets are rarely perfect or complete.  Here, take the two datasets that have been provided, `country` and `co2`, and compare the countries in each, in order to find what countries are missing from the two datasets.  Sort the country lists alphabetically.

The output should start like this:

```
Countries in `country` but not in `co2` are: ['Congo Dem. Rep.', 'Congo Rep.', ...]

Countries in `co2` but not in `country` are: ['Andorra', 'Antigua and Barbuda', ...]
```

Of course, the lists will be longer.

In [80]:
### CELL 2 ###
# Write your code in this cell.
# Do not remove this comment block.  Write your code after this comment block.
#
# Write you Pandas solution in this cell.
##############

countryList = country['country'].drop_duplicates()

co2List = co2['country'].drop_duplicates()
#print(co2List)

countryList = [*countryList.values]
co2List = [*co2List.values]

countryList.sort()
co2List.sort()

notInCo2 = []
notInCountry = []


for i in countryList:
    if i not in co2List:
        notInCo2.append(i)
        
for k in co2List:
    if k not in countryList:
        notInCountry.append(k)
        

##############
# Do not change the code below this point.
##############
print(f"Countries in `country` but not in `co2` are:", notInCo2)
print()
print(f"Countries in `co2` but not in `country` are:", notInCountry)

Countries in `country` but not in `co2` are: ['Congo Dem. Rep.', 'Congo Rep.', 'Hong Kong China', 'Korea Dem. Rep.', 'Korea Rep.', 'Puerto Rico', 'Reunion', 'Taiwan', 'West Bank and Gaza', 'Yemen Rep.']

Countries in `co2` but not in `country` are: ['Andorra', 'Antigua and Barbuda', 'Armenia', 'Azerbaijan', 'Bahamas', 'Barbados', 'Belarus', 'Belize', 'Bhutan', 'Brunei', 'Cape Verde', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Cyprus', 'Dominica', 'Estonia', 'Fiji', 'Georgia', 'Grenada', 'Guyana', 'Kazakhstan', 'Kiribati', 'Kyrgyz Republic', 'Lao', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Maldives', 'Malta', 'Marshall Islands', 'Micronesia, Fed. Sts.', 'Moldova', 'Nauru', 'North Korea', 'North Macedonia', 'Palau', 'Palestine', 'Papua New Guinea', 'Qatar', 'Russia', 'Samoa', 'Seychelles', 'Solomon Islands', 'South Korea', 'South Sudan', 'St. Kitts and Nevis', 'St. Lucia', 'St. Vincent and the Grenadines', 'Suriname', 'Tajikistan', 'Timor-Leste', 'Tonga', 'Turkmenistan', 'Tuvalu', 

In [81]:
### CELL 3 ###
# Write your code in this cell.
# Do not remove this comment block.  Write your code after this comment block.
#
# Write you SQLite solution in this cell.
##############

notInCo2 = []
notInCountry = []
fancyCursor.execute("""
    SELECT DISTINCT d.country
    FROM country d
    WHERE d.country NOT IN (
        SELECT m.country
        FROM co2 m)
    ORDER BY d.country ASC;
"""
)

for row in fancyCursor.fetchall():
    notInCo2.append(row[0])

fancyCursor.execute("""
    SELECT DISTINCT d.country
    FROM co2 d
    WHERE d.country NOT IN(
        SELECT m.country
        FROM country m)
    ORDER BY d.country ASC;
""")

for row in fancyCursor.fetchall():
    notInCountry.append(row[0])
##############
# Do not change the code below this point.
##############
print(f"Countries in `country` but not in `co2` are:", notInCo2)
print()
print(f"Countries in `co2` but not in `country` are:", notInCountry)

Countries in `country` but not in `co2` are: ['Congo Dem. Rep.', 'Congo Rep.', 'Hong Kong China', 'Korea Dem. Rep.', 'Korea Rep.', 'Puerto Rico', 'Reunion', 'Taiwan', 'West Bank and Gaza', 'Yemen Rep.']

Countries in `co2` but not in `country` are: ['Andorra', 'Antigua and Barbuda', 'Armenia', 'Azerbaijan', 'Bahamas', 'Barbados', 'Belarus', 'Belize', 'Bhutan', 'Brunei', 'Cape Verde', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Cyprus', 'Dominica', 'Estonia', 'Fiji', 'Georgia', 'Grenada', 'Guyana', 'Kazakhstan', 'Kiribati', 'Kyrgyz Republic', 'Lao', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Maldives', 'Malta', 'Marshall Islands', 'Micronesia, Fed. Sts.', 'Moldova', 'Nauru', 'North Korea', 'North Macedonia', 'Palau', 'Palestine', 'Papua New Guinea', 'Qatar', 'Russia', 'Samoa', 'Seychelles', 'Solomon Islands', 'South Korea', 'South Sudan', 'St. Kitts and Nevis', 'St. Lucia', 'St. Vincent and the Grenadines', 'Suriname', 'Tajikistan', 'Timor-Leste', 'Tonga', 'Turkmenistan', 'Tuvalu', 

## 3. GDP and CO2  (20 Points)

The `country` dataset provides the GDP per capita for each country, and `co2` provides the metric tonnes of CO2 per capita for each country.

Combining these two datasets, we can see which country is the most carbon efficient in respect to its GDP by dividing the GDP by the CO2 tonnage, resulting in the GDP per metric tonne of CO2.

For the year 2007, we will find the best and worst performing countries in respect to this GDP/CO2 metric.  The best country will have the highest GDP per CO2 tonnage.

Ignore any rows with missing values.

For simplictiy, the variables `best` and `worst` should both be dictionary-like containing `country`, `co2PerCapita`, `gdpPercap`, and `metric` (the value that you compute, described above).

In [82]:
### CELL 4 ###
# Write your code in this cell.
# Do not remove this comment block.  Write your code after this comment block.
#
# Write you Pandas solution in this cell.
##############


countrydf = country[country["year"] == 2007][["country", "gdpPercap"]]
co2df = co2[co2['year'] == 2007][['country','co2PerCapita']]

results = countrydf.set_index('country').join(co2df.set_index('country'), how="outer")
results['metric'] = results['gdpPercap']/results['co2PerCapita']
results['country'] = results.index.values
results = results.sort_values('metric', ascending = True).dropna()

results = results[['country','gdpPercap','co2PerCapita','metric']]

bestCountry = str(*results.head(1).index.values)
worstCountry = str(*results.tail(1).index.values)

best = results.iloc[0]
worst = results.iloc[-1]

best = {'country': best[0], 'gdpPercap':best[1], 'co2PerCapita':best[2], 'metric':best[3]}
worst = {'country': worst[0], 'gdpPercap':worst[1], 'co2PerCapita':worst[2], 'metric':worst[3]}

##############
# Do not change the code below this point.
##############
print(f"The best performing country is {best['country']} with a GDP per capita of {best['gdpPercap']:,.2f}, CO2 emission of {best['co2PerCapita']:,.2f} metric tonnes, and a GDP per CO2 tonne per capita of {best['metric']:,.2f}.")
print()
print(f"The worst performing country is {worst['country']} with a GDP per capita of {worst['gdpPercap']:,.2f}, CO2 emission of {worst['co2PerCapita']:,.2f} metric tonnes, and a GDP per CO2 tonne per capita of {worst['metric']:,.2f}.")

The best performing country is Trinidad and Tobago with a GDP per capita of 18,008.51, CO2 emission of 34.80 metric tonnes, and a GDP per CO2 tonne per capita of 517.49.

The worst performing country is Chad with a GDP per capita of 1,704.06, CO2 emission of 0.04 metric tonnes, and a GDP per CO2 tonne per capita of 39,907.82.


In [83]:
### CELL 5 ###
# Write your code in this cell.
# Do not remove this comment block.  Write your code after this comment block.
#
# Write you SQLite solution in this cell.
##############


best = {}
worst = {}
fancyCursor.execute("""
    SELECT d.country, gdpPercap, co2PerCapita,(gdpPercap/co2PerCapita) AS metric
    FROM country d
    INNER JOIN(
        SELECT c.country, c.co2PerCapita
        FROM co2 c
        WHERE year = 2007) c
    on d.country = c.country AND d.year = 2007
    ORDER BY metric ASC""")

cursor = fancyCursor.fetchall()

best = {'country': cursor[0][0], 'gdpPercap':cursor[0][1], 'co2PerCapita':cursor[0][2], 'metric':cursor[0][3]}
worst = {'country': cursor[-1][0], 'gdpPercap':cursor[-1][1], 'co2PerCapita':cursor[-1][2], 'metric':cursor[-1][3]}

##############
# Do not change the code below this point.
##############
print(f"The best performing country is {best['country']} with a GDP per capita of {best['gdpPercap']:,.2f}, CO2 emission of {best['co2PerCapita']:,.2f} metric tonnes, and a GDP per CO2 tonne per capita of {best['metric']:,.2f}.")
print()
print(f"The worst performing country is {worst['country']} with a GDP per capita of {worst['gdpPercap']:,.2f}, CO2 emission of {worst['co2PerCapita']:,.2f} metric tonnes, and a GDP per CO2 tonne per capita of {worst['metric']:,.2f}.")

The best performing country is Trinidad and Tobago with a GDP per capita of 18,008.51, CO2 emission of 34.80 metric tonnes, and a GDP per CO2 tonne per capita of 517.49.

The worst performing country is Chad with a GDP per capita of 1,704.06, CO2 emission of 0.04 metric tonnes, and a GDP per CO2 tonne per capita of 39,907.82.


## 4. Multiple sorts (20 Points)

As of 2002, give the metric tonnes of CO2 emissions for the 10 countries with the highest national GDP (total GDP, not per capita).  For each country provide the country name, the total GDP, and the level of CO2 emissions per capita.  The final list should be ordered by population in descending order.

Ignore any rows with missing values.

`rows` should be a list of dictionary-like values, each containing keys for `country`, `totalGDP` and `co2PerCapita`.

In [84]:
### CELL 6 ###
# Write your code in this cell.
# Do not remove this comment block.  Write your code after this comment block.
#
# Write you Pandas solution in this cell.
##############


countrydf = country[country['year'] == 2002][['country', 'gdpPercap','pop']]
co2df = co2[co2['year'] == 2002][['country','co2PerCapita']]

results = countrydf.set_index('country').join(co2df.set_index('country'), how="outer")

results['totalGDP'] = results['gdpPercap']*results['pop']

results = results.sort_values('totalGDP', ascending = False).dropna().head(10)


country = [*results.index.values]
totalGDP = [*results['totalGDP'].values]
co2PerCapita = [*results['co2PerCapita'].values]


rows = []
i = 0
for i in range(len(results)):
    row = {'country':country[i], 'totalGDP': totalGDP[i], 'co2PerCapita': co2PerCapita[i]}
    rows.append(row)
    
     
###############
# Do not change the code below this point.
##############
for row in rows:
    print(f"{row['country']:>15} has GDP of {row['totalGDP']:18,.0f} and CO2 emissions per capita of {row['co2PerCapita']:5,.2f} tonnes.")

  United States has GDP of 11,247,278,678,121 and CO2 emissions per capita of 19.60 tonnes.
          China has GDP of  3,993,927,259,238 and CO2 emissions per capita of  2.95 tonnes.
          Japan has GDP of  3,634,666,526,235 and CO2 emissions per capita of  9.54 tonnes.
        Germany has GDP of  2,473,468,447,076 and CO2 emissions per capita of 10.20 tonnes.
          India has GDP of  1,806,461,015,265 and CO2 emissions per capita of  0.96 tonnes.
 United Kingdom has GDP of  1,766,158,504,920 and CO2 emissions per capita of  8.91 tonnes.
         France has GDP of  1,733,393,500,386 and CO2 emissions per capita of  6.27 tonnes.
          Italy has GDP of  1,620,107,994,725 and CO2 emissions per capita of  7.92 tonnes.
         Brazil has GDP of  1,462,920,751,253 and CO2 emissions per capita of  1.85 tonnes.
         Mexico has GDP of  1,100,884,521,316 and CO2 emissions per capita of  4.08 tonnes.


In [85]:
### CELL 7 ###
# Write your code in this cell.
# Do not remove this comment block.  Write your code after this comment block.
#
# Write you SQLite solution in this cell.
##############

fancyCursor.execute("""
    SELECT d.country, (gdpPercap * pop) AS totalGDP, m.co2PerCapita
    FROM country d
    LEFT JOIN(
        SELECT m.co2PerCapita, m.country
        FROM co2 m
        WHERE m.year = 2002
    ) m
    ON d.country = m.country
    WHERE year = 2002
    ORDER BY totalGDP DESC
    LIMIT 10;
""")

rows = []

for country, gdp, co2 in fancyCursor.fetchall():
    rows.append({'country': country, 'totalGDP': gdp, 'co2PerCapita': co2})

##############
# Do not change the code below this point.
##############
for row in rows:
    print(f"{row['country']:>15} has GDP of {row['totalGDP']:18,.0f} and CO2 emissions per capita of {row['co2PerCapita']:5,.2f} tonnes.")

  United States has GDP of 11,247,278,678,121 and CO2 emissions per capita of 19.60 tonnes.
          China has GDP of  3,993,927,259,238 and CO2 emissions per capita of  2.95 tonnes.
          Japan has GDP of  3,634,666,526,235 and CO2 emissions per capita of  9.54 tonnes.
        Germany has GDP of  2,473,468,447,076 and CO2 emissions per capita of 10.20 tonnes.
          India has GDP of  1,806,461,015,265 and CO2 emissions per capita of  0.96 tonnes.
 United Kingdom has GDP of  1,766,158,504,920 and CO2 emissions per capita of  8.91 tonnes.
         France has GDP of  1,733,393,500,386 and CO2 emissions per capita of  6.27 tonnes.
          Italy has GDP of  1,620,107,994,725 and CO2 emissions per capita of  7.92 tonnes.
         Brazil has GDP of  1,462,920,751,253 and CO2 emissions per capita of  1.85 tonnes.
         Mexico has GDP of  1,100,884,521,316 and CO2 emissions per capita of  4.08 tonnes.
