## Assignment 1: Assess your Python Programming Skills

This simple set of assignments will help you to assess your Python programming skills. Try to solve the individual problems without searching for a solution online but basing your coding exclusively on your skills. Please keep in mind that the assignment during this course may be more challenging than this one.


## Dataset Information

The dataset for this problem comes from [this Kaggle page](https://www.kaggle.com/yamqwe/omicron-covid19-variant-daily-cases?select=covid-variants.csv). A copy of the file is also in the same directory as this Jupyter notebook.

This dataset contains data about the processing of COVID-19 sequences by different countries over time. It comes as a Comma-Separated Value (CSV) file. This file contains the following 6 columns:

1. `location`: the country for which the information is provided
2. `date`: the date of the data entry
3. `variant`: the COVID-19 variant for the data entry
4. `num_sequences`: the number of sequences **processed** (for the country, variant, and date)
5. `num_sequences_total`: the number of sequences **available** (for the country, variant, and date)
6. `perc_sequences`: the percentage of available number of sequences that were processed (*Note: this value is out of 100*)

Each row in the dataset represents the processing of *one* variant by *one* country on *one* day.

In [1]:
## Import any package you may need here below 
import pandas as pd
import numpy as np

In [2]:
## Add here anything else you may need
data = pd.read_csv('covid-variants.csv')
data

Unnamed: 0,location,date,variant,num_sequences,perc_sequences,num_sequences_total
0,Angola,2020-07-06,Alpha,0,0.0,3
1,Angola,2020-07-06,B.1.1.277,0,0.0,3
2,Angola,2020-07-06,B.1.1.302,0,0.0,3
3,Angola,2020-07-06,B.1.1.519,0,0.0,3
4,Angola,2020-07-06,B.1.160,0,0.0,3
...,...,...,...,...,...,...
100411,Zimbabwe,2021-11-01,Omicron,0,0.0,6
100412,Zimbabwe,2021-11-01,S:677H.Robin1,0,0.0,6
100413,Zimbabwe,2021-11-01,S:677P.Pelican,0,0.0,6
100414,Zimbabwe,2021-11-01,others,0,0.0,6


## 1. Find Uncommon Variants

The 3 main variants of COVID-19 that we've experienced in the US are:

1. Alpha
2. Delta
3. Omicron

However, there are many other variants recognized by the WHO. 

Determine which other variants are included in this dataset.

Sort the variant names alphanumerically and store them in a Python list.

*Note: the "variants" column of the dataset contains 2 "catch-all" categories called "**non_who**" and "**others**". Do **NOT** include these categories in the list.

In [3]:
## Write here your code
variantlist = data.variant.unique().tolist()
variantlist.remove('non_who')
variantlist.remove('others')
variantlist

['Alpha',
 'B.1.1.277',
 'B.1.1.302',
 'B.1.1.519',
 'B.1.160',
 'B.1.177',
 'B.1.221',
 'B.1.258',
 'B.1.367',
 'B.1.620',
 'Beta',
 'Delta',
 'Epsilon',
 'Eta',
 'Gamma',
 'Iota',
 'Kappa',
 'Lambda',
 'Mu',
 'Omicron',
 'S:677H.Robin1',
 'S:677P.Pelican']

## 2. Find the Most Processed Variant

Determine which variant of COVID-19 has the most sequences processed.


In [4]:
## Write here your code
mostprocessed = pd.DataFrame(data.groupby(['variant'])['num_sequences'].sum().reset_index())
mostprocessed.sort_values('num_sequences', ascending = False) #not necessary, double checking
mostprocessed = mostprocessed[mostprocessed.num_sequences == mostprocessed.num_sequences.max()]
mostprocessed
#Delta is the most processed with 3,834,100 sequences

Unnamed: 0,variant,num_sequences
11,Delta,3834100


## 3. Find Best Country at Processing All Sequences

Determine which country did the best at processing sequences across **all** variants (including "catch all" categories).

The output should be the name of a single country.


In [5]:
## Write here your code
bestcountry = pd.DataFrame(data.groupby(['location'])['num_sequences'].sum().reset_index())
bestcountry = bestcountry[bestcountry.num_sequences == bestcountry.num_sequences.max()]
bestcountry.location.iloc[0]
#United States processed the most across all variants

'United States'

## 4a. Find Best Country at Processing Specific Sequences

Determine which country did the best at processing sequences across the Alpha, Delta, and Omicron variants.

The output should be the name of a single country.


In [6]:
## Write here your code
bestcountry2 = data[(data.variant == 'Alpha') | (data.variant == 'Delta') | (data.variant == 'Omicron')]
bestcountry2 = pd.DataFrame(bestcountry2.groupby(['location'])['num_sequences'].sum().reset_index())
bestcountry2out = bestcountry2[bestcountry2.num_sequences == bestcountry2.num_sequences.max()]
bestcountry2out.location.iloc[0]
#United States processed the most across specifically Alpha, Delta, and Omicron variants

'United States'

## 4b. Find the Ranking of the US at Processing Specific Sequences

Determine the ranking of the US at processing sequences across the Alpha, Delta, and Omicron variants.

Store the ranking as an integer.

*Note: the best country has a ranking of 1, but indexing in Python starts at 0.*

*Note: in Jupyter, variables from already executed code cells are available in other code cells. This means you shouldn't have to copy and paste code from problem 4a.*

In [7]:
## Write here your code
bestcountry3 = bestcountry2.sort_values(['num_sequences'], ascending = False)
bestcountry3 = bestcountry3.reset_index()
bestcountry3 = bestcountry3.drop('index', axis=1)
bestcountry3['ranking'] = bestcountry3.index.get_level_values(0).values + 1
bestcountry3
#Added ranking column to table

Unnamed: 0,location,num_sequences,ranking
0,United States,1595808,1
1,United Kingdom,1413632,2
2,Germany,292106,3
3,Denmark,225315,4
4,Japan,140074,5
...,...,...,...
116,Monaco,79,117
117,Hungary,29,118
118,Madagascar,27,119
119,Cyprus,20,120


## 5. Find the Number of Processed Sequences Per Country on Date

Determine each country's total number of processed sequences for the Omicron variant on December 27, 2021.

Sort the output from the highest number of processed sequences to the smallest number of processed sequences.

Store the result as a list of tuples, with each tuple containing the country name first and the number of processed sequences second.


In [8]:
## Write here your code
bestcountry4 = data[(data.variant == 'Omicron') & (data.date == "2021-12-27")]
bestcountry4 = pd.DataFrame(bestcountry4.groupby(['location'])['num_sequences'].sum().reset_index())
bestcountry4 = bestcountry4.sort_values(['num_sequences'], ascending = False)
bestcountry4 = bestcountry4.reset_index()
bestcountry4 = bestcountry4.drop('index', axis=1)
bestcountry4 = list(bestcountry4.itertuples(index = False, name = None))
bestcountry4

[('United Kingdom', 52456),
 ('United States', 24681),
 ('Denmark', 3331),
 ('Germany', 1701),
 ('Israel', 1578),
 ('Australia', 1319),
 ('Switzerland', 514),
 ('France', 509),
 ('Italy', 486),
 ('Belgium', 464),
 ('Spain', 461),
 ('Sweden', 434),
 ('Chile', 260),
 ('Netherlands', 254),
 ('Singapore', 249),
 ('Mexico', 240),
 ('Turkey', 202),
 ('India', 174),
 ('Brazil', 147),
 ('Botswana', 142),
 ('Indonesia', 128),
 ('Portugal', 118),
 ('Japan', 118),
 ('Argentina', 80),
 ('New Zealand', 63),
 ('South Africa', 61),
 ('Lithuania', 50),
 ('Czechia', 49),
 ('Georgia', 46),
 ('Russia', 45),
 ('Colombia', 37),
 ('Sri Lanka', 37),
 ('Hong Kong', 35),
 ('Malta', 34),
 ('Poland', 28),
 ('Ecuador', 26),
 ('Canada', 25),
 ('Jordan', 22),
 ('Malawi', 21),
 ('Cambodia', 18),
 ('Norway', 17),
 ('Morocco', 15),
 ('Senegal', 15),
 ('Costa Rica', 14),
 ('Pakistan', 11),
 ('Nigeria', 10),
 ('Peru', 10),
 ('Trinidad and Tobago', 8),
 ('Brunei', 8),
 ('Slovakia', 8),
 ('Zambia', 7),
 ('Maldives', 7),
 

## 6. Find Percentage of Sequences Processed in the US

Determine the percentage of processed sequences for the Alpha, Delta, and Omicron variants in the US.

Store the result as a dictionary where keys are variant names and values are percentages.


In [9]:
## Write here your code
totalsequences = sum(data.num_sequences[(data.location == 'United States')]) #For all variants in the US
alphatotal = sum(data.num_sequences[(data.variant == 'Alpha') & (data.location == 'United States')])
deltatotal = sum(data.num_sequences[(data.variant == 'Delta') & (data.location == 'United States')])
omicrontotal = sum(data.num_sequences[(data.variant == 'Omicron') & (data.location == 'United States')])
variantdictionary = {
    'Alpha': "{:.2%}".format(alphatotal / totalsequences),
    'Delta' : "{:.2%}".format(deltatotal / totalsequences),
    'Omicron' : "{:.2%}".format(omicrontotal / totalsequences)
}

variantdictionary

{'Alpha': '9.91%', 'Delta': '54.84%', 'Omicron': '1.18%'}

Report below the challenges you faced in solving this assignment:

Write here your answer

In [10]:
#I have not coded in Python recently, so remembering how to do even basic actions took some
#researching, but in general I have coded in R recently so I still understand general steps needed