# COSC 426 / 526 - Assignment 01

### Discussed: January 24, 2025
### Due:  Jan 31, 2025 before 8AM ET

---
This notebook contains essential functions for your assignment. You will need to enhance and write additional code to complete the tasks. Please submit your completed work to the designated GitHub repository.

This assignment is crafted to assess your proficiency in Python programming. You are encouraged to solve the problems using your current knowledge and skills, refraining from searching for solutions online. Be aware that future assignments in this course may present more complexity than this initial task.

## Implementation Requirements

For your project, please adhere to the following straightforward requirements:

**Use Python 3:** Ensure that all your code is written in Python 3. We will be executing your code using a Python 3 interpreter. Code written in Python 2 is likely to encounter issues and may not run correctly.

**Code Format:** Your code should be presented in one of two formats:

* As a single Python script, saved with a .py file extension.
* As a Jupyter Notebook.

These requirements are essential to ensure that your code is compatible and can be executed smoothly in our environment.

## Dataset Information

For this assignment, you will be working with a dataset available from [this Kaggle page](https://www.kaggle.com/yamqwe/omicron-covid19-variant-daily-cases?select=covid-variants.csv). Additionally, a copy of this dataset is located in the same directory as this Jupyter notebook.

The dataset provides information on the processing of COVID-19 sequence data by different countries over a period of time. It is formatted as a CSV (Comma-Separated Values) file, containing the following six key columns:

1. `location`: This column indicates the specific country to which the data pertains.
2. `date`: This column shows the date on which the data was recorded.
3. `variant`: This column identifies the specific COVID-19 variant related to the data entry.
4. `num_sequences`: This column reflects the count of sequences that have been **processed** for the respective country, variant, and date.
5. `num_sequences_total`: This column indicates the total count of sequences that are available for the specified country, variant, and date.
6. `perc_sequences`: This column indicates the total count of sequences that are available for the specified country, variant, and date.

Each row in the dataset corresponds to the record of sequences processed for a particular COVID-19 variant in a specific country on a given day.

In [150]:
## Import any package you may need here below (Suggestion: consider pandas)
import pandas as pd

In [151]:
## Add here anything else you may need (Suggestion: import your data) 
df = pd.read_csv('covid-variants.csv')
display(df)

Unnamed: 0,location,date,variant,num_sequences,perc_sequences,num_sequences_total
0,Angola,2020-07-06,Alpha,0,0.0,3
1,Angola,2020-07-06,B.1.1.277,0,0.0,3
2,Angola,2020-07-06,B.1.1.302,0,0.0,3
3,Angola,2020-07-06,B.1.1.519,0,0.0,3
4,Angola,2020-07-06,B.1.160,0,0.0,3
...,...,...,...,...,...,...
100411,Zimbabwe,2021-11-01,Omicron,0,0.0,6
100412,Zimbabwe,2021-11-01,S:677H.Robin1,0,0.0,6
100413,Zimbabwe,2021-11-01,S:677P.Pelican,0,0.0,6
100414,Zimbabwe,2021-11-01,others,0,0.0,6


## Problem 1. Identifying Less Common COVID-19 Variants

In the United States, the three primary COVID-19 variants we have encountered are:
1. Alpha
2. Delta
3. Omicron

Besides these, the World Health Organization (WHO) has recognized several other variants of the virus.

Your task is to identify these additional variants present in the dataset. To do this, you should:

Examine the dataset, specifically focusing on the contents of the "variants" column.

Identify and compile a list of variant names that are different from the main variants (Alpha, Delta, Omicron) recognized in the US.

Ensure that you exclude two specific categories labeled as "non_who" and "others" in the "variants" column, as these are not individual variants but rather collective categories.

Organize the names of these less common variants in an alphanumeric order.

Store this organized list of variant names in a Python list.

Assign this list to a variable named uncommon_variants.

**Remember** this list should only include the names of the variants that are not as commonly known or discussed in the US context but are acknowledged by the WHO.


In [153]:
all_variants = df['variant'].unique()
remove_variants = ['non_who', 'others', 'Alpha', 'Delta', 'Omicron']
uncommon_variants = df[~df['variant'].isin(remove_variants)]['variant'].unique()
uncommon_variants = list(uncommon_variants)
print(uncommon_variants)

['B.1.1.277', 'B.1.1.302', 'B.1.1.519', 'B.1.160', 'B.1.177', 'B.1.221', 'B.1.258', 'B.1.367', 'B.1.620', 'Beta', 'Epsilon', 'Eta', 'Gamma', 'Iota', 'Kappa', 'Lambda', 'Mu', 'S:677H.Robin1', 'S:677P.Pelican']


**Solution:**

['B.1.1.277', 'B.1.1.302', 'B.1.1.519', 'B.1.160', 'B.1.177', 'B.1.221', 'B.1.258', 'B.1.367', 'B.1.620', 'Beta', 'Epsilon', 'Eta', 'Gamma', 'Iota', 'Kappa', 'Lambda', 'Mu', 'S:677H.Robin1', 'S:677P.Pelican']

## Problem 2. Identifying the Most Processed COVID-19 Variant

Your objective is to find out which COVID-19 variant has undergone the most processing, as indicated by the number of sequences processed.

To complete this task, follow these steps:

Examine the dataset to compare the number of processed sequences for each COVID-19 variant.

Identify the variant that has the highest number of sequences processed.

Store the name of this variant in a variable named variant_most_proc.

The variable `variant_most_proc` should ultimately contain the name of the COVID-19 variant with the greatest number of processed sequences according to the dataset.

In [118]:
filtered_df = df.groupby('variant')['num_sequences'].sum()
variant_most_proc = filtered_df.idxmax()
print(variant_most_proc)

Delta


**Solution:**

Delta

## Problem 3. Identifying the Top Country in Processing Sequences Across All COVID-19 Variants

Your task is to find out which country has been the most effective in processing sequences for all COVID-19 variants, including the general categories labeled as "catch all."

To achieve this:

Analyze the dataset to evaluate the sequence processing performance of each country for all COVID-19 variants.

Identify the country that has the highest efficiency or success rate in processing sequences across all these variants.

Assign the name of this country to a variable named best_proc_country.

The `best_proc_country` variable should ultimately hold the name of the country that stands out as the best in processing COVID-19 sequences for all variants, inclusive of the "catch all" categories.

In [119]:
success_by_country = df.groupby('location')['perc_sequences'].mean()
best_proc_country = success_by_country.idxmax()
print(best_proc_country)

Cyprus


**Solution:**

Cyprus

## Problem 4a. Identifying the Leading Country in Processing Alpha, Delta, and Omicron Sequences

Your objective in this task is to ascertain which country has demonstrated the highest proficiency in processing sequences specifically for the Alpha, Delta, and Omicron COVID-19 variants.

Follow these steps:

Review the dataset to assess how different countries have processed sequences for the Alpha, Delta, and Omicron variants.

Identify which country has shown the best overall performance in processing sequences for these three specific variants.

Record the name of this country in a variable named best_proc_country_ado.

Ultimately, `best_proc_country_ado`should contain the name of the country that stands out for its efficiency in handling sequences of the Alpha, Delta, and Omicron variants.

In [136]:
keep_variants = ['Alpha', 'Delta', 'Omicron']
filtered_df = df[df['variant'].isin(keep_variants)]

success_by_country = filtered_df.groupby('location')[['num_sequences', 'num_sequences_total']].sum()
success_by_country['perc_sequences_calc'] = (success_by_country['num_sequences'] / success_by_country['num_sequences_total']) * 100
best_proc_country_ado = success_by_country['perc_sequences_calc'].idxmax()
print(best_proc_country_ado)

Vietnam


**Solution:**

Vietnam

## Problem 4b. Assessing the U.S. Ranking in Processing Alpha, Delta, and Omicron Sequences

In this task, you are required to find out the U.S.'s ranking in terms of its efficiency in processing sequences for the Alpha, Delta, and Omicron COVID-19 variants.

To complete this task:

Analyze the dataset to compare the U.S.'s performance in processing sequences for these three variants against other countries.

Determine the U.S.'s position in this ranking, with the understanding that the top-performing country is ranked 1.

Store the U.S.'s ranking as an integer in a variable named `us_ranking`.

**Important Notes:**

Keep in mind that while the highest-ranking country is numbered 1, Python indexing begins at 0.

Remember, in Jupyter notebooks, variables from previously executed code cells remain accessible in subsequent cells. This means you can utilize data or results from problem 4a without needing to duplicate code.


In [149]:
success_by_country_sorted = success_by_country.sort_values('perc_sequences_calc', ascending=False)
us_ranking = success_by_country_sorted.index.get_loc('United States') + 1
print(us_ranking)

57


**Solution:**
    
57

## Problem 5. Calculating Processed Omicron Sequences by Country for a Specific Date

This task involves determining and sorting the total number of sequences processed for the Omicron variant by each country on December 27, 2021.

To achieve this:

Analyze the dataset to calculate the total number of processed sequences for the Omicron variant in each country specifically for the date December 27, 2021.

Arrange the results in descending order, starting with the country that processed the highest number of sequences down to the smallest.

Format the output as a list of tuples. Each tuple should contain two elements: the country's name and the corresponding number of processed sequences.

Store this sorted list of tuples in a variable named total_omicron_2021.

The `total_omicron_2021` variable will thus represent a structured list, providing a clear overview of each country's contribution to processing Omicron sequences on the specified date.


In [157]:
filtered_df = df[df['date'] == '2021-12-27']
filtered_df = filtered_df[filtered_df['variant'] == 'Omicron']
filtered_df = filtered_df[['location', 'num_sequences']].sort_values(by=['num_sequences', 'location'], ascending=[False, True])
total_omicron_2021 = list(filtered_df.itertuples(index=False, name=None))
print(total_omicron_2021)

[('United Kingdom', 52456), ('United States', 24681), ('Denmark', 3331), ('Germany', 1701), ('Israel', 1578), ('Australia', 1319), ('Switzerland', 514), ('France', 509), ('Italy', 486), ('Belgium', 464), ('Spain', 461), ('Sweden', 434), ('Chile', 260), ('Netherlands', 254), ('Singapore', 249), ('Mexico', 240), ('Turkey', 202), ('India', 174), ('Brazil', 147), ('Botswana', 142), ('Indonesia', 128), ('Japan', 118), ('Portugal', 118), ('Argentina', 80), ('New Zealand', 63), ('South Africa', 61), ('Lithuania', 50), ('Czechia', 49), ('Georgia', 46), ('Russia', 45), ('Colombia', 37), ('Sri Lanka', 37), ('Hong Kong', 35), ('Malta', 34), ('Poland', 28), ('Ecuador', 26), ('Canada', 25), ('Jordan', 22), ('Malawi', 21), ('Cambodia', 18), ('Norway', 17), ('Morocco', 15), ('Senegal', 15), ('Costa Rica', 14), ('Pakistan', 11), ('Nigeria', 10), ('Peru', 10), ('Brunei', 8), ('Slovakia', 8), ('Trinidad and Tobago', 8), ('Maldives', 7), ('Zambia', 7), ('Thailand', 6), ('Malaysia', 5), ('Bangladesh', 4),

**Solution:**

[('United Kingdom', 52456), ('United States', 24681), ('Denmark', 3331), ('Germany', 1701), ('Israel', 1578), ('Australia', 1319), ('Switzerland', 514), ('France', 509), ('Italy', 486), ('Belgium', 464), ('Spain', 461), ('Sweden', 434), ('Chile', 260), ('Netherlands', 254), ('Singapore', 249), ('Mexico', 240), ('Turkey', 202), ('India', 174), ('Brazil', 147), ('Botswana', 142), ('Indonesia', 128), ('Japan', 118), ('Portugal', 118), ('Argentina', 80), ('New Zealand', 63), ('South Africa', 61), ('Lithuania', 50), ('Czechia', 49), ('Georgia', 46), ('Russia', 45), ('Colombia', 37), ('Sri Lanka', 37), ('Hong Kong', 35), ('Malta', 34), ('Poland', 28), ('Ecuador', 26), ('Canada', 25), ('Jordan', 22), ('Malawi', 21), ('Cambodia', 18), ('Norway', 17), ('Morocco', 15), ('Senegal', 15), ('Costa Rica', 14), ('Pakistan', 11), ('Nigeria', 10), ('Peru', 10), ('Brunei', 8), ('Slovakia', 8), ('Trinidad and Tobago', 8), ('Maldives', 7), ('Zambia', 7), ('Thailand', 6), ('Malaysia', 5), ('Bangladesh', 4), ('Romania', 3), ('Iran', 1), ('Oman', 1), ('Ukraine', 1), ('Vietnam', 1), ('Moldova', 0), ('Monaco', 0), ('Nepal', 0), ('South Korea', 0)]

## Problem 6. Calculating U.S. Processing Percentages for Key COVID-19 Variants

Your task is to find out the percentage of sequences that have been processed in the United States (U.S.) for three specific COVID-19 variants: Alpha, Delta, and Omicron.

To complete this:

Examine the dataset to determine the percentage of sequences processed for each of the Alpha, Delta, and Omicron variants in the U.S.

Compile these percentages into a dictionary. In this dictionary:

The keys should be the names of the variants ("Alpha", "Delta", and "Omicron").

The values should be the corresponding percentages of sequences processed for each variant.

Assign this dictionary to a variable named proc_seq_us.

Ultimately, the proc_seq_us variable will hold a structured dictionary, providing a clear representation of the U.S.'s processing percentages for each of these significant COVID-19 variants.



6. Find Percentage of Sequences Processed in the US

Determine the percentage of processed sequences for the Alpha, Delta, and Omicron variants in the US.

Store the result as a dictionary where keys are variant names and values are percentages.

Save the result in a variable called `proc_seq_us`.

In [129]:
keep_variants = ['Alpha', 'Delta', 'Omicron']
filtered_df = df[df['variant'].isin(keep_variants)]
filtered_df = filtered_df[filtered_df['location'] == 'United States']

variant_totals = filtered_df.groupby('variant')[['num_sequences', 'num_sequences_total']].sum()
variant_totals['perc_sequences_calc'] = (variant_totals['num_sequences'] / variant_totals['num_sequences_total']) * 100
proc_seq_us = variant_totals['perc_sequences_calc'].to_dict()
print(proc_seq_us)

{'Alpha': 11.520951617373877, 'Delta': 63.76796208057254, 'Omicron': 1.370817855027461}


**Solution:**

{'Alpha': 11.520951617373877, 'Delta': 63.76796208057254, 'Omicron': 1.370817855027461}


## Problem 7. Write the comprehensive README files

**Note:** These directions are for a README file for your assignments. An extensive README file should be used for your project. 

***Write the comprehensive README files for Assginemnt 1***

A comprehensive README file on GitHub is the primary information source for anyone exploring your repository. It is essential for clearly conveying your assignment's purpose, setup, and usage.

Key elements of a comprehensive README for an assignment include:

Assignment title: This should clearly state the name of your project.

Assignment description: Provide a concise overview of what the project entails. This section should explain the project's usefulness and the problems it addresses.

Installation instructions: Offer detailed steps for setting up the project. This includes any prerequisites, dependencies, and a step-by-step guide to operationalizing the project.

Use: Give clear instructions on how to use the project. Enhance this section with practical examples, including code snippets, screenshots, or videos.

Contact information: Detail how to contact you. This could be through email.

Acknowledgments: Credit any individuals, organizations, or other entities contributing significantly to the assignment.

**Add the README file to the GitHub repository with the solution of Problems 1-6.**

# Live Chat: The History of Big Data

In her [keynote speech](https://youtu.be/CNoi-XqwJnA) at Supercomputing 2013, Intel's Genevieve Bell illustrates that humanity has been managing big data for thousands of years. She emphasizes that adopting the appropriate perspective is crucial for solving many contemporary challenges related to big data. Watch the video here: Genevieve Bell Keynote - Supercomputing 2013.

Provide a summary of the video in no more than 150 words. Additionally, list three key concepts or insights that you gained from watching this video.

**Important Note:** Your responses to this question is as critical as your codes. Brief answers or responses limited to just a few words will be considered inadequate and will negatively impact the overall grading of your assignment.


Write here your answer

Summary:
Genevieve extensively describes how Morgan the Conqueror surveyed his country and how that data affected English life for hundreds of years. This big data can be broken down into surveys (data), the Winchester Roll to organize the data (framework), and how it was used to judge citizens (extract value). She states that today's big data is broken into facts, visualization/analytics, and algorithms. Facts and raw data are complicated to understand due to people's nature to lie. Visualizations can clear up facts, and analytics can decide what data is relevant. Algorithms are only as good as the facts we feed them because they ultimately make the answers we want them to make. While big data seems to be big data = data + visualization/analytics + algorithms, humans are also in the equation. Big data starts and ends with humans, not technology. The only things stopping us are intellect and imagination.

Three key concepts/insights gained from video:
- Big data is just a combination of data, visualization/analytics, algorithms, and humans.
- Visualization makes facts easier to see whereas analytics gives those facts meaning.
- The Domesday Book preserved centuries-old data and was still used to shape English life in land ownership, tax base, etc. It begs the question of if and how modern data could be stored to last 1000 years like the Domesday Book.