## Import

Before we start with todays exercice, let us talk a little bit about **modules** and what they represent. Imagine you are working on a script which includes a variety of functions to solve common tasks, for instance: functions to perform mathematical matrix operations or functions to visualise huge amounts of data. Wouldn't it be convenient if we could use those functions in another python-script? Well....since we are too lazy to rewrite everything.... Yeah, it would!

Modules - it's your time to shine🌞<br>
Modules are nothing more than <ins>.py-files</ins> consisting of different kinds of components, i.e. functions, which can be made available in any other python-script using the `import`-statement. And yes, you can import your own python-scripts as well! Besides that, Python comes with a extensive amount of modules, known as <ins>Standard-Library</ins>. You can also include 3rd-party packages (numpy, pandas, scipy, matplotlib,...) but you will have to install them first. 

To keep the <ins>namespace</ins> clean (and your brain sane), lets have a look on how to use `import`.
### How to `import` everything from a module

**Python-Syntax:** 
```python
import module_name
```

It doesn't get easier than that - after the import-statement follows the name of the module. Now you can use all functions from the `pandas` module by prefixing their name with their <ins>namespace</ins> `pandas.` Usually, all import-statements are found at the top of the script to keep the code tidy and clear.



### How to `import` specific contents `from` a module

**Python-Syntax:**
```python
from module_name import content_name1, content_name2, etc
``` 

Instead of importing everything of a module, we can extract specific contents, i.e. only functions we really need. This allows us, to use functions without the namespace-prefix. Keep in mind, that multiple contents are separated with commas (`,`).

**Pitfall:**
```python
from statistics import mean
from numpy import mean
```
Always keep an eye on which elements you are importing from different modules. In our case, there are two imported functions with the same name (name-collision). Therefore python always uses the last imported function with that name - in our case, the mean-function of the numpy module. <ins>The last import always wins!</ins>



### How to `import` a module `as` you like

**Python-Syntax:** 
```python
import module_name as new_module_name_in_namespace
from module_name import component as new_component_name_in_namespace
```

Modules and packages can be renamed on import to keep code more succinct. Most widely-used packages have an established abbreviation. Stick to it to make your code readable for others! For example pandas established abbrevation is `pd` so you would import it as:
```python
import pandas as pd
```

# Exercise 06: Pandas A

[Pandas](https://pandas.pydata.org/docs/ ) is a Python package which provides data structures for working with tabular, labeled data (i.e. data in a table with rows and columns). It is a good tool for real-world data analysis in Python. In Google Colab, pandas is already installed. If you work locally, install it by executing `python -m pip install pandas` in a terminal.

[Here](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS) is an extensive playlist covering all basic pandas operations. The skills required for this exercise are covered in Parts 1 to 6. Feel free to skip around, as the videos cover lots of details :)

If you prefer a text-based tutorial, take a look at the [Getting started](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html) section of the pandas documentation. 

You can also just go ahead and try to solve the tasks without any tutorial - each time something new is required, a link to a Google Search is provided. Since programming usually requires lots and lots of googling and reading documentation or Stack Overflow, this might give you an idea of what to google and which sites are helpful ;)

This exercise uses COVID-19 data from [Our World in Data](https://ourworldindata.org/). The cell below extracts part of that data from the source on [GitHub](https://github.com/owid/covid-19-data/) and stores it in .csv files in a directory "data" next to this notebook. Execute it to get the most recent data and have a look at the .csv files!


In [3]:
from pathlib import Path

import pandas as pd

data_dir = Path("./data")
data_dir.mkdir(parents=True, exist_ok=True)
vaccinations_raw = pd.read_csv("https://github.com/owid/covid-19-data/raw/master/public/data/vaccinations/vaccinations.csv")
vaccinations_raw[['location', 'date', 'daily_vaccinations', 'people_fully_vaccinated']].to_csv(data_dir / "vaccinations.csv", index=False)
cases_deaths_raw = pd.read_csv("https://github.com/owid/covid-19-data/raw/master/public/data/jhu/full_data.csv")
cases_deaths_raw[['location', 'date', 'new_cases', 'new_deaths']].to_csv(data_dir / "cases_deaths.csv", index=False)
locations_raw = pd.read_csv("https://github.com/owid/covid-19-data/raw/master/public/data/jhu/locations.csv")
locations_raw[['location', 'continent', 'population']].dropna().to_csv(data_dir / "locations.csv", index=False)

Read the three .csv-files. Make sure to parse the "date" columns to a datetime type (check by viewing the `.dtypes` attribute).

[Help!](https://www.google.com/search?q=pandas+read+csv)  
[Help with dates!](https://www.google.com/search?q=pandas+csv+parse+date)

In [122]:
# your code goes here:
vaccinations = pd.read_csv("./data/vaccinations.csv", parse_dates=["date"])
cases_deaths = pd.read_csv("./data/cases_deaths.csv", parse_dates=["date"])
locations = pd.read_csv("./data/locations.csv")

Access the rows containing the most recent vaccination data for Austria.

[Help1](https://www.google.com/search?q=pandas+last+rows), [Help2](https://www.google.com/search?q=pandas+filter+rows)

In [8]:
# your code goes here:
vaccinations.loc[vaccinations["location"] == "Austria", ].tail()

Unnamed: 0,location,date,daily_vaccinations,people_fully_vaccinated
6531,Austria,2022-05-16,2668.0,
6532,Austria,2022-05-17,2376.0,
6533,Austria,2022-05-18,2084.0,
6534,Austria,2022-05-19,1792.0,
6535,Austria,2022-05-20,1500.0,6616365.0


Create a new dataframe which contains dates, locations, and new cases - but no information about deaths.

[Help](https://www.google.com/search?q=pandas+remove+column) ([alternative](https://www.google.com/search?q=pandas+select+columns))

In [None]:
# your code goes here:
cases = cases_deaths.drop(columns='new_deaths')
cases = cases_deaths.loc[:, ['date', 'location', 'new_cases']]
cases

Get the names of all locations starting with an "E"!

[Help](https://www.google.com/search?q=pandas+string+starts+with) ([alternative](https://www.google.com/search?q=pandas+string+slice))


In [None]:
# your code goes here:
locations.loc[locations['location'].str.startswith('E'), 'location']

For each letter in the alphabet, print how many location names start with that letter.

[Help](https://www.google.com/search?q=python+loop+over+alphabet)

In [None]:
# your code goes here:
for letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
    print(letter, locations['location'].str.startswith(letter).sum())

Get the names of all locations with a population above 200,000,000.

[Help](https://www.google.com/search?q=pandas+select+larger+than)

In [None]:
# your code goes here:
locations.loc[locations['population'] > 200_000_000, 'location']

Get the names of all locations with a population between 7,000,000 and 9,000,000.

[Help](https://www.google.com/search?q=pandas+select+between)

In [None]:
# your code goes here:
locations.loc[(7_000_000 < locations['population']) & (locations['population'] < 9_000_000), 'location']

Vaccinations, cases, and deaths are reported not only for individual countries, but also for groups of countries (e.g. continents). Create a new column named "is_country" in each of the dataframes based on whether the location is present in `locations.csv`

[Help](https://www.google.com/search?q=pandas+select+if+in+list)

In [None]:
# your code goes here:
for df in (cases, cases_deaths, vaccinations):
    df["is_country"] = df["location"].isin(locations["location"])

Get the _country_ with the highest number of vaccinations in a single day - continents and other country groups don't count!

[Help](https://www.google.com/search?q=pandas+row+with+max+value+in+column)

In [None]:
# your code goes here:
vaccinations.loc[vaccinations[vaccinations['is_country']]['daily_vaccinations'].idxmax()]

Get the 10 least-populated locations.

[Help](https://www.google.com/search?q=pandas+smallest+rows)

In [None]:
# your code goes here:
locations.nsmallest(10, 'population')

Find the unique continent names contained in the locations file.

[Help](https://www.google.com/search?q=pandas+find+unique+values)

In [130]:
# your code goes here:
locations["continent"].unique()

array(['Asia', 'Europe', 'Africa', 'North America', 'South America',
       'Oceania'], dtype=object)

Count the number of locations associated with each continent.

[Help](https://www.google.com/search?q=pandas+count+values)

In [109]:
# your code goes here:
locations["continent"].value_counts()

Africa           55
Asia             49
Europe           49
North America    34
Oceania          16
South America    13
Name: continent, dtype: int64