# Session 06

## Pandas

> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

### [Website](https://pandas.pydata.org)

pandas add a new data structure to Python, called **DataFrame**. It helps us store and work with Tabular data.

We are going to read Covid-19 data from the [COVID-19 Dataset by Our World in Data](https://github.com/owid/covid-19-data) repository.


In [None]:
import pandas as pd

covid_data_full = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

The `shape` property tell us the dimensions of the DataFrame (Rows and Columns).

In [None]:
print("Shape: ", covid_data_full.shape)

The `info` method tell us the column names and types.

With the `memory_usage="deep"` parameter we get the full memory consumption.

In [None]:
covid_data_full.info(memory_usage="deep")

The `head` method gives us a sample at the start of the DataFrame.

You can add a parameter to select the number of rows, and there is also a tail method.

In [None]:
covid_data_full.head()

Note: `NaN` means "Not a number", and is the way of pandas to represent null values.

The `describe` method gives us information about the data in the DataFrame.

In [None]:
covid_data_full.describe()

To make it more clear we will format the results to have 2 decimals.

In [None]:
pd.options.display.float_format = "{:.2f}".format

covid_data_full.describe()

For this exercise we will use only 4 columns.

In [None]:
covid_data = covid_data_full[["location", "date", "new_tests", "new_cases"]].copy()

In [None]:
print("Shape: ", covid_data.shape)
print("Info:")
covid_data.info(memory_usage="deep")
print("Head:")
covid_data.head()

Normally `Describe` only shows numeric columns.

In [None]:
covid_data.describe()

But we can ask it to display only `object` columns (Normally strings)

In [None]:
covid_data.describe(include="object")

As we saw, we have missing data.

In [None]:
print (covid_data.isnull().sum(axis=0))

We can remove records with no values at new_tests column

In [None]:
covid_data.dropna(subset=["new_tests"], inplace=True)
print (covid_data.isnull().sum(axis=0))

We do the same for new_cases

In [None]:
covid_data.dropna(subset=["new_cases"], inplace=True)
print (covid_data.isnull().sum(axis=0))

Examine the dataframe without missing data

In [None]:
covid_data.describe(include="all")

We can select just some data

In [None]:
covid_data[covid_data.location == "Mexico"]

In [None]:
covid_data["date"].max()

In [None]:
covid_data[covid_data.date == "2022-06-23"]

In [None]:
covid_data[covid_data.location == "Mexico"]["date"].max()

In [None]:
covid_data[covid_data.date == "2022-06-18"]

In [None]:
covid_data[covid_data.date == "2022-06-18"].head(5)

Look for rows when there are more new cases than new tests

In [None]:
covid_data.loc[covid_data["new_cases"] > covid_data["new_tests"]]

Remove those rows

In [None]:
covid_data.drop(covid_data[covid_data["new_cases"] > covid_data["new_tests"]].index, inplace=True)
covid_data.loc[covid_data["new_cases"] > covid_data["new_tests"]]

List records by country

In [None]:
covid_data.groupby(["location"])["location"].count()

In [None]:
covid_data.groupby(["location"])["location"].count()["Mexico"]

Let's create the new percentage column

In [None]:
covid_data["test_pct"] = covid_data["new_cases"] / covid_data["new_tests"] * 100
covid_data.describe(include="all")

In [None]:
covid_data.head()

We can get the average by country

In [None]:
covid_data.groupby(["location"])["test_pct"].mean()

And sort it descending

In [None]:
test_pct_df = covid_data.groupby(["location"])["test_pct"].mean().reset_index().sort_values(["test_pct"], ascending=False)
test_pct_df.head(20)

Create a plot of the countries with most cases / test ratio

In [None]:
import matplotlib.pyplot as plt

data_plot = test_pct_df.head(20).plot(
    x="location",
    y="test_pct",
    kind="bar",
    figsize=(10,5),
)
plt.xticks(rotation=90)
plt.title("COVID-19 data results")
plt.legend(["Percentage"])
plt.xlabel("Country")
plt.ylabel("cases vs. test ratio in %")
plt.show()


Create a plot of the cases in a country

In [None]:
country = "Mexico"
covid_data[covid_data["location"] == country].plot(
    x="date",
    y="new_cases",
    kind="line",
    figsize=(10,5),)
plt.xticks(rotation=90)
plt.title(country + " cases over time")
plt.legend(["New Cases"])
plt.xlabel("Date")
plt.ylabel("Number of New Cases")
plt.show()

And we can do very impressive stuff

In [None]:
import seaborn as sns
import altair as alt
import geopandas as gpd
from pyproj import CRS

In [None]:
latest_df = covid_data_full[covid_data_full.date == "2022-06-01"].copy()
latest_df.rename(columns={"location":"CNTRY_NAME"}, inplace=True)

In [None]:
latest_df.head(5)

In [None]:
map = gpd.read_file('https://opendata.arcgis.com/datasets/a21fdb46d23e4ef896f31475217cbb08_1.geojson')

In [None]:
map.head()

In [None]:
covid_map = pd.merge(map, latest_df, on="CNTRY_NAME")
crs_epsg = CRS("epsg:4326")
corona_gpd = gpd.GeoDataFrame(
    covid_map,
    crs=crs_epsg,
    geometry="geometry",
)
corona_gpd.head(5)

In [None]:
ax = corona_gpd.plot(
    figsize=(30, 22),
    column="total_deaths",
    cmap="inferno",
    scheme="HeadTailBreaks",
    k=9,
    alpha=1,
    legend=True,
    markersize=0.5,
)
plt.title("Coronavirus Total Death by Country")
plt.show()