<a href="https://colab.research.google.com/github/rk928/Ranjit-Kumar/blob/main/ENVS_617_Assignment_3_Data_Wrangling_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3: Data Wrangling

In this assignment, you'll be putting your new data wrangling skills to use. You'll be working with data on threatened species from the OECD Environment Statistics database. This dataset on biodiversity shows numbers of known species and threatened species with the aim of indicating the state of mammals, birds, freshwater fish, reptiles, amphibians and vascular plants.

An acronym that may be helpful to know is IUCN: the International Union for Conservation of Nature.

# Part 1. Proportion of endangered mammal species

In part 1, we'll do a series of data manipulations to find out which country has the highest proportion of endangered species among all mammal species. To get started:

1. Import the Pandas package
2. Read the data into a dataframe using `pd.read_csv()`. The csv is avaialable at `https://raw.githubusercontent.com/envirodatascience/ENVS-617-Class-Data/main/OECD_threatened_species.csv`.

In [3]:
import pandas as pd
import numpy as np
read_csv = 'https://raw.githubusercontent.com/envirodatascience/ENVS-617-Class-Data/main/OECD_threatened_species.csv'
mammals_df = pd.read_csv(read_csv)
mammals_df.head()

Unnamed: 0,IUCN,IUCN Category,SPEC,Species,COU,Country,Unit Code,Unit,Reference Period Code,Reference Period,Value,Flag Codes,Flags
0,TOT_KNOWN,Total number of known species,MAMMAL,Mammals,AUS,Australia,NBR,Number,,,380.0,,
1,TOT_KNOWN_IND,Total number of indigenous known species,MAMMAL,Mammals,AUS,Australia,NBR,Number,,,353.0,,
2,ENDANGERED,Number of endangered species,MAMMAL,Mammals,AUS,Australia,NBR,Number,,,38.0,,
3,CRITICAL,Number of critically endangered species,MAMMAL,Mammals,AUS,Australia,NBR,Number,,,10.0,,
4,VULNERABLE,Number of vulnerable species,MAMMAL,Mammals,AUS,Australia,NBR,Number,,,59.0,,


In [None]:
# Put your code here to load the pandas package and read in the data frame

## 1.1. Look at the data
First things first: let's see what this data set looks like. Write code below to examine the overall structure of the dataframe. Look at the first rows of the data, the types of columns, the shape of the data. What columns are there (what are the column names)? How many columns and row are there? What do they represent?

## 1.2. How would you describe the structure of this data in words?



## 1.3. Filtering the data

Since we're interested in data on mammals, filter the dataframe for only data for mammals and store that in a new `mammals_df`.

## 1.4. Reshaping the data

Let's reshape the data so that we can compare categories of the status of mammals within a country. Pivot the dataframe so that it has one row per country, with columns for each IUCN category. You will want to have a table that keeps the columns `Country`, `COU`, and `Species` as they are, and then creates new columns for each category in the existing `IUCN` column. Store that in a new `mammals_df_wide`.

Hint: Try reading the `.pivot()` documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html).

Now check the shape of the new wide dataframe:

## 1.5. Resetting the index

Depending on how you structured your pivot table, you may have noticed that your DataFrame now contains multiple column headers. This is called a _MultiIndex_ in Pandas. Although it sometimes has its uses, to simplify things for now, turn these into their own regular columns by using `.reset_index()`.

Uncomment and run the line of code below. (If your `mammals_df_wide` already only has a single index column, then you can skip this.)

In [None]:
# mammals_df_wide = mammals_df_wide.reset_index()

## 1.6. Check for duplicates

Lets now make sure we have no duplicated rows of data. Run code to check whether there are duplicates.

## 1.7. Calculating the proportion of endangered mammal species
Create a new column that contains the proportion of mammal species in each country that are categorized as endangered. To to this, simply select your dataframe columns and use the `/` operator. Assign this to a new column.


Ex:
```
df['new_column'] = df.column_1 / df.column_2
```


Verify this worked as expected by looking at one row and making sure the calculation is correct.

## 1.8. Sort your data: Which country has the highest proportion?
Now it's time to answer our question: in this data set, which country has the highest proportion of endangered mammal species?


HINT: when you sort the dataframe, you will need use the argument `ascending=False`

Does this answer make sense? Do you have any hypotheses why? (Just a few sentences.)

# Part 2. Protected areas
Now let's compare the proportion of endangered mammal species with protected land areas in those countries. First, we'll import a DataFrame with this information. The values in this DataFrame are the percent of total land area that is protected in the country.

In [None]:
url = 'https://raw.githubusercontent.com/envirodatascience/ENVS-617-Class-Data/main/OECD_protected_areas.csv'

## 2.1. Look at the data
Again, look at the first few rows of the data and the structure of the data. What data types do the columns have?

## 2.2. Merge with the species data

Merge this DataFrame with `mammals_df_wide` so that for every country in `mammals_df_wide`, we have the protected land area %. This way, we can compare the proportion of endangered mammal species with percent protected areas. You will need to decide whether to use a left join, inner join, or outer join.

Hint: To match rows of two dataframes, the values in the key column need to be identical. How can we transform one or both key colums so that we have identical matches?

## 2.3. Checking the merged DataFrame

After merging the two DataFrames, do some checks to make sure you were able to match all the rows. Get `.info()` on the merged data and then check the first rows.



## 2.4 Missing Values


Lets check if any rows have missing data. Display the any rows with missing data.

Filter the merged dataframe to look at Brazil, Russia, and Turkey and display the data for those three countries:

What do you notice for the values for those countries? Do you have any thoughts on why this might be?

## 2.5 Displaying the Results
To wrap things up, let's display the data we just cleaned in a sorted table.

First, sort your table by the proportion of endangered mammals in each country so that the country with the highest proportion is first (descending order).



## 2.6 Describe the data

Output the count, mean, standard deviation, min, and percentiles and max for the `proportion_endangered` and `percent_protected`. What is the max value for each?

Make a histogram for each of these columns. Describe what you observe in the two graphs.

Now, sort your table by the proportion of land area protected in each country so that the country with the highest proportion is first (descending order).

Scrolling through each of these two tables, what's something you notice? (Write a few sentences - there is not necessarily a 'right' answer here!)
