# Exploring Pokemon Data
In this Notebook, we will be exploring our first dataset - the [Pokemon](https://www.kaggle.com/mariotormo/complete-pokemon-dataset-updated-090420) data. In the process of exploring this data, we will be covering basics on reading, processing, analyzing, and visualizing simple tabular datasets.

Our Notebooks in CSMODEL are designed to be guided learning activities. To use them, simply go through the cells from top to bottom, following the directions along the way. If you find any unclear parts or mistakes in the Notebooks, email your instructor.

--------------------------------------------------------

*Name:* Arwyn Gabrielle A. Telosa (12209228) <br>
*Course:* BS Computer Science Major in Network and Information Secuirty (BSCS-NIS) <br>

## Instructions
* Read each cell and implement the TODOs sequentially. The markdown/text cells also contain instructions which you need to follow to get the whole notebook working.
* Do not change the variable names unless the instructor allows you to.
* Answer all the markdown/text cells with 'Question #' on them. The answer must strictly consume one line only.
* You are expected to search how to some functions work on the Internet or via the docs. 
* The notebooks will undergo a 'Restart and Run All' command, so make sure that your code is working properly.
* You are expected to understand the dataset loading and processing separately from this class.
* You may not reproduce this notebook or share them to anyone.

## `pandas` and `matplotlib`
* **`pandas`** is a software library for Python that is designed for data manipulation and data analysis. 
* **`matplotlib`** is a software libary for data visualization, which allows us to easily render various types of graphs. 

We will be using these two libraries in this Notebook.

In [46]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# sets the theme of the charts
plt.style.use('seaborn-v0_8-darkgrid')

%matplotlib inline

The `%matplotlib inline` command allows for the visualization charts from `matplotlib` to be automatically rendered and displayed in the Notebook.

Additionally, you can also customize or choose the theme you'd like to use with `matplotlib` using the `plt.style.use()` command. See the [style sheets reference](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html) from `matplotlib` for more options.

## The Dataset
For this notebook, we will working on a dataset called `pokemon`. This dataset contains 890 known pokemon until 8th Generation and its varieties.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

If you view the `.csv` file in Excel, you can see that our dataset contains many **observations** (rows) across 48 **variables** (columns). The following are the descriptions of each variable in the dataset.

- **`pokedex_number`**: entry number of the Pokemon in the National Pokedex
- **`name`**: English name of the Pokemon
- **`generation`**: numbered generation which the Pokemon was first introduced
- **`status`**: denotes if the Pokemon is normal, sub legendary, legendary or mythical
- **`species`**: category of the Pokemon
- **`type_number`**: number of types that the Pokemon has
- **`type_1`**: primary type of the Pokemon
- **`type_2`**: secondary type of the Pokemon (if any)
- **`height_m`**: height of the Pokemon in meters
- **`weight_kg`**: weight of the Pokemon in kilograms
- **`abilities_number`**: the number of abilities of the Pokemon
- **`ability_1`**: ability of the Pokemon
- **`ability_2`**: another ability of the Pokemon (if any)
- **`ability_hidden`**: hidden ability of the Pokemon (if any)
- **`total_points`**: total number of base points
- **`hp`**: base HP of the Pokemon
- **`attack`**: base attack of the Pokemon
- **`defense`**: base defense of the Pokemon
- **`sp_attack`**: base special attack of the pokemon
- **`sp_defense`**: base special defense of the Pokemon
- **`speed`**: base speed of the Pokemon
- **`catch_rate`**: catch rate of the Pokemon
- **`base_friendship`**: base friendship of the Pokemon
- **`base_experience`**: base experience of a wild Pokemon when caught
- **`growth_rate`**: growth rate of the Pokemon
- **`egg_type_number`**: number of groups where a Pokemon can hatch
- **`egg_type_1`**: name of an egg group where a Pokemon can hatch
- **`egg_type_2`**: name of an egg group where a Pokemon can hatch
- **`percentage_male`**: percentage of the species that are male, blank if the Pokemon is genderless.
- **`egg_cycles`**: number of cycles (255-257 steps) required to hatch an egg of the Pokemon
- **`against_normal`**: denote the amount of damage taken against an attack of a normal Pokemon
- **`against_fire`**: denote the amount of damage taken against an attack of a fire Pokemon
- **`against_water`**: denote the amount of damage taken against an attack of a water Pokemon
- **`against_electric`**: denote the amount of damage taken against an attack of an electric Pokemon
- **`against_grass`**: denote the amount of damage taken against an attack of a grass Pokemon
- **`against_ice`**: denote the amount of damage taken against an attack of an ice Pokemon
- **`against_fight`**: denote the amount of damage taken against an attack of a fighting Pokemon
- **`against_poison`**: denote the amount of damage taken against an attack of a poison Pokemon
- **`against_ground`**: denote the amount of damage taken against an attack of a ground Pokemon
- **`against_flying`**: denote the amount of damage taken against an attack of a flying Pokemon
- **`against_psychic`**: denote the amount of damage taken against an attack of a psychic Pokemon
- **`against_bug`**: denote the amount of damage taken against an attack of a bug Pokemon
- **`against_rock`**: denote the amount of damage taken against an attack of a rock Pokemon
- **`against_ghost`**: denote the amount of damage taken against an attack of a ghost Pokemon
- **`against_dragon`**: denote the amount of damage taken against an attack of a dragon Pokemon
- **`against_dark`**: denote the amount of damage taken against an attack of a dark Pokemon
- **`against_steel`**: denote the amount of damage taken against an attack of a steel Pokemon
- **`against_fairy`**: denote the amount of damage taken against an attack of a fairy Pokemon

## Reading the Dataset

Our first step is to load the dataset using `pandas`. This will load the dataset into a pandas `DataFrame`. To load the dataset, we use the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function. Note that you may need to change the path depending on the location of the file in your machine.

In [47]:
pokemon_df = pd.read_csv('pokemon.csv')

The dataset should now be loaded in the `pokemon_df` variable. `pokemon_df` is a [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). It is a data structure for storing tabular data, and the main data structure used in pandas.

Whenever we load a new dataset, it is generally a good idea to call the [`info`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) function, which displays general information about the dataset.

In [48]:
pokemon_df.info()

**Question #1:** How many observations are there in the dataset?

Answer: 1028

**Question #2:** How many variables are there in the dataset?

Answer: 48

**Question #3:** What is the data type of the `hp` column?

Answer: int64

We can call the [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) function to display the first `n` rows of the dataset.

In [49]:
pokemon_df.head(10)

We can call also call the [`tail`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) function to display the last `n` rows of the dataset.

Use the tail function to find out what is the type of the **last** row in the dataset.

In [50]:
# Write your code here
pokemon_df.tail(1)

**Question #4:** What is/are the type/s of the Pokemon in last row of the dataset?

Answer: Poison & Dragon

We can get the columns of the dataset by accessing the [`columns`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html) property of the `DataFrame`.

In [51]:
pokemon_df.columns

## Exploratory Data Analysis

The `pokemon` dataframe is a massive trove of information. Let's think about some questions we might want to answer with these data.

### Which type has the highest number of Pokemons?

To answer this question, the variable of interest is:
- **`type_1`**: primary type of the Pokemon

We can select a specific column from a `DataFrame` as a `Series` by using square brackets. For example, we can get the primary type of various Pokemon in the dataset by accessing the `type_1` column:

In [52]:
pokemon_df["type_1"]

Note that the data type of the column above is a `Series`.

In [53]:
type(pokemon_df["type_1"])

Count the number of pokemon per primary type.

In [54]:
type_count_df = pokemon_df['type_1'].value_counts()
type_count_df

Bar plots are used to show the count of each value. They are only used for categorical data. Use the [`bar`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html) function.

Let's create a plot to show the count per primary type of Pokemon.

In [55]:
type_count_df.plot.barh(figsize=(6,7)).invert_yaxis()
plt.xlabel('Primary Type')
plt.ylabel('Count')
plt.title('Pokemon count per primary type')

**Question #5:** What are the top 3 types with the highest Pokemon count?

Answer: Water, Normal, Grass

### What is the average base HP of a specific type of Pokemon?

To answer this question, the variables of interest are:
- **`type_1`**: primary type of the Pokemon
- **`hp`**: base HP of the Pokemon

Write code to select the `hp` column as a series.

In [56]:
# Write your code here
pokemon_df["hp"]
type(pokemon_df["hp"])

We can also select a list of columns from the dataset by providing a list instead of the name of a single column. For example, we can select both the `type_1` and `hp` columns at the same time as follows:

In [57]:
pokemon_df[["type_1", "hp"]]

Note that by doing this, we are getting a `DataFrame` (albeit a smaller one) instead of a `Series`.

In [58]:
type(pokemon_df[["type_1", "hp"]])

A good way to get an understanding of numerical values in the dataset is to use a histogram. Let's use a histogram to visualize the weight of all Pokemons in the dataset. To do this, we will call the [`hist`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) function of the `DataFrame` which in turn calls the appropriate matplotlib function.

Note that we also call the [`show`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.show.html) function of matplotlib to display only the graph.

In [59]:
pokemon_df.hist("weight_kg", bins=30, edgecolor='w', figsize=(8, 4))
plt.show()   # explicit call to show the chart (not needed with the matplotlib inline command)

You can play around the `bins` parameter by changing its value above.

Let's say we want to investigate the base HP for normal Pokemons only. To do this, we have to consider **only the observations in which the `type_1` is `Normal`**.

In [60]:
pokemon_df[pokemon_df["type_1"] == "Normal"]

As you can see, the above query resulted into a new `DataFrame` containing only the Pokemons where `type_1` is `Normal`. For now, we will assign this new `DataFrame` into a new variable for convenience.

In [61]:
normal_pokemon_df = pokemon_df[pokemon_df["type_1"] == "Normal"]

Plot a histogram of the base HP of normal Pokemons.

In [62]:
# Write your code here
pokemon_df.hist("hp", bins=30, edgecolor='w', figsize=(8, 6))
plt.show() 

**Question #6:** Which best describes the shape of the distribution of the base HP of normal Pokemons? (a) symmetric (b) positively-skewed (c) negatively-skewed (d) uniform

Answer: (B) Positively Skewed

We can also aggregate some summary statistics regarding the base HP using the [`agg`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function. Note that for this function, we pass a dictionary where the key is a column name and the corresponding value is a list of functions that we want to apply to that column. We can pass either an actual function **or** a string containing the name of a common function such as `"mean"` or `"std"`. 

Get the mean, standard deviation and length of the `hp` column.

In [63]:
normal_pokemon_df.agg({"hp": ["mean", "std", "count"]})

Next, let's try to do the same thing for normal Pokemons which were introduced on or before the 5th generation. We can filter observations using multiple criteria by using `&` (and) and `|` (or). Note that these are not the normal `and` and `or` operators in Python. These are bitwise operators that perform element-wise operations on two boolean lists. 

In [64]:
normal_pokemon_5thgen_df = pokemon_df[(pokemon_df["type_1"] == "Normal") & (pokemon_df["generation"] <= 5)]
normal_pokemon_5thgen_df

Use the `agg` function to determine the **median** base HP of normal Pokemons introduced on or before the 5th generation.

In [65]:
# Write your code here
normal_pokemon_5thgen_df = pokemon_df[(pokemon_df["type_1"] == "Normal") & (pokemon_df["generation"] <= 5)]
median_hp = normal_pokemon_5thgen_df.agg({"hp": ["median"]})
median_hp

**Question #7:** What is the median base HP of normal Pokemons introduced on or before the 5th generation? Limit to 2 decimal places.

Answer: 70.00

### Which type has the highest average base attack?

To answer this question, the variables of interest are:
- **`type_1`**: primary type of the Pokemon
- **`attack`**: base attack of the Pokemon

Sometimes, we may want to form groups in the datasets and compute summary statistics for each group. For instance, to determine which type has a highest average base attack in the dataset, we need to compute the average for each type.

To do this, we can use the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function. The function will group the dataset using the value of the provided column name.

In [66]:
pokemon_df.groupby("type_1")

Get the mean and standard deviation of the base attack per Pokemon type.

In [67]:
pokemon_df.groupby("type_1").agg({"attack": ["mean", "std"]})

You can sort a `DataFrame` by a column using the `sort_values` function. Sort the resulting table above in descending order to easily see which type has the highest average base attack.

In [68]:
mean_df = pokemon_df.groupby("type_1").agg({"attack": ["mean", "std"]})
mean_df.sort_values(("attack", "mean"), ascending=False)

Note that `sort_value` accepts the variable name you want to sort as the parameter. In this case, we pass in the **tuple** `("attack", "mean")` because if you look at the `DataFrame` in the previous cell, we have a hierharchical structure for the column names where the first level is `attack`, and the `mean` column is under that column.

Find out which type has the highest median base attack. Make sure to sort the median values per type in descending order.

In [69]:
# Write your code here
median_attack_df = pokemon_df.groupby("type_1").agg({"attack": ["median"]})
median_attack_df.sort_values(("attack", "median"), ascending=False)

**Question #8:** If you can choose which Pokemon type to use, using only the average base attack as the basis for your decision, which type has the **best** average base attack?

Answer: Fighting

**Question #9:** In choosing the best Pokemon type according to base attack (previous question), would it be better to use mean or median? Why?

Answer: Median. Its less affected by extreme values or outliers compared to the mean. Moreover, the median represents the middle value when all values are sorted, making it a better measure of the central tendency, especially for skewed distributions.

We can also visualize the base attack per type by using a side-by-side boxplot from the [`boxplot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) function of matplotlib. Notice that you can control the size of the figures in matplotlib using the `figsize` parameter (also works in other plots).

In [70]:
pokemon_df.boxplot("attack", by="type_1", figsize=(15, 10))
plt.show()

### Which type has the highest average base defense?

To answer this question, the variables of interest are:
- **`type_1`**: primary type of the Pokemon
- **`defense`**: base defense of the Pokemon

Get the mean and standard deviation of the base defense per Pokemon type.

In [71]:
# Write your code here
defense_mean_df = pokemon_df.groupby("type_1").agg({"defense": ["mean", "std"]})
defense_mean_df

Sort the resulting table above in descending order to easily see which type has the highest average base defense.

In [72]:
# Write your code here
sorted_defense_mean_df = defense_mean_df.sort_values(("defense", "mean"), ascending=False)
sorted_defense_mean_df

Find out which type has the highest median base defense. Make sure to sort the median values per type in descending order.

In [73]:
# Write your code here
median_defense_df = pokemon_df.groupby("type_1").agg({"defense": ["median"]})
median_defense_df.sort_values(("defense", "median"), ascending=False)

**Question #10:** Which type has the highest median base defense? What is its median base defense? Limit to 2 decimal places.

Answer: 115.00

Visualize the base defense per type by using a side-by-side boxplot from the [`boxplot`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html) function of matplotlib. 

In [74]:
# Write your code here
pokemon_df.boxplot("defense", by="type_1", figsize=(15, 10))
plt.show()

**Question #11:** Based on the visualization alone, which type has the best base defense value? Why?

Answer: Steel. The median line (green line inside the box) for Steel type is visibly higher than those of most other types, showing that the typical (middle) Steel Pokémon has a high defense value.

### Is there a relationship between `hp`, `attack`, `defense`?

To answer this question, the variables of interest are:
- **`hp`**: base HP of the Pokemon
- **`attack`**: base attack of the Pokemon
- **`defense`**: base defense of the Pokemon

In [75]:
hp_atk_def = pokemon_df[['hp', 'attack', 'defense']]
hp_atk_def

When we want to understand the relationship of different variables, typically numerical/continuous variables, we can get the correlation between the variables and check how they are related.

In [76]:
hp_atk_def.corr()

`pandas` has a built in `corr()` function which computes for the Pearson's correlation coefficient. The table created is called a correlation matrix.

A good way to get a visual of the relationship of two variables is to use a scatter plot. Let's use a scatter plot to visualize the relationship of the HP and the attack stat of all Pokemons in the dataset.

To do this, we can call the `scatter` function.

In [77]:
hp_atk_def.plot.scatter(x='hp', y='attack', alpha=0.5)
plt.title('Relationship of Pokemon HP and Attack')

From the correlation matrix above, the correlation of HP and attack is **0.442992**. It has a positive relationship, which means that as one value increases, the other value also increases. 

In most cases, to test whether the correlation is significant or not, a statistical test is performed. 

Now, let's look at the relationship of the Pokemons' attack and defense. 

Visualize the attack and defense of all Pokemons using a scatter plot.

In [78]:
# Write your code here
atk_def = pokemon_df[['attack', 'defense']]
atk_def.corr()
hp_atk_def.plot.scatter(x='attack', y='defense', alpha=0.5)
plt.title('Relationship of Pokemon Attack & Defense')

**Question #12:** What can you say about the correlation of attack and defense? (a) positively correlated, (b) negatively correlated

Answer: Positively Correlated. As the attack values increase, the defense values also tend to increase. This indicates a positive correlation between attack and defense.

## Conclusions

Given that the dataset is rich with information, there are a lot more information and relationships that you can extract from the dataset. Try looking at each variable and ask more questions that may interest you and visualize them to confirm or deny your initial hypothesis!