# Task 1 and 2: Pokemon EDA

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Task 1: Setting the Aesthetics for the plots

### 1.1: Set the Seaborn figure theme and scale up the text in the figures

There are five preset Seaborn themes: `darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`. 
They are each suited to different applications and personal preferences.
You can see what they look like [here](https://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles).

Hint: You will need to use the `font_scale` property of the `set_theme()` function in Seaborn.

In [3]:
# Your Solution Here
sns.set_theme(style = "whitegrid", font_scale = 2)

Once you are able to set the theme, you will see all plots in this Jupyter Notebook update using the same theme.

Remember to copy this code above to your other Jupyter Notebooks as well!! 

## Task 2: Exploratory Data Analysis 

In this section of the task, you will be performing EDA on a given dataset with a goal to be able to describe it. 

### 2.1. Describe your dataset 

Consider (and keep in mind) the following questions to guide you in your exploration:

- Who: Which company/agency/organization provided this data?
- What: What is in your data?
- When: When was your data collected (for example, for which years)?
- Why: What is the purpose of your dataset? Is it for transparency/accountability, public interest, fun, learning, etc...
- How: How was your data collected? Was it a human collecting the data? Historical records digitized? Server logs?

**Hint:** The [pokemon dataset is from this Kaggle page.](https://www.kaggle.com/rounakbanik/pokemon)


**Hint:** *You probably will not need more than 250 words to describe your dataset. All the questions above do not need to be answered, it's more to guide your exploration and think a little bit about the context of your data. It is also possible you will not know the answers to some of the questions above, that is FINE - data scientists are often faced with the challenge of analyzing data from unknown sources. Do your best, acknowledge the limitations of your data as well as your understanding of it. Also, make it clear what you're speculating about. For example, "I speculate that the {...column_name...} column must be related to {....} because {....}."*


#### My Description
This dataset contains a description of each Pokemon across seven generations, with all of the information coming from a Pokemon forum called Serebii.net. Within the data, you'll find the individual base statistics, performance, height, weight, Classification, egg steps, experience points, abilities, etc. 

It was collected by someone out of curiosity and is inspired by questions surrounding how the Pokemon relate and compare. For example, the attack and attack and attack speed columns combined with the defence ones can be used to find which Pokemon is strongest and weakest overall. You could also look into questions that aren't directly answered in the game, such as how a Pokemon's height and weight influence its health and overall stats.

The limitations of this dataset are about its accuracy. The data was collected and gathered by a human which leaves room for error. Furthermore, it was based on an unofficial forum page which may have errors itself. The dataframe itself has not been updated in five years, according to the Kaggle page, meaning that some of the information may be outdated. 

### 2.2. Load the dataset from a file, or URL 

This needs to be a pandas dataframe. Remember that others may be running your jupyter notebook so it's important that the data is accessible to them. 

***General advice:** If your dataset isn't accessible as a URL, make sure to commit it into your repo. If your dataset is too large to commit (>100 MB), and it's not possible to get a URL to it, you should contact your instructor for advice.*

For this question, luckily the dataset is available as a URL.
You can use this URL to load the data: https://github.com/firasm/bits/raw/master/pokemon.csv

In [4]:
# Your solution here
df = pd.read_csv("https://github.com/firasm/bits/raw/master/pokemon.csv")
df

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


### 2.3. Explore your dataset 

Which of your columns are interesting/relevant? Remember to take some notes on your observations, you'll need them for the next EDA step (initial thoughts).

#### 2.3.1:  You should start with [`df.describe().T`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) 

See [linked documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) for the use of `include`/`exclude` to look at numerical and categorical data.

In [5]:
# Your solution to output `df.describe.T` for numerical columns:
#df.describe(include = [np.number])
df[0:10]['HP'].dropna(axis=1)

ValueError: No axis named 1 for object type Series

In [6]:
# Your solution to output `df.describe.T` for categorical columns:
df.describe(exclude = [np.number])

Unnamed: 0,Name,Type 1,Type 2,Legendary
count,800,800,414,800
unique,800,18,18,2
top,Bulbasaur,Water,Flying,False
freq,1,112,97,735


### 2.4. Initial Thoughts

#### 2.4.1. Use this section to record your observations. 

Does anything jump out at you as surprising or particularly interesting? Feel free to make additional plots as needed to explore your data set.

Where do you think you'll go with exploring this dataset? Feel free to take notes in this section and use it as a scratch pad.

Any content in this area will only be marked for effort and completeness.

#### Your observations here:

- Looking at the describe that only includes categorial columns, it is clear that "legendary" Pokemon are actually rare as 735/800 recorded Pokemon are not legendary.
- Water is the most common type with 112 of the Pokemon being of that type.
- The averages of the HP, attack, defense, sp. atk, sp. def, and speed are all around 70 showing that there are equally as many strong Pokemon as there are weak.

I think that it would be interesting to dive deeper into how much "better" legendary Pokemon are and if their stats fall above the average of 70.

### 2.5. Wrangling 

The next step is to wrangle your data based on your initial explorations. Normally, by this point, you have some idea of what your research question will be, and that will help you narrow and focus your dataset. 

In this lab, we will guide you through some wrangling tasks with this dataset.

#### 2.5.1. Drop the 'Generation', 'Sp. Atk', 'Sp. Def', 'Total', and the '#' columns

In [19]:
# Your solution here
dfClean = df.copy().drop(['Generation','Sp. Atk', 'Total', '#'], axis=1).dropna(axis=0)
dfClean.select_dtypes(exclude = [np.number])
dfClean

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Def,Speed,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,45,False
1,Ivysaur,Grass,Poison,60,62,63,80,60,False
2,Venusaur,Grass,Poison,80,82,83,100,80,False
3,VenusaurMega Venusaur,Grass,Poison,80,100,123,120,80,False
6,Charizard,Fire,Flying,78,84,78,85,100,False
...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,150,50,True
796,DiancieMega Diancie,Rock,Fairy,50,160,110,110,110,True
797,HoopaHoopa Confined,Psychic,Ghost,80,110,60,130,70,True
798,HoopaHoopa Unbound,Psychic,Dark,80,160,60,130,80,True


#### 2.5.2. Drop any NaN values in HP, Attack, Defense, Speed

In [20]:
# Your solution here
dfClean = dfClean.dropna()
dfClean

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Def,Speed,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,45,False
1,Ivysaur,Grass,Poison,60,62,63,80,60,False
2,Venusaur,Grass,Poison,80,82,83,100,80,False
3,VenusaurMega Venusaur,Grass,Poison,80,100,123,120,80,False
6,Charizard,Fire,Flying,78,84,78,85,100,False
...,...,...,...,...,...,...,...,...,...
795,Diancie,Rock,Fairy,50,100,150,150,50,True
796,DiancieMega Diancie,Rock,Fairy,50,160,110,110,110,True
797,HoopaHoopa Confined,Psychic,Ghost,80,110,60,130,70,True
798,HoopaHoopa Unbound,Psychic,Dark,80,160,60,130,80,True


#### 2.5.3. Reset the index to get a new index without missing values

In [21]:
# Your solution here
dfClean = dfClean.reset_index()
dfClean

Unnamed: 0,index,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Def,Speed,Legendary
0,0,Bulbasaur,Grass,Poison,45,49,49,65,45,False
1,1,Ivysaur,Grass,Poison,60,62,63,80,60,False
2,2,Venusaur,Grass,Poison,80,82,83,100,80,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,120,80,False
4,6,Charizard,Fire,Flying,78,84,78,85,100,False
...,...,...,...,...,...,...,...,...,...,...
409,795,Diancie,Rock,Fairy,50,100,150,150,50,True
410,796,DiancieMega Diancie,Rock,Fairy,50,160,110,110,110,True
411,797,HoopaHoopa Confined,Psychic,Ghost,80,110,60,130,70,True
412,798,HoopaHoopa Unbound,Psychic,Dark,80,160,60,130,80,True


#### 2.5.4. A new column was added called `index`; remove it. 

In [23]:
# Your solution here
dfClean = dfClean.drop(['index'], axis=1)
dfClean

Unnamed: 0,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Def,Speed,Legendary
0,Bulbasaur,Grass,Poison,45,49,49,65,45,False
1,Ivysaur,Grass,Poison,60,62,63,80,60,False
2,Venusaur,Grass,Poison,80,82,83,100,80,False
3,VenusaurMega Venusaur,Grass,Poison,80,100,123,120,80,False
4,Charizard,Fire,Flying,78,84,78,85,100,False
...,...,...,...,...,...,...,...,...,...
409,Diancie,Rock,Fairy,50,100,150,150,50,True
410,DiancieMega Diancie,Rock,Fairy,50,160,110,110,110,True
411,HoopaHoopa Confined,Psychic,Ghost,80,110,60,130,70,True
412,HoopaHoopa Unbound,Psychic,Dark,80,160,60,130,80,True


#### 2.5.5. Calculate a new column called "Weighted Score" that computes an aggregate score comprising:

- 20% 'HP'
- 40% 'Attack'
- 30% 'Defense'
- 10% 'Speed'


In [12]:
# Your solution here

### 2.6. Research questions 

#### 2.6.1 Come up with at least two research questions about your dataset that will require data visualizations to help answer. 

Recall that for this purpose, you should only aim for "Descriptive" or "Exploratory" research questions.

**Hint1: You are welcome to calculate any columns that you think might be useful to answer the question (or re-add dropped columns like 'Generation', 'Sp. Atk', 'Sp. Def'.***

**Hint2: Try not to overthink this; this is a toy dataset about Pokémon, you're not going to solve climate change or cure world hunger. Focus your research questions on the various Pokémon attributes, and the types.**

#### # Your solution here: 

**1. Sample Research Question:** Which Pokemon Types are the best, as determined by the Weighted Score?

**2. Your RQ 1:**

**3. Your RQ 2:**



### 2.7. Save your dataset

Here, using the [pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) function, save your dataframe to be used in other tasks by naming it **task2.csv** in the data directory.

In [None]:
# Your solution here. 
task2 = df.to_csv()