## Data Wrangling with Python: Intro to Pandas
Note: Notebook adapted from [here](https://github.com/EricElmoznino/lighthouse_pandas_tutorial/blob/master/pandas_tutorial.ipynb) & [here](https://github.com/sedv8808/LighthouseLabs/tree/main/W02D2) & from LHL's [21 Day Data Challenge](https://data-challenge.lighthouselabs.ca/start)
#### Instructor: Andrew Berry
#### Date: March 4, 2021

**Agenda:**
 - Why Pandas? --- because it's da-BESSSST
 - Pandas Basics
     - Pandas Series vs. Pandas DataFrames
     - .loc() vs. iloc()
 - Pandas Advance
     - Filtering
     - Group bys
 - Pandas Exercises
     - Challenge 1
     - Challenge 2

### Pandas: Why Pandas? What is it? 

To do data anlaysis with Python, Pandas is a great tool to for dealing with data in a tabular and time series formats. Designed by Wes McKinney as an attempt to port R's dataframes to python. 

- Python Package for working with **tables**
- Similar to SQL & Excel
    - Faster
    - More features to manipulate, transform, and aggregate data
- Easy to handle messy and missing data
- Great at working with large data files
- When combing with other Python libraries, it's fairly easy to create bautiful and customazied visuals. Easy integration with Matplotlib, Seaborn, Plotly.
- Easy integration with machine learning plugins (sckit-learn)
    
    
-----------
To read more about, Wes McKinney, the creator of Pandas, check out the article below.

1. https://qz.com/1126615/the-story-of-the-most-important-tool-in-data-science/

--------------


## Think of how we would try to represent a table in Python?


In [61]:
#A dicitonary of lists example
students = {
    'student_id': [1, 2, 3, 4, 5, 6],
    'name': ['Daenerys', 'Jon', 'Arya', 'Sansa', 'Eddard', 'Khal Drogo'],
    'course_mark': [82, 100, 12, 76, 46, 20],
    'species': ['cat', 'human', 'cat', 'human', 'human', 'human']
}

**What are some operations we might want to do on this data?**

- 1.Select a subset of columns
- 2.Filter out some rows based on an attribute
- 3.Group by some attribute
- 4.Compute some aggregate values within groups
- 5.Save to a file

How about we try out one of these to see how easy it is

### Try to return a table with the mean course mark per-species.

In [62]:
# Return a table with the mean course mark per-species
# Think about a SQL statment where we group by species with the average course mark

species_sums = {} # Tables of Sums
species_counts = {} # Count per Species
for i in range(len(students['species'])):  # iterating over the rows
    species = students['species'][i] # every row number I get species 
    course_mark = students['course_mark'][i] # and course mark
    if species not in species_sums: # Intializing Species if not in list
        species_sums[species] = 0
        species_counts[species] = 0
    species_sums[species] += course_mark # Add each course mark for each species
    species_counts[species] += 1 

species_means = {}
                                  
for species in species_sums: # for every unique species we found
    species_means[species] = species_sums[species] / species_counts[species] # sum/count

species_means # return

{'cat': 47.0, 'human': 60.5}

- Did you like looking at is? Does this look fun to do?
- Super Tiring. 

## Pandas Version

In [63]:
# Pandas Version
import pandas as pd

# Can take in a dictionry of list to instatiate a DataFrame
students = pd.DataFrame(students) 
students

Unnamed: 0,student_id,name,course_mark,species
0,1,Daenerys,82,cat
1,2,Jon,100,human
2,3,Arya,12,cat
3,4,Sansa,76,human
4,5,Eddard,46,human
5,6,Khal Drogo,20,human


In [6]:
species_means = students[['species', 'course_mark']].groupby('species').mean()
# species_means = students.groupby('species')['course_mark'].mean()
species_means

Unnamed: 0_level_0,course_mark
species,Unnamed: 1_level_1
cat,47.0
human,60.5


### Dissecting the above code!

In [7]:
#Step 1: Filter out the columns we want to keep
students_filtered = students[['species','course_mark']]
students_filtered

Unnamed: 0,species,course_mark
0,cat,82
1,human,100
2,cat,12
3,human,76
4,human,46
5,human,20


In [8]:
# Step 2: Group by species column
students_grouped_by_species = students_filtered.groupby('species') 
students_grouped_by_species

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f0c03806790>

In [9]:
#Step 3: Specify how to aggregate the course-mark column
species_means = students_grouped_by_species.mean()

In [10]:
species_means

Unnamed: 0_level_0,course_mark
species,Unnamed: 1_level_1
cat,47.0
human,60.5


#### As shown, Pandas makes use of vectorized operations. 


- Rather than use for-loops, we specify the operation that will apply to the structure as a whole (i.e. all the rows)
- By vectorizing, **the code becomes more concise and more readable**
- Pandas is optimized for vectorized operations (parallel vs. serial computation), which makes them **much faster**
- It is almost always possible to vectorize operations on Pandas data types


### Getting Started: Pandas Series & Pandas DataFrames

There are two Pandas data types of interest:

- Series (column)
    - A pandas series is similar to an array but it has an index. The index is constant, and doesnt change through the operations we apply to the series. 
- DataFrame (table)
    - A pandas dataframe is an object that is similar to a collection of pandas series.

In [11]:
# One way to construct a Series
series = pd.Series([82, 100, 12, 76, 46, 20]) 
series

0     82
1    100
2     12
3     76
4     46
5     20
dtype: int64

In [12]:
#We can specify some index when building a series. 
grades = pd.Series([82, 100, 12, 76, 46, 20], 
                   index = ['Daenerys', 'Jon', 'Arya', 'Sansa', 'Eddard', 'Khal Drogo'] ) 

grades

Daenerys       82
Jon           100
Arya           12
Sansa          76
Eddard         46
Khal Drogo     20
dtype: int64

In [13]:
print("The values:", grades.values)
print("The indexes:", grades.index)

The values: [ 82 100  12  76  46  20]
The indexes: Index(['Daenerys', 'Jon', 'Arya', 'Sansa', 'Eddard', 'Khal Drogo'], dtype='object')


**Note:** The underlying index is still 0, 1, 2, 3.... and we can still index on that:

In [14]:
grades[2]

12

### Pandas DataFrames

In [15]:
# One way to construct a DataFrame
df = pd.DataFrame({
    'name': ['Daenerys', 'Jon', 'Arya', 'Sansa'],
    'course_mark': [82, 100, 12, 76],
    'species': ['human', 'human', 'cat', 'human']},
    index=[1412, 94, 9351, 14])
df

Unnamed: 0,name,course_mark,species
1412,Daenerys,82,human
94,Jon,100,human
9351,Arya,12,cat
14,Sansa,76,human


#### Reading a CSV file

We'll use the function `read_csv()` to load the data into our notebook

- The `read_csv()` function can read data from a locally saved file or from a URL
- We'll store the data as a variable `df_pokemon`

In [64]:
df_pokemon = pd.read_csv('data/pokemon.csv') # good practice to save dataframe info under 'df' then identifiers like 'pokemon'

In [20]:
df_pokemon # not the best practice if the dataframe is very large

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


**What do we see here?**
- Each row of the table is an observation, containing data of a single pokemon

In [21]:
df_pokemon.shape # also tells us rows and columns

(800, 13)

For large DataFrames, it's often useful to display just the first few or last few rows:

In [23]:
df_pokemon.head(10) # gives the first 5 rows alike the LIMIT function of SQL

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
8,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
9,7,Squirtle,Water,,314,44,48,65,50,64,43,1,False


> **Pro tip:**
> - To display the documentation for this method within Jupyter notebook, you can run the command `df_pokemon.head?` or press `Shift-Tab` within the parentheses of `df_pokemon.head()`
> - To see other methods available for the DataFrame, type `df_pokemon.` followed by `Tab` for auto-complete options 

In [25]:
df_pokemon.tail()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True
799,721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


## Data at a Glance

`pandas` provides many ways to quickly and easily summarize your data:
- How many rows and columns are there?
- What are all the column names and what type of data is in each column?
- How many values are missing in each column or row?
- Numerical data: What is the average and range of the values?
- Text data: What are the unique values and how often does each occur?

### Peeking into the pokemon dataset

- Similar with getting familar with SQL tables, it is often a great idea to look at the pandas dataframes we are working with. Below are some of the basic methods to glance at a dataset. 

In [26]:
#Getting the Columns
df_pokemon.columns

Index(['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

In [27]:
#Getting Summary Statistics
df_pokemon.describe()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,362.81375,435.1025,69.25875,79.00125,73.8425,72.82,71.9025,68.2775,3.32375
std,208.343798,119.96304,25.534669,32.457366,31.183501,32.722294,27.828916,29.060474,1.66129
min,1.0,180.0,1.0,5.0,5.0,10.0,20.0,5.0,1.0
25%,184.75,330.0,50.0,55.0,50.0,49.75,50.0,45.0,2.0
50%,364.5,450.0,65.0,75.0,70.0,65.0,70.0,65.0,3.0
75%,539.25,515.0,80.0,100.0,90.0,95.0,90.0,90.0,5.0
max,721.0,780.0,255.0,190.0,230.0,194.0,230.0,180.0,6.0


In [28]:
df_pokemon['Total'].describe() # the info from this column alone

count    800.00000
mean     435.10250
std      119.96304
min      180.00000
25%      330.00000
50%      450.00000
75%      515.00000
max      780.00000
Name: Total, dtype: float64

In [29]:
df_pokemon[['Total','Attack']].describe() # now for two columns

Unnamed: 0,Total,Attack
count,800.0,800.0
mean,435.1025,79.00125
std,119.96304,32.457366
min,180.0,5.0
25%,330.0,55.0
50%,450.0,75.0
75%,515.0,100.0
max,780.0,190.0


In [30]:
df_pokemon.Total.describe() # works only if the one column you want to look through has no spces

count    800.00000
mean     435.10250
std      119.96304
min      180.00000
25%      330.00000
50%      450.00000
75%      515.00000
max      780.00000
Name: Total, dtype: float64

In [32]:
#Checking for Missing Data
df_pokemon.isnull().sum() # weithout the aggregate function it is difficult to parse through the missing files but with sum we can identify where it is missing

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

## The .loc() vs .iloc() method
##### ---the method to find subsets of their data. Only difference is that iloc uses index integars

To select rows and columns at the same time, we use the syntax `.loc[<rows>, <columns>]`:

In [33]:
#Notice the square brackets on loc and the colon
df_pokemon.loc[ : , ['Name']] # colon gets us all the rows for the column listed as Name
# you can add a .head() to the end to see the top 5 or you can change your colon to 0:4 which slices the top 5 only

Unnamed: 0,Name
0,Bulbasaur
1,Ivysaur
2,Venusaur
3,VenusaurMega Venusaur
4,Charmander
...,...
795,Diancie
796,DiancieMega Diancie
797,HoopaHoopa Confined
798,HoopaHoopa Unbound


In [35]:
#Taking a slice of index values
df_pokemon.loc[20:30,['Name']]

Unnamed: 0,Name
20,Pidgey
21,Pidgeotto
22,Pidgeot
23,PidgeotMega Pidgeot
24,Rattata
25,Raticate
26,Spearow
27,Fearow
28,Ekans
29,Arbok


In [36]:
# Getting more than one columns
df_pokemon.loc[20:30,['Name','Legendary']]

Unnamed: 0,Name,Legendary
20,Pidgey,False
21,Pidgeotto,False
22,Pidgeot,False
23,PidgeotMega Pidgeot,False
24,Rattata,False
25,Raticate,False
26,Spearow,False
27,Fearow,False
28,Ekans,False
29,Arbok,False


In [37]:
#we can also feed in a list for the rows
df_pokemon.loc[[1,8,21,7,432,666],['Name','Legendary']]

Unnamed: 0,Name,Legendary
1,Ivysaur,False
8,CharizardMega Charizard Y,False
21,Pidgeotto,False
7,CharizardMega Charizard X,False
432,Turtwig,False
666,Elgyem,False


In [None]:
#We can also slice over  range of column values

In [38]:
#Iloc is use for integer based indexing
df_pokemon.iloc[0:3,1:4]


Unnamed: 0,Name,Type 1,Type 2
0,Bulbasaur,Grass,Poison
1,Ivysaur,Grass,Poison
2,Venusaur,Grass,Poison


### Modifying a Column or Creating a new column

In [None]:
# good practice to create a copy of the dataframe to keep track of the changes you make
df_pokemon_2 = df_pokemon.copy()

In [39]:
df_pokemon.head(3)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False


In [47]:
# Combine 'Attack' + 'Special Attack'
df_pokemon['Total_Attack'] = df_pokemon['Attack'] + df_pokemon['Sp. Atk']
df_pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,1,Bulbasaur,Grass,Poison,2544,45,49,49,65,65,45,1,False,114,True
1,2,Ivysaur,Grass,Poison,3240,60,62,63,80,80,60,1,False,142,True
2,3,Venusaur,Grass,Poison,4200,80,82,83,100,100,80,1,False,182,True
3,3,VenusaurMega Venusaur,Grass,Poison,5000,80,100,123,122,120,80,1,False,222,True
4,4,Charmander,Fire,,2472,39,52,43,60,50,65,1,False,112,True


In [49]:
#Create a filler column
df_pokemon['filler'] = True
df_pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,1,Bulbasaur,Grass,Poison,2544,45,49,49,65,65,45,1,False,114,True
1,2,Ivysaur,Grass,Poison,3240,60,62,63,80,80,60,1,False,142,True
2,3,Venusaur,Grass,Poison,4200,80,82,83,100,100,80,1,False,182,True
3,3,VenusaurMega Venusaur,Grass,Poison,5000,80,100,123,122,120,80,1,False,222,True
4,4,Charmander,Fire,,2472,39,52,43,60,50,65,1,False,112,True


In [None]:
#Copy total and creat that


In [46]:
#Modify an orginal 
df_pokemon['Total'] = df_pokemon['Total'] * 2
df_pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,1,Bulbasaur,Grass,Poison,2544,45,49,49,65,65,45,1,False,114,True
1,2,Ivysaur,Grass,Poison,3240,60,62,63,80,80,60,1,False,142,True
2,3,Venusaur,Grass,Poison,4200,80,82,83,100,100,80,1,False,182,True
3,3,VenusaurMega Venusaur,Grass,Poison,5000,80,100,123,122,120,80,1,False,222,True
4,4,Charmander,Fire,,2472,39,52,43,60,50,65,1,False,112,True


In [51]:
#Modify Data Frame with .loc() method
df_pokemon.loc[[1,2,3], ['Name']] = 'Andrew'
df_pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,1,Bulbasaur,Grass,Poison,2544,45,49,49,65,65,45,1,False,114,True
1,2,Andrew,Grass,Poison,3240,60,62,63,80,80,60,1,False,142,True
2,3,Andrew,Grass,Poison,4200,80,82,83,100,100,80,1,False,182,True
3,3,Andrew,Grass,Poison,5000,80,100,123,122,120,80,1,False,222,True
4,4,Charmander,Fire,,2472,39,52,43,60,50,65,1,False,112,True


### Sort_values() & value_counts()

1. ***df.sort_values()***
2. ***df.value_counts()***


The ***pandas.sort_values()*** allows us to reorder our dataframe in an ascending or descending order given a column for pandas to work from. This is similar to the excel sort function.

```python
import pandas as pd
df = pd.read_csv('random.csv')
df


df.sort_values(by=['some_column'], ascending = True)
```
In the above code snippet, we are sorting our *random.csv* pandas data frame by the column *some_column* in ascending order. To read more on the ***df.sort_values()*** function, read this [article](https://datatofish.com/sort-pandas-dataframe/).

The second function is ***df.value_counts()***, it allows us to count how many times a specific value/item occurred in the dataframe. This function is best used on a specific column on a data frame, ideally on a column representing categorical data. Categorical data refers to a statistical data type consisting of categorical variables. 

```python
df['column'].value_counts()
```

To read more on some of the advanced functionalities of ***df.value_counts()***, please refer to the pandas documentation or this [article](https://towardsdatascience.com/getting-more-value-from-the-pandas-value-counts-aa17230907a6).

In [52]:
df_pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,1,Bulbasaur,Grass,Poison,2544,45,49,49,65,65,45,1,False,114,True
1,2,Andrew,Grass,Poison,3240,60,62,63,80,80,60,1,False,142,True
2,3,Andrew,Grass,Poison,4200,80,82,83,100,100,80,1,False,182,True
3,3,Andrew,Grass,Poison,5000,80,100,123,122,120,80,1,False,222,True
4,4,Charmander,Fire,,2472,39,52,43,60,50,65,1,False,112,True


In [53]:
df_pokemon.sort_values(by = ['Generation'], ascending=False) # we found ascending by shift+TAB to look at the variables

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
799,721,Volcanion,Fire,Water,4800,80,110,120,130,90,70,6,True,240,True
738,670,Floette,Fairy,,2968,54,45,47,75,98,52,6,False,120,True
740,672,Skiddo,Grass,,2800,66,65,48,62,57,52,6,False,127,True
741,673,Gogoat,Grass,,4248,123,100,62,97,81,68,6,False,197,True
742,674,Pancham,Fighting,,2784,67,82,62,46,48,43,6,False,128,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110,102,Exeggcute,Grass,Psychic,2600,60,40,80,60,45,40,1,False,100,True
109,101,Electrode,Electric,,3840,60,50,70,80,80,140,1,False,130,True
108,100,Voltorb,Electric,,2640,40,30,50,55,55,100,1,False,85,True
107,99,Kingler,Water,,3800,55,130,115,50,50,75,1,False,180,True


In [54]:
df_pokemon.sort_values(by = ['HP', 'Attack'], ascending = [True, False]) # where HP ascends and Attack descends

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
316,292,Shedinja,Bug,Ghost,1888,1,90,45,30,30,40,3,False,120,True
55,50,Diglett,Ground,,2120,10,55,25,35,45,95,1,False,90,True
186,172,Pichu,Electric,,1640,20,40,15,35,35,60,2,False,75,True
388,355,Duskull,Ghost,,2360,20,40,90,30,90,25,3,False,70,True
487,439,Mime Jr.,Psychic,Fairy,2480,20,25,45,70,90,60,4,False,95,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,594,Alomomola,Water,,3760,165,75,80,40,45,65,5,False,115,True
351,321,Wailord,Water,,4000,170,90,45,90,45,60,3,False,180,True
217,202,Wobbuffet,Psychic,,3240,190,33,58,33,58,33,2,False,66,True
121,113,Chansey,Normal,,3600,250,5,5,35,105,50,1,False,40,True


In [55]:
df_pokemon['Type 1'].value_counts() # see how often we get X data type
# can only search by one column at a time

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Rock         44
Electric     44
Dragon       32
Ghost        32
Ground       32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

In [65]:
# order the above code further such as alphabetical or by count
df_pokemon['Type 1'].value_counts().sort_index()

Bug          69
Dark         31
Dragon       32
Electric     44
Fairy        17
Fighting     27
Fire         52
Flying        4
Ghost        32
Grass        70
Ground       32
Ice          24
Normal       98
Poison       28
Psychic      57
Rock         44
Steel        27
Water       112
Name: Type 1, dtype: int64

In [58]:
#Just Unique Values (omit the count)
df_pokemon['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [59]:
#How many unique Values
df_pokemon['Type 1'].nunique()

18

### How to Query or Filter Data with Conditions?

- We can extract specific data from our dataframe based on a specific condition. We will be using the syntax below. Pandas will return a subset of the dataframe based on the given condition. 

```python
df[<insert_condition>]
```

Conditions follow the generic boolean logic in Python. Below is a cheat sheet python boolean logic.

**Conditional Logic:** 

Conditional logic refers to the execution of different actions based on whether a certain condition is met. In programming, these conditions are expressed by a set of symbols called **Boolean Operators**. 

| Boolean Comparator | Example | Meaning                         |
|--------------------|---------|---------------------------------|
| >                  | x > y   | x is greater than y             |
| >=                 | x >= y  | x is greater than or equal to y |
| <                  | x < y   | x is less than y                |
| <=                 | x <= y  | x is less than or equal to y    |
| !=                 | x != y  | x is not equal to y             |
| ==                 | x == y  | x is equal to y                 |




In [65]:
#Step 1: Create a filter
the_filter = df_pokemon['Total'] >= 500

In [66]:
#Step 2: Apply Filter
df_pokemon[the_filter]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,1,Bulbasaur,Grass,Poison,2544,45,49,49,65,65,45,1,False,114,True
1,2,Andrew,Grass,Poison,3240,60,62,63,80,80,60,1,False,142,True
2,3,Andrew,Grass,Poison,4200,80,82,83,100,100,80,1,False,182,True
3,3,Andrew,Grass,Poison,5000,80,100,123,122,120,80,1,False,222,True
4,4,Charmander,Fire,,2472,39,52,43,60,50,65,1,False,112,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,4800,50,100,150,100,150,50,6,True,200,True
796,719,DiancieMega Diancie,Rock,Fairy,5600,50,160,110,160,110,110,6,True,320,True
797,720,HoopaHoopa Confined,Psychic,Ghost,4800,80,110,60,150,130,70,6,True,260,True
798,720,HoopaHoopa Unbound,Psychic,Dark,5440,80,160,60,170,130,80,6,True,330,True


In [67]:
#Finding Only Legendary Pokemons
the_filter = (df_pokemon['Total'] >= 500) & (df_pokemon['Legendary'] == True)

In [68]:
df_pokemon[the_filter]

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
156,144,Articuno,Ice,Flying,4640,90,85,100,95,125,85,1,True,180,True
157,145,Zapdos,Electric,Flying,4640,90,90,85,125,90,100,1,True,215,True
158,146,Moltres,Fire,Flying,4640,90,100,90,125,85,90,1,True,225,True
162,150,Mewtwo,Psychic,,5440,106,110,90,154,90,130,1,True,264,True
163,150,MewtwoMega Mewtwo X,Psychic,Fighting,6240,106,190,100,154,100,130,1,True,344,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,4800,50,100,150,100,150,50,6,True,200,True
796,719,DiancieMega Diancie,Rock,Fairy,5600,50,160,110,160,110,110,6,True,320,True
797,720,HoopaHoopa Confined,Psychic,Ghost,4800,80,110,60,150,130,70,6,True,260,True
798,720,HoopaHoopa Unbound,Psychic,Dark,5440,80,160,60,170,130,80,6,True,330,True


Alternativly we can use the pandas **.where()** function, and it has the following syntax.


```python 
df.where(<condition>, <What to fill inplace where the condition is not True>) #default is Nan
```

In [69]:
df_pokemon.where(the_filter, 'Weak Pokemon') # good to filter out the condition true or not and this can be saved to another variable

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon
1,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon
2,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon
3,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon
4,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon,Weak Pokemon
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,4800,50,100,150,100,150,50,6,True,200,True
796,719,DiancieMega Diancie,Rock,Fairy,5600,50,160,110,160,110,110,6,True,320,True
797,720,HoopaHoopa Confined,Psychic,Ghost,4800,80,110,60,150,130,70,6,True,260,True
798,720,HoopaHoopa Unbound,Psychic,Dark,5440,80,160,60,170,130,80,6,True,330,True


### Grouping and Aggregation 

Grouping and aggregation can be used to calculate statistics on groups in the data.

**Common Aggregation Functions**
- mean()
- median()
- sum()
- count()


In [70]:
df_pokemon.groupby('Type 1').mean()

Unnamed: 0_level_0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Bug,334.492754,3031.42029,56.884058,70.971014,70.724638,53.869565,64.797101,61.681159,3.217391,0.0,124.84058,1.0
Dark,461.354839,3565.935484,66.806452,88.387097,70.225806,74.645161,69.516129,76.16129,4.032258,0.064516,163.032258,1.0
Dragon,474.375,4404.25,83.3125,112.125,86.375,96.84375,88.84375,83.03125,3.875,0.375,208.96875,1.0
Electric,363.5,3547.272727,59.795455,69.090909,66.295455,90.022727,73.704545,84.5,3.272727,0.090909,159.113636,1.0
Fairy,449.529412,3305.411765,74.117647,61.529412,65.705882,78.529412,84.705882,48.588235,4.117647,0.058824,140.058824,1.0
Fighting,363.851852,3331.555556,69.851852,96.777778,65.925926,53.111111,64.703704,66.074074,3.37037,0.0,149.888889,1.0
Fire,327.403846,3664.615385,69.903846,84.769231,67.769231,88.980769,72.211538,74.442308,3.211538,0.096154,173.75,1.0
Flying,677.75,3880.0,70.75,78.75,66.25,94.25,72.5,102.5,5.5,0.5,173.0,1.0
Ghost,486.5,3516.5,64.4375,73.78125,81.1875,79.34375,76.46875,64.34375,4.1875,0.0625,153.125,1.0
Grass,344.871429,3369.142857,67.271429,73.214286,70.8,77.5,70.428571,61.928571,3.357143,0.042857,150.714286,1.0


- By default, `groupby()` assigns the variable that we're grouping on (in this case `Type 1`) to the index of the output data
- If we use the keyword argument `as_index=False`, the grouping variable is instead assigned to a regular column
  - This can be useful in some situations, such as data visualization functions which expect the relevant variables to be in columns rather than the index

In [72]:
df_pokemon.groupby('Type 1', as_index = False).sum() # lets us not use the grouping as a column, better for transforming later

Unnamed: 0,Type 1,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Total_Attack,filler
0,Bug,23080,209168,3925,4897,4880,3717,4471,4256,222,0.0,8614,69.0
1,Dark,14302,110544,2071,2740,2177,2314,2155,2361,125,2.0,5054,31.0
2,Dragon,15180,140936,2666,3588,2764,3099,2843,2657,124,12.0,6687,32.0
3,Electric,15994,156080,2631,3040,2917,3961,3243,3718,144,4.0,7001,44.0
4,Fairy,7642,56192,1260,1046,1117,1335,1440,826,70,1.0,2381,17.0
5,Fighting,9824,89952,1886,2613,1780,1434,1747,1784,91,0.0,4047,27.0
6,Fire,17025,190560,3635,4408,3524,4627,3755,3871,167,5.0,9035,52.0
7,Flying,2711,15520,283,315,265,377,290,410,22,2.0,692,4.0
8,Ghost,15568,112528,2062,2361,2598,2539,2447,2059,134,2.0,4900,32.0
9,Grass,24141,235840,4709,5125,4956,5425,4930,4335,235,3.0,10550,70.0


In [74]:
df_pokemon.groupby('Type 1', as_index = False)['Attack'].max() # not showing all the columns

Unnamed: 0,Type 1,Attack
0,Bug,185
1,Dark,150
2,Dragon,180
3,Electric,123
4,Fairy,131
5,Fighting,145
6,Fire,160
7,Flying,115
8,Ghost,165
9,Grass,132


In [76]:
df_pokemon.groupby(['Type 1', 'Legendary']).max() # index becomes a multi-index

Unnamed: 0_level_0,Unnamed: 1_level_0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Total_Attack,filler
Type 1,Legendary,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Bug,False,666,4800,86,185,230,135,230,160,6,240,True
Dark,False,687,4800,110,150,125,140,130,125,6,265,True
Dark,True,717,5440,126,131,95,135,98,125,6,262,True
Dragon,False,706,5600,108,170,130,120,150,120,6,290,True
Dragon,True,718,6240,125,180,121,180,150,115,6,360,True
Electric,False,702,4880,90,123,115,165,110,140,6,260,True
Electric,True,642,4640,90,115,85,145,100,115,5,250,True
Fairy,False,700,4416,101,120,95,120,154,80,6,180,True
Fairy,True,716,5440,126,131,95,131,98,99,6,262,True
Fighting,False,701,5000,144,145,95,140,110,118,6,285,True


We can use the `agg` method to compute multiple aggregated statistics on our data, for example minimum and maximum country populations in each region:

In [77]:
df_pokemon.groupby('Type 1', as_index = False)['Attack'].agg(['min','max','mean'])

Unnamed: 0_level_0,min,max,mean
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bug,10,185,70.971014
Dark,50,150,88.387097
Dragon,50,180,112.125
Electric,30,123,69.090909
Fairy,20,131,61.529412
Fighting,35,145,96.777778
Fire,30,160,84.769231
Flying,30,115,78.75
Ghost,30,165,73.78125
Grass,27,132,73.214286


We can also use `agg` to compute different statistics for different columns:

In [78]:
agg_dict = {
    'Attack' : 'mean',
    'Defense' : ['min', 'max'] 
}

In [79]:
df_pokemon.groupby(['Type 1','Legendary'], as_index = False).agg(agg_dict)

Unnamed: 0_level_0,Type 1,Legendary,Attack,Defense,Defense
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,min,max
0,Bug,False,70.971014,30,230
1,Dark,False,86.862069,30,125
2,Dark,True,110.5,90,95
3,Dragon,False,103.4,35,130
4,Dragon,True,126.666667,80,121
5,Electric,False,66.125,15,115
6,Electric,True,98.75,70,85
7,Fairy,False,57.1875,28,95
8,Fairy,True,131.0,95,95
9,Fighting,False,96.777778,30,95


### Challenge 1 (20 minutes)

Let's play around with Pandas on a more intricate dataset: a dataset on wines!

**Challenge 14 from the 21 Day Data Challenge** 

Dot's neighbour said that he only likes wine from Stellenbosch, Bordeaux, and the Okanagan Valley, and that the sulfates can't be that high. The problem is, Dot can't really afford to spend tons of money on the wine. Dot's conditions for searching for wine are: 
1. Sulfates cannot be higher than 0.6. 
2. The price has to be less than  $20. 

Use the above conditions to filter the data for questions **2 and 3** below. 

**Questions:**
1. Where is Stellenbosch, anyway? How many wines from Stellenbosch are there in the *entire dataset*? 
2. *After filtering with the 2 conditions*, what is the average price of wine from the Bordeaux region? 
3. *After filtering with the 2 conditions*, what is the least expensive wine that's of the highest quality from the Okanagan Valley?



**Stretch Question:**
1. What is the average price of wine from Stellenbosch, according to the entire unfiltered dataset? 


**Note: Check the dataset to see if there are missing values; if there are, fill in missing values with the mean.**


In [None]:
#Write your Code Below

In [34]:
import pandas as pd
df = pd.read_csv('winequality-red_2.csv')
df = df.drop(columns = ['Unnamed: 0'])

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,region,price
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Colchagua Valley,64
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Bordeaux,89
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,La Rjoja,25
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Willamette,27
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Marlborough,9


In [35]:
df['region'].value_counts()

La Rjoja            341
Bordeaux            264
Colchagua Valley    260
Okanagan Valley     256
Willamette          233
Marlborough         210
Stellenbosch         35
Name: region, dtype: int64

In [36]:
df_wine = (df['sulphates'] <= 0.6) & (df['price'] <= 20)

2. *After filtering with the 2 conditions*, what is the average price of wine from the Bordeaux region? 

In [37]:
df[df_wine].groupby(['region'])['price'].mean()

region
Bordeaux            11.714286
Colchagua Valley    13.818182
La Rjoja            12.968750
Marlborough         10.000000
Okanagan Valley     11.000000
Stellenbosch        17.333333
Willamette          12.526316
Name: price, dtype: float64

3. After filtering with the 2 conditions, what is the least expensive wine that's of the highest quality from the Okanagan Valley?

In [38]:
df_wine = (df['sulphates'] <= 0.6) & (df['price'] <= 20) & (df['region'] == 'Okanagan Valley')

In [40]:
df[df_wine].sort_values(by=['quality', 'price'], ascending = [False,True]).head(1)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,region,price
1025,8.6,0.83,0.0,2.8,0.095,17.0,43.0,0.99822,3.33,0.6,10.4,6,Okanagan Valley,4


### Challenge 2 (25 minutes)

**Challenge 21 from the 21DDC (Adapted)**

Dot wants to play retro video games with all their new friends! Help them figure out which games would be best.

Questions: 
    
1. What is the top 5 best selling games released before the year 2000.

     -  **Note**: Use Global_Sales
    
    
2. Create a new column called Aggregate_Score, which returns the proportional average between Critic Score and User_Score based on Critic_Count and User_Count. Plot a horizontal bar chart of the top 5 highest rated games by Aggregate_Score, not published by Nintendo before the year 2000. From this bar chart, what is the highest rated game by Aggregate_Score?

    -  **Note**: Critic_Count should be filled with the mean. User_Count should be filled with the median.
    
    
#### In the exercise above, there is some missing values in the dataset. Look up the pandas documentation to figure out how to fill missing values in a column. You will be using the **ffill()** function.   

In [41]:
df = pd.read_csv('video_games.csv')
df.head(2)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,7.5,,,


In [42]:
df_games = df.copy()
df_games.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,7.5,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,7.5,,,


In [43]:
# Q1 What is the top 5 best selling games released before the year 2000.
filter_year = (df_games['Year_of_Release'] < 2000.0)
df_games[filter_year].sort_values(by='Global_Sales', ascending = False).head(5)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,7.5,,,
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,7.5,,,
5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26,,,7.5,,,
9,Duck Hunt,NES,1984.0,Shooter,Nintendo,26.93,0.63,0.28,0.47,28.31,,,7.5,,,
12,Pokemon Gold/Pokemon Silver,GB,1999.0,Role-Playing,Nintendo,9.0,6.18,7.2,0.71,23.1,,,7.5,,,


Create a new column called Aggregate_Score, which returns the proportional average between Critic Score and User_Score based on Critic_Count and User_Count. Plot a horizontal bar chart of the top 5 highest rated games by Aggregate_Score, not published by Nintendo before the year 2000. From this bar chart, what is the highest rated game by Aggregate_Score?

Note: Critic_Count should be filled with the mean. User_Count should be filled with the median.

𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒𝑆𝑐𝑜𝑟𝑒 = (𝐶𝑟𝑖𝑡𝑖𝑐𝐶𝑜𝑢𝑛𝑡 ∗ 𝐶𝑟𝑖𝑡𝑖𝑐𝑆𝑐𝑜𝑟𝑒) + (𝑈𝑠𝑒𝑟𝐶𝑜𝑢𝑛𝑡 ∗ 𝑈𝑠𝑒𝑟𝑆𝑐𝑜𝑟𝑒) / 𝑈𝑠𝑒𝑟𝐶𝑜𝑢𝑛𝑡 + 𝐶𝑟𝑖𝑡𝑖𝑐𝐶𝑜𝑢𝑛𝑡

In [50]:
df_games.Critic_Count.isnull().sum()

0

In [51]:
df_games['Critic_Count'] = df_games['Critic_Count'].fillna(value = df_games.Critic_Score.mean())

In [52]:
df_games['User_Count'] = df_games['User_Count'].fillna(value = df_games.Critic_Score.median())

In [53]:
df_games['User_Score'] = df_games['User_Score'] * 10

In [54]:
df_games['Aggregate_Score'] = ((df_games.Critic_Count * df_games.Critic_Score) + (df_games.User_Count * df_games.User_Score)) / (df_games.User_Count + df_games.Critic_Count)

In [55]:
df_games["Aggregate_Score"].describe()

count    8137.000000
mean       70.746777
std        12.150460
min        10.054054
25%        65.290541
50%        73.285714
75%        78.753247
max        94.193168
Name: Aggregate_Score, dtype: float64

In [57]:
nintendo_filter = df_games["Publisher"] != 'Nintendo'

In [58]:
nintendo = df_games[filter_year]
nintendo = nintendo[nintendo_filter]

nintendo.sort_values('Aggregate_Score', ascending = False).head(5)

  


Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating,Aggregate_Score
146,Metal Gear Solid,PS,1998.0,Action,Konami Digital Entertainment,3.18,1.83,0.78,0.24,6.03,94.0,20.0,94.0,918.0,KCEJ,M,94.0
1546,Castlevania: Symphony of the Night,PS,1997.0,Platform,Konami Digital Entertainment,0.58,0.4,0.21,0.08,1.27,93.0,12.0,94.0,358.0,Konami,T,93.967568
1712,Shenmue,DC,1999.0,Adventure,Sega,0.52,0.24,0.38,0.04,1.18,88.0,9.0,94.0,201.0,Sega AM2,T,93.742857
5585,Harvest Moon: Back to Nature,PS,1999.0,Simulation,Ubisoft,0.11,0.07,0.12,0.02,0.32,82.0,6.0,93.0,78.0,Victor Interactive Software,E,92.214286
65,Final Fantasy VII,PS,1997.0,Role-Playing,Sony Computer Entertainment,3.01,2.47,3.28,0.96,9.72,92.0,20.0,92.0,1282.0,SquareSoft,T,92.0


# HINT

**How to create the Aggregate Score Column?**

\begin{equation*}
AggregateScore = \frac{(CriticCount * CriticScore)+(UserCount * UserScore)}{UserCount + CriticCount}
\end{equation*}

**Check Your Column Values**

The Critic_Score column is scored out of 100. The User_Score column is scored out of 10. You will need to modify one of the columns to match the other.

## Documentation

In the meantime, check out pandas the user guide in the [pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide).

-------
**Why should I use the documentation?**

On the job as a data scientist or data analyst, more often than not, you may find yourself looking up the documentation of a particular function or plugin you use. Don't worry if there are a few functions you don't know by heart. However, there are just too many to know! An essential skill is to learn how to navigate documentation and understand how to apply the examples to your work. 

--------

Additional resources:

- To learn more about these topics, as well as other topics not covered here (e.g. reshaping, merging, additional subsetting methods, working with text data, etc.) check out [these introductory tutorials](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) from the `pandas` documentation
- To learn more about subsetting your data, check out [this tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#min-tut-03-subset)
- This [pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) may also be helpful as a reference.