# Quick Pandas Go-To reference

This notebook is intended to be used as a Go-to reference for those Pandas commands that are usually used but need to be checked everytime.

## Data to Hero

We are going to use a [dataset](https://www.kaggle.com/mbogernetto/women-in-nobel-prize-19012019?select=nobel_prize_awarded_women_details_1901_2019.csv) that contain those paople that have won the Nobel Prize from 1901 to 2019.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Basic Operations

First, we will need to read the data from the .csv. We are nothing without the data. So let's populate our dataframe with the dataset previously mentioned. In this case we have a `csv`, but we could also read from a lo of other formats.
> More information abour I/O using Pandas, click [here](https://pandas.pydata.org/pandas-docs/stable/reference/io.html).

In [None]:
df = pd.read_csv('../input/women-in-nobel-prize-19012019/nobel_prize_awarded_1901_2019.csv')

To take a look of what the data offers we will use the method [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html?highlight=head#pandas.DataFrame.head). We can specify the number of rows we would like to see.

In [None]:
df.head(10)

The columns `Description` and `Details` do not give a lot of information, at least the first ones, still we will not need them for now, so let's just keep the "interesting" ones: 

In [None]:
df = df[['Name', 'Year', 'Category', 'Countries', 'Gender']]
df.head(10)

Let's check the last element of the Dataframe, so that we can confirm later that we added one row succesfully:

In [None]:
df.iloc[-1]

Let's add a new element to our Dataframe:

In [None]:
df = df.append({'Name': 'Cecilia Merelo', 'Year': 2030, 'Category': 'Life', 'Countries': 'Spain', 'Gender': 'Woman' }, ignore_index=True)
df.iloc[-1]

Now we will delete it, we dont' want any data that it's not real (yet).

In [None]:
df = df[df.Name != 'Cecilia Merelo']
df.iloc[-1]

So these are the basic operations that I mostly use. Also we now have an overview of which data are we going to work with.

### Visualization

So first we want to get the view of the amount of women that have won this prize compared to men. For these we will need to get the rows that have 'Woman' in the gender column:

In [None]:
df.loc[df['Gender'] == 'Woman'].count()


In [None]:
df.loc[df['Gender'] == 'Man'].count()


54 Woman have won this prize vs 866 men, Wow. Now let's see which are the categories where they got them. Let's get the Womens' rows:

In [None]:
df_women = df[df.Gender == 'Woman']
df_women.head()

In the `df_women` dataframe we have the Women that have won the prize, what we need, is to get the categories in which they won. For that we will use the `groupby` method. This will group the rows by the column name that we want. As we need the amount of Women in each category we will group it by and count them, that's why we use the `count()`.

In [None]:
df_women_category = df_women.groupby(['Category'])['Category'].count()
df_women_category

In [None]:
ax = df_women_category.plot.bar(x='Category', y='Amount')

Now let's compare this values with the categories that men have won. Fot that we will need to count the categories that each have won:

In [None]:
df_categories = df.groupby(['Category','Gender'], as_index=False).count()
df_categories

The Name, Year and Countries look a bit confusing, the information that have do not correspond with the name of the column. What they really show is the amount of people that have won the prize prize. So let's rename the column so that we don't we have a more accurate information.

In [None]:
change = {'Year': 'Amount'}
df_categories = df.rename(columns=change).groupby(['Category','Gender'], as_index=False)['Amount'].count()
df_categories

We can see that the category row is repeated because it is needed for each gender value. So let's rearrange the dataframe so that the Gender valuess are column titles.

In [None]:
df_categories_pivotted = df_categories.pivot(index=['Category'], columns=['Gender'], values='Amount').reindex(['Man', 'Woman'], axis=1)
df_categories_pivotted

In [None]:
df_categories_pivotted.columns.values

With this dataframe the columns values are only the genders, for when plotting it will not take the Category correctly. So let's add Category as column name:

In [None]:
df_categories_pivotted.columns.name = None #remove gender
df_categories_pivotted = df_categories_pivotted.reset_index() #index to columns 
df_categories_pivotted = df_categories_pivotted[['Category', 'Man', 'Woman']]
df_categories_pivotted

In [None]:
df_categories_pivotted.columns.values

In [None]:
ax = df_categories_pivotted.plot.bar(x='Category')

YAY! We have a plot that shows the deep gender gap that there is between Novel prize winners. The reader can draw his/her own conclusions.