# Pandas Overview

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. We aim to give you familiarity with:

* Creating dataframes
* Slicing data frames (i.e. selecting rows and columns)
* Filtering data (using boolean arrays)

In this lab you are going to use several pandas methods, such as `drop` and `loc`.

<h3> Let me know if this information is helpful by upvoting the notebook and/or writing a comment :) </h3>

<h3> Imports </h3>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

## 1. Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the pandas `DataFrame` class  provide at least two syntaxes to create a data frame.

**Syntax 1:** You can create a data frame by specifying the columns and values using a dictionary as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
            'color': ['red', 'orange', 'yellow', 'pink']})
fruit_info

**Syntax 2:** You can also define a dataframe by specifying the rows like below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

You can obtain the dimensions of a dataframe by using the shape attribute `dataframe.shape`.

In [None]:
fruit_info.shape

You can also convert the entire dataframe into a two-dimensional numpy array.

In [None]:
fruit_info.values

**Example 1.1.** For a DataFrame `d`, you can add a column with `d['new column name'] = ...` and assign a list or array of values to the column. Add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty). 

In [None]:
fruit_info['rank1'] = [2, 3, 1, 4]
fruit_info

**Example 1.2.** You can also add a column to `d` with `d.loc[:, 'new column name'] = ...`. As discussed in the lesson, the first parameter is for the rows and second is for columns. The `:` means change all rows and the `new column name` indicates the column you are modifying (or in this case, adding). 

Add a column called `rank2` to the `fruit_info` table which contains the same values in the same order as the `rank1` column.

In [None]:
fruit_info.loc[:, 'rank2'] =  [2, 3, 1, 4]
fruit_info

**Example 1.3.** Use the `.drop()` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created (make sure to use the `axis` parameter correctly).

**Note:** `drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

**Hint:** Look through the documentation to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.


In [None]:
fruit_info_original = fruit_info.drop(['rank1', 'rank2'], axis=1)
fruit_info_original

**Example 1.4.** Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_original` so they begin with capital letters. Set this new dataframe to `fruit_info_caps`.


In [None]:
fruit_info_caps = fruit_info_original.rename(str.capitalize, axis=1)
fruit_info_caps

## 2. Babyname dataset

Now that we have learned the basics, let's move on to the babynames dataset. The babynames dataset contains a record of the given names of babies born in the United States each year.

First let's run the following cells to build the dataframe `baby_names`. The cells below download the data from the web and extract the data into a dataframe. There should be a total of 890627 records.

In [None]:
baby_names = pd.read_csv('../input/baby-names/baby_names.csv', index_col = 0)
len(baby_names)

In [None]:
baby_names.head()

## 3. Slicing Data Frames - Selecting rows and columns

### Selection Using Label/Index (using loc)

#### Column Selection 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. (Reminder that the colon `:` means "everything.")  For example, if we want the `color` column of the `ex` data frame, we would use: `ex.loc[:, 'color']`

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- **Alternative:** While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[]` method, which takes on the form `df['colname']`.

#### Row Selection

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e. primary key) of the dataframe.

**Example: 3.1.**

In [None]:
baby_names.loc[2:5, 'Name']

**Example: 3.2.  Notice the difference between this method and the method in **Example: 3.1.**

Just passing in `'Name'` returns a Series while `['Name']` returns a Dataframe

In [None]:
baby_names.loc[2:5, ['Name']]

**Note:** `.loc` actually uses the Pandas row index rather than row id/position of rows in the dataframe to perform the selection. Also, notice that if you write `2:5` with `loc[]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 5.

#### Selection using Integer location (using iloc)

In the lesson we discussed another pandas feature `iloc[]` which lets you slice the dataframe by row position and column position instead of by row index and column label (which is the case for `loc[]`). This is really the main difference between the two functions and it is **important** that you remember the difference and why you might want to use one over the other. In addition, with `iloc[]`, the end index is **not** included, like with normal Python slicing.

**Note:** As a mnemonic, remember that the i in `iloc` means "integer". 

Below, we have sorted the `baby_names` dataframe. Notice how the **position** of a row is not necessarily equal to the **index** of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the different between `loc[]` and `iloc[]`.

In [None]:
sorted_baby_names = baby_names.sort_values(by = ['Name'])
sorted_baby_names.head()

**Example 3.3.** Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the `Name` column of the `baby_names` dataframe using both `iloc[]` and `loc[]`. Observe the difference, especially after sorting `baby_names` by name.

In [None]:
sorted_baby_names.iloc[1:4, 3]

Notice that using `loc[]` with 1:4 gives different results, since it selects using the **index**.

In [None]:
sorted_baby_names.loc[1:4, 'Name']

**Example 3.4.** Lastly, we can change the index of a dataframe using the `set_index` method. We change the index from 0,1,2,... to the `Name` column.

In [None]:
df = baby_names[:5].set_index("Name") 
df

**Example 3.5.** However, if we still want to access rows by location we will need to use the integer location (`iloc`) accessor.

**Note:** We can't do this `df.loc[2:5, 'Year'] `

In [None]:
df.iloc[1:4, 2:3]

**Example 3.6.** Selecting multiple columns is easy.  You just need to supply a list of column names.  Select the `Name` and `Year` **in that order** from the `baby_names` table.


In [None]:
name_and_year = baby_names[["Name", "Year"]]
name_and_year[:5]

**Note:** `.loc[]` can be used to re-order the columns within a dataframe.

## 4. Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. Example usage looks like `df[df['column name'] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol   | Usage      | Meaning 
------   | ---------- | -------------------------------------
$==$     | a == b     | Does a equal b?
$\lt =$  | a >= b     | Is a less than or equal to b?
$\gt =$  | a >= b     | Is a greater than or equal to b?
$\lt$    | a < b      | Is a less than 
$\gt$    | a > b      | Is a greater than b?
~        | ~p         | Returns negation of p
&#124;   | p &#124; q | p OR q
&        | p & q      | p AND q
^        | p ^ q      | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in North Carolina.

In [None]:
nc = baby_names[baby_names['State'] == 'NC']

**Example 4.1.** To count the number of instances of each unique value in a Series, we can use the `value_counts()` method as `df['col_name'].value_counts()`. Count the number of different names for each Year in NC (North Carolina).

**Note:** We are **not** computing the number of babies but instead the number of names (rows in the table) for each year.

In [None]:
num_of_names_per_year = nc['Year'].value_counts()
num_of_names_per_year.head()

**Example 4.2.** Count the number of different names for each gender in NC (North Carolina).

**Note:** An implementation with `groupby` and `.size()` is also possible.

In [None]:
num_of_names_per_gender = nc['Sex'].value_counts()
num_of_names_per_gender

**Example 4.3.** Using a boolean array, select the names in Year 2019 (from `baby_names`) that have at least 500 counts and are in NC. Keep all columns from the original `baby_names` dataframe.

**Hint:** Any time you use `p & q` to filter the dataframe, make sure to use `df[(df[p]) & (df[q])]` or `df.loc[(df[p]) & (df[q])]`. That is, make sure to wrap conditions with parentheses.

**Note:** Both slicing and `loc` will achieve the same result, it is just that `loc` is typically faster in production. You are free to use whichever one you would like.

In [None]:
result = nc.loc[(baby_names['Year'] == 2019) & (baby_names['Count'] >= 500)]
result.head()

In [None]:
all(i >= 500 for i in result['Count'].values)

**Example 4.4.** Some names gain/lose popularity because of cultural phenomena such as a political figure coming to power or a successful athlete or entertainer in the during the prime years of his/her career. 

Below, we plot the popularity of the name Jordan in North Carolina over time. What do you notice about this plot? What might be the cause of the steep drop?

In [None]:
name = 'Jordan'
state = 'NC'

male_baby = baby_names[(baby_names['Name'] == name) & (baby_names['State'] == state) & (baby_names['Sex'] == 'M')] # SOLUTIOM
female_baby = baby_names[(baby_names['Name'] == name) & (baby_names['State'] == state) & (baby_names['Sex'] == 'F')] # SOLUTIOM

plt.plot(male_baby['Year'], male_baby['Count'], 'b', label = 'Male')
plt.plot(female_baby['Year'], female_baby['Count'], 'r', label = 'Female')
plt.title(f'Popularity of {name} Over Time')
plt.xticks(np.arange(1940, 2020, step = 10))
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend()
plt.show()

And that's it. I hope this Pandas overview was helpful :)