# Pandas Basics

This notebook is based on [100 Pandas Puzzles](https://github.com/ajcr/100-pandas-puzzles/tree/master). For more examples, visit the link!

## Importing pandas

**1.** Import pandas under the alias `pd`.

In [1]:
import pandas as pd

**2.** Print the version of pandas that has been imported.

In [2]:
pd.__version__

'2.3.0'

## DataFrame basics

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```

**3.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

In [4]:
import numpy as np

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, index=labels)

**4.** Display a summary of the basic information about this DataFrame and its data (*hint: there is a single method that can be called on the DataFrame*).

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   animal    10 non-null     object 
 1   age       8 non-null      float64
 2   visits    10 non-null     int64  
 3   priority  10 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes


**5.** Return the first few rows of the DataFrame `df`.

In [7]:
df.head()

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no


In [17]:
# You can choose the number of rows to return
df.head(3)

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no


**6.** Select just the 'animal' and 'age' columns from the DataFrame `df`.

In [None]:
df[['animal', 'age']]

**7.** Select the data in rows `['d', 'e', 'i']` *and* in columns `['animal', 'age']`.

In [18]:
df.loc[['d', 'e', 'i'], ['animal', 'age']]

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
i,dog,7.0


**8.** Select the data in rows `[3, 4, 8]` *and* in columns `[0, 1]`.

In [19]:
df.iloc[[3, 4, 8], [0, 1]]

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
i,dog,7.0


**9.** Index label to index integer position and vice versa

In [26]:
# index label to index integer position
index_labels = ['d', 'e', 'i']
index_positions = [df.index.get_loc(label) for label in index_labels]
index_positions

[3, 4, 8]

In [25]:
df.iloc[index_positions, [0, 1]]

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
i,dog,7.0


In [None]:
# index integer position to index label
df.index[[3, 4, 8]]

Index(['d', 'e', 'i'], dtype='object')

In [27]:
df.loc[df.index[[3, 4, 8]], ['animal', 'age']]

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
i,dog,7.0


**10.** Column label to column integer position and vice versa

In [28]:
# column label to column integer position
col_labels = ['animal', 'age']
col_positions = [df.columns.get_loc(col) for col in col_labels]
col_positions

[0, 1]

In [30]:
df.iloc[[3,4,8], col_positions]

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
i,dog,7.0


In [29]:
# column integer position to column label
df.columns[[0, 1]]

Index(['animal', 'age'], dtype='object')

In [31]:
df.loc[['d', 'e', 'i'], df.columns[[0, 1]]]

Unnamed: 0,animal,age
d,dog,
e,dog,5.0
i,dog,7.0


**11.** Select only the rows where the number of visits is greater than 2.

In [34]:
df[df['visits'] > 2]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
d,dog,,3,yes
f,cat,2.0,3,no


**12.** Select the rows where the age is missing, i.e. it is `NaN`.

In [35]:
df[df['age'].isnull()]

Unnamed: 0,animal,age,visits,priority
d,dog,,3,yes
h,cat,,1,yes


**13.** Select the rows where the animal is a cat *and* the age is less than 3.

In [36]:
df[(df['animal'] == 'cat') & (df['age'] < 3)]

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
f,cat,2.0,3,no


**14.** Select the rows the age is between 2 and 4 (inclusive).

In [37]:
df[df['age'].between(2, 4)]

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
f,cat,2.0,3,no
j,dog,3.0,1,no


**15.** Change the age in row 'f' to 1.5.

In [38]:
df.loc['f', 'age'] = 1.5

**16.** Calculate the sum of all visits in `df` (i.e. the total number of visits).

In [39]:
df['visits'].sum()

np.int64(19)

**17.** Calculate the mean age for each different animal in `df`.

In [40]:
df.groupby('animal')['age'].mean()

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

**18.** Append a new row 'k' to `df` with your choice of values for each column.

In [44]:
df.loc['k'] = [5.5, 'dog', 'no', 2]

**19.** Delete the row you appended in previous one and return the original DataFrame

In [45]:
df.drop('k')

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,1.5,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [None]:
# If you want to save it back to the original DataFrame
df = df.drop('k')

**20.** Count the number of each type of animal in `df`.

In [None]:
df['animal'].value_counts()

**21.** Sort `df` first by the values in the 'age' in *decending* order, then by the value in the 'visits' column in *ascending* order (so row `i` should be first, and row `d` should be last).

In [None]:
df.sort_values(by=['age', 'visits'], ascending=[False, True])

**22.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.

In [None]:
df['priority'] = df['priority'].map({'yes': True, 'no': False})

## Using MultiIndex

Consider the following dataframe:

In [78]:
idf = pd.DataFrame({"letters": ["A", "A", "B", "B", "C", "C"],
                     "numbers": [0, 1, 0, 1, 0, 1],
                     "col1": ["a", "b", "c", "d", "e", "f"],
                     "col2": [1, 2, 3, 4, 5, 6]})
idf

Unnamed: 0,letters,numbers,col1,col2
0,A,0,a,1
1,A,1,b,2
2,B,0,c,3
3,B,1,d,4
4,C,0,e,5
5,C,1,f,6


**23.** Create a dataframe with multiindex `('letters', 'numbers')` from `idf`.

In [79]:
midf = idf.set_index(['letters', 'numbers'])
midf

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,b,2
B,0,c,3
B,1,d,4
C,0,e,5
C,1,f,6


In this dataframe, `['A', 'B', 'C']` is the level-0 index, and `[0, 1]` is the level-1 index. You can access to the names of index by

In [89]:
midf.index.names

FrozenList(['letters', 'numbers'])

**24.** Select all rows where the level-0 index is 'B'

In [80]:
midf.loc['B']

Unnamed: 0_level_0,col1,col2
numbers,Unnamed: 1_level_1,Unnamed: 2_level_1
0,c,3
1,d,4


In [81]:
midf[midf.index.get_level_values(0) == 'B']

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
B,0,c,3
B,1,d,4


In [82]:
midf[midf.index.get_level_values('letters') == 'B']

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
B,0,c,3
B,1,d,4


**25.** Select all rows where the level-1 index is 0.

In [83]:
midf.loc[(slice(None), 0), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
B,0,c,3
C,0,e,5


In [84]:
midf[midf.index.get_level_values(1) == 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
B,0,c,3
C,0,e,5


In [86]:
midf[midf.index.get_level_values('numbers') == 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
B,0,c,3
C,0,e,5


**26.** Select all rows where the level-0 index is either 'A' or 'C'.

In [88]:
midf.loc[midf.index.get_level_values(0).isin(['A', 'C'])]

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,b,2
C,0,e,5
C,1,f,6


In [87]:
midf.loc[midf.index.get_level_values('letters').isin(['A', 'C'])]

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,b,2
C,0,e,5
C,1,f,6


**27**. Compute the sum of col2 grouped by the level-1 index.

In [98]:
midf.groupby(level=1)['col2'].sum()

numbers
0     9
1    12
Name: col2, dtype: int64

**28.** Find the full index of the row where col1 is 'e'.

In [99]:
midf[midf['col1'] == 'e'].index

MultiIndex([('C', 0)],
           names=['letters', 'numbers'])

**29.** Reset the index so that Multiindex levels become regular columns

In [100]:
midf.reset_index()

Unnamed: 0,letters,numbers,col1,col2
0,A,0,a,1
1,A,1,b,2
2,B,0,c,3
3,B,1,d,4
4,C,0,e,5
5,C,1,f,6


**30.** Given the lists `letters = ['A', 'B', 'C']` and `numbers = list(range(2))`, construct a MultiIndex object from the product of the two lists.

In [101]:
letters = ['A', 'B', 'C']
numbers = list(range(2))

mi = pd.MultiIndex.from_product([letters, numbers])
mi

MultiIndex([('A', 0),
            ('A', 1),
            ('B', 0),
            ('B', 1),
            ('C', 0),
            ('C', 1)],
           )

In [102]:
# We can construct a Dataframe from this multiindex
midf2 = pd.DataFrame({"col1": ["a", "b", "c", "d", "e", "f"],
                     "col2": [1, 2, 3, 4, 5, 6]}, index=mi)
midf2

Unnamed: 0,Unnamed: 1,col1,col2
A,0,a,1
A,1,b,2
B,0,c,3
B,1,d,4
C,0,e,5
C,1,f,6


In [103]:
# Set the names of index
midf2.index.names = ['letters', 'numbers']
midf2

Unnamed: 0_level_0,Unnamed: 1_level_0,col1,col2
letters,numbers,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0,a,1
A,1,b,2
B,0,c,3
B,1,d,4
C,0,e,5
C,1,f,6
