# Week 2 — Introduction to Pandas

**In this session we will cover Pandas**, a powerful data manipulation library in Python, through many exercises.

This practical session is designed to prepare you for Data Analysis and Data Wrangling tasks that you may encounter in your Data Science projects.

**Roadmap**
1. Why Pandas? Whats is Tabular Data?
    - Creating DataFrames from Scratch
    - Exploring DataFrames
    - Checking indexes and columns
    - Renaming Columns
2. Selecting and Filtering DataFrames
    - Multiple conditions
    - Filtering using `.loc[]`
    - Filtering using `.iloc[]`
3. Update DataFrames
    - Adding New Data
    - Casting Data Types
4. Reading and Exporting DataFrames
5. Datasets



## 1. Why Pandas? What is Tabular Data?

**Pandas** is the standard Python library for manipulating tabular data (rows × columns). It provides the `DataFrame` (table) and `Series` (column) abstractions, efficient I/O (CSV, Excel, SQL, JSON), and a rich API for cleaning, transforming, and analyzing datasets.

**Tabular data**: each row is an observation (e.g., a person or transaction), and each column is an attribute (e.g., age, city, price). Pandas lets us read, inspect, filter, modify, and export this data efficiently.


Let's get started by installing Pandas!

In [None]:
!pip install pandas

Now you can import the library

In [1]:
import pandas as pd

### Creating DataFrames from Scratch

`DataFrame` is the core Pandas data structure representing tabular data. It consists of rows and columns, similar to a spreadsheet or SQL table and is built on top of NumPy arrays for performance.

We can create DataFrames from Python structures such as lists, dictionaries, and lists of dictionaries.

**Example: Creating a DataFrame from a dictionary.**

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
```

Output:
```
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

In [None]:
# your code here


**Example: Creating a DataFrame from a list of dictionaries.**

```python
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)
```
Output:
```
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

In [None]:
# your code here


We can also specify **column names** when creating a `DataFrame`.

**Example — From lists**  
Usage:
```python
data = [["Alice", 24], ["Bob", 27], ["Charlie", 22]]
pd.DataFrame(data, columns=[...])
```


Try it yourself!

In [None]:
# your code here


**Example — From a dictionary of columns**  
Usage:
```python
pd.DataFrame({"col1": [...], "col2": [...]})
```

In [None]:
data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [24, 27, 22]}
df = pd.DataFrame(data)
df

**Example — From a list of dictionaries**  
Usage:
```python
pd.DataFrame([{"col": val, ...}, {...}])
```


In [None]:
data = [
    {"name": "Alice", "age": 24},
    {"name": "Bob", "age": 27},
    {"name": "Charlie", "age": 22},
]
df = pd.DataFrame(data)
df

### Q 1.1. Create a DataFrame of 4 students with columns `name`, `age`, `grade`, `city`.

Output desired:

```python
      name  age grade         city
0    Alice   24     20      Lisboa
1      Bob   27     15      Porto
2  Charlie   22     12      Braga
3    David   23     17      Aveiro
```

In [None]:
# your code here


### Exploring DataFrames

Essential inspection methods:
- `shape` → tuple with (n_rows, n_cols)
- `head(n)` / `tail(n)` → first/last rows
- `info()` → dtypes, non-null counts, memory
- `describe()` → summary stats (numeric by default)

**Example — Quick inspection**  
Usage:
```python
df.shape; df.head(3); df.tail(2); df.info(); df.describe()
```


Let's try by using the DataFrame you created in Q 1.1!

### Q 1.2. Inspect the `shape` of the `DataFrame`.

In [None]:
# your code here


### Q 1.3. Display the first 2 rows of the `DataFrame`.

**Note: by omiting the `n` in `df.head(n)` you display the first 5 rows.**

In [None]:
# your code here


### Q 1.4. Display the last 2 rows of the `DataFrame`.

In [None]:
# your code here


### Q 1.5. Use the `info()` method to display a summary of the `DataFrame`.

In [None]:
# your code here


### Q 1.6. Use the `describe()` method to display summary statistics of the `DataFrame`.

In [None]:
# your code here


**The larger the dataset, the more useful these methods become for inspection!**

Now that we have covered the **basics of creating and inspecting** DataFrames, we can move on to more advanced data inspection such as filtering, grouping, and aggregating data.

### Checking indexes and columns

You can check the index and columns of a DataFrame using the `.index` and `.columns` attributes.

- `.index` returns the index (row labels) of the DataFrame.
- `.columns` returns the column labels of the DataFrame.

### Q 1.7. Check the index and columns of the DataFrame you created in Q 1.1.

In [None]:
# your code here


### Renaming Columns

You can rename columns in a DataFrame using the `rename()` method or by directly assigning a new list to the `columns` attribute.

**Example — Using `rename()` method**
```python
df.rename(columns={'old_name': 'new_name'}, inplace=True)
```

Notice the `inplace=True` argument, which modifies the DataFrame in place. If you omit it, `rename()` will return a new DataFrame with the changes.

**Example — Directly assigning to `columns` attribute**
```python
df.columns = ['new_name1', 'new_name2', ...]
```

Renaming columns can help make your DataFrame more understandable and easier to work with.

### Q 1.8. Rename the columns `name`, `age` and `city` of the DataFrame you created in Q 1.1 to `Name`, `Age` and `City`.

In [None]:
# your code here


## 2. Selecting and Filtering Data

Pandas provides several selection patterns:
- Column access: `df['col']` or `df[['col1','col2']]`
- Row selection by **label**: `df.loc[row_label]`
- Row selection by **position**: `df.iloc[row_position]`
- Boolean filtering with masks and combined conditions using `&` (AND) and `|` (OR) with parentheses.

Let's start with some basic filtering!

### Q 2.1. Select the `Name` column from the given `DataFrame`.

```python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 23],
    'Grade': [20, 15, 12, 17],
    'City': ['Lisboa', 'Porto', 'Braga', 'Aveiro']
})
```

In [3]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 23],
    'Grade': [20, 15, 12, 17],
    'City': ['Lisboa', 'Porto', 'Braga', 'Aveiro']
})

# your code here


### Q 2.2. Select the columns `Name` and `Grade` from the `DataFrame`.

**Note: by using double brackets `[['col1','col2']]` you select multiple columns and the result is a subset of the `DataFrame`.**

In [None]:
# your code here


### Q 2.3. Try again with a different columns subset.

In [None]:
# your code here


Pandas allows boolean indexing to filter rows that meet specific criteria.

We need to create a boolean mask (a Series of `True`/`False` values) based on a condition applied to a column, and then use that mask to filter the DataFrame.

```python
df[condition]
```

this will return only the rows where `condition` is `True`.

but what is `condition`?

A condition is typically a comparison operation applied to a DataFrame column, such as:
- `df['col'] > value`
- `df['col'] < value`
- `df['col'] == value`
- `df['col'] != value`

**Example: Filtering rows where `Age` is greater than 23.**

```python
df[df['Age'] > 23]
```

Output:
```
    Name  Age  Grade      City 
0  Alice   24     20    Lisboa
1    Bob   27     15     Porto
```

### Q 2.4. Specify a condition to filter the DataFrame for students with a `Grade` greater than 15.

- What do you expect the output to be?

In [None]:
# your code here


`pd.Series` is the Pandas data structure representing a one-dimensional labeled array, similar to a single column in a DataFrame.

**When we apply a condition to a DataFrame column, it returns a Series of boolean values (`True` or `False`) indicating whether each row meets the condition.**

Now put your `condition` inside the brackets to filter the DataFrame!

```python
df[condition]
```

In [None]:
# your code here


### Q 2.5. Filter for students from the `City` of "Porto" **or** "Braga".
- `|` for logical OR

In [None]:
# your code here


### Q 2.6 Filter for students with a `Grade` greater than 15 **and** from the city of "Lisboa".
- `&` for logical AND

In [None]:
# your code here


### Q 2.7 Filter for students **not** from the city of "Aveiro".
- `!=` for not equal
- `~` for NOT

In [10]:
# your code here


We could also use the `~` operator to **negate a condition** but it's often clearer to use `!=` for not equal.

```python
df[~(df['City'] == 'Aveiro')]
```

Let's practice a bit more combining Filtering of rows and columns!

### Q 2.8. Filter for students with a `Grade` greater than 15 and select only their `Name` and `City` columns.

In [None]:
# your code here


### Q 2.9. Filter for students from the city of "Porto" and select only their `Name` and `Grade` columns.

In [None]:
# your code here


### Q 2.10. Filter for students **not** from the city of "Aveiro" and select only their `Name` and `City` columns.

In [None]:
# your code here


### Q 2.11. Filter for students with a `Grade` greater than 15 and select only their `Name`, `City` and `Grade` columns.

In [None]:
# your code here


### Multiple Conditions

**Still on filtering, we can set multiple conditions.**

When combining conditions, it's **important to use parentheses** to group conditions properly.

The logical operators are:
- `&` for **AND**
- `|` for **OR**
- `~` for **NOT**

**Usage:**
```python
df[(condition1) & (condition2)]  # AND
df[(condition1) | (condition2)]  # OR
df[~(condition)]                 # NOT
```

**Example: Filtering for students with `Grade` greater than 15 and from "Lisboa" or "Porto".**

```python
df[(df['Grade'] > 15) & ((df['City'] == 'Lisboa') | (df['City'] == 'Porto'))]
```

### Q 2.12. Filter for students with a `Age` greater than 23 from `City` **Aveiro** and select only their `Name` and `Grade` columns.


In [None]:
# your code here


As those conditions can get complex, make sure to use parentheses to group them properly and alternatively break them into multiple steps for clarity!

```python
condition1 = df['Age'] > 23
condition2 = df['City'] == 'Aveiro'

df[condition1 & condition2]
```

### Q 2.13. Rewrite the filtering from `Q 2.12` using intermediate conditions for clarity.

In [None]:
# your code here


**Filtering is an essential skill when working with tabular data in Pandas.**

**It allows us to extract relevant subsets of data based on specific criteria, enabling focused analysis and insights.**

Don't sleep on that!

There are many more advanced selection and filtering techniques in Pandas, including using `.loc[]` and `.iloc[]` for label-based and position-based indexing, respectively.

Those are useful when you need more control over row and column selection based on labels or integer positions.

### Filering data using `.loc[]`.

`.loc[]` is a powerful method for label-based indexing and selection in Pandas DataFrames. It allows you to select rows and columns based on their labels.

- For a single column:

    ```python
    df.loc[condition, 'col1']
    ```

- For multiple columns:

    ```python
    df.loc[condition, ['col1', 'col2']]
    ```

### Q 2.14. Using `.loc[]`, filter for students with a `Grade` greater than 15 and select only their `Name` column.

In [None]:
# your code here


### Q 2.15. Using `.loc[]`, filter for students from the city of "Porto" and select only their `Name` and `Grade` columns.

In [None]:
# your code here


### Q 2.16. Using `.loc[]`, filter for students **not** from the city of "Aveiro" and select only their `Name` and `City` columns.

In [None]:
# your code here


Why use `.loc[]` over standard filtering?
- `.loc[]` provides a more explicit and readable way to select rows and columns based on labels.
- when we need to update values filtered by a condition, `.loc[]` is the preferred method to avoid unsafe updates.
- unsafe updates can lead to unexpected behavior and warnings in Pandas, so using `.loc[]` helps ensure that we are modifying the DataFrame safely and correctly.

### Filtering data using `.iloc[]`.

`.iloc[]` is another powerful method for integer-location based indexing and selection in Pandas DataFrames. It allows you to select rows and columns based on their integer positions.

The syntax is similar to `.loc[]`, but instead of using labels, we use integer indices that represent the position of rows and columns.

It is zero-based indexing and follows the format `df.iloc[row_indices, column_indices]`.

- For a single column:

    ```python
    df.iloc[condition, col_index]
    ```

- For multiple columns:

    ```python
    df.iloc[condition, [col_index1, col_index2]]
    ```

### Q 2.17. Using `.iloc[]`, filter for students with a `Grade` greater than 15 and select only their `Name` column.

In [None]:
# your code here


### Q 2.18. Using `.iloc[]`, filter for students from the city of "Porto" and select only their `Name` and `Grade` columns.

In [None]:
# your code here


It is important to note that `.iloc[]` uses integer positions, so you need to know the index of the columns you want to select.

The same logic we used for NumPy indexing applies here.

So, to select the `Name` column (index 0) and `Grade` column (index 2), we would use:
```python
df.iloc[condition, [0, 2]]
```
We can select blocks of columns too by using `:` for ranges.
```python
df.iloc[condition, 0:3]
```

We can select blocks of rows too by using `:` for ranges.
```python
df.iloc[0:2, condition]
```

We can select rows and columns blocks together too by using `:` for ranges.
```python
df.iloc[0:2, 0:3]
```

### Q 2.19. Using `.iloc[]`, filter for students **not** from the city of "Aveiro" and select only their `Name` and `City` columns.

In [None]:
# your code here


### Q 2.20. Using `.iloc[]`, filter for the second and third students (zero-based) from columns 0 to 2 (exclusive).

In [None]:
# your code here


## 3. Updating DataFrames

Now that we have covered filtering and selecting data in Pandas DataFrames, we can move on to updating data.

Updating data in a DataFrame can be done in several ways, including:
- Direct assignment using column names or `.loc[]`/`.iloc[]`
- Using conditional updates with boolean masks

Example: Direct assignment using column names.

```python
df['Age'] = df['Age'] + 1  # Increment age by 1
```

As result, the `Age` column will have all values incremented by 1.

Example: Using `.loc[]` for conditional updates.

```python
df.loc[df['Grade'] < 15, 'Grade'] = 15  # Set Grade to 15 if less than 15
```

In this case, all rows where the `Grade` is less than 15 will have their `Grade` updated to 15.

**Note that we used `.loc[]` to safely update the DataFrame based on a condition.**

### Q 3.1. Decrement the `Age` of all students by 1 using direct assignment.

In [None]:
# your code here


### Q 3.2. Using `.loc[]`, set the `Grade` to 18 for students from the city of "Lisboa".

In [None]:
# your code here


### Q 3.3. Using `.loc[]`, set the `City` to "Coimbra" for students with a `Grade` less than 15.

In [None]:
# your code here


### Adding new data

We can add new columns to a DataFrame by direct assignment.

```python
df['New_Column'] = [1, 2, 3]
```

This will create a new column named `New_Column` with the specified values.
If the length of the list does not match the number of rows in the DataFrame, Pandas will raise a `ValueError`.

If we want to add a NaN value, we can use `numpy.nan`.

```python
import numpy as np
df['New_Column'] = [1, np.nan, 3]
```

You can also add new rows using the `append()` function. 

```python
new_row = {'Name': 'Eve', 'Age': 21, 'Grade': 19, 'City': 'Faro'}
df = df.append(new_row, ignore_index=True)
```

The `ignore_index=True` parameter is used to reindex the DataFrame after appending the new row. This is important to ensure that the index remains sequential and consistent.

### Q 3.4. Add a new column `Graduated` with boolean values indicating whether the student has graduated (True/False).

In [None]:
# your code here


### Q 3.5. Add a column `Gender` with values `Male`, `Female` or `Other`.

In [None]:
# your code here


### Casting Data Types
Pandas allows us to change the data type of a column using the `astype()` method. This is useful when we need to ensure that a column has the correct type for analysis or when preparing data for export.

It is a good practice to ensure that columns have the appropriate data types for efficient storage and accurate computations.

We can check the data types of each column using the `dtypes` attribute of the DataFrame.

- `.info()` → provides a summary of the DataFrame including data types.
- `.dtypes` → returns a Series with the data type of each column.

**Example of checking data types.**
```python
print(df.info())
```

**Example: Changing the data type of the `Grade` column to `float`.**

Usage:
```python
df['Grade'] = df['Grade'].astype(float)
```

Useful data types to consider:
- `int` → Integer numbers
- `float` → Floating-point numbers
- `object` → General Python objects (often used for strings)
- `category` → Categorical data with a fixed number of possible values (saves memory)
- `datetime` → Date and time values
- `bool` → Boolean values (`True`/`False`)

### Q 3.4. Check the data types of each column in the DataFrame.

In [None]:
# your code here


### Q 3.5. Change the data type of the `Grade` column to `float`.

In [None]:
# your code here


**Note that columns dtypes are the default when even just one value is of that type is present on it.**

For example, if a column has mostly integers but one value is a float, the entire column will be of type `float` to accommodate that value.

The same applies to other types as well such as object (strings) dtype.

## 4. Reading and Exporting DataFrames

Pandas provides functions to read data from various file formats into DataFrames and to export DataFrames back to files.

Common file formats include:
- CSV (`.csv`)
- JSON (`.json`)
- Excel (`.xlsx`, `.xls`)

You can use the following functions to read data:
- `pd.read_csv('file.csv')`
- `pd.read_json('file.json')`
- `pd.read_excel('file.xlsx')`

To export DataFrames to files, you can use:
- `df.to_csv('file.csv', index=False)`
- `df.to_json('file.json', orient='records')`
- `df.to_excel('file.xlsx', index=False)`

Those extra parameters help to control the output format, such as whether to include the index or how to structure the JSON data.

### Q 4.1. Export your DataFrame from previous questions to a CSV file named `students.csv`. Experiment with the `index` parameter set to True and False.

In [None]:
# your code here


### Q 4.2. Read the `students.csv` file back into a new DataFrame named `df_students`.

In [None]:
# your code here


### Q 4.3. Export your DataFrame to a JSON file named `students.json`. Experiment with the `orient` parameter set to `records` and `columns`.

In [None]:
# your code here


Alternatively, some libraries provide built-in datasets that we can load directly into Pandas DataFrames for practice. In this case they provide functions to load those datasets.

Pandas itself does not include built-in datasets, but libraries like `seaborn` and `sklearn` do.

We can also read datasets from online sources directly into DataFrames using URLs.

## 5. Datasets

In this section, we will work with real-world datasets to apply the concepts we have learned so far.

Let's load a sample dataset using Pandas.

We are going to use the popular `Titanic` dataset, which contains information about passengers on the Titanic, including whether they survived the disaster.

**Data dictionary**

| Column | Description |
|---|---|
| PassengerId | Unique ID |
| Survived | 0 = No, 1 = Yes |
| Pclass | Ticket class (1, 2, 3) |
| Name | Passenger name |
| Sex | Gender |
| Age | Age in years |
| SibSp | Siblings/spouses aboard |
| Parch | Parents/children aboard |
| Ticket | Ticket number |
| Fare | Ticket fare |
| Cabin | Cabin number |
| Embarked | Port of embarkation (C/Q/S)|


### Q 5.1. Load the Titanic dataset:

### Q 5.2. Inspect the dataset using the methods we learned earlier (`shape`, `head()`, `info()`, `describe()`).

In [None]:
# your code here


### Q 5.3. Filter for passengers who survived (Survived == 1) and select only their `Name`, `Age`, and `Fare` columns. Can you calculate the sum, median and average of age and fare from the survivors?

In [None]:
# your code here


### Q 5.4. Filter for passengers who did not survive (Survived == 0) and select only their `Name`, `Age`, and `Fare` columns. Can you calculate the sum, median and average of age and fare of those who did not survive?

In [None]:
# your code here


Next session we will cover more topics in Pandas, more on data wrangling, casting dtypes, handling missing values, etc!