# Week 2 — Introduction to Pandas

**In this session we will cover Pandas**, a powerful data manipulation library in Python, through many exercises.

This practical session is designed to prepare you for Data Analysis and Data Wrangling tasks that you may encounter in your Data Science projects.

**Roadmap**
1. Why Pandas? Whats is Tabular Data?
    - Creating DataFrames from Scratch
    - Exploring DataFrames
    - Checking indexes and columns
    - Renaming Columns
2. Selecting and Filtering DataFrames
    - Multiple conditions
    - Filtering using `.loc[]`
    - Filtering using `.iloc[]`
3. Update DataFrames
    - Adding New Data
    - Casting Data Types
4. Reading and Exporting DataFrames
5. Datasets



## 1. Why Pandas? What is Tabular Data?

**Pandas** is the standard Python library for manipulating tabular data (rows × columns). It provides the `DataFrame` (table) and `Series` (column) abstractions, efficient I/O (CSV, Excel, SQL, JSON), and a rich API for cleaning, transforming, and analyzing datasets.

**Tabular data**: each row is an observation (e.g., a person or transaction), and each column is an attribute (e.g., age, city, price). Pandas lets us read, inspect, filter, modify, and export this data efficiently.


Let's get started by installing Pandas!

In [None]:
!pip install pandas

Now you can import the library

In [1]:
import pandas as pd

### Creating DataFrames from Scratch

`DataFrame` is the core Pandas data structure representing tabular data. It consists of rows and columns, similar to a spreadsheet or SQL table and is built on top of NumPy arrays for performance.

We can create DataFrames from Python structures such as lists, dictionaries, and lists of dictionaries.

**Example: Creating a DataFrame from a dictionary.**

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
```

Output:
```
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


**Example: Creating a DataFrame from a list of dictionaries.**

```python
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)
```
Output:
```
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

In [3]:
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


We can also specify **column names** when creating a `DataFrame`.

**Example — From lists**  
Usage:
```python
data = [["Alice", 24], ["Bob", 27], ["Charlie", 22]]
pd.DataFrame(data, columns=[...])
```


Try it yourself!

In [4]:
data = [["Alice", 24], ["Bob", 27], ["Charlie", 22]]
df = pd.DataFrame(data, columns=["Name", "Age"])
df

Unnamed: 0,Name,Age
0,Alice,24
1,Bob,27
2,Charlie,22


**Example — From a dictionary of columns**  
Usage:
```python
pd.DataFrame({"col1": [...], "col2": [...]})
```

In [5]:
data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [24, 27, 22]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Alice,24
1,Bob,27
2,Charlie,22


**Example — From a list of dictionaries**  
Usage:
```python
pd.DataFrame([{"col": val, ...}, {...}])
```


In [6]:
data = [
    {"name": "Alice", "age": 24},
    {"name": "Bob", "age": 27},
    {"name": "Charlie", "age": 22},
]
df = pd.DataFrame(data)
df

Unnamed: 0,name,age
0,Alice,24
1,Bob,27
2,Charlie,22


### Q 1.1. Create a DataFrame of 4 students with columns `name`, `age`, `grade`, `city`.

Output desired:

```python
      name  age grade         city
0    Alice   24     20      Lisboa
1      Bob   27     15      Porto
2  Charlie   22     12      Braga
3    David   23     17      Aveiro
```

In [7]:
data = [
    ["Alice", 24, 20, "Lisboa"],
    ["Bob", 27, 15, "Porto"],
    ["Charlie", 22, 12, "Braga"],
    ["David", 23, 17, "Aveiro"]
]

df = pd.DataFrame(data, columns=["name", "age", "grade", "city"])
df

Unnamed: 0,name,age,grade,city
0,Alice,24,20,Lisboa
1,Bob,27,15,Porto
2,Charlie,22,12,Braga
3,David,23,17,Aveiro


### Exploring DataFrames

Essential inspection methods:
- `shape` → tuple with (n_rows, n_cols)
- `head(n)` / `tail(n)` → first/last rows
- `info()` → dtypes, non-null counts, memory
- `describe()` → summary stats (numeric by default)

**Example — Quick inspection**  
Usage:
```python
df.shape; df.head(3); df.tail(2); df.info(); df.describe()
```


Let's try by using the DataFrame you created in Q 1.1!

### Q 1.2. Inspect the `shape` of the `DataFrame`.

In [8]:
df.shape

(4, 4)

### Q 1.3. Display the first 2 rows of the `DataFrame`.

**Note: by omiting the `n` in `df.head(n)` you display the first 5 rows.**

In [9]:
df.head(2)

Unnamed: 0,name,age,grade,city
0,Alice,24,20,Lisboa
1,Bob,27,15,Porto


### Q 1.4. Display the last 2 rows of the `DataFrame`.

In [10]:
df.tail(2)

Unnamed: 0,name,age,grade,city
2,Charlie,22,12,Braga
3,David,23,17,Aveiro


### Q 1.5. Use the `info()` method to display a summary of the `DataFrame`.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int64 
 2   grade   4 non-null      int64 
 3   city    4 non-null      object
dtypes: int64(2), object(2)
memory usage: 260.0+ bytes


### Q 1.6. Use the `describe()` method to display summary statistics of the `DataFrame`.

In [12]:
df.describe()

Unnamed: 0,age,grade
count,4.0,4.0
mean,24.0,16.0
std,2.160247,3.366502
min,22.0,12.0
25%,22.75,14.25
50%,23.5,16.0
75%,24.75,17.75
max,27.0,20.0


**The larger the dataset, the more useful these methods become for inspection!**

Now that we have covered the **basics of creating and inspecting** DataFrames, we can move on to more advanced data inspection such as filtering, grouping, and aggregating data.

### Checking indexes and columns

You can check the index and columns of a DataFrame using the `.index` and `.columns` attributes.

- `.index` returns the index (row labels) of the DataFrame.
- `.columns` returns the column labels of the DataFrame.

### Q 1.7. Check the index and columns of the DataFrame you created in Q 1.1.

In [13]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [14]:
df.columns

Index(['name', 'age', 'grade', 'city'], dtype='object')

### Renaming Columns

You can rename columns in a DataFrame using the `rename()` method or by directly assigning a new list to the `columns` attribute.

**Example — Using `rename()` method**
```python
df.rename(columns={'old_name': 'new_name'}, inplace=True)
```

Notice the `inplace=True` argument, which modifies the DataFrame in place. If you omit it, `rename()` will return a new DataFrame with the changes.

**Example — Directly assigning to `columns` attribute**
```python
df.columns = ['new_name1', 'new_name2', ...]
```

Renaming columns can help make your DataFrame more understandable and easier to work with.

### Q 1.8. Rename the columns `name`, `age` and `city` of the DataFrame you created in Q 1.1 to `Name`, `Age` and `City`.

In [15]:
# dont forget the inplace=True param
df.rename(columns={"name": "Name", "age": "Age", "city": "City"}, inplace=True)
df

Unnamed: 0,Name,Age,grade,City
0,Alice,24,20,Lisboa
1,Bob,27,15,Porto
2,Charlie,22,12,Braga
3,David,23,17,Aveiro


## 2. Selecting and Filtering Data

Pandas provides several selection patterns:
- Column access: `df['col']` or `df[['col1','col2']]`
- Row selection by **label**: `df.loc[row_label]`
- Row selection by **position**: `df.iloc[row_position]`
- Boolean filtering with masks and combined conditions using `&` (AND) and `|` (OR) with parentheses.

Let's start with some basic filtering!

### Q 2.1. Select the `Name` column from the given `DataFrame`.

```python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 23],
    'Grade': [20, 15, 12, 17],
    'City': ['Lisboa', 'Porto', 'Braga', 'Aveiro']
})
```

In [16]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 23],
    'Grade': [20, 15, 12, 17],
    'City': ['Lisboa', 'Porto', 'Braga', 'Aveiro']
})

df["Name"]

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

### Q 2.2. Select the columns `Name` and `Grade` from the `DataFrame`.

**Note: by using double brackets `[['col1','col2']]` you select multiple columns and the result is a subset of the `DataFrame`.**

In [17]:
df[['Name', 'Grade']]

Unnamed: 0,Name,Grade
0,Alice,20
1,Bob,15
2,Charlie,12
3,David,17


### Q 2.3. Try again with a different columns subset.

In [18]:
df[['Age', 'Grade']]


Unnamed: 0,Age,Grade
0,24,20
1,27,15
2,22,12
3,23,17


Pandas allows boolean indexing to filter rows that meet specific criteria.

We need to create a boolean mask (a Series of `True`/`False` values) based on a condition applied to a column, and then use that mask to filter the DataFrame.

```python
df[condition]
```

this will return only the rows where `condition` is `True`.

but what is `condition`?

A condition is typically a comparison operation applied to a DataFrame column, such as:
- `df['col'] > value`
- `df['col'] < value`
- `df['col'] == value`
- `df['col'] != value`

**Example: Filtering rows where `Age` is greater than 23.**

```python
df[df['Age'] > 23]
```

Output:
```
    Name  Age  Grade      City
0  Alice   24     20    Lisboa
1    Bob   27     15     Porto
```

### Q 2.4. Specify a condition to filter the DataFrame for students with a `Grade` greater than 15.

- What do you expect the output to be?

In [19]:
df[df['Grade'] > 15]

Unnamed: 0,Name,Age,Grade,City
0,Alice,24,20,Lisboa
3,David,23,17,Aveiro


`pd.Series` is the Pandas data structure representing a one-dimensional labeled array, similar to a single column in a DataFrame.

**When we apply a condition to a DataFrame column, it returns a Series of boolean values (`True` or `False`) indicating whether each row meets the condition.**

Now put your `condition` inside the brackets to filter the DataFrame!

```python
df[condition]
```

In [20]:
df['Grade'] > 15

0     True
1    False
2    False
3     True
Name: Grade, dtype: bool

### Q 2.5. Filter for students from the `City` of "Porto" **or** "Braga".
- `|` for logical OR

In [21]:
df[(df["City"] == "Porto") | (df["City"] == "Braga")]

Unnamed: 0,Name,Age,Grade,City
1,Bob,27,15,Porto
2,Charlie,22,12,Braga


### Q 2.6 Filter for students with a `Grade` greater than 15 **and** from the city of "Lisboa".
- `&` for logical AND

In [22]:
df[(df["Grade"] > 15) & (df["City"] == "Lisboa")]

Unnamed: 0,Name,Age,Grade,City
0,Alice,24,20,Lisboa


### Q 2.7 Filter for students **not** from the city of "Aveiro".
- `!=` for not equal
- `~` for NOT

In [23]:
df[df["City"] != "Aveiro"]

Unnamed: 0,Name,Age,Grade,City
0,Alice,24,20,Lisboa
1,Bob,27,15,Porto
2,Charlie,22,12,Braga


We could also use the `~` operator to **negate a condition** but it's often clearer to use `!=` for not equal.

```python
df[~(df['City'] == 'Aveiro')]
```

Let's practice a bit more combining Filtering of rows and columns!

### Q 2.8. Filter for students with a `Grade` greater than 15 and select only their `Name` and `City` columns.

In [24]:
df.loc[df["Grade"] > 15, ["Name", "City"]]

Unnamed: 0,Name,City
0,Alice,Lisboa
3,David,Aveiro


### Q 2.9. Filter for students from the city of "Porto" and select only their `Name` and `Grade` columns.

In [25]:
df.loc[df["City"] == "Porto", ["Name", "Grade"]]

Unnamed: 0,Name,Grade
1,Bob,15


### Q 2.10. Filter for students **not** from the city of "Aveiro" and select only their `Name` and `City` columns.

In [26]:
df.loc[df["City"] != "Aveiro", ["Name", "City"]]

Unnamed: 0,Name,City
0,Alice,Lisboa
1,Bob,Porto
2,Charlie,Braga


### Q 2.11. Filter for students with a `Grade` greater than 15 and select only their `Name`, `City` and `Grade` columns.

In [27]:
df.loc[df["Grade"] > 15, ["Name", "City", "Grade"]]

Unnamed: 0,Name,City,Grade
0,Alice,Lisboa,20
3,David,Aveiro,17


### Multiple Conditions

**Still on filtering, we can set multiple conditions.**

When combining conditions, it's **important to use parentheses** to group conditions properly.

The logical operators are:
- `&` for **AND**
- `|` for **OR**
- `~` for **NOT**

**Usage:**
```python
df[(condition1) & (condition2)]  # AND
df[(condition1) | (condition2)]  # OR
df[~(condition)]                 # NOT
```

**Example: Filtering for students with `Grade` greater than 15 and from "Lisboa" or "Porto".**

```python
df[(df['Grade'] > 15) & ((df['City'] == 'Lisboa') | (df['City'] == 'Porto'))]
```

### Q 2.12. Filter for students with a `Age` greater than 23 from `City` **Aveiro** and select only their `Name` and `Grade` columns.


In [None]:
# in this case, there are no students that have age above 23 from the city of Aveiro
df.loc[(df["Age"] > 23) & (df["City"] == "Aveiro"), ["Name", "Grade"]]

Unnamed: 0,Name,Grade


As those conditions can get complex, make sure to use parentheses to group them properly and alternatively break them into multiple steps for clarity!

```python
condition1 = df['Age'] > 23
condition2 = df['City'] == 'Aveiro'

df[condition1 & condition2]
```

### Q 2.13. Rewrite the filtering from `Q 2.12` using intermediate conditions for clarity.

In [29]:
condition1 = df["Age"] > 23
condition2 = df["City"] == "Aveiro"

df.loc[condition1 & condition2, ["Name", "Grade"]]


Unnamed: 0,Name,Grade


In [None]:
# lets be more lenient now with age above 20
condition1 = df["Age"] > 20
condition2 = df["City"] == "Aveiro"

df.loc[condition1 & condition2, ["Name", "Grade"]]


Unnamed: 0,Name,Grade
3,David,17


**Filtering is an essential skill when working with tabular data in Pandas.**

**It allows us to extract relevant subsets of data based on specific criteria, enabling focused analysis and insights.**

Don't sleep on that!

There are many more advanced selection and filtering techniques in Pandas, including using `.loc[]` and `.iloc[]` for label-based and position-based indexing, respectively.

Those are useful when you need more control over row and column selection based on labels or integer positions.

### Filering data using `.loc[]`.

`.loc[]` is a powerful method for label-based indexing and selection in Pandas DataFrames. It allows you to select rows and columns based on their labels.

- For a single column:

    ```python
    df.loc[condition, 'col1']
    ```

- For multiple columns:

    ```python
    df.loc[condition, ['col1', 'col2']]
    ```

### Q 2.14. Using `.loc[]`, filter for students with a `Grade` greater than 15 and select only their `Name` column.

In [31]:
df.loc[df["Grade"] > 15, "Name"]

0    Alice
3    David
Name: Name, dtype: object

### Q 2.15. Using `.loc[]`, filter for students from the city of "Porto" and select only their `Name` and `Grade` columns.

In [32]:
df.loc[df["City"] == "Porto", ["Name", "Grade"]]

Unnamed: 0,Name,Grade
1,Bob,15


### Q 2.16. Using `.loc[]`, filter for students **not** from the city of "Aveiro" and select only their `Name` and `City` columns.

In [33]:
df.loc[df["City"] != "Aveiro", ["Name", "City"]]

Unnamed: 0,Name,City
0,Alice,Lisboa
1,Bob,Porto
2,Charlie,Braga


Why use `.loc[]` over standard filtering?
- `.loc[]` provides a more explicit and readable way to select rows and columns based on labels.
- when we need to update values filtered by a condition, `.loc[]` is the preferred method to avoid unsafe updates.
- unsafe updates can lead to unexpected behavior and warnings in Pandas, so using `.loc[]` helps ensure that we are modifying the DataFrame safely and correctly.

### Filtering data using `.iloc[]`.

`.iloc[]` is another powerful method for integer-location based indexing and selection in Pandas DataFrames. It allows you to select rows and columns based on their integer positions.

The syntax is similar to `.loc[]`, but instead of using labels, we use integer indices that represent the position of rows and columns.

It is zero-based indexing and follows the format `df.iloc[row_indices, column_indices]`.

- For a single column:

    ```python
    df.iloc[list(condition), col_index]
    ```

- For multiple columns:

    ```python
    df.iloc[list(condition), [col_index1, col_index2]]
    ```

**Despite being able to filter for conditions using `.iloc`, it is not ideal to use it like this.**

### Q 2.17. Using `.iloc[]`, filter for students with a `Grade` greater than 15 and select only their `Name` column.

In [34]:
condition = df["Grade"] > 15
df.iloc[list(condition), 0]

0    Alice
3    David
Name: Name, dtype: object

### Q 2.18. Using `.iloc[]`, filter for students from the city of "Porto" and select only their `Name` and `Grade` columns.

In [35]:
condition = df["City"] == "Porto"
df.iloc[list(condition), [0, 2]]

Unnamed: 0,Name,Grade
1,Bob,15


It is important to note that `.iloc[]` uses integer positions, so you need to know the index of the columns you want to select.

The same logic we used for NumPy indexing applies here.

So, to select the `Name` column (index 0) and `Grade` column (index 2), we would use:
```python
df.iloc[list(condition), [0, 2]]
```
We can select blocks of columns too by using `:` for ranges.
```python
df.iloc[list(condition), 0:3]
```

We can select blocks of rows too by using `:` for ranges.
```python
df.iloc[0:2, list(condition)]
```

We can select rows and columns blocks together too by using `:` for ranges.
```python
df.iloc[0:2, 0:3]
```

### Q 2.19. Using `.iloc[]`, filter for students **not** from the city of "Aveiro" and select only their `Name` and `City` columns.

In [36]:
condition = df["City"] != "Aveiro"
df.iloc[list(condition), [0, 3]]

Unnamed: 0,Name,City
0,Alice,Lisboa
1,Bob,Porto
2,Charlie,Braga


### Q 2.20. Using `.iloc[]`, filter for the second and third students (zero-based) from columns 0 to 2 (exclusive).

In [39]:
df.iloc[1:3, 0:2]

Unnamed: 0,Name,Age
1,Bob,27
2,Charlie,22


## 3. Updating DataFrames

Now that we have covered filtering and selecting data in Pandas DataFrames, we can move on to updating data.

Updating data in a DataFrame can be done in several ways, including:
- Direct assignment using column names or `.loc[]`/`.iloc[]`
- Using conditional updates with boolean masks

Example: Direct assignment using column names.

```python
df['Age'] = df['Age'] + 1  # Increment age by 1
```

As result, the `Age` column will have all values incremented by 1.

Example: Using `.loc[]` for conditional updates.

```python
df.loc[df['Grade'] < 15, 'Grade'] = 15  # Set Grade to 15 if less than 15
```

In this case, all rows where the `Grade` is less than 15 will have their `Grade` updated to 15.

**Note that we used `.loc[]` to safely update the DataFrame based on a condition.**

### Q 3.1. Decrement the `Age` of all students by 1 using direct assignment.

In [40]:
df["Age"] = df["Age"] - 1
df

Unnamed: 0,Name,Age,Grade,City
0,Alice,23,20,Lisboa
1,Bob,26,15,Porto
2,Charlie,21,12,Braga
3,David,22,17,Aveiro


### Q 3.2. Using `.loc[]`, set the `Grade` to 18 for students from the city of "Lisboa".

In [41]:
df.loc[df["City"] == "Lisboa", "Grade"] = 18
df

Unnamed: 0,Name,Age,Grade,City
0,Alice,23,18,Lisboa
1,Bob,26,15,Porto
2,Charlie,21,12,Braga
3,David,22,17,Aveiro


### Q 3.3. Using `.loc[]`, set the `City` to "Coimbra" for students with a `Grade` less than 15.

In [42]:
df.loc[df["Grade"] < 15, "City"] = "Coimbra"
df

Unnamed: 0,Name,Age,Grade,City
0,Alice,23,18,Lisboa
1,Bob,26,15,Porto
2,Charlie,21,12,Coimbra
3,David,22,17,Aveiro


### Adding new data

We can add new columns to a DataFrame by direct assignment.

```python
df['New_Column'] = [1, 2, 3]
```

This will create a new column named `New_Column` with the specified values.
If the length of the list does not match the number of rows in the DataFrame, Pandas will raise a `ValueError`.

If we want to add a NaN value, we can use `numpy.nan`.

```python
import numpy as np
df['New_Column'] = [1, np.nan, 3]
```

You can also add new rows using the `append()` function.

```python
new_row = {'Name': 'Eve', 'Age': 21, 'Grade': 19, 'City': 'Faro'}
df = df.append(new_row, ignore_index=True)
```

The `ignore_index=True` parameter is used to reindex the DataFrame after appending the new row. This is important to ensure that the index remains sequential and consistent.

### Q 3.4. Add a new column `Graduated` with boolean values indicating whether the student has graduated (True/False).

In [None]:
df["Graduated"] = [True, False, False, True]
df

### Q 3.5. Add a column `Gender` with values `Male`, `Female` or `Other`.

In [43]:
df["Gender"] = ["Female", "Male", "Other", "Male"]
df

Unnamed: 0,Name,Age,Grade,City,Gender
0,Alice,23,18,Lisboa,Female
1,Bob,26,15,Porto,Male
2,Charlie,21,12,Coimbra,Other
3,David,22,17,Aveiro,Male


### Casting Data Types
Pandas allows us to change the data type of a column using the `astype()` method. This is useful when we need to ensure that a column has the correct type for analysis or when preparing data for export.

It is a good practice to ensure that columns have the appropriate data types for efficient storage and accurate computations.

We can check the data types of each column using the `dtypes` attribute of the DataFrame.

- `.info()` → provides a summary of the DataFrame including data types.
- `.dtypes` → returns a Series with the data type of each column.

**Example of checking data types.**
```python
print(df.info())
```

**Example: Changing the data type of the `Grade` column to `float`.**

Usage:
```python
df['Grade'] = df['Grade'].astype(float)
```

Useful data types to consider:
- `int` → Integer numbers
- `float` → Floating-point numbers
- `object` → General Python objects (often used for strings)
- `category` → Categorical data with a fixed number of possible values (saves memory)
- `datetime` → Date and time values
- `bool` → Boolean values (`True`/`False`)

### Q 3.4. Check the data types of each column in the DataFrame.

In [44]:
df.dtypes

Name      object
Age        int64
Grade      int64
City      object
Gender    object
dtype: object

### Q 3.5. Change the data type of the `Grade` column to `float`.

In [45]:
df["Grade"] = df["Grade"].astype(float)
df

Unnamed: 0,Name,Age,Grade,City,Gender
0,Alice,23,18.0,Lisboa,Female
1,Bob,26,15.0,Porto,Male
2,Charlie,21,12.0,Coimbra,Other
3,David,22,17.0,Aveiro,Male


**Note that columns dtypes are the default when even just one value is of that type is present on it.**

For example, if a column has mostly integers but one value is a float, the entire column will be of type `float` to accommodate that value.

The same applies to other types as well such as object (strings) dtype.

## 4. Reading and Exporting DataFrames

Pandas provides functions to read data from various file formats into DataFrames and to export DataFrames back to files.

Common file formats include:
- CSV (`.csv`)
- JSON (`.json`)
- Excel (`.xlsx`, `.xls`)

You can use the following functions to read data:
- `pd.read_csv('file.csv')`
- `pd.read_json('file.json')`
- `pd.read_excel('file.xlsx')`

To export DataFrames to files, you can use:
- `df.to_csv('file.csv', index=False)`
- `df.to_json('file.json', orient='records')`
- `df.to_excel('file.xlsx', index=False)`

Those extra parameters help to control the output format, such as whether to include the index or how to structure the JSON data.

### Q 4.1. Export your DataFrame from previous questions to a CSV file named `students.csv`. Experiment with the `index` parameter set to True and False.

In [53]:
# this will include the index
df.to_csv("students.csv", index=True)

In [54]:
!cat students.csv

,Name,Age,Grade,City,Gender
0,Alice,23,18.0,Lisboa,Female
1,Bob,26,15.0,Porto,Male
2,Charlie,21,12.0,Coimbra,Other
3,David,22,17.0,Aveiro,Male


In [None]:
# we usually use this one
# this will not include the index
df.to_csv("students.csv", index=False)

In [56]:
!cat students.csv

Name,Age,Grade,City,Gender
Alice,23,18.0,Lisboa,Female
Bob,26,15.0,Porto,Male
Charlie,21,12.0,Coimbra,Other
David,22,17.0,Aveiro,Male


### Q 4.2. Read the `students.csv` file back into a new DataFrame named `df_students`.

In [47]:
df_students = pd.read_csv("students.csv")
df_students

Unnamed: 0,Name,Age,Grade,City,Gender
0,Alice,23,18.0,Lisboa,Female
1,Bob,26,15.0,Porto,Male
2,Charlie,21,12.0,Coimbra,Other
3,David,22,17.0,Aveiro,Male


### Q 4.3. Export your DataFrame to a JSON file named `students.json`. Experiment with the `orient` parameter set to `records` and `columns`.

In [48]:
df.to_json("students.json", orient="records")

In [50]:
!cat students.json

[{"Name":"Alice","Age":23,"Grade":18.0,"City":"Lisboa","Gender":"Female"},{"Name":"Bob","Age":26,"Grade":15.0,"City":"Porto","Gender":"Male"},{"Name":"Charlie","Age":21,"Grade":12.0,"City":"Coimbra","Gender":"Other"},{"Name":"David","Age":22,"Grade":17.0,"City":"Aveiro","Gender":"Male"}]


In [51]:
df.to_json("students.json", orient="columns")

In [52]:
!cat students.json

{"Name":{"0":"Alice","1":"Bob","2":"Charlie","3":"David"},"Age":{"0":23,"1":26,"2":21,"3":22},"Grade":{"0":18.0,"1":15.0,"2":12.0,"3":17.0},"City":{"0":"Lisboa","1":"Porto","2":"Coimbra","3":"Aveiro"},"Gender":{"0":"Female","1":"Male","2":"Other","3":"Male"}}


Alternatively, some libraries provide built-in datasets that we can load directly into Pandas DataFrames for practice. In this case they provide functions to load those datasets.

Pandas itself does not include built-in datasets, but libraries like `seaborn` and `sklearn` do.

We can also read datasets from online sources directly into DataFrames using URLs.

## 5. Datasets

In this section, we will work with real-world datasets to apply the concepts we have learned so far.

Let's load a sample dataset using Pandas.

We are going to use the popular `Titanic` dataset, which contains information about passengers on the Titanic, including whether they survived the disaster.

**Data dictionary**

| Column | Description |
|---|---|
| PassengerId | Unique ID |
| Survived | 0 = No, 1 = Yes |
| Pclass | Ticket class (1, 2, 3) |
| Name | Passenger name |
| Sex | Gender |
| Age | Age in years |
| SibSp | Siblings/spouses aboard |
| Parch | Parents/children aboard |
| Ticket | Ticket number |
| Fare | Ticket fare |
| Cabin | Cabin number |
| Embarked | Port of embarkation (C/Q/S)|


### Q 5.1. Load the Titanic dataset:

In [58]:
df = pd.read_csv('titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Q 5.2. Inspect the dataset using the methods we learned earlier (`shape`, `head()`, `info()`, `describe()`).

In [None]:
# 891 rows and 12 columns
df.shape

(891, 12)

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [61]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Q 5.3. Filter for passengers who survived (Survived == 1) and select only their `Name`, `Age`, and `Fare` columns. Can you calculate the sum, median and average of age and fare from the survivors?

In [None]:
# we are createing a subset here of the columns that we want given the filter for survivors
survivors = df[df["Survived"] == 1][["Name", "Age", "Fare"]]

age_sum = survivors["Age"].sum()
age_median = survivors["Age"].median()
age_mean = survivors["Age"].mean()

fare_sum = survivors["Fare"].sum()
fare_median = survivors["Fare"].median()
fare_mean = survivors["Fare"].mean()

print("Age statistics (survivors):")
print("  • Sum    =", age_sum)
print("  • Median =", age_median)
print("  • Mean   =", age_mean)
print()
print("Fare statistics (survivors):")
print("  • Sum    =", fare_sum)
print("  • Median =", fare_median)
print("  • Mean   =", fare_mean)

survivors

Age statistics (survivors):
  • Sum    = 8219.67
  • Median = 28.0
  • Mean   = 28.343689655172415

Fare statistics (survivors):
  • Sum    = 16551.2294
  • Median = 26.0
  • Mean   = 48.39540760233918


Unnamed: 0,Name,Age,Fare
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833
2,"Heikkinen, Miss. Laina",26.0,7.9250
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1000
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0,11.1333
9,"Nasser, Mrs. Nicholas (Adele Achem)",14.0,30.0708
...,...,...,...
875,"Najib, Miss. Adele Kiamie ""Jane""",15.0,7.2250
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",56.0,83.1583
880,"Shelley, Mrs. William (Imanita Parrish Hall)",25.0,26.0000
887,"Graham, Miss. Margaret Edith",19.0,30.0000


### Q 5.4. Filter for passengers who did not survive (Survived == 0) and select only their `Name`, `Age`, and `Fare` columns. Can you calculate the sum, median and average of age and fare of those who did not survive?

In [None]:
# we are createing a subset here of the columns that we want given the filter for nonsurvivors
nonsurvivors = df[df["Survived"] == 0][["Name", "Age", "Fare"]]

age_sum = nonsurvivors["Age"].sum()
age_median = nonsurvivors["Age"].median()
age_mean = nonsurvivors["Age"].mean()

fare_sum = nonsurvivors["Fare"].sum()
fare_median = nonsurvivors["Fare"].median()
fare_mean = nonsurvivors["Fare"].mean()

print("Age statistics (non-survivors):")
print("  • Sum   =", age_sum)
print("  • Median =", age_median)
print("  • Mean   =", age_mean)
print()
print("Fare statistics (non-survivors):")
print("  • Sum   =", fare_sum)
print("  • Median =", fare_median)
print("  • Mean   =", fare_mean)

nonsurvivors


Age statistics (non-survivors):
  • Sum   = 12985.5
  • Median = 28.0
  • Mean   = 30.62617924528302

Fare statistics (non-survivors):
  • Sum   = 12142.7199
  • Median = 10.5
  • Mean   = 22.117886885245902


Unnamed: 0,Name,Age,Fare
0,"Braund, Mr. Owen Harris",22.0,7.2500
4,"Allen, Mr. William Henry",35.0,8.0500
5,"Moran, Mr. James",,8.4583
6,"McCarthy, Mr. Timothy J",54.0,51.8625
7,"Palsson, Master. Gosta Leonard",2.0,21.0750
...,...,...,...
884,"Sutehall, Mr. Henry Jr",25.0,7.0500
885,"Rice, Mrs. William (Margaret Norton)",39.0,29.1250
886,"Montvila, Rev. Juozas",27.0,13.0000
888,"Johnston, Miss. Catherine Helen ""Carrie""",,23.4500


Next session we will cover more topics in Pandas, more on data wrangling, casting dtypes, handling missing values, etc!