# Section 1: Getting Started With Pandas

## Introduction
We will begin by introducing the `Series`, `DataFrame`, and `Index` classes, which are the basic building blocks of the Pandas library, and by showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.

If at the end of this introduction to Pandas you need more information, you can always refer to the [official Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html), the [official Pandas website](http://pandas.pydata.org/), or the [source code](https://github.com/pandas-dev/pandas/).

## Anatomy of a DataFrame
A **DataFrame** is composed of one or more **Series**. The name of the Series form the <span style="color:#fa5252">**Column Names**</span>, and the row labels form the <span style="color:#a5d8ff">**Index**</span>.

![](https://www.w3resource.com/w3r_images/pandas-data-frame.svg)

In [1]:
import pandas as pd
# importami le librerie di pandas (header file e linking libreria)
import numpy as np # non necessario, posso non importarlo, se non lo chiamo
# numpy permette di utilizzare gli array in modo più efficiente
# ci dà strumenti per fare visualizzazione (altrimenti uso mathplotlib)
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')
penguins
# l'indice c'è di default, è utilizzato per indicare le istanze (le righe)
# NaN, dati incompleti, andrebbero filtrati.

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


### Series

In [8]:
penguins.species
# una serie è una colonna con il suo nome. con questo comando si prende una sola
# colonna. Le colonne possono essere indicate tra parentesi quadre.
penguins['species']

Unnamed: 0,species
0,Adelie
1,Adelie
2,Adelie
3,Adelie
4,Adelie
...,...
339,Chinstrap
340,Chinstrap
341,Chinstrap
342,Chinstrap


### Columns

In [9]:
penguins.columns
# nome delle colonne

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

### Index

In [10]:
penguins.index
# restituisce il range degli indici, l'ultimo valore è escluso.

RangeIndex(start=0, stop=344, step=1)

## Creating DataFrames
We can create DataFrames from a variety of sources such as other Python objects, flat files, webscraping, and API requests. Here, we will see just an example, but be sure to check out [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the documentation for a complete list and have a look at [this page](https://towardsdatascience.com/top-5-ridiculously-better-csv-alternatives-595f70a9c936/) for csv files alternatives for big data storage.

![](https://github.com/applied-machine-learning-aa-2024-25/data-analysis-with-pandas-notebooks-stephenpasqualottounipd/blob/main/images/readwrite.svg?raw=1)

### Using a flat file

In [12]:
# pandas ammette molti dati di diverse tipologie, anche excel, sql, html, ...
# può anche convertirli: penguins.to_excel("penguins.xls")
penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')

*Tip: There are many parameters to this function to handle some initial processing while reading in the file - be sure check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).*

*Tip 2: You can also write this data to a new file using the `df.to_csv('data.csv')` function*

## Inspecting the data
Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have.

Let's inspect the `penguins` data.

#### 🐧 Step 1: How big is this thing?

In [13]:
penguins.shape
# dimensioni (forma) della tabella. Tiene conto dell'indice
# Qua vado a vedere quanto tempo mi ci vorrà per fare azioni sui dati

(344, 8)

#### 🧠 Step 2: What are we even looking at?

In [14]:
penguins.columns
# capisco quali sono le colonne, valuto se hanno tutte senso

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

#### 📦 Step 3: What type of stuff is in each column?

In [15]:
penguins.dtypes
# tipi all'interno dell'oggetto, con object si intende stringa.
# questi tipi sono individuati leggendo il dataset, possono non avere senso
# potrebbe essere sensato far considerare gli objects come categorie

Unnamed: 0,0
species,object
island,object
bill_length_mm,float64
bill_depth_mm,float64
flipper_length_mm,float64
body_mass_g,float64
sex,object
year,int64


#### 👀 Step 4: Show me the penguins!

In [16]:
penguins.head()
# prime 5 righe

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:

In [17]:
penguins.tail()
# ultime 5 righe, utile perchè, in fondo potrebbero esserci cose inutili come:
# somme, altre righe di dati inutili

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009
343,Chinstrap,Dream,50.2,18.7,198.0,3775.0,female,2009


#### 🔢 Step 5: How much non-missing data do we actually have?

In [18]:
penguins.count()
# conteggio dei dati validi, non valido vuol dire NaN

Unnamed: 0,0
species,344
island,344
bill_length_mm,342
bill_depth_mm,342
flipper_length_mm,342
body_mass_g,342
sex,333
year,344


#### 🕵 ️Step 6: Want a quick profile of the whole DataFrame?

In [19]:
penguins.info()
# dà informazioni complessive

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB


## Extracting subsets
A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks."

#### Selecting columns
We can select columns as attributes if their names would be valid Python variables:

In [20]:
penguins.species

Unnamed: 0,species
0,Adelie
1,Adelie
2,Adelie
3,Adelie
4,Adelie
...,...
339,Chinstrap
340,Chinstrap
341,Chinstrap
342,Chinstrap


If they aren't, we have to select them as keys.

In [21]:
penguins['body_mass_g']
# usare come standard

Unnamed: 0,body_mass_g
0,3750.0
1,3800.0
2,3250.0
3,
4,3450.0
...,...
339,4000.0
340,3400.0
341,3775.0
342,4100.0


However, we can select multiple columns at once this way:

In [22]:
penguins[['species','body_mass_g']]

Unnamed: 0,species,body_mass_g
0,Adelie,3750.0
1,Adelie,3800.0
2,Adelie,3250.0
3,Adelie,
4,Adelie,3450.0
...,...,...
339,Chinstrap,4000.0
340,Chinstrap,3400.0
341,Chinstrap,3775.0
342,Chinstrap,4100.0


#### Selecting rows

In [23]:
penguins[100:110]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
100,Adelie,Biscoe,35.0,17.9,192.0,3725.0,female,2009
101,Adelie,Biscoe,41.0,20.0,203.0,4725.0,male,2009
102,Adelie,Biscoe,37.7,16.0,183.0,3075.0,female,2009
103,Adelie,Biscoe,37.8,20.0,190.0,4250.0,male,2009
104,Adelie,Biscoe,37.9,18.6,193.0,2925.0,female,2009
105,Adelie,Biscoe,39.7,18.9,184.0,3550.0,male,2009
106,Adelie,Biscoe,38.6,17.2,199.0,3750.0,female,2009
107,Adelie,Biscoe,38.2,20.0,190.0,3900.0,male,2009
108,Adelie,Biscoe,38.1,17.0,181.0,3175.0,female,2009
109,Adelie,Biscoe,43.2,19.0,197.0,4775.0,male,2009


#### Indexing
We use `iloc[]` to select rows and columns by their position:

In [None]:
penguins.iloc[100:110, [0, 3, 4, 6]]

and use `loc[]` to select rows and columns by their labels:

In [None]:
penguins.loc[100:110, ['species', 'body_mass_g']]

#### Filtering with Boolean masks
A **Boolean mask** is a array-like structure of Boolean values &ndash; it's a way to specify which rows/columns we want to select (`True`) and which we don't (`False`).

Here's an example of a Boolean mask for penguins weighing more than 3500 grams that are female:

In [None]:
(penguins['body_mass_g'] > 3500) & (penguins.sex == 'female')

**Important**: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

We can use a Boolean mask to select the subset of meteorites weighing more than 3500 grams that are female:


In [None]:
penguins[(penguins['body_mass_g'] > 3500) & (penguins.sex == 'female')]

*Tip: Boolean masks can be used with `loc[]` and `iloc[]`.*

An alternative to this is the `query()` method:

In [None]:
penguins.query("body_mass_g > 3500 and sex == 'female'")

*Tip: Here, we can use both logical operators and bitwise operators.*

## Calculating summary statistics
In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the `penguins` data by calculating summary statistics.

#### 📊 Island Frequency Counts

In [None]:
penguins['island'].value_counts()

In [None]:
penguins['island'].value_counts().plot(kind='bar', title='Number of Penguins per Island')


#### ⚖️ Mean Body Mass

In [None]:
print("Mean:", penguins['body_mass_g'].mean())
print("Median:", penguins['body_mass_g'].median())




**Important**: The mean isn't always the best measure of central tendency. If there are outliers in the distribution, the mean will be skewed.

Taking a look at some quantiles at the extremes of the distribution shows that the mean is between the 95th and 99th percentile of the distribution, so it isn't a good measure of central tendency here

In [None]:

penguins['body_mass_g'].plot(kind='hist', bins=20, title='Body Mass Distribution', edgecolor='black')


### 📐 Mass by Species – Boxplot

In [None]:
import seaborn as sns
sns.boxplot(x='species', y='body_mass_g', data=penguins)


#### 🧮 Using Quantiles

In [None]:
penguins['body_mass_g'].quantile([0.01, 0.05, 0.5, 0.95, 0.99])

A better measure in this case is the median (50th percentile), since it is robust to outliers:

In [None]:
penguins['body_mass_g'].median()

#### 🐘 Heaviest Penguin  – Outlier Check

In [None]:
penguins['body_mass_g'].max()

Let's extract the information on this penguin:

In [None]:
penguins.loc[penguins['body_mass_g'].idxmax()]

#### 📚 Unique Categorical Classes

In [None]:
penguins.island.nunique()

Some examples:

In [None]:
penguins.island.unique()[:2]

#### 🧾 General Summary: describe()

We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:

In [None]:
penguins.describe(include='all')

**Important**: `NaN` values signify missing data. For instance, the `fall` column contains strings, so there is no value for `mean`; likewise, `mass (g)` is numeric, so we don't have entries for the categorical summary statistics (`unique`, `top`, `freq`).

#### Check out the documentation for more descriptive statistics:

- [Series](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats)
- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)