## Esamina i Dati

Carichiamo i dati usando `read_csv` e diamo un'occhiata rapida con `head()`.

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
from datasets import load_dataset
import matplotlib.pyplot as plt  



# Carica il dataset
dataset = load_dataset("yiqing111/Engineering_Jobs_Insight_Dataset")
# Converte in DataFrame Pandas
df = dataset['train'].to_pandas()
# Rimpiazza gli spazi con l'underscore
df.columns = df.columns.str.replace(' ', '_')
# Convertire 'Date_Posted' in datetime senza specificare il formato esatto
df['Date_Posted'] = pd.to_datetime(df['Date_Posted'], errors='coerce')

Repo card metadata block was not found. Setting CardData to empty.


In [2]:
df.head()

Unnamed: 0,Job_Title,Company,Description,Location,Salary_Min,Salary_Max,Date_Posted,URL
0,Senior Software Engineer (Python),BP Energy,Entity: Trading & Shipping Job Family Group: S...,"Crestwood, Houston",138992.4,138992.4,2024-10-29 16:35:26+00:00,https://www.adzuna.com/land/ad/4917931721?se=N...
1,Sr. Backend Software Engineer,Meijer,"As a family company, we serve people and commu...","Belmont, Kent County",118638.8,118638.8,2024-11-10 01:13:11+00:00,https://www.adzuna.com/land/ad/4933370156?se=N...
2,Sr. Software Engineer - Mobile,Meijer,"As a family company, we serve people and commu...","Belmont, Kent County",108041.95,108041.95,2024-10-15 11:51:30+00:00,https://www.adzuna.com/land/ad/4902683574?se=N...
3,Acquisition Software Engineer,Naval Air Systems Command,Position Description The Harpoon/SLAM ER/JSOW ...,"China Lake, Kern County",88583.57,88583.57,2024-11-16 04:21:41+00:00,https://www.adzuna.com/land/ad/4941260438?se=N...
4,Senior Software Engineer,Innova,A client of Innova Solutions is immediately hi...,"Richardson, Dallas",121932.35,121932.35,2024-11-15 09:42:55+00:00,https://www.adzuna.com/details/4940271538?utm_...


## iloc

Abbiamo già imparato come ottenere le righe usando `iloc[]`.  
Ma possiamo fare molto di più con questo strumento: possiamo infatti selezionare *sia* righe *che* colonne.

Per fare questo, dobbiamo conoscere gli indici del nostro DataFrame.


In [3]:
df.iloc[0] #prima riga

Job_Title                      Senior Software Engineer (Python)
Company                                                BP Energy
Description    Entity: Trading & Shipping Job Family Group: S...
Location                                      Crestwood, Houston
Salary_Min                                              138992.4
Salary_Max                                              138992.4
Date_Posted                            2024-10-29 16:35:26+00:00
URL            https://www.adzuna.com/land/ad/4917931721?se=N...
Name: 0, dtype: object

In [4]:
df.iloc[0][5]

  df.iloc[0][5]


np.float64(138992.4)

##### Nota: usa `df.iloc[0, 15]` invece di `df.iloc[0][15]` per garantire la compatibilità futura con pandas.

L’uso dell’indicizzazione concatenata come `df.iloc[0][15]` è in fase di deprecazione in pandas, perché può portare a comportamenti ambigui tra accesso basato sulla posizione e accesso basato sull’etichetta nelle versioni future.  
Utilizzando `df.iloc[0, 15]`, specifichi direttamente la posizione dell’elemento che vuoi accedere, in modo più chiaro e sicuro, evitando potenziali errori futuri legati all’interpretazione degli indici interi nelle serie.

Quindi dovremmo scrivere:

In [6]:
df.iloc[0,5]

np.float64(138992.4)

In [7]:
df.iloc[2:4,4:6]

Unnamed: 0,Salary_Min,Salary_Max
2,108041.95,108041.95
3,88583.57,88583.57


In [7]:
df.iloc[[2,4],[4,5]]

Unnamed: 0,Salary_Min,Salary_Max
2,108041.95,108041.95
4,121932.35,121932.35


## loc

* `df.loc[]`: seleziona righe e colonne per **etichetta** o **intervallo di etichette**.
* È simile a `df.iloc[]`, ma consente di usare **etichette** invece degli indici numerici.


In [8]:
df.loc[:9,['Salary_Min',	'Salary_Max']]

Unnamed: 0,Salary_Min,Salary_Max
0,138992.4,138992.4
1,118638.8,118638.8
2,108041.95,108041.95
3,88583.57,88583.57
4,121932.35,121932.35
5,133348.23,133348.23
6,88769.2,88769.2
7,79830.78,79830.78
8,94173.15,94173.15
9,91324.85,91324.85


## Gestione dei Dati Mancanti



- `df.dropna()`: elimina i valori mancanti.



### Fillna

- `df.fillna()`: riempie i valori mancanti

In [8]:
data = {
    'Nome': ['Anna', 'Luca', 'Marco', 'Elisa'],
    'Età': [25, np.nan, 30, np.nan],
    'Città': ['Roma', 'Milano', np.nan, 'Torino']
}

df2 = pd.DataFrame(data)
df2

Unnamed: 0,Nome,Età,Città
0,Anna,25.0,Roma
1,Luca,,Milano
2,Marco,30.0,
3,Elisa,,Torino


In [9]:
df_drop = df2.dropna()
df_drop

Unnamed: 0,Nome,Età,Città
0,Anna,25.0,Roma


In [10]:
media_eta = df2['Età'].mean()
df2['Età'] = df2['Età'].fillna(media_eta)
df2

Unnamed: 0,Nome,Età,Città
0,Anna,25.0,Roma
1,Luca,27.5,Milano
2,Marco,30.0,
3,Elisa,27.5,Torino


## Rimozione dei Duplicati

* `drop_duplicates()`: rimuove le righe duplicate.
* Gli analisti devono spesso ripulire i dati, e uno dei problemi più comuni che si incontrano sono i valori duplicati.

In [15]:
# DataFrame Copy
df_original = df.copy()
df_altered = df_original

# Filling the missing values with the median salary
df_altered['Salary_Min'] = 5
df_altered['Salary_Min']

0        5
1        5
2        5
3        5
4        5
        ..
11180    5
11181    5
11182    5
11183    5
11184    5
Name: Salary_Min, Length: 11185, dtype: int64

In [16]:
df_original['Salary_Min']

0        5
1        5
2        5
3        5
4        5
        ..
11180    5
11181    5
11182    5
11183    5
11184    5
Name: Salary_Min, Length: 11185, dtype: int64

In [14]:
df['Salary_Min']

0        138992.40
1        118638.80
2        108041.95
3         88583.57
4        121932.35
           ...    
11180    110000.00
11181    110000.00
11182    126635.63
11183    155731.52
11184    174527.50
Name: Salary_Min, Length: 11185, dtype: float64

In [17]:

print('ID of df_original:               ', id(df_original))
print('ID of df_altered:                ', id(df_altered))
print('ID of df:                        ', id(df))

ID of df_original:                6265881808
ID of df_altered:                 6265881808
ID of df:                         6291379008


## Campionamento

* `sample()`: estrae un campione casuale di elementi.

### Esempi

Otteniamo un campione casuale dei dati.  
È possibile ottenere un campione con un numero fisso di righe.


In [21]:
df.sample(n=2)

Unnamed: 0,Job_Title,Company,Description,Location,Salary_Min,Salary_Max,Date_Posted,URL
847,Software Engineer,Moveworks.ai,Who We Are Moveworks is the universal AI copil...,"Mountain View, Santa Clara County",181433.84,181433.84,2024-11-19 14:11:27+00:00,https://www.adzuna.com/details/4944619905?utm_...
10780,Lead Solutions Architect,Humana,Become a part of our caring community and help...,"West End, Dauphin County",114487.07,114487.07,2024-11-19 14:09:06+00:00,https://www.adzuna.com/details/4944595636?utm_...


Oppure puoi selezionare casualmente una frazione dei dati (ad esempio, il 10% delle righe), con o senza reinserimento (replacement).


In [28]:
df.sample(frac=0.10, replace=False) # (senza replacement) → ogni riga può essere selezionata una sola volta nel campione.



Unnamed: 0,Job_Title,Company,Description,Location,Salary_Min,Salary_Max,Date_Posted,URL
5141,Business Analyst,Maximus,"Description & Requirements Maximus, Inc. is lo...","Jenks, Tulsa County",72010.59,72010.59,2024-11-20 10:55:28+00:00,https://www.adzuna.com/details/4945472946?utm_...
3823,Associate Product Manager Social app startup,Cheez,Cheez is a new app that sends you the pictures...,"Kansas City, Wyandotte County",80706.04,80706.04,2024-05-02 12:57:18+00:00,https://www.adzuna.com/land/ad/4675826411?se=g...
4896,Principal Product Manager - Controls,Lennox International,Lennox (NYSE: LII) Driven by 130 years of lega...,"Richardson, Dallas",133322.16,133322.16,2024-11-16 11:04:03+00:00,https://www.adzuna.com/details/4941503812?utm_...
7786,Cloud Engineer,02 Caci-Federal,Cloud Engineer Job Category: Information Techn...,US,74600.00,156700.00,2024-10-26 05:47:35+00:00,https://www.adzuna.com/details/4914828971?utm_...
1324,Fullstack Software Engineer,Hadrian,Hadrian — Manufacturing the Future Hadrian is ...,"Los Angeles, Los Angeles County",100361.40,100361.40,2024-08-01 05:30:31+00:00,https://www.adzuna.com/details/4804621695?utm_...
...,...,...,...,...,...,...,...,...
3741,Product Manager Social app startup,Cheez,Cheez is a new app that sends you the pictures...,"Indianapolis, Marion County",83070.43,83070.43,2024-05-02 13:04:28+00:00,https://www.adzuna.com/land/ad/4675829230?se=v...
8877,"Cyber Security Engineer - Hybrid Alexandria, VA",Addison Group,Position Title: Cyber Security Engineer Locati...,"Alexandria, Alexandria City",142336.69,142336.69,2024-11-23 11:35:45+00:00,https://www.adzuna.com/details/4949062968?utm_...
6321,UNIV - Information Systems/Business Analyst II...,MUSC,Job Description Summary The Department of Publ...,"Charleston, Charleston County",58909.35,58909.35,2024-11-07 05:50:45+00:00,https://www.adzuna.com/details/4929758478?utm_...
759,Software Engineer,Mudrasys,Job Id: C2S_ Software Engineer _0123_2024 Job ...,"Times Square, King County",96723.48,96723.48,2024-11-19 14:11:08+00:00,https://www.adzuna.com/details/4944617059?utm_...


## Pandas Pivot 

È un'operazione che consente di ristrutturare un DataFrame, trasformando colonne in indici e viceversa. È utile quando vuoi riorganizzare i dati in una forma tabellare più leggibile o più adatta all’analisi.



In [29]:
df = pd.DataFrame({
    'Data': ['2024-01', '2024-01', '2024-02', '2024-02'],
    'Prodotto': ['A', 'B', 'A', 'B'],
    'Vendite': [100, 150, 120, 130]
})
df

Unnamed: 0,Data,Prodotto,Vendite
0,2024-01,A,100
1,2024-01,B,150
2,2024-02,A,120
3,2024-02,B,130


In [None]:
# rioganizziamo i dati di modo che Le righe (index) siano le date, Le colonne, i prodotti, e le celle contengono i valori di vendite.

df.pivot(index='Data', columns='Prodotto', values='Vendite')

Prodotto,A,B
Data,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-01,100,150
2024-02,120,130




- `pivot_table()` è una versione più flessibile di `pivot()`, perché ti permette di aggregare dati quando ci sono valori duplicati per la combinazione di indice e colonne.
* Syntax: `pivot_table(values='column_to_aggregate', index='row_index', columns='column_index', aggfunc='mean')`


In [30]:
df = pd.DataFrame({
    'Data': ['2024-01', '2024-01', '2024-01', '2024-02'],
    'Prodotto': ['A', 'A', 'B', 'B'],
    'Vendite': [100, 120, 150, 130]
})
df

Unnamed: 0,Data,Prodotto,Vendite
0,2024-01,A,100
1,2024-01,A,120
2,2024-01,B,150
3,2024-02,B,130


In [31]:
df.pivot(index='Data', columns='Prodotto', values='Vendite')

ValueError: Index contains duplicate entries, cannot reshape

In [32]:
df.pivot_table(index='Data', columns='Prodotto', values='Vendite', aggfunc='sum')

Prodotto,A,B
Data,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-01,220.0,150.0
2024-02,,130.0


In [33]:
df.pivot_table(index='Prodotto', values='Vendite', aggfunc='sum')

Unnamed: 0_level_0,Vendite
Prodotto,Unnamed: 1_level_1
A,220
B,280


In [None]:
df.groupby('Prodotto')['Vendite'].sum()


Prodotto
A    220
B    280
Name: Vendite, dtype: int64

## Esempio
Contiamo quanti lavori per ogni tipo ci sono

In [None]:
df.pivot_table(index='Job_Title', aggfunc='size').sort_values(ascending=False)

Job_Title
Software Engineer                                                     311
Product Manager  Social app startup                                   282
Associate Product Manager  Social app startup                         277
2025 Virtual Summer Intern Program - Product Analyst Intern (Xome)    267
Real-Time Software Engineer                                           266
                                                                     ... 
Information Systems Security Engineer, Senior                           1
Information Systems Security Engineering (ISSE)                         1
Information Systems Security Officer/ Engineer                          1
Information Technology - Senior Business Analyst IT                     1
☁ Cloud HPC Engineer ☁                                                  1
Length: 4848, dtype: int64

In [None]:
df.groupby('Job_Title').size().sort_values(ascending=False)

Job_Title
Software Engineer                                                     311
Product Manager  Social app startup                                   282
Associate Product Manager  Social app startup                         277
2025 Virtual Summer Intern Program - Product Analyst Intern (Xome)    267
Real-Time Software Engineer                                           266
                                                                     ... 
Information Systems Security Engineer, Senior                           1
Information Systems Security Engineering (ISSE)                         1
Information Systems Security Officer/ Engineer                          1
Information Technology - Senior Business Analyst IT                     1
☁ Cloud HPC Engineer ☁                                                  1
Length: 4848, dtype: int64