<a href="https://colab.research.google.com/github/sebabecerra/CursoBasicoPython/blob/main/06_Pandas_Acceso_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Accessing Data

## Revisar Datos

Carguemos los datos usando `read_csv` y revisémoslos rápidamente usando `head()`.

In [None]:
!pip install datasets

In [1]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/231M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/785741 [00:00<?, ? examples/s]

In [2]:
df.head()

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Senior Data Engineer,Senior Clinical Data Engineer / Principal Clin...,"Watertown, CT",via Work Nearby,Full-time,False,"Texas, United States",2023-06-16 13:44:15,False,False,United States,,,,Boehringer Ingelheim,,
1,Data Analyst,Data Analyst,"Guadalajara, Jalisco, Mexico",via BeBee México,Full-time,False,Mexico,2023-01-14 13:18:07,False,False,Mexico,,,,Hewlett Packard Enterprise,"['r', 'python', 'sql', 'nosql', 'power bi', 't...","{'analyst_tools': ['power bi', 'tableau'], 'pr..."
2,Data Engineer,"Data Engineer/Scientist/Analyst, Mid or Senior...","Berlin, Germany",via LinkedIn,Full-time,False,Germany,2023-10-10 13:14:55,False,False,Germany,,,,ALPHA Augmented Services,"['python', 'sql', 'c#', 'azure', 'airflow', 'd...","{'analyst_tools': ['dax'], 'cloud': ['azure'],..."
3,Data Engineer,LEAD ENGINEER - PRINCIPAL ANALYST - PRINCIPAL ...,"San Antonio, TX",via Diversity.com,Full-time,False,"Texas, United States",2023-07-04 13:01:41,True,False,United States,,,,Southwest Research Institute,"['python', 'c++', 'java', 'matlab', 'aws', 'te...","{'cloud': ['aws'], 'libraries': ['tensorflow',..."
4,Data Engineer,Data Engineer- Sr Jobs,"Washington, DC",via Clearance Jobs,Full-time,False,Sudan,2023-08-07 14:29:36,False,False,Sudan,,,,Kristina Daniel,"['bash', 'python', 'oracle', 'aws', 'ansible',...","{'cloud': ['oracle', 'aws'], 'other': ['ansibl..."


Aprendimos cómo obtener filas usando iloc[] antes. Pero podemos hacer mucho más con ello. Podemos obtener filas y columnas.

## iloc

### Notas

* `df.iloc[]`: Selecciona filas y columnas por posición.

### Ejemplos
Usando `iloc` vamos a:

1. Obtener la primera fila (índice 0).
2. Obtener el elemento `job_skills` de la tercera fila (índice 2).
3. Obtener `job_skills` (índice 15) y `job_type_skills` (índice 16) para la tercera (índice 2) y cuarta (índice 3) filas.
4. Obtener las primeras 12 filas del DataFrame.
5. Obtener las primeras cinco columnas del DataFrame y todas las filas.

Para esto necesitaremos conocer los números de índice de nuestro DataFrame.

![image](https://github.com/sebabecerra/CursoBasicoPython/blob/main/images/iloc_visual_1.png?raw=1)


1. Obtener la primera fila.

In [None]:
df.iloc[0]

job_title_short                                       Senior Data Engineer
job_title                Senior Clinical Data Engineer / Principal Clin...
job_location                                                 Watertown, CT
job_via                                                    via Work Nearby
job_schedule_type                                                Full-time
job_work_from_home                                                   False
search_location                                       Texas, United States
job_posted_date                                        2023-06-16 13:44:15
job_no_degree_mention                                                False
job_health_insurance                                                 False
job_country                                                  United States
salary_rate                                                           None
salary_year_avg                                                        NaN
salary_hour_avg          

2. Obtener el elemento `job_skills` de la tercera fila (índice 2).

In [None]:
df.iloc[2][15]

"['python', 'sql', 'c#', 'azure', 'airflow', 'dax', 'docker', 'kubernetes', 'jenkins']"

##### Nota: Usa `df.iloc[0, 15]` en lugar de `df.iloc[0][15]` para asegurar la compatibilidad futura con pandas.

El uso de indexación encadenada como `df.iloc[0][15]` está siendo obsoleto en pandas, ya que puede llevar a un comportamiento ambiguo entre el acceso basado en posición y el basado en etiquetas en futuras versiones. Al usar `df.iloc[0, 15]`, especificas directamente la posición de los datos que deseas acceder, lo cual es más claro y evita posibles errores futuros cuando pandas cambie cómo se interpretan las claves enteras en la indexación de series.

Entonces, en su lugar deberíamos escribir:


In [None]:
df.iloc[2,15]

"['python', 'sql', 'c#', 'azure', 'airflow', 'dax', 'docker', 'kubernetes', 'jenkins']"

3. Obtener `job_skills` (índice 15) y `job_type_skills` (índice 16) para la tercera (índice 2) y cuarta (índice 3) filas.
    * Para obtener la tercera (índice 2) y cuarta (índice 3) filas: `[2,3]`
    * Para obtener `job_skills` y `job_type_skills` que son los índices 15 y 16 respectivamente: `[15,16]`
    * Luego, coloca esos dos en una lista para obtener todo entre ellos: `df.iloc[[2,3],[15,16]]`

![image2](https://github.com/sebabecerra/CursoBasicoPython/blob/main/images/iloc_visual_2_a.png?raw=1)


In [None]:
df.iloc[[2,3],[15,16]]

Unnamed: 0,job_skills,job_type_skills
2,"['python', 'sql', 'c#', 'azure', 'airflow', 'd...","{'analyst_tools': ['dax'], 'cloud': ['azure'],..."
3,"['python', 'c++', 'java', 'matlab', 'aws', 'te...","{'cloud': ['aws'], 'libraries': ['tensorflow',..."


##### Vista Previa

A continuación, se muestran algunos ejemplos más de lo que `iloc` puede hacer. Presta mucha atención a estos porque los usaremos más adelante.

4. Obtener las primeras 10 filas del DataFrame.

In [None]:
df.iloc[:9]

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Senior Data Engineer,Senior Clinical Data Engineer / Principal Clin...,"Watertown, CT",via Work Nearby,Full-time,False,"Texas, United States",2023-06-16 13:44:15,False,False,United States,,,,Boehringer Ingelheim,,
1,Data Analyst,Data Analyst,"Guadalajara, Jalisco, Mexico",via BeBee México,Full-time,False,Mexico,2023-01-14 13:18:07,False,False,Mexico,,,,Hewlett Packard Enterprise,"['r', 'python', 'sql', 'nosql', 'power bi', 't...","{'analyst_tools': ['power bi', 'tableau'], 'pr..."
2,Data Engineer,"Data Engineer/Scientist/Analyst, Mid or Senior...","Berlin, Germany",via LinkedIn,Full-time,False,Germany,2023-10-10 13:14:55,False,False,Germany,,,,ALPHA Augmented Services,"['python', 'sql', 'c#', 'azure', 'airflow', 'd...","{'analyst_tools': ['dax'], 'cloud': ['azure'],..."
3,Data Engineer,LEAD ENGINEER - PRINCIPAL ANALYST - PRINCIPAL ...,"San Antonio, TX",via Diversity.com,Full-time,False,"Texas, United States",2023-07-04 13:01:41,True,False,United States,,,,Southwest Research Institute,"['python', 'c++', 'java', 'matlab', 'aws', 'te...","{'cloud': ['aws'], 'libraries': ['tensorflow',..."
4,Data Engineer,Data Engineer- Sr Jobs,"Washington, DC",via Clearance Jobs,Full-time,False,Sudan,2023-08-07 14:29:36,False,False,Sudan,,,,Kristina Daniel,"['bash', 'python', 'oracle', 'aws', 'ansible',...","{'cloud': ['oracle', 'aws'], 'other': ['ansibl..."
5,Data Engineer,GCP Data Engineer,Anywhere,via ZipRecruiter,Contractor and Temp work,True,Georgia,2023-11-07 14:01:59,False,False,United States,,,,smart folks inc,"['python', 'sql', 'gcp']","{'cloud': ['gcp'], 'programming': ['python', '..."
6,Senior Data Engineer,Senior Data Engineer - GCP Cloud,"Dearborn, MI",via LinkedIn,Full-time,False,"Florida, United States",2023-03-27 13:18:18,False,False,United States,,,,"Miracle Software Systems, Inc","['sql', 'python', 'java', 'sql server', 'gcp',...","{'cloud': ['gcp', 'bigquery'], 'databases': ['..."
7,Data Engineer,Data Engineer,Anywhere,via LinkedIn,Full-time,True,Romania,2023-12-07 13:40:49,False,False,Romania,,,,Zitec,"['sql', 'nosql', 'gcp', 'azure', 'aws', 'bigqu...","{'cloud': ['gcp', 'azure', 'aws', 'bigquery', ..."
8,Business Analyst,Technology & Operations Business Analyst,"Copenhagen, Denmark",via Trabajo.org,Full-time,False,Denmark,2023-06-05 13:44:34,False,False,Denmark,,,,Hempel,"['excel', 'powerpoint', 'power bi']","{'analyst_tools': ['excel', 'powerpoint', 'pow..."


5. Obtener las primeras cinco columnas del DataFrame y todas las filas.

In [None]:
df.iloc[:, :5]

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type
0,Senior Data Engineer,Senior Clinical Data Engineer / Principal Clin...,"Watertown, CT",via Work Nearby,Full-time
1,Data Analyst,Data Analyst,"Guadalajara, Jalisco, Mexico",via BeBee México,Full-time
2,Data Engineer,"Data Engineer/Scientist/Analyst, Mid or Senior...","Berlin, Germany",via LinkedIn,Full-time
3,Data Engineer,LEAD ENGINEER - PRINCIPAL ANALYST - PRINCIPAL ...,"San Antonio, TX",via Diversity.com,Full-time
4,Data Engineer,Data Engineer- Sr Jobs,"Washington, DC",via Clearance Jobs,Full-time
...,...,...,...,...,...
785736,Software Engineer,DevOps Engineer,Singapura,melalui Trabajo.org,Pekerjaan tetap
785737,Data Analyst,CRM Data Analyst,"Bad Rodach, Jerman",melalui BeBee Deutschland,Pekerjaan tetap
785738,Business Analyst,Commercial Analyst - Start Now,Malaysia,melalui Ricebowl,Pekerjaan tetap
785739,Data Engineer,"Principal Associate, Data Engineer (Remote-Eli...","Newark, New Jersey, Amerika Serikat",melalui Recruit.net,Pekerjaan tetap


## loc

### Notas

* `df.loc[]`: Selecciona filas y columnas por posición o etiqueta.
* Similar a `df.iloc[]` excepto que podemos usar etiquetas en su lugar.

### Ejemplo

Vamos a obtener las mismas columnas que antes pero con `loc`. Esto utiliza etiquetas de columnas y filas. Estas son:

1. Obtener la primera fila (índice 0).
2. Obtener las primeras 10 filas de `job_skills` y `job_type_skills`.
3. Obtener las primeras 5 columnas y las filas 10-20.
4. Obtener las primeras 12 filas del DataFrame.
5. Obtener las primeras 5 columnas del DataFrame y todas las filas.

1. Obtener la primera fila. Esto permanece igual porque la fila no tiene una etiqueta.

In [None]:
df.loc[0]

job_title_short                                       Senior Data Engineer
job_title                Senior Clinical Data Engineer / Principal Clin...
job_location                                                 Watertown, CT
job_via                                                    via Work Nearby
job_schedule_type                                                Full-time
job_work_from_home                                                   False
search_location                                       Texas, United States
job_posted_date                                        2023-06-16 13:44:15
job_no_degree_mention                                                False
job_health_insurance                                                 False
job_country                                                  United States
salary_rate                                                           None
salary_year_avg                                                        NaN
salary_hour_avg          

2. Obtener las primeras 10 filas de `job_skills` y `job_type_skills`.

In [None]:
df.loc[:9,['job_skills','job_type_skills']]


Unnamed: 0,job_skills,job_type_skills
0,,
1,"['r', 'python', 'sql', 'nosql', 'power bi', 't...","{'analyst_tools': ['power bi', 'tableau'], 'pr..."
2,"['python', 'sql', 'c#', 'azure', 'airflow', 'd...","{'analyst_tools': ['dax'], 'cloud': ['azure'],..."
3,"['python', 'c++', 'java', 'matlab', 'aws', 'te...","{'cloud': ['aws'], 'libraries': ['tensorflow',..."
4,"['bash', 'python', 'oracle', 'aws', 'ansible',...","{'cloud': ['oracle', 'aws'], 'other': ['ansibl..."
5,"['python', 'sql', 'gcp']","{'cloud': ['gcp'], 'programming': ['python', '..."
6,"['sql', 'python', 'java', 'sql server', 'gcp',...","{'cloud': ['gcp', 'bigquery'], 'databases': ['..."
7,"['sql', 'nosql', 'gcp', 'azure', 'aws', 'bigqu...","{'cloud': ['gcp', 'azure', 'aws', 'bigquery', ..."
8,"['excel', 'powerpoint', 'power bi']","{'analyst_tools': ['excel', 'powerpoint', 'pow..."
9,"['sql', 'python', 'r', 'mongodb', 'mongodb', '...","{'analyst_tools': ['excel'], 'cloud': ['azure'..."


In [None]:
df.loc[:9][['job_skills','job_type_skills']]

Unnamed: 0,job_skills,job_type_skills
0,,
1,"['r', 'python', 'sql', 'nosql', 'power bi', 't...","{'analyst_tools': ['power bi', 'tableau'], 'pr..."
2,"['python', 'sql', 'c#', 'azure', 'airflow', 'd...","{'analyst_tools': ['dax'], 'cloud': ['azure'],..."
3,"['python', 'c++', 'java', 'matlab', 'aws', 'te...","{'cloud': ['aws'], 'libraries': ['tensorflow',..."
4,"['bash', 'python', 'oracle', 'aws', 'ansible',...","{'cloud': ['oracle', 'aws'], 'other': ['ansibl..."
5,"['python', 'sql', 'gcp']","{'cloud': ['gcp'], 'programming': ['python', '..."
6,"['sql', 'python', 'java', 'sql server', 'gcp',...","{'cloud': ['gcp', 'bigquery'], 'databases': ['..."
7,"['sql', 'nosql', 'gcp', 'azure', 'aws', 'bigqu...","{'cloud': ['gcp', 'azure', 'aws', 'bigquery', ..."
8,"['excel', 'powerpoint', 'power bi']","{'analyst_tools': ['excel', 'powerpoint', 'pow..."
9,"['sql', 'python', 'r', 'mongodb', 'mongodb', '...","{'analyst_tools': ['excel'], 'cloud': ['azure'..."


3. Obtener las primeras 5 columnas y las filas 10-20.

In [None]:
df.loc[10:20,'job_title_short':'job_work_from_home']

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home
10,Data Engineer,Erfahrene*r Data Engineer*in (m/w/d),"Basel, Switzerland",via LinkedIn,Full-time,False
11,Data Engineer,Data Engineer,"Colorado Springs, CO (+3 others)",via The Muse,Full-time,False
12,Data Analyst,Stagiaire Data Analyst (H/F) - Lyon (69006),"Lyon, France",via Jobijoba,Full-time and Internship,False
13,Senior Data Engineer,Senior Data Engineer,"New York, NY",via LinkedIn,Full-time,False
14,Data Engineer,Data Engineer,"Rio de Janeiro, State of Rio de Janeiro, Brazil",via BeBee,Full-time,False
15,Data Engineer,Data Engineer,"Aberdeen, UK",via LinkedIn,Temp work,False
16,Data Engineer,Big Data Engineer,"Pune, Maharashtra, India",via LinkedIn,Full-time,False
17,Data Scientist,Data Science Team Lead,"Laurel, MD",via APL Careers - Johns Hopkins University App...,Full-time,False
18,Data Analyst,Data Analyst,"Warsaw, Poland",via Praca Trabajo.org,Full-time,False
19,Data Scientist,Data Scientist,"Pune, Maharashtra, India",via Indeed,Full-time,False
