## Sección 1- Limpiando un archivo json con pandas

In [1]:
import pandas as pd

### Carga de datos.

**Cargar el archivo JSON en un DataFrame**

In [2]:
df = pd.read_json('data_large.json')
# print("DataFrame original:")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    100 non-null    object
 1   age     52 non-null     object
 2   city    100 non-null    object
 3   email   100 non-null    object
dtypes: object(4)
memory usage: 3.3+ KB


In [4]:
df.columns

Index(['name', 'age', 'city', 'email'], dtype='object')

In [5]:
df.dtypes

name     object
age      object
city     object
email    object
dtype: object

In [6]:
df.head(50)

Unnamed: 0,name,age,city,email
0,Patricia Miller,,Paris,user0@example.com
1,Michael Brown,unknown,Paris,user1@example.com
2,Robert Wilson,31,Madrid,user2@example.com
3,Linda Johnson,unknown,London,user3@example.com
4,Maria Garcia,,Madrid,user4@example.com
5,Peter Jones,29,Rio de Janeiro,user5@example.com
6,Michael Brown,unknown,,user6@example.com
7,John Doe,unknown,Moscow,user7@example.com
8,James Lee,25,New York,user8@example.com
9,Maria Garcia,,Berlin,user9@example.com


### Cambiar el tipo de dato del campo age a int

In [7]:
# Manejar valores no numéricos en 'age'
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df.head(60)

Unnamed: 0,name,age,city,email
0,Patricia Miller,,Paris,user0@example.com
1,Michael Brown,,Paris,user1@example.com
2,Robert Wilson,31.0,Madrid,user2@example.com
3,Linda Johnson,,London,user3@example.com
4,Maria Garcia,,Madrid,user4@example.com
5,Peter Jones,29.0,Rio de Janeiro,user5@example.com
6,Michael Brown,,,user6@example.com
7,John Doe,,Moscow,user7@example.com
8,James Lee,25.0,New York,user8@example.com
9,Maria Garcia,,Berlin,user9@example.com


### Hallar la media de edad sin los NaN

In [8]:
media_age = df['age'].mean()
print(media_age)

40.76923076923077


### Rellenar los NaNs con la media de 'age' y convertir a int

In [9]:
df['age'] = df['age'].fillna(media_age).astype(int)
df.head(30)

Unnamed: 0,name,age,city,email
0,Patricia Miller,40,Paris,user0@example.com
1,Michael Brown,40,Paris,user1@example.com
2,Robert Wilson,31,Madrid,user2@example.com
3,Linda Johnson,40,London,user3@example.com
4,Maria Garcia,40,Madrid,user4@example.com
5,Peter Jones,29,Rio de Janeiro,user5@example.com
6,Michael Brown,40,,user6@example.com
7,John Doe,40,Moscow,user7@example.com
8,James Lee,25,New York,user8@example.com
9,Maria Garcia,40,Berlin,user9@example.com


In [10]:
df.dtypes

name     object
age       int64
city     object
email    object
dtype: object

### Identificar datos faltantes

In [11]:
faltantes = df.isna().sum() + (df == '').sum()
faltantes

name      8
age       0
city     10
email     0
dtype: int64

### Identificar filas duplicadas

In [12]:
duplicados = df.duplicated()
duplicados

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 100, dtype: bool

In [13]:
filas_duplicadas = df[df.duplicated()]
filas_duplicadas

Unnamed: 0,name,age,city,email


El resultado muestra que no hay filas duplicadas en el DataFrame.

Si deseas verificar duplicados basados en columnas específicas, puedes especificar las columnas en el método duplicated(). Por ejemplo, para verificar duplicados basados en las columnas name y email, puedes usar el siguiente código:

In [14]:
duplicados_especificos = df[df.duplicated(subset=['name', 'email'])]
duplicados_especificos

Unnamed: 0,name,age,city,email
55,Michael Brown,40,Paris,user6@example.com


### Duplicados de un email

In [15]:
duplicado_email = df[df.duplicated(subset=['email'])]
duplicado_email 

Unnamed: 0,name,age,city,email
25,Linda Johnson,40,Moscow,user5@example.com
54,Michael Brown,40,London,user3@example.com
55,Michael Brown,40,Paris,user6@example.com
63,James Lee,40,Sydney,user3@example.com
66,Patricia Miller,40,Moscow,user6@example.com
71,,40,Berlin,user8@example.com
73,Peter Jones,40,New York,user0@example.com


### Contar y mostrar donde se repiten los emails

In [16]:
# Contar las repeticiones de cada 'email'
email_counts = df['email'].value_counts()

# Filtrar los correos electrónicos que tienen más de una aparición
duplicated_emails_with_counts = email_counts[email_counts > 1]

duplicated_emails_with_counts

email
user3@example.com    3
user6@example.com    3
user0@example.com    2
user5@example.com    2
user8@example.com    2
Name: count, dtype: int64

In [17]:
# Contar las repeticiones de cada 'email'
email_counts = df['email'].value_counts()

# Filtrar los correos electrónicos que tienen más de una aparición
duplicated_emails = email_counts[email_counts > 1].index

# Filtrar el DataFrame para mostrar todas las filas con correos electrónicos duplicados
duplicated_email_rows = df[df['email'].isin(duplicated_emails)]

duplicated_email_rows

Unnamed: 0,name,age,city,email
0,Patricia Miller,40,Paris,user0@example.com
3,Linda Johnson,40,London,user3@example.com
5,Peter Jones,29,Rio de Janeiro,user5@example.com
6,Michael Brown,40,,user6@example.com
8,James Lee,25,New York,user8@example.com
25,Linda Johnson,40,Moscow,user5@example.com
54,Michael Brown,40,London,user3@example.com
55,Michael Brown,40,Paris,user6@example.com
63,James Lee,40,Sydney,user3@example.com
66,Patricia Miller,40,Moscow,user6@example.com


### Eliminar filas duplicadas

In [18]:
# Eliminar filas duplicadas completas
df_cleaned = df.drop_duplicates()
df_cleaned

Unnamed: 0,name,age,city,email
0,Patricia Miller,40,Paris,user0@example.com
1,Michael Brown,40,Paris,user1@example.com
2,Robert Wilson,31,Madrid,user2@example.com
3,Linda Johnson,40,London,user3@example.com
4,Maria Garcia,40,Madrid,user4@example.com
...,...,...,...,...
95,Patricia Miller,30,Berlin,user95@example.com
96,James Lee,56,London,user96@example.com
97,Patricia Miller,40,New York,user97@example.com
98,Maria Garcia,40,Madrid,user98@example.com


### Eliminar filas con datos vacíos

In [19]:
faltantes_1 = df_cleaned.isna().sum() + (df_cleaned == '').sum()
faltantes_1

name      8
age       0
city     10
email     0
dtype: int64

In [20]:
# Eliminar filas con datos vacíos en las columnas 'name' y 'city'
df_no_missing_specific = df_cleaned.dropna(subset=['name', 'city'])

# También eliminar filas donde 'name' o 'city' son cadenas vacías
df_no_missing_specific = df_no_missing_specific[(df_no_missing_specific['name'] != '') & (df_no_missing_specific['city'] != '')]

df_no_missing_specific.head(50)

Unnamed: 0,name,age,city,email
0,Patricia Miller,40,Paris,user0@example.com
1,Michael Brown,40,Paris,user1@example.com
2,Robert Wilson,31,Madrid,user2@example.com
3,Linda Johnson,40,London,user3@example.com
4,Maria Garcia,40,Madrid,user4@example.com
5,Peter Jones,29,Rio de Janeiro,user5@example.com
7,John Doe,40,Moscow,user7@example.com
8,James Lee,25,New York,user8@example.com
9,Maria Garcia,40,Berlin,user9@example.com
10,Patricia Miller,40,Paris,user10@example.com


In [21]:
df_no_missing_specific.tail(30)

Unnamed: 0,name,age,city,email
65,Maria Garcia,22,Madrid,user65@example.com
66,Patricia Miller,40,Moscow,user6@example.com
67,Patricia Miller,40,Paris,user67@example.com
68,James Lee,40,London,user68@example.com
69,Maria Garcia,40,Sydney,user69@example.com
70,Robert Wilson,40,Tokyo,user70@example.com
72,Peter Jones,40,Tokyo,user72@example.com
73,Peter Jones,40,New York,user0@example.com
74,Patricia Miller,40,Paris,user74@example.com
75,James Lee,40,London,user75@example.com


In [22]:
df_no_missing_specific.shape

(83, 4)

### Resetear los índices

In [23]:
df_no_missing_specific_reset = df_no_missing_specific.reset_index(drop=True)

# Mostrar el DataFrame con índices reseteados
df_no_missing_specific_reset

Unnamed: 0,name,age,city,email
0,Patricia Miller,40,Paris,user0@example.com
1,Michael Brown,40,Paris,user1@example.com
2,Robert Wilson,31,Madrid,user2@example.com
3,Linda Johnson,40,London,user3@example.com
4,Maria Garcia,40,Madrid,user4@example.com
...,...,...,...,...
78,Patricia Miller,30,Berlin,user95@example.com
79,James Lee,56,London,user96@example.com
80,Patricia Miller,40,New York,user97@example.com
81,Maria Garcia,40,Madrid,user98@example.com


## Sección 2- Análisis JSON anidado con Pandas

Empecemos por crear una estructura JSON anidada. Para ello simularemos datos.


In [24]:
import json

# Creating a nested JSON structure
nested_json = {
    "company": "Tech Solutions",
    "employees": [
        {
            "name": "John Doe",
            "age": 30,
            "address": {
                "street": "123 Main St",
                "city": "New York",
                "zipcode": "10001"
            },
            "skills": ["Python", "Data Analysis"]
        },
        {
            "name": "Jane Smith",
            "age": 25,
            "address": {
                "street": "456 Market St",
                "city": "San Francisco",
                "zipcode": "94105"
            },
            "skills": ["JavaScript", "React"]
        },
        {
            "name": "Mike Johnson",
            "age": 35,
            "address": {
                "street": "789 Broadway",
                "city": "Los Angeles",
                "zipcode": "90015"
            },
            "skills": ["Java", "Spring"]
        }
    ]
}

# Convertir el diccionario en una cadena JSON
nested_json_str = json.dumps(nested_json, indent=4)
print(nested_json_str)

{
    "company": "Tech Solutions",
    "employees": [
        {
            "name": "John Doe",
            "age": 30,
            "address": {
                "street": "123 Main St",
                "city": "New York",
                "zipcode": "10001"
            },
            "skills": [
                "Python",
                "Data Analysis"
            ]
        },
        {
            "name": "Jane Smith",
            "age": 25,
            "address": {
                "street": "456 Market St",
                "city": "San Francisco",
                "zipcode": "94105"
            },
            "skills": [
                "JavaScript",
                "React"
            ]
        },
        {
            "name": "Mike Johnson",
            "age": 35,
            "address": {
                "street": "789 Broadway",
                "city": "Los Angeles",
                "zipcode": "90015"
            },
            "skills": [
                "Java",
                "Spr

Cargar el JSON anidado en un DataFrame de Pandas.  

Usaremos `pandas.json_normalize` para aplanar la estructura JSON y cargarla en un DataFrame.

In [25]:
import pandas as pd

# Flatten the nested JSON structure
df = pd.json_normalize(nested_json, record_path='employees', meta='company')
print(df.head())

           name  age                   skills address.street   address.city  \
0      John Doe   30  [Python, Data Analysis]    123 Main St       New York   
1    Jane Smith   25      [JavaScript, React]  456 Market St  San Francisco   
2  Mike Johnson   35           [Java, Spring]   789 Broadway    Los Angeles   

  address.zipcode         company  
0           10001  Tech Solutions  
1           94105  Tech Solutions  
2           90015  Tech Solutions  


In [26]:
df

Unnamed: 0,name,age,skills,address.street,address.city,address.zipcode,company
0,John Doe,30,"[Python, Data Analysis]",123 Main St,New York,10001,Tech Solutions
1,Jane Smith,25,"[JavaScript, React]",456 Market St,San Francisco,94105,Tech Solutions
2,Mike Johnson,35,"[Java, Spring]",789 Broadway,Los Angeles,90015,Tech Solutions


### Limpieza de datos

Tratamiento de valores perdidos

Tratemos los valores que faltan en nuestro DataFrame. En este ejemplo, no tenemos valores perdidos, pero vamos a demostrar cómo se pueden manejar.

In [27]:
# Check for missing values
print(df.isna().sum())

# Fill missing values with a placeholder or drop rows/columns with missing values
# Example: Fill missing values with 'Unknown'
df.fillna('Unknown', inplace=True)

name               0
age                0
skills             0
address.street     0
address.city       0
address.zipcode    0
company            0
dtype: int64


### Renombrar Columnas

In [28]:
df.rename(columns={
    'address.street': 'street',
    'address.city': 'city',
    'address.zipcode': 'zipcode'
}, inplace=True)
print(df.head())

           name  age                   skills         street           city  \
0      John Doe   30  [Python, Data Analysis]    123 Main St       New York   
1    Jane Smith   25      [JavaScript, React]  456 Market St  San Francisco   
2  Mike Johnson   35           [Java, Spring]   789 Broadway    Los Angeles   

  zipcode         company  
0   10001  Tech Solutions  
1   94105  Tech Solutions  
2   90015  Tech Solutions  


### Cambiar los tipos de datos

Asegúrese de que los tipos de datos son adecuados para el análisis.

In [29]:
print(df.dtypes)

name       object
age         int64
skills     object
street     object
city       object
zipcode    object
company    object
dtype: object


In [30]:
df['age'] = df['age'].astype(int)
print(df.dtypes)

name       object
age         int64
skills     object
street     object
city       object
zipcode    object
company    object
dtype: object


### Ampliación de las columnas de la lista

Si tiene columnas con listas (por ejemplo, skills), puede que desee ampliarlas en filas o columnas separadas.

In [31]:
# Expand the 'skills' column
skills_expanded = df['skills'].apply(pd.Series)
skills_expanded.columns = [f'skill_{i+1}' for i in skills_expanded.columns]

# Concatenate the expanded skills back to the original DataFrame
df = pd.concat([df.drop(columns='skills'), skills_expanded], axis=1)
print(df.head())


           name  age         street           city zipcode         company  \
0      John Doe   30    123 Main St       New York   10001  Tech Solutions   
1    Jane Smith   25  456 Market St  San Francisco   94105  Tech Solutions   
2  Mike Johnson   35   789 Broadway    Los Angeles   90015  Tech Solutions   

      skill_1        skill_2  
0      Python  Data Analysis  
1  JavaScript          React  
2        Java         Spring  


### Guardar el DataFrame modificado a csv

In [32]:
df.to_csv('cleaned_data.csv', index=False)
print("DataFrame saved to 'cleaned_data.csv'")

DataFrame saved to 'cleaned_data.csv'
