## Analisis Exploratorio de Datos (EDA) IMDB Top 250 Dataset

Objetivo: explorar y analizar un dataset tanto de manera conceptual como gráfica para lograr un mejor entendimiento de los datos que disponemos. Al mismo tiempo limpiar y ordenar los datos para un correcto analisis posterior.

### Contenido
1. Descripción de los datos
2. Limpieza de datos
3. Missing values
4. Visualización de datos
5. Referencias



### 1. Descripción de los datos
Esta base de datos contiene la información básica de las 250 peliculas mejor puntuadas en el portal imdb.com

|Campo          |Tipo       |Descripción                            |Notas   |   |
|---------------|-----------|---------------------------------------|---|---|
|rank           |entero     |Posición de la película en el TOP250   |Valor 1 a 250   |   |
|name           |texto      |Nombre de la película                  |Nombre en idioma original   |   |
|year           |entero     |Año de estreno                         |   |   |
|rating         |decimal    |Calificación                           |Otorgado por usuarios   |   |
|genre          |texto      |Generos                                |Lista con los generos a los que pertenece   |   |
|certificate    |texto      |Clasificación                          |Clasificación asignado en el pais de origen   |   |
|run_time       |texto      |Duración                               |Medido en horas y minutos    |   |
|tagline        |texto      |Eslogan                                |En Ingles   |   |
|budget         |entero     |Presupuesto                            |En dolares   |   |
|box_office     |entero     |Ingresos en la apertura                |En dolares   |   |
|casts          |texto      |Elenco                                 |Lista de el elenco principal   |   |
|directors      |texto      |Directores                             |Lista de los directores   |   |
|writers        |texto      |Escritores                             |Lista de los escritores   |   |

La pregunta que quiere responder esta relacionada a la ganancia, pero dado que los valores proporcionados no estan actualizados a la inflación actual, utilizaremos un valor agregado calculando la proporción entre el box_office respecto al presupuesto en valores originales. Este campo lo llamaremos ganancia_proporcional.

$$ ganancia proporcional = {boxoffice \over budget} $$

### 2. Limpieza de datos

In [18]:
# Importación de librerias

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [20]:
# Carga de dataset

data = pd.read_csv("../data/IMDB_Top_250_Movies.csv")

In [22]:
data.dtypes

rank             int64
name            object
year             int64
rating         float64
genre           object
certificate     object
run_time        object
tagline         object
budget          object
box_office      object
casts           object
directors       object
writers         object
dtype: object

In [23]:
data.head()

Unnamed: 0,rank,name,year,rating,genre,certificate,run_time,tagline,budget,box_office,casts,directors,writers
0,1,The Shawshank Redemption,1994,9.3,Drama,R,2h 22m,Fear can hold you prisoner. Hope can set you f...,25000000,28884504,"Tim Robbins,Morgan Freeman,Bob Gunton,William ...",Frank Darabont,"Stephen King,Frank Darabont"
1,2,The Godfather,1972,9.2,"Crime,Drama",R,2h 55m,An offer you can't refuse.,6000000,250341816,"Marlon Brando,Al Pacino,James Caan,Diane Keato...",Francis Ford Coppola,"Mario Puzo,Francis Ford Coppola"
2,3,The Dark Knight,2008,9.0,"Action,Crime,Drama",PG-13,2h 32m,Why So Serious?,185000000,1006234167,"Christian Bale,Heath Ledger,Aaron Eckhart,Mich...",Christopher Nolan,"Jonathan Nolan,Christopher Nolan,David S. Goyer"
3,4,The Godfather Part II,1974,9.0,"Crime,Drama",R,3h 22m,All the power on earth can't change destiny.,13000000,47961919,"Al Pacino,Robert De Niro,Robert Duvall,Diane K...",Francis Ford Coppola,"Francis Ford Coppola,Mario Puzo"
4,5,12 Angry Men,1957,9.0,"Crime,Drama",Approved,1h 36m,Life Is In Their Hands -- Death Is On Their Mi...,350000,955,"Henry Fonda,Lee J. Cobb,Martin Balsam,John Fie...",Sidney Lumet,Reginald Rose


In [24]:
data.tail()

Unnamed: 0,rank,name,year,rating,genre,certificate,run_time,tagline,budget,box_office,casts,directors,writers
245,246,The Help,2011,8.1,Drama,PG-13,2h 26m,Change begins with a whisper.,25000000,216639112,"Viola Davis,Emma Stone,Octavia Spencer,Bryce D...",Tate Taylor,"Tate Taylor,Kathryn Stockett"
246,247,Dersu Uzala,1975,8.2,"Adventure,Biography,Drama",G,2h 22m,There is man and beast at nature's mercy. Ther...,4000000,14480,"Maksim Munzuk,Yuriy Solomin,Mikhail Bychkov,Vl...",Akira Kurosawa,"Akira Kurosawa,Yuriy Nagibin,Vladimir Arsenev"
247,248,Aladdin,1992,8.0,"Animation,Adventure,Comedy",G,1h 30m,Wish granted! (DVD re-release),Not Available,Not Available,"Scott Weinger,Robin Williams,Linda Larkin,Jona...","Ron Clements,John Musker","Ron Clements,John Musker,Ted Elliott"
248,249,Gandhi,1982,8.0,"Biography,Drama,History",PG,3h 11m,His Triumph Changed The World Forever.,22000000,52767889,"Ben Kingsley,John Gielgud,Rohini Hattangadi,Ro...",Richard Attenborough,John Briley
249,250,Dances with Wolves,1990,8.0,"Adventure,Drama,Western",PG-13,3h 1m,Inside everyone is a frontier waiting to be di...,22000000,424208848,"Kevin Costner,Mary McDonnell,Graham Greene,Rod...",Kevin Costner,Michael Blake


En esta primera observación comprobamos que los nombres de las columnas son correctos y la información 
(por lo menos en los primeros y ultimos registros) coincide con el contenido esperado.

### 3. Valores faltantes

In [15]:
data.isnull().sum()

rank           0
name           0
year           0
rating         0
genre          0
certificate    0
run_time       0
tagline        0
budget         0
box_office     0
casts          0
directors      0
writers        0
dtype: int64

No contamos con valores faltantes, sin embargo en las columnas de budget y box_office tenemos valores nulos ("Not Available"). Para el ejercicio esta información es indispensable así que eliminamos los registros que no sean numericos.

In [27]:
dataC = data[data['budget'].str.contains("Not Available")==False]
dataC

Unnamed: 0,rank,name,year,rating,genre,certificate,run_time,tagline,budget,box_office,casts,directors,writers
0,1,The Shawshank Redemption,1994,9.3,Drama,R,2h 22m,Fear can hold you prisoner. Hope can set you f...,25000000,28884504,"Tim Robbins,Morgan Freeman,Bob Gunton,William ...",Frank Darabont,"Stephen King,Frank Darabont"
1,2,The Godfather,1972,9.2,"Crime,Drama",R,2h 55m,An offer you can't refuse.,6000000,250341816,"Marlon Brando,Al Pacino,James Caan,Diane Keato...",Francis Ford Coppola,"Mario Puzo,Francis Ford Coppola"
2,3,The Dark Knight,2008,9.0,"Action,Crime,Drama",PG-13,2h 32m,Why So Serious?,185000000,1006234167,"Christian Bale,Heath Ledger,Aaron Eckhart,Mich...",Christopher Nolan,"Jonathan Nolan,Christopher Nolan,David S. Goyer"
3,4,The Godfather Part II,1974,9.0,"Crime,Drama",R,3h 22m,All the power on earth can't change destiny.,13000000,47961919,"Al Pacino,Robert De Niro,Robert Duvall,Diane K...",Francis Ford Coppola,"Francis Ford Coppola,Mario Puzo"
4,5,12 Angry Men,1957,9.0,"Crime,Drama",Approved,1h 36m,Life Is In Their Hands -- Death Is On Their Mi...,350000,955,"Henry Fonda,Lee J. Cobb,Martin Balsam,John Fie...",Sidney Lumet,Reginald Rose
...,...,...,...,...,...,...,...,...,...,...,...,...,...
244,245,The Iron Giant,1999,8.1,"Animation,Action,Adventure",PG,1h 26m,Some secrets are too huge to hide,70000000,23335817,"Eli Marienthal,Harry Connick Jr.,Jennifer Anis...",Brad Bird,"Tim McCanlies,Brad Bird,Ted Hughes"
245,246,The Help,2011,8.1,Drama,PG-13,2h 26m,Change begins with a whisper.,25000000,216639112,"Viola Davis,Emma Stone,Octavia Spencer,Bryce D...",Tate Taylor,"Tate Taylor,Kathryn Stockett"
246,247,Dersu Uzala,1975,8.2,"Adventure,Biography,Drama",G,2h 22m,There is man and beast at nature's mercy. Ther...,4000000,14480,"Maksim Munzuk,Yuriy Solomin,Mikhail Bychkov,Vl...",Akira Kurosawa,"Akira Kurosawa,Yuriy Nagibin,Vladimir Arsenev"
248,249,Gandhi,1982,8.0,"Biography,Drama,History",PG,3h 11m,His Triumph Changed The World Forever.,22000000,52767889,"Ben Kingsley,John Gielgud,Rohini Hattangadi,Ro...",Richard Attenborough,John Briley


Eliminamos los registros que tienen "Not Available" en la columna budget. Quedando 211 registros

In [28]:
dataD = dataC[dataC['box_office'].str.contains("Not Available")==False]
dataD

Unnamed: 0,rank,name,year,rating,genre,certificate,run_time,tagline,budget,box_office,casts,directors,writers
0,1,The Shawshank Redemption,1994,9.3,Drama,R,2h 22m,Fear can hold you prisoner. Hope can set you f...,25000000,28884504,"Tim Robbins,Morgan Freeman,Bob Gunton,William ...",Frank Darabont,"Stephen King,Frank Darabont"
1,2,The Godfather,1972,9.2,"Crime,Drama",R,2h 55m,An offer you can't refuse.,6000000,250341816,"Marlon Brando,Al Pacino,James Caan,Diane Keato...",Francis Ford Coppola,"Mario Puzo,Francis Ford Coppola"
2,3,The Dark Knight,2008,9.0,"Action,Crime,Drama",PG-13,2h 32m,Why So Serious?,185000000,1006234167,"Christian Bale,Heath Ledger,Aaron Eckhart,Mich...",Christopher Nolan,"Jonathan Nolan,Christopher Nolan,David S. Goyer"
3,4,The Godfather Part II,1974,9.0,"Crime,Drama",R,3h 22m,All the power on earth can't change destiny.,13000000,47961919,"Al Pacino,Robert De Niro,Robert Duvall,Diane K...",Francis Ford Coppola,"Francis Ford Coppola,Mario Puzo"
4,5,12 Angry Men,1957,9.0,"Crime,Drama",Approved,1h 36m,Life Is In Their Hands -- Death Is On Their Mi...,350000,955,"Henry Fonda,Lee J. Cobb,Martin Balsam,John Fie...",Sidney Lumet,Reginald Rose
...,...,...,...,...,...,...,...,...,...,...,...,...,...
244,245,The Iron Giant,1999,8.1,"Animation,Action,Adventure",PG,1h 26m,Some secrets are too huge to hide,70000000,23335817,"Eli Marienthal,Harry Connick Jr.,Jennifer Anis...",Brad Bird,"Tim McCanlies,Brad Bird,Ted Hughes"
245,246,The Help,2011,8.1,Drama,PG-13,2h 26m,Change begins with a whisper.,25000000,216639112,"Viola Davis,Emma Stone,Octavia Spencer,Bryce D...",Tate Taylor,"Tate Taylor,Kathryn Stockett"
246,247,Dersu Uzala,1975,8.2,"Adventure,Biography,Drama",G,2h 22m,There is man and beast at nature's mercy. Ther...,4000000,14480,"Maksim Munzuk,Yuriy Solomin,Mikhail Bychkov,Vl...",Akira Kurosawa,"Akira Kurosawa,Yuriy Nagibin,Vladimir Arsenev"
248,249,Gandhi,1982,8.0,"Biography,Drama,History",PG,3h 11m,His Triumph Changed The World Forever.,22000000,52767889,"Ben Kingsley,John Gielgud,Rohini Hattangadi,Ro...",Richard Attenborough,John Briley


Eliminamos los registros que tienen "Not Available" en la columna box_office. Quedando 211 registros

In [30]:
dataD.dtypes

rank             int64
name            object
year             int64
rating         float64
genre           object
certificate     object
run_time        object
tagline         object
budget          object
box_office      object
casts           object
directors       object
writers         object
dtype: object

Tanto budget como box_office deben ser numericos asi que cambiaremos el tipo

In [33]:
dataD['budget'] = pd.to_numeric(dataD['budget'])
print(dataD)
print(dataD.dtypes)

ValueError: Unable to parse string "$3300000" at position 22

### 4. Visualización de datos

### Referencias
- Análisis Exploratorio de Datos. JMP. (n.d.). Retrieved March 16, 2023, from https://www.jmp.com/es_co/statistics-knowledge-portal/exploratory-data-analysis.html#:~:text=El%20an%C3%A1lisis%20exploratorio%20de%20datos%20(EDA%20por%20sus%20siglas%20en,aprender%2C%20no%20confirmar%20hip%C3%B3tesis%20estad%C3%ADsticas. 

- Eduardo Sánchez Z. 2022. "Proyecto-Sports-Analytics". https://github.com/EduardoSanchezZ/Proyecto-Sports-Analytics

- Carol Castañeda. 2022. "EDA_Airbnb_Buenos-Aires". https://github.com/Carol-Castaneda/EDA_Airbnb_Buenos-Aires

- K. Katari. (Aug 21, 2020).Exploratory Data Analysis(EDA): Python. Towards Data Science.