# Análisis Exploratorio de Datos (EDA)

## 1. El problema del negocio

Una start-up que provee servicios de agregación de plataformas de streaming solicita la realización en todas sus etapas de <span style="color:darkgreen">un sistema de recomendación de películas</span>, para este proyecto fué suministrado al **Data Scientist** un conjunto de archivos en formatos **.csv y excel** los cuales requirienro un trabajo conciente de [E.T.L](ETL-(Extract-Transform-Load).ipynb). el cual encontrará siguiendo el enlace 

## 2. El set de datos después de la limpieza

La información una vez procesada se encuentra en un CSV (`movies_cleaned.csv`) con 43.641 filas y 15 columnas.

Cada registro contiene xx características lLas columnas son las siguientes:
1. **belongs_to_collection**: Franquicia o serie de películas a la que pertenece la película.
2. **budget**: El presupuesto de la película, en dólares.
3. **genres**: Una lista que indica todos los géneros asociados a la película.
4. **id**: ID de la pelicula.
5. **original_language**: Codigo ISO639 del idioma original en la que se grabo la pelicula.
6. **overview**: Resumen de la película.
7. **production_companies**: Lista con las compañias productoras asociadas a la película.
8. **production_countries**: Lista con los países donde se produjo la película.
9. **release_date**: Fecha de estreno de la película.
10. **revenue**: Recaudación de la pelicula, en dolares (Tambien se toma como Ganancia para fines de éste trabajo).
11. **runtime**: Duración de la película, en minutos.
12. **title**: Titulo de la pelicula.
13. **cast**: Personal artístico protagónico de la película.
14. **director**: Nombre del director de la película.
15. **return**: Proporción de retorno de inversión


## 3. Lectura del dataset

In [1]:
# Importar librerías
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Lectura
path = "datasets/movies_cleaned.csv"
data = pd.read_csv(path)

In [3]:
print(data.shape)
data.head()

(43641, 15)


Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,production_companies,production_countries,release_date,revenue,runtime,title,cast,director,return
0,toy story collection,30000000.0,"['animation', 'comedy', 'family']",862,en,"Led by Woody, Andy's toys live happily in his ...",['pixar animation studios'],['united states of america'],1995-10-30,373554033.0,81.0,Toy Story,"['tom hanks', 'tim allen', 'don rickles', 'jim...",John Lasseter,12.451801
1,no collection,65000000.0,"['adventure', 'fantasy', 'family']",8844,en,When siblings Judy and Peter discover an encha...,"['tristar pictures', 'teitler film', 'intersco...",['united states of america'],1995-12-15,262797249.0,104.0,Jumanji,"['robin williams', 'jonathan hyde', 'kirsten d...",Joe Johnston,4.043035
2,grumpy old men collection,0.0,"['romance', 'comedy']",15602,en,A family wedding reignites the ancient feud be...,"['warner bros.', 'lancaster gate']",['united states of america'],1995-12-22,0.0,101.0,Grumpier Old Men,"['walter matthau', 'jack lemmon', 'ann-margret...",Howard Deutch,0.0
3,no collection,16000000.0,"['comedy', 'drama', 'romance']",31357,en,"Cheated on, mistreated and stepped on, the wom...",['twentieth century fox film corporation'],['united states of america'],1995-12-22,81452156.0,127.0,Waiting to Exhale,"['whitney houston', 'angela bassett', 'loretta...",Forest Whitaker,5.09076
4,father of the bride collection,0.0,['comedy'],11862,en,Just when George Banks has recovered from his ...,"['sandollar productions', 'touchstone pictures']",['united states of america'],1995-02-10,76578911.0,106.0,Father of the Bride Part II,"['steve martin', 'diane keaton', 'martin short...",Charles Shyer,0.0


## 4. Análisis exploratorio

La idea es usar herramientas estadísticas y de visualización para:

- Crear un mapa mental del set de datos (entenderlo)
- Empezar a encontrar respuestas para encontrar las mejores prácticas para lograr un sistema de recomendación de películas exitoso y confiable.

### 4.1 Análisis de cada variable de manera individual

Nos permite entender las características generales de cada variable de nuestro set de datos.

In [4]:
# Verificar la información general del set de datos
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43641 entries, 0 to 43640
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  43641 non-null  object 
 1   budget                 43641 non-null  float64
 2   genres                 43641 non-null  object 
 3   id                     43641 non-null  int64  
 4   original_language      43641 non-null  object 
 5   overview               43641 non-null  object 
 6   production_companies   43641 non-null  object 
 7   production_countries   43641 non-null  object 
 8   release_date           43641 non-null  object 
 9   revenue                43641 non-null  float64
 10  runtime                43641 non-null  float64
 11  title                  43641 non-null  object 
 12  cast                   43641 non-null  object 
 13  director               43641 non-null  object 
 14  return                 43641 non-null  float64
dtypes:

In [5]:
# Podemos extraer algunas variables estadísticas descriptivas básicas
data.describe()

Unnamed: 0,budget,id,revenue,runtime,return
count,43641.0,43641.0,43641.0,43641.0,43641.0
mean,4392429.0,105551.922321,11669780.0,95.452648,686.283
std,17760150.0,111074.305613,65617320.0,36.563856,76163.5
min,0.0,2.0,0.0,0.0,0.0
25%,0.0,25587.0,0.0,86.0,0.0
50%,0.0,57536.0,0.0,95.0,0.0
75%,0.0,149154.0,0.0,107.0,0.0
max,380000000.0,469172.0,2787965000.0,1256.0,12396380.0


In [6]:
# De las caracteristicas de genero, actor, director y overview
nro_registros = 1000
features = ['genres', 'director', 'cast', 'overview', 'title']
df_ml = data[features].head(nro_registros)
df_ml = data.dropna(how='any', subset=features)

In [7]:
# Limpiando la data
# !pip install nltk
import nltk
import re

# stopwords
nltk.download('stopwords')

stemmer = nltk.SnowballStemmer("english")

from nltk.corpus import stopwords
import string

stopword = set(stopwords.words("english"))

# Definimos función de limpieza
def clean_data(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text

[nltk_data] Downloading package stopwords to /home/cnd-
[nltk_data]     sen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
for feature in features:
    if feature not in  ['title']:
        df_ml[feature] = df_ml[feature].apply(clean_data)

df_ml.head(2)

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,production_companies,production_countries,release_date,revenue,runtime,title,cast,director,return
0,toy story collection,30000000.0,,862,en,led woodi andi toy live happili room andi birt...,['pixar animation studios'],['united states of america'],1995-10-30,373554033.0,81.0,Toy Story,,john lasset,12.451801
1,no collection,65000000.0,,8844,en,sibl judi peter discov enchant board game open...,"['tristar pictures', 'teitler film', 'intersco...",['united states of america'],1995-12-15,262797249.0,104.0,Jumanji,,joe johnston,4.043035


In [9]:
# Ahora podemos crear nuestra 'sopa de metadatos', que es una cadena que contiene
# todos los metadatos que queremos alimentar a nuestro vectorizador
def create_soup(x):
    return x['genres'] + ' ' + x['director'] + ' ' + x['cast'] + ' ' + x['overview']

In [10]:
df_ml['soup'] = df_ml.apply(create_soup, axis=1)
df_ml['soup'][0]

' john lasset  led woodi andi toy live happili room andi birthday bring buzz lightyear onto scene afraid lose place andi heart woodi plot buzz circumst separ buzz woodi owner duo eventu learn put asid differ'

In [11]:
# Import CountVectorizer and create the cuount matrix
from sklearn. feature_extraction.text import CountVectorizer

count = CountVectorizer()
count_matrix = count.fit_transform(df_ml['soup'])

In [None]:
# Compute the cosine similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Restablece el indice de nuestro Dataframe principal y construyo el mapeo inverso como antes
df_ml = df_ml.reset_index()
indices = pd.Series(df_ml.index, index=df_ml['title'])

In [None]:
def get_recomentacion(title, cosine_sim=cosine_sim):
    idx = indices[title]

    # Obtengo las puntuaciones de similitud por pares de todas las peliculas con esa pelicula
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Ordene las peliculas segun las puntuaciones de similitud
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Obten las puntuaciones de las 5 peliculas mas similares
    sim_scores = sim_scores[1:6]

    # Obtenga los indices de las peliculas
    movie_indices = [i[0] for i in sim_scores]

    # Devuelve el top 10 de peliculas similares
    return df_ml['title'].iloc[movie_indices]

In [None]:
get_recomentacion('Born to Win')

In [None]:
movies = data[['id','title','overview','genres','cast','director']]
movies.sample(5)

In [None]:
pd.options.mode.chained_assignment = None  # default='warn'
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [None]:
movies.sample(5)

In [None]:
movies['director'] = movies['director'].apply(lambda x:[x])
movies.sample(5)

In [None]:
from ast import literal_eval
movies['tags'] = movies['overview'] + movies['genres'].apply(literal_eval) + movies['cast'].apply(literal_eval) + movies['director']

In [None]:
new = movies.drop(columns=['overview','genres','cast','director'])

In [None]:
new['tags'] = new['tags'].apply(lambda x: " ".join(x))
new.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1000,stop_words='english')

In [None]:
vector = cv.fit_transform(new['tags']).toarray()

In [None]:
vector.shape

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
similarity = cosine_similarity(vector)