# <h1 align=center>**`Movies Score - Data Engineering - PI`**</h1>

## ``Importación de librerías``

In [1]:
import pandas as pd
import numpy as np
# mostrar todas las filas y columnas
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

## ``Carga de datasets``

In [115]:
amazon = pd.read_csv('../app/Datasets/amazon_prime_titles-score.csv')
disney = pd.read_csv('../app/Datasets/disney_plus_titles-score.csv')
hulu = pd.read_csv('../app/Datasets/hulu_titles-score (2).csv')
netflix = pd.read_csv('../app/Datasets/netflix_titles-score.csv')

## **Propuesta de trabajo (requerimientos de aprobación)**

**`Transformaciones`**:  El analista de datos requiere estas, ***y solo estas***, transformaciones para sus datos:


+ **Consigna 1:** Generar campo **`id`**: Cada id se compondrá de la primera letra del nombre de la plataforma, seguido del show_id ya presente en los datasets (ejemplo para títulos de Amazon = **`as123`**)

+ **Consigna 2:** Los valores nulos del campo rating deberán reemplazarse por el string “**`G`**” (corresponde al maturity rating: “general for all audiences”

+ **Consigna 3:** De haber fechas, deberán tener el formato **`AAAA-mm-dd`**

+ **Consigna 4:** Los campos de texto deberán estar en **minúsculas**, sin excepciones

+ **Consigna 5:** El campo ***duration*** debe convertirse en dos campos: **`duration_int`** y **`duration_type`**. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas)

### ``Consigna 1``

In [116]:
def generate_id_show(initial_name, df):
    df['show_id'] = initial_name + df['show_id']
    return df

In [117]:
amazon = generate_id_show('a', amazon)
disney = generate_id_show('d', disney)
hulu = generate_id_show('h', hulu)
netflix = generate_id_show('n', netflix)

### ``Consigna 2``

In [118]:
# amazon[~amazon.isna()]
amazon['rating'].unique()

array([nan, '13+', 'ALL', '18+', 'R', 'TV-Y', 'TV-Y7', 'NR', '16+',
       'TV-PG', '7+', 'TV-14', 'TV-NR', 'TV-G', 'PG-13', 'TV-MA', 'G',
       'PG', 'NC-17', 'UNRATED', '16', 'AGES_16_', 'AGES_18_', 'ALL_AGES',
       'NOT_RATE'], dtype=object)

In [119]:
amazon['rating'] = amazon['rating'].fillna('g')
disney['rating'] = disney['rating'].fillna('g')
hulu['rating'] = hulu['rating'].fillna('g')
netflix['rating'] = netflix['rating'].fillna('g')

In [120]:
amazon.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'score'],
      dtype='object')

### ``Consigna 3``

In [121]:
def change_date_added_datetime(df):
    # pd.to_datetime(amazon['date_added'].str.strip(),  format='%B %d, %Y').dt.strftime('%Y-%m-%d')
    # str.strip(): Netflix tuvo datos con espacios en la columna date_added
    df['date_added'] = pd.to_datetime(df['date_added'].str.strip(),  format='%B %d, %Y').dt.strftime('%Y-%m-%d')
    return df

In [122]:
amazon = change_date_added_datetime(amazon)
disney = change_date_added_datetime(disney)
hulu = change_date_added_datetime(hulu)
netflix = change_date_added_datetime(netflix)

### ``Consigna 4``

In [123]:
def convert_to_lower(df):
    object_columns_list= df.dtypes[df.dtypes == 'object'].index.to_list()
    for col in object_columns_list:
        df[col] = df[col].str.lower()
    return df

In [124]:
amazon = convert_to_lower(amazon)
disney = convert_to_lower(disney)
hulu = convert_to_lower(hulu)
netflix = convert_to_lower(amazon)

### ``Consigna 5``

gna 5:** El campo ***duration*** debe convertirse en dos campos: **`duration_int`** y **`duration_type`**. El primero será un integer y el segundo un string indicando la unidad de medición de duración: min (minutos) o season (temporadas)

In [136]:
amazon['duration_int'].unique() # Todos se pueden convertir a tipo entero (int), no hay caracteres alfabéticos u otros que no sean numéricos.

array([113, 110,  74,  69,  45,  52,  98, 131,  87,  92,  88,  93,  94,
        46,  96,   1, 104,  62,  50,   3,   2,  86,  36,  37, 103,   9,
        18,  14,  20,  19,  22,  60,   6,  54,   5,  84, 126, 125, 109,
        89,  85,  56,  40, 111,  33,  34,  95,  99,  78,   4,  77,  55,
        53, 115,  58,  49, 135,  91,  64,  59,  48, 122,  90, 102,  65,
       114, 136,  70, 138, 100, 480,  30, 152,  68,  57,   7,  31, 151,
       149, 141, 121,  79, 140,  51, 106,  75,  27, 107, 108,  38, 157,
        43, 118, 139, 112,  15,  72, 116, 142,  71,  42,  81,  32,  66,
       127, 159,  67,  29, 132, 101, 164,  73,  61,  80,  83,  44, 120,
        26,  97,  23, 105,  82,  11, 148, 161, 123,   0, 124, 143,  35,
        47, 170, 146, 601,  24,  21, 154, 128, 133, 153, 119,  63, 169,
       174, 144, 137,  76,  39,   8,  12, 134, 163, 145, 162,  41, 147,
       155, 117, 167,  28,  25, 180, 541, 240, 129, 178, 171, 172, 173,
        10, 166, 160, 130, 479,  13,  17,  16, 158, 183, 150, 48

In [134]:
def normalize_duration(df):
    df[['duration_int', 'duration_type']] = df['duration'].str.split(' ', expand=True)
    df['duration_type'] = df['duration_type'].str.replace('seasons', 'season')
    df['duration_int'] = pd.to_numeric(df['duration_int'], downcast='integer', errors='coerce')
    return df

In [135]:
amazon = normalize_duration(amazon)
disney = normalize_duration(disney)
hulu = normalize_duration(hulu)
netflix = normalize_duration(netflix)

## ``Concatenación de dataframes``

In [137]:
movies = pd.concat([amazon, disney, hulu, netflix], axis=0)
movies.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,score,duration_int,duration_type
0,as1,movie,the grand seduction,don mckellar,"brendan gleeson, taylor kitsch, gordon pinsent",canada,2021-03-30,2014,g,113 min,"comedy, drama",a small fishing village must procure a local d...,99,113.0,min
1,as2,movie,take care good night,girish joshi,"mahesh manjrekar, abhay mahajan, sachin khedekar",india,2021-03-30,2018,13+,110 min,"drama, international",a metro family decides to fight a cyber crimin...,37,110.0,min
2,as3,movie,secrets of deception,josh webber,"tom sizemore, lorenzo lamas, robert lasardo, r...",united states,2021-03-30,2017,g,74 min,"action, drama, suspense",after a man discovers his wife is cheating on ...,20,74.0,min
3,as4,movie,pink: staying true,sonia anderson,"interviews with: pink, adele, beyoncé, britney...",united states,2021-03-30,2014,g,69 min,documentary,"pink breaks the mold once again, bringing her ...",27,69.0,min
4,as5,movie,monster maker,giles foster,"harry dean stanton, kieran o'brien, george cos...",united kingdom,2021-03-30,1989,g,45 min,"drama, fantasy",teenage matt banting wants to work with a famo...,75,45.0,min
