# AMAZON PRIME DATAFRAME

1. Data Cleaning and Formatting
2. Data Aggregation and Filtering
3. Data Structuring and Combining Data

## Import library & import csv

In [192]:
# Import all libraries. 

import pandas as pd
import numpy as np
import re
import seaborn as sns 
import matplotlib.pyplot as plt

In [193]:
# Import the csv.

amazon = pd.read_csv("/Users/roraimachavez/Downloads/7.IRONHACK/Projects/data-wrangling-project/src/amazon_prime_titles.csv")

## General information

In [194]:
amazon #General info of the DataFrame

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,"March 30, 2021",2014,,113 min,"Comedy, Drama",A small fishing village must procure a local d...
1,s2,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,"March 30, 2021",2018,13+,110 min,"Drama, International",A Metro Family decides to fight a Cyber Crimin...
2,s3,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,"March 30, 2021",2017,,74 min,"Action, Drama, Suspense",After a man discovers his wife is cheating on ...
3,s4,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,"March 30, 2021",2014,,69 min,Documentary,"Pink breaks the mold once again, bringing her ..."
4,s5,Movie,Monster Maker,Giles Foster,"Harry Dean Stanton, Kieran O'Brien, George Cos...",United Kingdom,"March 30, 2021",1989,,45 min,"Drama, Fantasy",Teenage Matt Banting wants to work with a famo...
...,...,...,...,...,...,...,...,...,...,...,...,...
9663,s9664,Movie,Pride Of The Bowery,Joseph H. Lewis,"Leo Gorcey, Bobby Jordan",,,1940,7+,60 min,Comedy,New York City street principles get an East Si...
9664,s9665,TV Show,Planet Patrol,,"DICK VOSBURGH, RONNIE STEVENS, LIBBY MORRIS, M...",,,2018,13+,4 Seasons,TV Shows,"This is Earth, 2100AD - and these are the adve..."
9665,s9666,Movie,Outpost,Steve Barker,"Ray Stevenson, Julian Wadham, Richard Brake, M...",,,2008,R,90 min,Action,"In war-torn Eastern Europe, a world-weary grou..."
9666,s9667,TV Show,Maradona: Blessed Dream,,"Esteban Recagno, Ezequiel Stremiz, Luciano Vit...",,,2021,TV-MA,1 Season,"Drama, Sports","The series tells the story of Diego Maradona, ..."


In [195]:
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9668 entries, 0 to 9667
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       9668 non-null   object
 1   type          9668 non-null   object
 2   title         9668 non-null   object
 3   director      7585 non-null   object
 4   cast          8435 non-null   object
 5   country       672 non-null    object
 6   date_added    155 non-null    object
 7   release_year  9668 non-null   int64 
 8   rating        9331 non-null   object
 9   duration      9668 non-null   object
 10  listed_in     9668 non-null   object
 11  description   9668 non-null   object
dtypes: int64(1), object(11)
memory usage: 906.5+ KB


`Rows: 9668`

`Columns: 12`

## Data cleaning & formatting

1. Edit column names.
2. Delete columns I won't use.
3. Sort last 10 years.
4. Delete duplicate rows.
5. Delete nulls values.
6. Tranform columns types if it is neccesary.
7. Add a column for "plataform"

`Edit column names & Delete columns I won't use.`

In [196]:
# Edit column names.

amazon = amazon.rename(columns=lambda x: x.replace('_', ' '))

# Change column name. 

amazon.rename(columns={'listed in': 'genres'}, inplace=True)

# Delete columns I won't use.

amazon = amazon.drop(["show id", "date added", "rating", "duration", "description"], axis=1)

In [197]:
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9668 entries, 0 to 9667
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          9668 non-null   object
 1   title         9668 non-null   object
 2   director      7585 non-null   object
 3   cast          8435 non-null   object
 4   country       672 non-null    object
 5   release year  9668 non-null   int64 
 6   genres        9668 non-null   object
dtypes: int64(1), object(6)
memory usage: 528.8+ KB


`Delete rows & Filter data since 2014 to 2024 (10 years).`

In [198]:
# Delete rows. I'm only using data since 2014 to 2024 (10 years).
amazon_filter = amazon.loc[(amazon['release year'] > 2013) & (amazon['release year'] <= 2024)] 
amazon = amazon_filter.copy() #assign the new dataframe to the old one

In [199]:
amazon.shape

(5808, 7)

`Delete duplicade data.`

In [200]:
# Delete duplicates rows.
amazon_drop_duplicates = amazon.drop_duplicates()
amazon = amazon_drop_duplicates.copy()

In [201]:
amazon.shape # There wasn't any duplicates rows

(5808, 7)

`Check null values.`

In [202]:
# Check how many nulls values do we have.
num_nans = amazon.isna().sum() 
num_nans

type               0
title              0
director        1523
cast             905
country         5333
release year       0
genres             0
dtype: int64

In [203]:
# Chance null values.
for col in amazon.columns:
    amazon[col].fillna("not found", inplace=True)

In [204]:
num_nans = amazon.isna().sum() 
num_nans

type            0
title           0
director        0
cast            0
country         0
release year    0
genres          0
dtype: int64

`Add a column for name the plataform.`

In [205]:
# Add a column for "plataform"
amazon['platform'] = 'Amazon Prime'

In [206]:
amazon.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5808 entries, 0 to 9666
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          5808 non-null   object
 1   title         5808 non-null   object
 2   director      5808 non-null   object
 3   cast          5808 non-null   object
 4   country       5808 non-null   object
 5   release year  5808 non-null   int64 
 6   genres        5808 non-null   object
 7   platform      5808 non-null   object
dtypes: int64(1), object(7)
memory usage: 408.4+ KB


In [207]:
amazon.head()

Unnamed: 0,type,title,director,cast,country,release year,genres,platform
0,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,2014,"Comedy, Drama",Amazon Prime
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,2018,"Drama, International",Amazon Prime
2,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,2017,"Action, Drama, Suspense",Amazon Prime
3,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,2014,Documentary,Amazon Prime
6,Movie,Hired Gun,Fran Strine,"Alice Cooper, Liberty DeVitto, Ray Parker Jr.,...",United States,2017,"Documentary, Special Interest",Amazon Prime


## Genre filtre

In [208]:
unicos = amazon["genres"].unique()
for i in unicos:
    print(i)
len(unicos)

Comedy, Drama
Drama, International
Action, Drama, Suspense
Documentary
Documentary, Special Interest
Comedy
Action, Science Fiction, Suspense
Adventure, Kids
Horror, Suspense
Documentary, Sports
Horror, Science Fiction
Comedy, Talk Show and Variety
Science Fiction
Action, Anime, Comedy
TV Shows
Animation, Anime, Fantasy
Action, Adventure, Animation
Drama
Fitness, Special Interest
Faith and Spirituality, Special Interest
Special Interest
Fitness
Arts, Entertainment, and Culture, Comedy, Talk Show and Variety
Documentary, Science Fiction
Adventure, Animation, Kids
Drama, Romance, Suspense
Unscripted
Documentary, Military and War
Kids
Animation, Kids
Arts, Entertainment, and Culture, Comedy
Arts, Entertainment, and Culture, Comedy, Special Interest
Action, Drama
Arts, Entertainment, and Culture
Drama, Special Interest
Action, Science Fiction
Documentary, Faith and Spirituality, Special Interest
Action, Drama, Special Interest
Drama, Young Adult Audience
Sports
Comedy, International
Arts, 

416

In [210]:
# Make columns per genres.
dfama = pd.concat([
    amazon, 
    amazon['genres'].str.get_dummies(sep=',')[[
        'Drama','Comedy','Documentary','Action','Animation',
        'Fantasy','Horror','Suspense','Science Fiction',
        'Adventure','Romance','Kids','Anime', ' Entertainment', 'Music Videos and Concerts']]
], axis=1).drop(columns=['genres'])

In [211]:
# Convert columns names to in lower case
dfama.columns = dfama.columns.str.lower()

In [212]:
# Rename new columns.
dfama.rename(columns={
    'kids': 'children & family',
    'science fiction': 'sci-fi'},
    inplace=True)

In [213]:
dfama.head()

Unnamed: 0,type,title,director,cast,country,release year,platform,drama,comedy,documentary,...,fantasy,horror,suspense,sci-fi,adventure,romance,children & family,anime,entertainment,music videos and concerts
0,Movie,The Grand Seduction,Don McKellar,"Brendan Gleeson, Taylor Kitsch, Gordon Pinsent",Canada,2014,Amazon Prime,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Movie,Take Care Good Night,Girish Joshi,"Mahesh Manjrekar, Abhay Mahajan, Sachin Khedekar",India,2018,Amazon Prime,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Movie,Secrets of Deception,Josh Webber,"Tom Sizemore, Lorenzo Lamas, Robert LaSardo, R...",United States,2017,Amazon Prime,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Movie,Pink: Staying True,Sonia Anderson,"Interviews with: Pink, Adele, Beyoncé, Britney...",United States,2014,Amazon Prime,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6,Movie,Hired Gun,Fran Strine,"Alice Cooper, Liberty DeVitto, Ray Parker Jr.,...",United States,2017,Amazon Prime,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [216]:
# Create new columns that is the combination of 2.
dfama['action & adventure'] = dfama['adventure'] | dfama['action']

dfama.drop(columns=['adventure', 'action'], inplace=True)

dfama['sci-fi & fantasy'] = dfama['sci-fi'] | dfama['fantasy']

dfama.drop(columns=['sci-fi', 'fantasy'], inplace=True)

dfama['animation2'] = dfama['animation'] | dfama['anime']

dfama.drop(columns=['animation', 'anime'], inplace=True)

In [219]:
# Change column name. 

dfama.rename(columns={'animation2': 'animation'}, inplace=True)
dfama.rename(columns={'suspense': 'thrillers'}, inplace=True)
dfama.rename(columns={'music videos and concerts': 'music & musicals'}, inplace=True)
dfama.rename(columns={' entertainment': 'entertainment'}, inplace=True)

In [220]:
dfama.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5808 entries, 0 to 9666
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   type                5808 non-null   object
 1   title               5808 non-null   object
 2   director            5808 non-null   object
 3   cast                5808 non-null   object
 4   country             5808 non-null   object
 5   release year        5808 non-null   int64 
 6   platform            5808 non-null   object
 7   drama               5808 non-null   int64 
 8   comedy              5808 non-null   int64 
 9   documentary         5808 non-null   int64 
 10  horror              5808 non-null   int64 
 11  thrillers           5808 non-null   int64 
 12  romance             5808 non-null   int64 
 13  children & family   5808 non-null   int64 
 14  entertainment       5808 non-null   int64 
 15  music & musicals    5808 non-null   int64 
 16  action & adventure  5808 non-

In [221]:
# Check null values. 
num_nans = dfama.isna().sum() 
num_nans

type                  0
title                 0
director              0
cast                  0
country               0
release year          0
platform              0
drama                 0
comedy                0
documentary           0
horror                0
thrillers             0
romance               0
children & family     0
entertainment         0
music & musicals      0
action & adventure    0
sci-fi & fantasy      0
animation             0
dtype: int64

In [223]:
# Save DataFrame
dfama.to_csv('amazonprime.csv', index=True)