# Datencharakterisierung

Als erstes werden notwendige <span style="color:seagreen">packages</span> installiert und <span style="color:coral">csv</span> Dateien von Netflix und Disney+ eingelesen und in <span style="color:lightblue">df_list</span> gespeichert für eine schnellere Untersuchung und weniger Codezeilen.

# Inhaltsverzeichnis
- [Untersuchung der Daten](#untersuchung-der-Daten): [Netflix](##untersuchung-netflix) & [Disney+](##untersuchung-disney) & [Beide](##untersuchung-beide)
- [Cleaning](#cleaning)  : [Für Beide](##für-beide) & [Netflix](##netflix) & [Disney+](##disney)


In [2]:
import pandas as pd
import numpy as np
import os
import plotly as pl

notebook_dir = os.path.dirname(os.path.abspath("__file__"))
dataset_dir = os.path.join(notebook_dir, '../1_Datenset/ursprüngliche')

disney_df = pd.read_csv(os.path.join(dataset_dir, 'disney_plus_titles.csv'), sep=',')
netflix_df = pd.read_csv(os.path.join(dataset_dir, 'netflix_titles.csv'), sep=',')

df_list = [
    ('disney_df', disney_df),
    ('netflix_df', netflix_df)
]

## Untersuchung der Daten

### Untersuchung Netflix

In [3]:
print('--------------------------------------------------------------------')
print('Info:')
print(netflix_df.info())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None
--------------------------------------------------------------------


In [4]:
print('--------------------------------------------------------------------')
print('Head:')
print(netflix_df.head())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Head:
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  Septembe

In [5]:
print('--------------------------------------------------------------------')
print('Shape:')
print(netflix_df.shape)
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Shape:
(8807, 12)
--------------------------------------------------------------------


In [6]:
print('--------------------------------------------------------------------')
print('Nullwerte pro Spalte: ')
print(netflix_df.isna().sum())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Nullwerte pro Spalte: 
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64
--------------------------------------------------------------------


### Untersuchung Disney+

In [8]:
print('--------------------------------------------------------------------')
print('Info:')
print(disney_df.info())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
dtypes: int64(1), object(11)
memory usage: 136.1+ KB
None
--------------------------------------------------------------------


In [9]:
print('--------------------------------------------------------------------')
print('Head:')
print(disney_df.head())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Head:
  show_id     type                                             title  \
0      s1    Movie  Duck the Halls: A Mickey Mouse Christmas Special   
1      s2    Movie                            Ernest Saves Christmas   
2      s3    Movie                      Ice Age: A Mammoth Christmas   
3      s4    Movie                        The Queen Family Singalong   
4      s5  TV Show                             The Beatles: Get Back   

                            director  \
0  Alonso Ramirez Ramos, Dave Wasson   
1                        John Cherry   
2                       Karen Disher   
3                    Hamish Hamilton   
4                                NaN   

                                                cast        country  \
0  Chris Diamantopoulos, Tony Anselmo, Tress MacN...            NaN   
1           Jim Varney, Noelle Parker, Douglas Seale            NaN   
2  Raymond Albert Romano, John Leguiza

In [10]:
print('--------------------------------------------------------------------')
print('Shape:')
print(disney_df.shape)
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Shape:
(1450, 12)
--------------------------------------------------------------------


In [11]:
print('--------------------------------------------------------------------')
print('Nullwerte pro Spalte: ')
print(disney_df.isna().sum())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Nullwerte pro Spalte: 
show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64
--------------------------------------------------------------------


### Untersuchung Beide

In [12]:
for name, df in df_list:
    print(name)
    print('Anzahl der Duplikate:', df[df.duplicated(subset=['title'])].shape[0])
    print('Typen:', df['type'].value_counts())
    print('Anzahl Alterfreigaben:', df['rating'].nunique())
    print('Anzahl Regisseure:', df['director'].nunique())
    print('Anzahl Genres:', df['listed_in'].nunique()) # sehr hohe Zahl -> Werte anschauen
    print('Anzahl Länder:', df['country'].nunique()) # bei Netflix 740 -> Werte anschauen
    print('--------------------------------------------------------------------')


disney_df
Anzahl der Duplikate: 0
Typen: type
Movie      1052
TV Show     398
Name: count, dtype: int64
Anzahl Alterfreigaben: 9
Anzahl Regisseure: 609
Anzahl Genres: 329
Anzahl Länder: 89
--------------------------------------------------------------------
netflix_df
Anzahl der Duplikate: 0
Typen: type
Movie      6131
TV Show    2676
Name: count, dtype: int64
Anzahl Alterfreigaben: 17
Anzahl Regisseure: 4528
Anzahl Genres: 514
Anzahl Länder: 748
--------------------------------------------------------------------
