# Datenaufbereitung

Als erstes werden notwendige <span style="color:seagreen">packages</span> installiert und <span style="color:coral">csv</span> Dateien von Netflix und Disney+ eingelesen und in <span style="color:lightblue">df_list</span> gespeichert für eine schnellere Datenbereinigung und weniger Codezeilen und für das Zusammenführen.

# Inhaltsverzeichnis
- [Untersuchung der Daten](#untersuchung-der-Daten): [Netflix](##untersuchung-netflix) & [Disney+](##untersuchung-disney) & [Beide](##untersuchung-beide)
- [Cleaning](#cleaning)  : [Für Beide](##für-beide) & [Netflix](##netflix) & [Disney+](##disney)
- [Speichern](#1.-speichern)
- [Zusammenführen und Speichern](#zusammenführen-und-speichern)

In [1]:
import pandas as pd
import numpy as np
import os
import plotly as pl
import pycountry

notebook_dir = os.path.dirname(os.path.abspath("__file__"))
dataset_dir = os.path.join(notebook_dir, '../1_Datenset/ursprüngliche')

disney_df = pd.read_csv(os.path.join(dataset_dir, 'disney_plus_titles.csv'), sep=',')
netflix_df = pd.read_csv(os.path.join(dataset_dir, 'netflix_titles.csv'), sep=',')

df_list = [
    ('disney_df', disney_df),
    ('netflix_df', netflix_df)
]

## Untersuchung der Daten

### Untersuchung Netflix

In [2]:
print('--------------------------------------------------------------------')
print('Info:')
print(netflix_df.info())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None
--------------------------------------------------------------------


In [3]:
print('--------------------------------------------------------------------')
print('Head:')
print(netflix_df.head())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Head:
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  Septembe

In [4]:
print('--------------------------------------------------------------------')
print('Shape:')
print(netflix_df.shape)
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Shape:
(8807, 12)
--------------------------------------------------------------------


In [5]:
print('--------------------------------------------------------------------')
print('Nullwerte pro Spalte: ')
print(netflix_df.isna().sum())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Nullwerte pro Spalte: 
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64
--------------------------------------------------------------------


### Untersuchung Disney+

In [6]:
print('--------------------------------------------------------------------')
print('Info:')
print(disney_df.info())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
dtypes: int64(1), object(11)
memory usage: 136.1+ KB
None
--------------------------------------------------------------------


In [7]:
print('--------------------------------------------------------------------')
print('Head:')
print(disney_df.head())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Head:
  show_id     type                                             title  \
0      s1    Movie  Duck the Halls: A Mickey Mouse Christmas Special   
1      s2    Movie                            Ernest Saves Christmas   
2      s3    Movie                      Ice Age: A Mammoth Christmas   
3      s4    Movie                        The Queen Family Singalong   
4      s5  TV Show                             The Beatles: Get Back   

                            director  \
0  Alonso Ramirez Ramos, Dave Wasson   
1                        John Cherry   
2                       Karen Disher   
3                    Hamish Hamilton   
4                                NaN   

                                                cast        country  \
0  Chris Diamantopoulos, Tony Anselmo, Tress MacN...            NaN   
1           Jim Varney, Noelle Parker, Douglas Seale            NaN   
2  Raymond Albert Romano, John Leguiza

In [8]:
print('--------------------------------------------------------------------')
print('Shape:')
print(disney_df.shape)
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Shape:
(1450, 12)
--------------------------------------------------------------------


In [9]:
print('--------------------------------------------------------------------')
print('Nullwerte pro Spalte: ')
print(disney_df.isna().sum())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Nullwerte pro Spalte: 
show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64
--------------------------------------------------------------------


### Untersuchung Beide

In [10]:
for name, df in df_list:
    print(name)
    print('Anzahl der Duplikate:', df[df.duplicated(subset=['title'])].shape[0])
    print('Typen:', df['type'].value_counts())
    print('Anzahl Alterfreigaben:', df['rating'].nunique())
    print('Anzahl Regisseure:', df['director'].nunique())
    print('Anzahl Genres:', df['listed_in'].nunique()) # sehr hohe Zahl -> Werte anschauen
    print('Anzahl Länder:', df['country'].nunique()) # bei Netflix 740 -> Werte anschauen
    print('--------------------------------------------------------------------')


disney_df
Anzahl der Duplikate: 0
Typen: type
Movie      1052
TV Show     398
Name: count, dtype: int64
Anzahl Alterfreigaben: 9
Anzahl Regisseure: 609
Anzahl Genres: 329
Anzahl Länder: 89
--------------------------------------------------------------------
netflix_df
Anzahl der Duplikate: 0
Typen: type
Movie      6131
TV Show    2676
Name: count, dtype: int64
Anzahl Alterfreigaben: 17
Anzahl Regisseure: 4528
Anzahl Genres: 514
Anzahl Länder: 748
--------------------------------------------------------------------


## Cleaning

### Für Beide

1. Leere Zellen in der Spalte: Director, ersetzten mit: unknown
2. Leere Zellen in der Spalte: Cast, ersetzten mit: unknown
3. Leere Zellen in der Spalte: Country, ersetzten mit: unknown

In [11]:
print('Nullwerte pro Spalte: ')
for name, df in df_list:
    df['director'] = df['director'].fillna('unknown')
    df['cast'] = df['cast'].fillna('unknown')
    df['country'] = df['country'].fillna('unknown')

    print(name)
    print(df.isna().sum())


Nullwerte pro Spalte: 
disney_df
show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      3
release_year    0
rating          3
duration        0
listed_in       0
description     0
dtype: int64
netflix_df
show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      10
release_year     0
rating           4
duration         3
listed_in        0
description      0
dtype: int64


4. Zeilen, wo leere Zellen in rating und duration sind, löschen
5. Spalten die nicht benötigt werden löschen: date_added, release_year

In [12]:
for name, df in df_list:
    print(name)
    print('Shape davor:')
    print(df.shape)

    df.dropna(subset=['duration', 'rating'], inplace=True)

    print('Shape danach:')
    print(df.shape)
    
    df.drop(columns=['date_added', 'release_year'], inplace=True)
    


disney_df
Shape davor:
(1450, 12)
Shape danach:
(1447, 12)
netflix_df
Shape davor:
(8807, 12)
Shape danach:
(8800, 12)


6. Länder anschauen und anschließend splitten
    - wieso gibt es so viele?
    - sind sie plausibel?

-> 'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia' das hier wird als ein Land gewertet, welche falsch ist

#### Netflix

In [13]:
netflix_df['country'].unique()


array(['United States', 'South Africa', 'unknown', 'India',
       'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia',
       'United Kingdom', 'Germany, Czech Republic', 'Mexico', 'Turkey',
       'Australia', 'United States, India, France', 'Finland',
       'China, Canada, United States',
       'South Africa, United States, Japan', 'Nigeria', 'Japan',
       'Spain, United States', 'France', 'Belgium',
       'United Kingdom, United States', 'United States, United Kingdom',
       'France, United States', 'South Korea', 'Spain',
       'United States, Singapore', 'United Kingdom, Australia, France',
       'United Kingdom, Australia, France, United States',
       'United States, Canada', 'Germany, United States',
       'South Africa, United States', 'United States, Mexico',
       'United States, Italy, France, Japan',
       'United States, Italy, Romania, United Kingdom',
       'Australia, United States', 'Argentina, Venezuela',
       'United States, Unit

In [14]:
netflix_df_copy = netflix_df.copy()
netflix_df_copy['country'] = netflix_df_copy['country'].str.split(',')
netflix_df_copy['country'] = netflix_df_copy['country'].apply(lambda x: [country.strip() for country in x])
netflix_df_countries = netflix_df_copy.explode('country')
netflix_movies_per_country = netflix_df_countries['country'].value_counts()
print(netflix_movies_per_country)

country
United States     3687
India             1046
unknown            830
United Kingdom     806
Canada             445
                  ... 
Sudan                1
Panama               1
Uganda               1
East Germany         1
Montenegro           1
Name: count, Length: 124, dtype: int64


In [15]:
netflix_df_copy_2 = netflix_df.copy()
netflix_df_copy_2['director'] = netflix_df_copy_2['director'].str.split(',')
netflix_df_copy_2['director'] = netflix_df_copy_2['director'].apply(lambda x: [director.strip() for director in x])
netflix_df_director = netflix_df_copy_2.explode('director')
netflix_movies_per_director = netflix_df_director['director'].value_counts()
print(netflix_movies_per_director)

director
unknown                   2631
Rajiv Chilaka               22
Jan Suter                   21
Raúl Campos                 19
Suhas Kadav                 16
                          ... 
Robert Cullen                1
Kirsten Johnson              1
Lawrence Kasdan              1
Milla Harrison-Hansley       1
Alicky Sussman               1
Name: count, Length: 4992, dtype: int64


In [16]:
netflix_df_copy_3 = netflix_df.copy()
netflix_df_copy_3['listed_in'] = netflix_df_copy_3['listed_in'].str.split(',')
netflix_df_copy_3['listed_in'] = netflix_df_copy_3['listed_in'].apply(lambda x: [genre.strip() for genre in x])
netflix_df_genres = netflix_df_copy_3.explode('listed_in')
netflix_movies_per_genres = netflix_df_genres['listed_in'].value_counts()
print(netflix_movies_per_genres)

listed_in
International Movies            2752
Dramas                          2426
Comedies                        1674
International TV Shows          1350
Documentaries                    869
Action & Adventure               859
TV Dramas                        763
Independent Movies               756
Children & Family Movies         641
Romantic Movies                  616
TV Comedies                      580
Thrillers                        577
Crime TV Shows                   470
Kids' TV                         450
Docuseries                       395
Music & Musicals                 375
Romantic TV Shows                370
Horror Movies                    357
Stand-Up Comedy                  343
Reality TV                       255
British TV Shows                 253
Sci-Fi & Fantasy                 243
Sports Movies                    219
Anime Series                     175
Spanish-Language TV Shows        174
TV Action & Adventure            168
Korean TV Shows             

In [17]:
def is_valid_country(country_name):
    if pd.isna(country_name) or country_name.lower() in ['unknown'] + ['russia']:
        return True
    try:
        pycountry.countries.lookup(country_name)
        return True
    except LookupError:
        return False

def is_valid_country(country_name):
    valid_countries = ['turkey', 'russia', 'palestine', 'vatican city', 'soviet union']
    if pd.isna(country_name) or country_name.lower() in ['unknown'] + valid_countries:
        return True
    if country_name.lower() == 'soviet union':
        country_name = 'russia'
    if country_name.lower() in ['west germany', 'east germany']:
        country_name = 'germany'
    try:
        pycountry.countries.lookup(country_name)
        return True
    except LookupError:
        return False


invalid_countries = netflix_df['country'].apply(lambda x: [country.strip() for country in str(x).split(',') if country.strip() != 'unknown' and not is_valid_country(country.strip())])
print('Nicht plasible Länder')

print(invalid_countries[invalid_countries.apply(len) > 0])



Nicht plasible Länder
193     []
365     []
1192    []
2224    []
4653    []
5925    []
7007    []
Name: country, dtype: object


In [18]:
print(netflix_df.loc[193, ['show_id', 'title', 'country']])
print('--------------------------------------------------------------------')
print(netflix_df.loc[365, ['show_id', 'title', 'country']])
print('--------------------------------------------------------------------')
print(netflix_df.loc[1192, ['show_id', 'title', 'country']])
print('--------------------------------------------------------------------')
print(netflix_df.loc[2224, ['show_id', 'title', 'country']])
print('--------------------------------------------------------------------')
print(netflix_df.loc[4653, ['show_id', 'title', 'country']])
print('--------------------------------------------------------------------')
print(netflix_df.loc[5925, ['show_id', 'title', 'country']])
print('--------------------------------------------------------------------')
print(netflix_df.loc[7007, ['show_id', 'title', 'country']])

show_id             s194
title               D.P.
country    , South Korea
Name: 193, dtype: object
--------------------------------------------------------------------
show_id                 s366
title        Eyes of a Thief
country    , France, Algeria
Name: 365, dtype: object
--------------------------------------------------------------------
show_id              s1193
title          The Present
country    United Kingdom,
Name: 1192, dtype: object
--------------------------------------------------------------------
show_id                                     s2225
title                                       Funan
country    France, Belgium, Luxembourg, Cambodia,
Name: 2224, dtype: object
--------------------------------------------------------------------
show_id             s4654
title         City of Joy
country    United States,
Name: 4653, dtype: object
--------------------------------------------------------------------
show_id              s5926
title              Virunga
co

In [19]:
#Länder umändern, sie pausible machen
netflix_df.loc[365, 'country'] = netflix_df.loc[365, 'country'].replace(', France, Algeria', 'France, Algeria')
netflix_df.loc[4653,'country'] = netflix_df.loc[4563, 'country'].replace('United States,', 'United States')
netflix_df.loc[5925, 'country'] = netflix_df.loc[5925, 'country'].replace('United Kingdom,', 'United Kingdom')
netflix_df.loc[7007, 'country'] = netflix_df.loc[7007, 'country'].replace('Poland,', 'Poland')
netflix_df.loc[1192, 'country'] = 'France, Belgium, Luxembourg, Cambodia'
netflix_df.loc[193, 'country'] = netflix_df.loc[193, 'country'].replace(', South Korea', 'South Korea')
netflix_df.loc[2224,'country'] = 'United States'

In [20]:
invalid_countries = netflix_df['country'].apply(lambda x: [country.strip() for country in str(x).split(',') if country.strip() != 'unknown' and not is_valid_country(country.strip())])
print('Nicht plasible Länder:')
if len(invalid_countries[invalid_countries.apply(len) > 0]) == 0:
    print('Alle Länder sind plausibel')
else:
    print(invalid_countries[invalid_countries.apply(len) > 0])


Nicht plasible Länder:
Alle Länder sind plausibel


#### Disney+

In [21]:
netflix_df['country'].unique()

array(['United States', 'South Africa', 'unknown', 'India',
       'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia',
       'United Kingdom', 'Germany, Czech Republic', 'Mexico', 'Turkey',
       'Australia', 'United States, India, France', 'Finland',
       'China, Canada, United States',
       'South Africa, United States, Japan', 'Nigeria', 'Japan',
       'Spain, United States', 'France', 'Belgium',
       'United Kingdom, United States', 'United States, United Kingdom',
       'France, United States', 'South Korea', 'Spain',
       'United States, Singapore', 'United Kingdom, Australia, France',
       'United Kingdom, Australia, France, United States',
       'United States, Canada', 'Germany, United States',
       'South Africa, United States', 'United States, Mexico',
       'United States, Italy, France, Japan',
       'United States, Italy, Romania, United Kingdom',
       'Australia, United States', 'Argentina, Venezuela',
       'United States, Unit

In [26]:
disney_df_copy = disney_df.copy()
disney_df_copy['country'] = disney_df_copy['country'].str.split(',')
disney_df_copy['country'] = disney_df_copy['country'].apply(lambda x: [country.strip() for country in x])
disney_df_countries = disney_df_copy.explode('country')
disney_movies_per_country = disney_df_countries['country'].value_counts()
print(disney_movies_per_country)

country
United States           1184
unknown                  216
United Kingdom           101
Canada                    77
Australia                 23
France                    22
South Korea               13
Japan                     10
China                     10
Germany                    9
Ireland                    8
Taiwan                     6
India                      5
Mexico                     4
Hong Kong                  4
Spain                      4
South Africa               3
Argentina                  3
Hungary                    3
Denmark                    3
New Zealand                3
Poland                     2
Singapore                  2
Austria                    2
Philippines                2
United Arab Emirates       2
Malaysia                   2
Brazil                     1
Switzerland                1
Tanzania                   1
Belgium                    1
Thailand                   1
Angola                     1
Panama                     1
Luxemb

In [27]:
disney_df_copy_2 = disney_df.copy()
disney_df_copy_2['director'] = disney_df_copy_2['director'].str.split(',')
disney_df_copy_2['director'] = disney_df_copy_2['director'].apply(lambda x: [director.strip() for director in x])
disney_df_director = disney_df_copy_2.explode('director')
disney_movies_per_director = disney_df_director['director'].value_counts()
print(disney_movies_per_director)

director
unknown             471
Jack Hannah          17
Wilfred Jackson      16
John Lasseter        16
Paul Hoen            16
                   ... 
Zhong Yu              1
Byron Haskin          1
Steven Lisberger      1
Jay Russell           1
Nick Castle           1
Name: count, Length: 636, dtype: int64


In [28]:
disney_df_copy_3 = disney_df.copy()
disney_df_copy_3['listed_in'] = disney_df_copy_3['listed_in'].str.split(',')
disney_df_copy_3['listed_in'] = disney_df_copy_3['listed_in'].apply(lambda x: [genre.strip() for genre in x])
disney_df_genres = disney_df_copy_3.explode('listed_in')
disney_df_genres['listed_in'] = disney_df_genres['listed_in'].str.strip()
disney_movies_per_genres = disney_df_genres['listed_in'].value_counts()
print(disney_movies_per_genres)

listed_in
Family                     632
Animation                  542
Comedy                     526
Action-Adventure           452
Animals & Nature           208
Coming of Age              205
Fantasy                    192
Documentary                173
Kids                       141
Drama                      134
Docuseries                 120
Science Fiction             91
Historical                  52
Music                       46
Musical                     44
Sports                      43
Buddy                       40
Biographical                40
Anthology                   27
Reality                     26
Romance                     20
Superhero                   19
Crime                       16
Variety                     12
Mystery                     12
Game Show / Competition     10
Survival                     9
Parody                       9
Lifestyle                    8
Western                      7
Concert Film                 7
Medical                      

In [29]:
d_invalid_countries = disney_df['country'].apply(lambda x: [country.strip() for country in str(x).split(',') if country.strip() != 'unknown' and not is_valid_country(country.strip())])

print('Nicht plasible Länder')
if len(d_invalid_countries[d_invalid_countries.apply(len) > 0]) == 0:
    print('Alle Länder sind plausibel')
else:
    print(d_invalid_countries[d_invalid_countries.apply(len) > 0])


Nicht plasible Länder
Alle Länder sind plausibel


## 1. Speichern 

In [30]:
netflix_df.to_csv('../1_Datenset/erstellte/cleaned/netflix_titles_cleaned.csv', index=False)

disney_df.to_csv('../1_Datenset/erstellte/cleaned/disney_plus_titles_cleaned.csv', index=False)

## Zusammenführen und Speichern

Hier werden die 2 Datensätze zusammengeführtzu eine einzige Movies_Shows Datensatz.

In [34]:
# Pfade zu den CSV-Dateien
netflix_cleaned_file = '../1_Datenset/erstellte/cleaned/netflix_titles_cleaned.csv'
disney_cleaned_file = '../1_Datenset/erstellte/cleaned/disney_plus_titles_cleaned.csv'

# Lade die Netflix CSV-Datei und füge eine Spalte für die Plattform hinzu
netflix_cleaned_df = pd.read_csv(netflix_cleaned_file)
netflix_cleaned_df['platform'] = 'Netflix'

# Lade die Disney+ CSV-Datei und füge eine Spalte für die Plattform hinzu
disney_cleaned_df = pd.read_csv(disney_cleaned_file)
disney_cleaned_df['platform'] = 'Disney+'

# Finde gemeinsame Titel
common_titles_cleaned = set(netflix_cleaned_df['title']).intersection(set(disney_cleaned_df['title']))

# Markiere gemeinsame Titel in beiden DataFrames
netflix_cleaned_df['platform'] = netflix_cleaned_df.apply(lambda row: 'Netflix, Disney+' if row['title'] in common_titles_cleaned else 'Netflix', axis=1)
disney_cleaned_df['platform'] = disney_cleaned_df.apply(lambda row: 'Netflix, Disney+' if row['title'] in common_titles_cleaned else 'Disney+', axis=1)

# Kombiniere die DataFrames
combined_cleaned_df = pd.concat([netflix_cleaned_df, disney_cleaned_df], ignore_index=True)

#Entferne doppelte Einträge
combined_cleaned_df = combined_cleaned_df.drop_duplicates(subset=['title'])

# Benenne die Spalte 'rating' in 'agerating' um
combined_cleaned_df.rename(columns={'rating': 'agerating'}, inplace=True)

# Gruppiere nach allen Spalten außer 'platform' und kombiniere die Plattformen
combined_cleaned_df = combined_cleaned_df.groupby(['show_id', 'type', 'title', 'director', 'cast', 'country', 'agerating', 'duration', 'listed_in', 'description'], as_index=False).agg({'platform': lambda x: ', '.join(sorted(set(x)))})

# Neuverteilung der show_id von oben nach unten in der Form 's1', 's2', ...
combined_cleaned_df['show_id'] = ['s' + str(i+1) for i in range(len(combined_cleaned_df))]

# Gib das kombinierte DataFrame aus
print(combined_cleaned_df.head())

# Speichere das kombinierte DataFrame in einer neuen CSV-Datei
combined_cleaned_df.to_csv('../1_Datenset/erstellte/fertig/fertig.csv', index=False)


  show_id   type                                             title  \
0      s1  Movie                              Dick Johnson Is Dead   
1      s2  Movie  Duck the Halls: A Mickey Mouse Christmas Special   
2      s3  Movie             A Muppets Christmas: Letters To Santa   
3      s4  Movie                                      The Starling   
4      s5  Movie                       Confessions of a Shopaholic   

                            director  \
0                    Kirsten Johnson   
1  Alonso Ramirez Ramos, Dave Wasson   
2                   Kirk R. Thatcher   
3                     Theodore Melfi   
4                         P.J. Hogan   

                                                cast        country agerating  \
0                                            unknown  United States     PG-13   
1  Chris Diamantopoulos, Tony Anselmo, Tress MacN...        unknown      TV-G   
2  Steve Whitmire, Dave Goelz, Bill Barretta, Eri...  United States         G   
3  Melissa McC

In [35]:
# Zeige die eindeutigen Werte in der Spalte 'platform'
unique_platforms = combined_cleaned_df['platform'].unique()
print(f"Einzigartige Werte in der Spalte 'platform': {unique_platforms}")

# Zähle die Anzahl der Einträge für jede Plattform
cleaned_netflix_count = combined_cleaned_df[combined_cleaned_df['platform'] == 'Netflix'].shape[0]
cleaned_disney_count = combined_cleaned_df[combined_cleaned_df['platform'] == 'Disney+'].shape[0]
cleaned_both_count = combined_cleaned_df[combined_cleaned_df['platform'] == 'Netflix, Disney+'].shape[0]

print(f"Anzahl der Netflix-Einträge: {cleaned_netflix_count}")
print(f"Anzahl der Disney+-Einträge: {cleaned_disney_count}")
print(f"Anzahl der Einträge auf beiden Plattformen: {cleaned_both_count}")


Einzigartige Werte in der Spalte 'platform': ['Netflix' 'Disney+' 'Netflix, Disney+']
Anzahl der Netflix-Einträge: 8757
Anzahl der Disney+-Einträge: 1404
Anzahl der Einträge auf beiden Plattformen: 43
