# Datencharakterisierung

Als erstes werden notwendige <span style="color:seagreen">packages</span> installiert und <span style="color:coral">csv</span> Dateien von Netflix und Disney+ eingelesen und in <span style="color:lightblue">df_list</span> gespeichert für eine schnellere Untersuchung und weniger Codezeilen.

# Inhaltsverzeichnis
- [1. Untersuchung der Daten](#untersuchung-der-Daten)
    - 1.1 Überblick & Datentypen
    - 1.2 Menge und Ausschnitt
    - 1.2 Zustand
- [2. Diagramme und Tabellen](#diagramme-und-tabellen) 


In [1]:
import pandas as pd
import numpy as np
import os
import plotly as pl
import pycountry
import plotly.graph_objects as go
from plotly.subplots import make_subplots

notebook_dir = os.path.dirname(os.path.abspath("__file__"))
dataset_dir = os.path.join(notebook_dir, '../1_Datenset/ursprüngliche')

disney_df = pd.read_csv(os.path.join(dataset_dir, 'disney_plus_titles.csv'), sep=',')
netflix_df = pd.read_csv(os.path.join(dataset_dir, 'netflix_titles.csv'), sep=',')

df_list = [
    ('disney_df', disney_df),
    ('netflix_df', netflix_df)
]

## 1. Untersuchung der Daten

### 1.1 Überblick & Datentypen

#### 1.1.1 Netflix

In [2]:
print('--------------------------------------------------------------------')
print('Info:')
print(netflix_df.info())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None
--------------------------------------------------------------------


In [18]:
print(netflix_df.describe())


       release_year
count   8807.000000
mean    2014.180198
std        8.819312
min     1925.000000
25%     2013.000000
50%     2017.000000
75%     2019.000000
max     2021.000000


Datentypen
- Qualitative
    - Nominale: 
        - ohne Reihenfolge
        - man kann keine Median bilden
    - Ordinale:
        - Zahlen
        - nicht rechenbare Reihenfolge
- Quantitative
    - Intervall:
        - absolute Abstände
        - kein absolute Nullpunkt
        - berechnbare reihenfolge 
    - Absolut:
        - absolute Nullpunkte
        - berechenbare Reihenfolge

In [3]:
print('--------------------------------------------------------------------')
print('Head:')
print(netflix_df.head())
print('--------------------------------------------------------------------')


--------------------------------------------------------------------
Head:
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                cast        country  \
0                                                NaN  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...            NaN   
3                                                NaN            NaN   
4  Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...          India   

           date_added  release_year rating   duration  \
0  September 25, 2021          2020  PG-13     90 min   
1  Septembe

In [19]:
print(disney_df.describe())


       release_year
count   1450.000000
mean    2003.091724
std       21.860162
min     1928.000000
25%     1999.000000
50%     2011.000000
75%     2018.000000
max     2021.000000


In [4]:
print('--------------------------------------------------------------------')
print('Shape:')
print(netflix_df.shape)
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Shape:
(8807, 12)
--------------------------------------------------------------------


In [5]:
print('--------------------------------------------------------------------')
print('Nullwerte pro Spalte: ')
print(netflix_df.isna().sum())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Nullwerte pro Spalte: 
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64
--------------------------------------------------------------------


#### Disney+

In [6]:
print('--------------------------------------------------------------------')
print('Info:')
print(disney_df.info())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1450 entries, 0 to 1449
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       1450 non-null   object
 1   type          1450 non-null   object
 2   title         1450 non-null   object
 3   director      977 non-null    object
 4   cast          1260 non-null   object
 5   country       1231 non-null   object
 6   date_added    1447 non-null   object
 7   release_year  1450 non-null   int64 
 8   rating        1447 non-null   object
 9   duration      1450 non-null   object
 10  listed_in     1450 non-null   object
 11  description   1450 non-null   object
dtypes: int64(1), object(11)
memory usage: 136.1+ KB
None
--------------------------------------------------------------------


In [7]:
print('--------------------------------------------------------------------')
print('Head:')
print(disney_df.head())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Head:
  show_id     type                                             title  \
0      s1    Movie  Duck the Halls: A Mickey Mouse Christmas Special   
1      s2    Movie                            Ernest Saves Christmas   
2      s3    Movie                      Ice Age: A Mammoth Christmas   
3      s4    Movie                        The Queen Family Singalong   
4      s5  TV Show                             The Beatles: Get Back   

                            director  \
0  Alonso Ramirez Ramos, Dave Wasson   
1                        John Cherry   
2                       Karen Disher   
3                    Hamish Hamilton   
4                                NaN   

                                                cast        country  \
0  Chris Diamantopoulos, Tony Anselmo, Tress MacN...            NaN   
1           Jim Varney, Noelle Parker, Douglas Seale            NaN   
2  Raymond Albert Romano, John Leguiza

In [8]:
print('--------------------------------------------------------------------')
print('Shape:')
print(disney_df.shape)
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Shape:
(1450, 12)
--------------------------------------------------------------------


In [9]:
print('--------------------------------------------------------------------')
print('Nullwerte pro Spalte: ')
print(disney_df.isna().sum())
print('--------------------------------------------------------------------')

--------------------------------------------------------------------
Nullwerte pro Spalte: 
show_id           0
type              0
title             0
director        473
cast            190
country         219
date_added        3
release_year      0
rating            3
duration          0
listed_in         0
description       0
dtype: int64
--------------------------------------------------------------------


### Untersuchung Beide

In [10]:
for name, df in df_list:
    print(name)
    print('Anzahl der Duplikate:', df[df.duplicated(subset=['title'])].shape[0])
    print('Typen:', df['type'].value_counts())
    print('Anzahl Alterfreigaben:', df['rating'].nunique())
    print('Anzahl Regisseure:', df['director'].nunique())
    print('Anzahl Genres:', df['listed_in'].nunique()) # sehr hohe Zahl -> Werte anschauen
    print('Anzahl Länder:', df['country'].nunique()) # bei Netflix 740 -> Werte anschauen
    print('--------------------------------------------------------------------')


disney_df
Anzahl der Duplikate: 0
Typen: type
Movie      1052
TV Show     398
Name: count, dtype: int64
Anzahl Alterfreigaben: 9
Anzahl Regisseure: 609
Anzahl Genres: 329
Anzahl Länder: 89
--------------------------------------------------------------------
netflix_df
Anzahl der Duplikate: 0
Typen: type
Movie      6131
TV Show    2676
Name: count, dtype: int64
Anzahl Alterfreigaben: 17
Anzahl Regisseure: 4528
Anzahl Genres: 514
Anzahl Länder: 748
--------------------------------------------------------------------


##### Rating

In [11]:
# Ratings und deren Anzahlen für Disney+
disney_ratings = disney_df.groupby(['rating', 'type']).size().reset_index(name='count')
disney_ratings['source'] = disney_ratings['type'].apply(lambda x: 'Disney+ Movie' if x == 'Movie' else 'Disney+ Serie')
disney_ratings = disney_ratings.drop(columns=['type'])

# Ratings und deren Anzahlen für Netflix
netflix_ratings = netflix_df.groupby(['rating', 'type']).size().reset_index(name='count')
netflix_ratings['source'] = netflix_ratings['type'].apply(lambda x: 'Netflix Movie' if x == 'Movie' else 'Netflix Serie')
netflix_ratings = netflix_ratings.drop(columns=['type'])

# Zusammenführen der beiden DataFrames
all_ratings = pd.concat([disney_ratings, netflix_ratings])

# Ausgabe der Ratings mit den jeweiligen Anzahlen
print(all_ratings)

      rating  count         source
0          G    253  Disney+ Movie
1         PG    235  Disney+ Movie
2         PG      1  Disney+ Serie
3      PG-13     66  Disney+ Movie
4      TV-14     37  Disney+ Movie
5      TV-14     42  Disney+ Serie
6       TV-G    233  Disney+ Movie
7       TV-G     85  Disney+ Serie
8      TV-PG    181  Disney+ Movie
9      TV-PG    120  Disney+ Serie
10      TV-Y      3  Disney+ Movie
11      TV-Y     47  Disney+ Serie
12     TV-Y7     36  Disney+ Movie
13     TV-Y7     95  Disney+ Serie
14  TV-Y7-FV      7  Disney+ Movie
15  TV-Y7-FV      6  Disney+ Serie
0     66 min      1  Netflix Movie
1     74 min      1  Netflix Movie
2     84 min      1  Netflix Movie
3          G     41  Netflix Movie
4      NC-17      3  Netflix Movie
5         NR     75  Netflix Movie
6         NR      5  Netflix Serie
7         PG    287  Netflix Movie
8      PG-13    490  Netflix Movie
9          R    797  Netflix Movie
10         R      2  Netflix Serie
11     TV-14   1427 

6. Länder anschauen und anschließend splitten
    - wieso gibt es so viele?
    - sind sie plausibel?

-> 'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia' das hier wird als ein Land gewertet, welche falsch ist

#### Netflix

In [12]:
netflix_df['country'].unique()


array(['United States', 'South Africa', nan, 'India',
       'United States, Ghana, Burkina Faso, United Kingdom, Germany, Ethiopia',
       'United Kingdom', 'Germany, Czech Republic', 'Mexico', 'Turkey',
       'Australia', 'United States, India, France', 'Finland',
       'China, Canada, United States',
       'South Africa, United States, Japan', 'Nigeria', 'Japan',
       'Spain, United States', 'France', 'Belgium',
       'United Kingdom, United States', 'United States, United Kingdom',
       'France, United States', 'South Korea', 'Spain',
       'United States, Singapore', 'United Kingdom, Australia, France',
       'United Kingdom, Australia, France, United States',
       'United States, Canada', 'Germany, United States',
       'South Africa, United States', 'United States, Mexico',
       'United States, Italy, France, Japan',
       'United States, Italy, Romania, United Kingdom',
       'Australia, United States', 'Argentina, Venezuela',
       'United States, United Kin

In [13]:
def is_valid_country(country_name):
    if pd.isna(country_name) or country_name.lower() in ['unknown'] + ['russia']:
        return True
    try:
        pycountry.countries.lookup(country_name)
        return True
    except LookupError:
        return False

def is_valid_country(country_name):
    valid_countries = ['turkey', 'russia', 'palestine', 'vatican city', 'soviet union']
    if pd.isna(country_name) or country_name.lower() in ['unknown'] + valid_countries:
        return True
    if country_name.lower() == 'soviet union':
        country_name = 'russia'
    if country_name.lower() in ['west germany', 'east germany']:
        country_name = 'germany'
    try:
        pycountry.countries.lookup(country_name)
        return True
    except LookupError:
        return False


invalid_countries = netflix_df['country'].apply(lambda x: [country.strip() for country in str(x).split(',') if country.strip() != 'unknown' and not is_valid_country(country.strip())])
print('Nicht plasible Länder')

print(invalid_countries[invalid_countries.apply(len) > 0])


Nicht plasible Länder
2       [nan]
3       [nan]
5       [nan]
6       [nan]
10      [nan]
        ...  
8718    [nan]
8759    [nan]
8783    [nan]
8785    [nan]
8803    [nan]
Name: country, Length: 838, dtype: object


-> wird in cleanen.ipynb gecleant und anschließend wieder untersucht

In [14]:
netflix_df_copy_3 = netflix_df.copy()
netflix_df_copy_3['listed_in'] = netflix_df_copy_3['listed_in'].str.split(',')
netflix_df_copy_3['listed_in'] = netflix_df_copy_3['listed_in'].apply(lambda x: [genre.strip() for genre in x])
netflix_df_genres = netflix_df_copy_3.explode('listed_in')
netflix_movies_per_genres = netflix_df_genres['listed_in'].value_counts()
print(netflix_movies_per_genres)

listed_in
International Movies            2752
Dramas                          2427
Comedies                        1674
International TV Shows          1351
Documentaries                    869
Action & Adventure               859
TV Dramas                        763
Independent Movies               756
Children & Family Movies         641
Romantic Movies                  616
TV Comedies                      581
Thrillers                        577
Crime TV Shows                   470
Kids' TV                         451
Docuseries                       395
Music & Musicals                 375
Romantic TV Shows                370
Horror Movies                    357
Stand-Up Comedy                  343
Reality TV                       255
British TV Shows                 253
Sci-Fi & Fantasy                 243
Sports Movies                    219
Anime Series                     176
Spanish-Language TV Shows        174
TV Action & Adventure            168
Korean TV Shows             



#### Disney+

##### Länder

In [15]:
disney_df['country'].unique()

array([nan, 'United States', 'United States, Canada',
       'United States, Australia', 'Canada',
       'United States, United Kingdom', 'United States, South Korea',
       'Ireland, United States, Canada, United Kingdom, Denmark, Spain, Poland, Hungary',
       'France, United Kingdom', 'United Kingdom, Australia',
       'Ireland, United States', 'Canada, United States, France',
       'France, South Korea, Japan, United States', 'France',
       'United States, United Kingdom, Hungary', 'United States, Germany',
       'United States, United Kingdom, Australia', 'United States, India',
       'United States, Canada, United Kingdom, Singapore, Australia, Thailand',
       'Canada, United States',
       'South Korea, United States, China, Japan',
       'Australia, United Kingdom', 'United Kingdom',
       'United States, United Kingdom, South Korea',
       'United States, United Kingdom, Canada',
       'United States, Germany, United Kingdom',
       'United States, Canada, Ire

In [16]:
disney_df_copy_3 = disney_df.copy()
disney_df_copy_3['listed_in'] = disney_df_copy_3['listed_in'].str.split(',')
disney_df_copy_3['listed_in'] = disney_df_copy_3['listed_in'].apply(lambda x: [genre.strip() for genre in x])
disney_df_genres = disney_df_copy_3.explode('listed_in')
disney_df_genres['listed_in'] = disney_df_genres['listed_in'].str.strip()
disney_movies_per_genres = disney_df_genres['listed_in'].value_counts()
print(disney_movies_per_genres)

listed_in
Family                     632
Animation                  542
Comedy                     526
Action-Adventure           452
Animals & Nature           208
Coming of Age              205
Fantasy                    192
Documentary                174
Kids                       141
Drama                      134
Docuseries                 122
Science Fiction             91
Historical                  53
Music                       48
Musical                     44
Sports                      43
Biographical                41
Buddy                       40
Anthology                   28
Reality                     26
Romance                     20
Superhero                   19
Crime                       16
Variety                     12
Mystery                     12
Game Show / Competition     10
Survival                     9
Parody                       9
Lifestyle                    8
Western                      7
Concert Film                 7
Medical                      

In [17]:
d_invalid_countries = disney_df['country'].apply(lambda x: [country.strip() for country in str(x).split(',') if country.strip() != 'unknown' and not is_valid_country(country.strip())])

print('Nicht plasible Länder')
if len(d_invalid_countries[d_invalid_countries.apply(len) > 0]) == 0:
    print('Alle Länder sind plausibel')
else:
    print(d_invalid_countries[d_invalid_countries.apply(len) > 0])


Nicht plasible Länder
0       [nan]
1       [nan]
3       [nan]
4       [nan]
6       [nan]
        ...  
1136    [nan]
1204    [nan]
1210    [nan]
1259    [nan]
1388    [nan]
Name: country, Length: 219, dtype: object


-> wird in cleanen.ipynb gecleant und anschließend wieder untersucht

## Grafiken

1. typen und agerating
2. Top 10 Genres in den jeweiligen Platformen
3. 

## Tabellen

1. Beschreibung der Spalten mit den Datentypen der ursprünglichen Datensätze und der Neuen
2. Agerating Erklärung
