# Aufgabenstellung

- Erstelle eine kompakte Analyse zum aktuellen Film- und Serienangebot
- Was macht gute Filme aus?
- Was schauen die Leute gerne und warum?
- Auf der Basis deiner Analyse möchte Netflix in Zukunft selbst neue Filme und Serien produzieren.


## Bibliotheken und Datensätze importieren

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
movies = pd.read_csv("../input/imdb-extensive-dataset/IMDb movies.csv")
ratings = pd.read_csv("../input/imdb-extensive-dataset/IMDb ratings.csv")

In [None]:
ratings = ratings.rename(columns={"imdb_title_id": "imdb_title_id2"})
data = movies.merge(ratings, left_on='imdb_title_id', right_on='imdb_title_id2')

## Datensatz analysieren

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.shape

## Daten bereinigen

### fehlende Werte als "missing" ersetzen

In [None]:
data.director.fillna("missing", inplace=True)
data.writer.fillna("missing", inplace=True)
data.production_company.fillna("missing", inplace=True)
data.actors.fillna("missing", inplace=True) 
data.description.fillna("missing", inplace=True)
data.budget.fillna("missing", inplace=True)
data.worlwide_gross_income.fillna("missing", inplace=True)
data.country.fillna("missing", inplace=True)

In [None]:
data["genre_main"] = data["genre"].apply(lambda x: x.split(", ")[0])
data["country_main"] = data["country"].apply(lambda x: x.split(", ")[0])
data=data.drop(["country", "title", "usa_gross_income", "metascore", "reviews_from_users", "reviews_from_critics", "votes_1", "votes_2", "votes_3", "votes_4", "votes_5", "votes_6", "votes_7", "votes_8","votes_9", "votes_10", "top1000_voters_rating", "top1000_voters_votes", "us_voters_rating", "us_voters_votes", "non_us_voters_rating", "non_us_voters_votes", "mean_vote", "median_vote", "weighted_average_vote", "genre" ], axis = 1)

In [None]:
data.info()

## Visualisierung

- Welches Genre wird am meisten produziert?

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(y="genre_main", data=data, order = data['genre_main'].value_counts().index)

- Dauer der Filme in den verschiedenen Ländern

In [None]:
data = data.replace({"year":"TV Movie 2019"}, "2019")
data["year"] = data["year"].astype(int)

In [None]:
top = data["country_main"].value_counts()[:5].index.tolist()
TopMovies = data[data["country_main"].isin(top)]
TopMovies2 = TopMovies.groupby(["year", "country_main"]).mean()
TopMovies2 = TopMovies2.reset_index()
TopMovies2

In [None]:
data[["duration","avg_vote"]].hist(figsize=(20,7))

In [None]:
plt.figure(figsize=(20,10))
sns.lineplot(data=TopMovies2, x="year", y="duration",hue="country_main", markers=True, dashes=False)

- Genre gruppiert und Mittelwert erfasst

In [None]:
Genres = data.groupby(["genre_main"]).mean()
Genres_Count = data["genre_main"].value_counts()
Genres = Genres.merge(Genres_Count, left_on=Genres.index, right_on=Genres_Count.index)
Genres["Average_Vote"] = Genres["total_votes"] / Genres["genre_main"]
Genres["Average_Votes_Male"] = Genres["males_allages_votes"] / Genres["genre_main"]
Genres["Average_Votes_Female"] = Genres["females_allages_votes"] / Genres["genre_main"]
Genres = Genres.rename(columns={"key_0": "Genre","genre_main":"Movie_count"})

- Was haben die Leute am meisten angesehen:

In [None]:
sns.catplot(kind="bar",data=Genres,x="Movie_count", y="Genre")

- Wo haben die Leute im Durchschnitt am meisten gevoted?

In [None]:
sns.catplot(kind="bar",data=Genres,x="Average_Vote", y="Genre")

In [None]:
sns.catplot(kind="bar",data=Genres,x="Average_Votes_Male", y="Genre")

In [None]:
sns.catplot(kind="bar",data=Genres,x="Average_Votes_Female", y="Genre")

In [None]:
Genres[Genres["Genre"].isin(["Drama","Film-Noir"])][["Genre","Average_Vote","Average_Votes_Male","Average_Votes_Female","votes","Movie_count"]]