<a href="https://colab.research.google.com/github/vyavasthita/dsml_learning/blob/master/pandas/movies_case_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

In [None]:
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd

In [None]:
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

In [None]:
movies = pd.read_csv("movies.csv", index_col=0)
directors = pd.read_csv("directors.csv", index_col=0)

In [None]:
movies.head()

In [None]:
directors.head()

In [None]:
data = movies.merge(directors, left_on='director_id', right_on='id', how='left')
data.drop(['director_id', 'id_y'], axis=1, inplace=True)
data

# filter out the risky movies from our data
# whose budget was even higher than the avg revenue of the director from his other movies

In [None]:
def is_risky(row_data):
  movie_budget = row_data['budget']
  avg_revenue_by_director = row_data['revenue'].mean()
  row_data['risky'] = movie_budget > avg_revenue_by_director  # new colomn with boolean value
  return row_data

In [None]:
data_risky = data.groupby('director_name').apply(is_risky)

### Combine group and convert to dataframe-like output with individual rows

In [None]:
data_risky = data.groupby('director_name', group_keys=False).apply(is_risky)
data_risky

In [None]:
data_risky['risky'].value_counts()

In [None]:
data_risky[data_risky['risky'] == True]

# which director => most productive director

In [None]:
data.groupby('director_name')['title'].count()

In [None]:
data.groupby('director_name')['title'].count().sort_values(ascending=False)

But only count of movies is not the right parameter to measure productivity.
We need to check the duration also.
We need to find the starting year and ending year of direction by the director.

Based on the number of years he worked and the number of movies, we will decide who is the most productive director.

In [None]:
data_agg = data.groupby('director_name')[['year', 'title']].agg({'year': ['min', 'max'], 'title': 'count'})

In the output, we have two levels of columns.

Year has min and max and title has count.

In [None]:
data_agg.columns
# the output shows that data_agg is a multi level index

In [None]:
data_agg['year'].head()

In [None]:
data_agg['year']['min']

In [None]:
data_agg['title']['count']

In [None]:
# Show both levels of columns
data_agg[('title', 'count')]

To simplify this, we should convert a multi-level index to a single-level index.

For this we override the columns.

In [None]:
data_agg.columns = ['year_min', 'year_max', 'title_count']
data_agg

In [None]:
# reset index
data_agg.reset_index(inplace=True)
data_agg

In [None]:
data_agg['active_years'] = data_agg['year_max'] - data_agg['year_min']
data_agg

Moves per year by each director

In [None]:
data_agg['movies_per_year'] = data_agg['title_count'] / data_agg['active_years']
data_agg

In [None]:
data_agg.sort_values(by='movies_per_year', ascending=False, inplace=True)
data_agg

In [None]:
data_agg.reset_index(inplace=True)
data_agg