<a href="https://colab.research.google.com/github/vyavasthita/dsml_learning/blob/master/IMDBCaseStudy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

In [None]:
!gdown 1s2TkjSpzNc4SyxqRrQleZyDIHlc7bxnd

In [None]:
!gdown 1Ws-_s1fHZ9nHfGLVUQurbHDvStePlEJm

In [None]:
# index column will be named 0 and we will have repetitive index (pd provides iloc indexing)
movies = pd.read_csv('movies.csv')
directors = pd.read_csv('directors.csv')

In [None]:
movies.head()

In [None]:
# we are asking pandas not to create an implict index, but instead use the existing column 0 as index column
movies = pd.read_csv('movies.csv', index_col=0)
directors = pd.read_csv('directors.csv', index_col=0)

In [None]:
directors.head()

In [None]:
movies.shape

In [None]:
directors.shape

In [None]:
movies['director_id'].nunique()

In [None]:
directors['id'].nunique()

Check if all director_ids in the movies tables are present as an id column in the directors table

In [None]:
movies['director_id'].isin(directors['id'])

In [None]:
movies['director_id'].isin(directors['id']).value_counts()

In [None]:
np.sum(movies['director_id'].isin(directors['id']))

In [None]:
np.all(movies['director_id'].isin(directors['id']))

We would like to keep every movie's data, so a left join has to be performed.

But because each director_id of movies is present in directors table, so inner join can also be used.

Both inner and left joins will give the same result

In [None]:
pd.merge(movies, directors, how='left', left_on='director_id', right_on='id')

In [None]:
data = movies.merge(directors, how='left', left_on='director_id', right_on='id')

In [None]:
data.drop(['director_id', 'id_y'], axis=1, inplace=True)
data

post read => imdbd data exploration

https://colab.research.google.com/drive/1yrfHSQYUMxxLKGUG-gCPf-R232BuimiR?usp=sharing

apply() => apply function along an axis of the df => axis=0

In [None]:
def encode(data):
  return 0 if data == "Male" else 1

In [None]:
data['gender'] = data['gender'].apply(encode)
data

In [None]:
# sum of revenue and budget for each row
data['total_money'] = data[['budget', 'revenue']].apply(np.sum, axis = 1)
data

In [None]:
# profit per movie (revenue - budget)
def profit(temp_data):
  return temp_data['revenue'] - temp_data['budget']

In [None]:
data['profit'] = data[['revenue','budget']].apply(profit, axis=1)
data

GroupBy

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781491912126/files/assets/pyds_03in01.png" height="350" width="700"/>

In [None]:
data['director_name'].nunique()

In [None]:
data.groupby('director_name')

In [None]:
data.groupby('director_name').ngroups

In [None]:
# print groups
data.groupby('director_name').groups # it will print director_name as key and values as indexes

In [None]:
# get particular group
data.groupby('director_name').get_group('Adam McKay')

In [None]:
# count of movies by each director
data.groupby('director_name')['title'].count()

In [None]:
# min year and max year for each director (multiple aggregate functions)
data.groupby('director_name')['year'].aggregate(['min', 'max'])

In [None]:
data.head()

# Find the details of movies by high-budget directors.
# High budget directors -> any director with at least one movie with a budget > 100M

In [None]:

dir_budget = data.groupby('director_name')['budget'].max().reset_index()


In [None]:
dir_budget['budget'] > 100000000

In [None]:
dir_budget[dir_budget['budget'] > 100000000]

In [None]:
names =dir_budget[dir_budget['budget'] > 100000000]['director_name']
names

In [None]:
data['director_name'].isin(names)

In [None]:
data[data['director_name'].isin(names)]

In [None]:
# Single line
data.groupby('director_name').filter(lambda val: val['budget'].max() > 100000000)
