# Investigate tmdb movies dataset
> The primary goal of the project is to go through the dataset and the general data analysis process using numpy, pandas and matplotlib. This contain four parts:
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

# Introduction

### Questions to be answered
> 
1.  Which year has the highest release of movies?
2.  Which Movie Has The Highest Or Lowest Profit? Top 10 movies which earn highest profit?
3.  Movie with Highest And Lowest Budget? 
4.  Which movie made the highest revenue and lowest as well?
5.  Movie with shorest and longest runtime?
6.  Which movie get the highest or lowest votes (Ratings)?
7.  Which Year Has The Highest Profit Rate?

In [None]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# reading data
data = pd.read_csv('../input/tmdb-movies-dataset/tmdb_movies_data.csv')

# Data Exploration 
   ### steps 
> 
1. showing data and its shape
2. information about data
3. describing data using some statistics
4. defining nulls


In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()


> 
1. there is some useless features need to be removed
2. there is many nulls
3. release_date feature data type needs to be converted to datetime


In [None]:
data.describe()

In [None]:
data.isna().sum()

#### there is movies with zero budget and zero revenue which will be affect the results of analysis


# Data Cleaning

### steps to be done 
> 
1. remove any duplicates in records if exists
2. adjust type of release_ date feature
3. dropping useless features 
4. handling nulls
5. remove data with zero budget and zero revenue

In [None]:
# check if there is any duplicates in data
sum(data.duplicated())

In [None]:
# removing duplicates
print('the shape before drooping duplicates is : ' , data.shape)
data.drop_duplicates(inplace = True)
print('the shape after drooping duplicates is : ' , data.shape)


In [None]:
data['release_date'] = pd.to_datetime(data['release_date'])


In [None]:
# features which are useless in analysis ara ['budget_adj','revenue_adj','overview','imdb_id','homepage','tagline']
data.drop(['budget_adj','revenue_adj','overview','imdb_id','homepage','tagline','keywords','production_companies']\
          , axis = 1, inplace = True)

In [None]:
# filling nulls with zeros to avoid losing some records 
data.fillna(0, inplace = True)

In [None]:
print('number of records in which revenue = 0 are ', data[data['revenue'] == 0]['id'].count())
print('number of records in which budget = 0 are ', data[data['budget'] == 0]['id'].count())
print('number of records in which revenue and budget = 0 are ', data[(data['budget'] == 0) & (data['revenue'] == 0)]['id'].count())


#### we can't remove this huge number of records so we will deal with them

# Exploratory Data Analysis


#### Q.1 Which year has the highest and lowest release of movies?

In [None]:
movies_number_in_year = data.groupby('release_year').count()['id'] 
print(movies_number_in_year.sort_values(ascending = False)[ : 1])
movies_number_in_year.plot(xticks = np.arange(1960, 2016, 5), figsize = (10,6))
plt.title('year vs number of movies')
plt.xlabel('year')
plt.ylabel('number of movies released')
sns.set_style('whitegrid')

#### Q.2  Which Movie Has The Highest Or Lowest Profit? Top 10 movies which earn highest profit?


In [None]:
# getting profit for each movie
data['profit'] = data['revenue'] - data['budget']
data['profit'].replace(0 ,np.nan, inplace = True)

In [None]:
def min_max(df, col) :
    max_ = df[col].idxmax()
    min_ = df[col].idxmin()
    highest_movie = df['original_title'][max_]
    lowest_movie = df['original_title'][min_]
    return  highest_movie, lowest_movie

In [None]:
highest_movie, lowest_movie = min_max(data, 'profit')
print('the movie with higest profit is ', highest_movie)
print('the movie with lowest profit is ', lowest_movie)

In [None]:
# top 10 movies in profit
highest_movies = data.sort_values(by = 'profit', ascending = False)[:10]
highest_movies.loc[: , ['original_title', 'profit']]


### Q.3 3.  Movie with Highest And Lowest Budget? 


In [None]:
highest_bud_movie, lowest_bud_movie = min_max(data, 'budget')
print('the movie with highest budget is ', highest_bud_movie)
print('the movie with lowest budget is ', lowest_bud_movie)

### Q.4  Which movie made the highest revenue and lowest as well?
 

In [None]:
highest_rev_movie, lowest_rev_movie = min_max(data, 'revenue')
print('the movie with highest revenue is ', highest_rev_movie)
print('the movie with lowest revenue is ', lowest_rev_movie)

### Q.5  Movie with shorest and longest runtime?


In [None]:
highest_runtime_movie, lowest_runtime_movie = min_max(data, 'runtime')
print('the movie with highest runtime is ', highest_runtime_movie)
print('the movie with lowest runtime is ', lowest_runtime_movie)

### Q.6  Which movie get the highest or lowest votes (Ratings)?

In [None]:
highest_votes_movie, lowest_votes_movie = min_max(data, 'vote_count')
print('the movie with highest votes is ', highest_votes_movie)
print('the movie with lowest votes is ', lowest_votes_movie)

### Q.7 Which Year Has The Highest Profit Rate?

In [None]:
print(data.groupby('release_year').sum()['profit'].sort_values(ascending = False)[:1])
data.groupby('release_year').sum()['profit'].plot();

### Q.8 Which length movies most liked by the audiences according to their popularity?

In [None]:
data.groupby('runtime').mean()['popularity'].sort_values( ascending = False)[:10]

#### the most appropriate run time is between 160 and 200

# conclusion
>
1. 2014 has the highest number of movies released
2. avatar, star wars and titanic movies have the most profit
3. the Warrior's way has the highest budget and mr holmes has the lowest budget
4. 2015 has the highest orofit rate
5. the most appropriate run time is between 160 and 200
