# TMDB Movie Data Analysis
> Analysed by Simba Pfaira

# Project: Investigate a Dataset - TMDB Movie Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

This project will provide an analysis of the movies data from TMDB. I will analyse the data to determine the attributes that are associated with popular movies. I will also perform data cleaning operations to extract the information that is relevant to my project. I will also plot visualizations to illustrate my findings better and in the end i will provide a conclusion of my findings. 

<a id='qfa'></a>
### Question(s) for Analysis
##### Data overview
The TMDB movie database contains information about over 10000 movies and the attributes of each movie are-: id, imdb_id, popularity, budget, revenue, original_title, cast, homepage, director, tagline, keywords, overview, runtime, genres, production_companies, release_date, vote_count, vote_average, release_year, budget_adj, revenue_adj.

##### Questions to ask
The type of question I would like to ask from this dataset include-:

1. What is the relation betwen the revenue and vote average?
1. Which movie has the longest runtime in the data set?
1. Which movie has the shortest runtime in the data set?
1. Which runtime range has the most sold copies of movies?
1. Do movies with higher expenditures also receive higher user rating?

### Importing the required modules
Firstly I imported the packages I wish to use in my project.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('tmdb-movies.csv')

### Exploring the dataframe
#### 1. Data overview
> Firstly I read my data file using pandas's `read_csv` function, then I executed the function `head()` to get glimple of the first five columns of the dataframe

In [None]:
df.head()

#### 2.  Dimensions of the dataframe
> I executed the `.shape` command to find the dimension of the dataframe i.e the number of rows and columns in the data frame

In [None]:
df.shape

#### 3.  Visualization of the dataframe
> I executed the `.hist()` function to visualize the data frame.

In [None]:
df.hist(figsize= (10,8));

<a id='wrangling'></a>
## Data Wrangling
### General Properties
> I executed the `.info` function to get the data types and the count of cells with data for each column. From this I can see that there are columns which has some blank cells e.g `homepage`, `keywords` e.t.c. This information will be useful for the next step of data cleaning

In [None]:
df.info()

### Data Cleaning
1. **The first thing i checked for duplicates in my dataframe**

In [None]:
df.duplicated().sum()

2. **The code above showed that there is one duplicated entry, so the next step will be to drop the duplicated entry**

In [None]:
df.drop_duplicates(inplace = True)

3. **The next thing, I removed some columns that contains data that I will not use for my project. These columns are-: imdb_id, homepage, tagline, overview, cast, director, production_companies, keywords, budget, revenue and release_date.**

In [None]:
df.drop(['imdb_id','homepage','budget','revenue','tagline', 'cast', 'director', 'production_companies', 'overview','release_date','keywords'], axis=1, inplace=True)

In [None]:
df.info()
df.head(2)

4. **The next think I dropped all the rows with missing values and zero values so that I remain with only rows with all data**

In [None]:
#remove rows with null values
df.replace(0, np.nan, inplace=True)
df.dropna(axis = 0, inplace = True)
df.shape

<a id='eda'></a>
## Exploratory Data Analysis

Before I start to answer the questions I would create a function to put in some repetitive code that will occur in the source code.

In [None]:
def drawScatterPlot(x_values, y_values, plot_title):
    df.plot(x=x_values, y=y_values, title=plot_title, kind='scatter', color='blue' )
    plt.show()

### Research Question 1
**What is the relation betwen the revenue and vote average?**

In [None]:
drawScatterPlot('vote_average', 'revenue_adj', 'Vote Average vs Revenue')

The result indicates that there is a positive correlation between revenue and vote average, which suggests that movies that are highly rated by viewers have more revenue than their counterparts.

### Research Question 2
**Which movie has the longest runtime in the data set?**

In [None]:
df_longest = df[df.runtime == df.runtime.max()]
print(df_longest)


The analysis reveals that the longest movie is called `Carlos` which has a runtime of `338` minutes

### Research Question 3
**Which movie has the shortest runtime in the data set?**

In [None]:
df_shortest = df[df.runtime == df.runtime.min()]
print(df_shortest)

The result indicates that the shortest movie is `Kid's Story` which has a runtime of `15` minutes.

### Research Question 4
**In which year were the most movies released?**

In [None]:
plt.figure(figsize=(30,10))
sns.countplot(df['release_year'])
plt.title('Movies by year')
plt.xlabel('Year')
plt.ylabel('Quantity')
plt.show()

The analysis shows that most movies were released in `2011`.

### Research Question 5
**Which runtime range has the most sold copies of movies?**

In [None]:
df.groupby('runtime')['revenue_adj'].mean().plot(kind='line');
plt.title('Revenue vs Runtime')
plt.ylabel('Revenue $')
plt.xlabel('Runtime')
plt.show()

The analysis shows that movies which sold most copies has a runtime of `180` to `190` minutes.

### Research Question 6
**Do movies with higher expenditures also receive higher user rating?**

In [None]:
drawScatterPlot('vote_average', 'budget_adj', 'Vote Average vs Expenditures')

This scatter plot illustrates a positive correlation, which means that movies where higher costs were incurred are also rated high by users.

<a id='conclusions'></a>
## Conclusions

> In this project, I was able to analyze the movie dataset, by assising the qualities, attributes and properties that are associated with movies provided. After gathering the data, I went on to clean and trim the dataset by removing zero and duplicated values and other columns which were irrelevant to my research questions.

> Then I answered the few of the questions I had, using pandas. I also plotted a few charts to visualize the results. The main attribute I used in the plots was `vote_average`. In those visualization I discovered that-: a. movies where higher costs were incurred are also rated high by users, b. movies that are highly rated by viewers have more revenue than their counterparts.

## Limitations
* There were many rows with columns with zero values which I dropped which casued me to loose 50% of the dataset. This means that the results I came up may not fully represent the whole dataset. If all values were thee it would have been more accurate.
* This dataset is not really an adequate representation of the movies industry since there are many movies which were released from other countries which doesnt appear in the dataset.

### References
1. [How to add a new column to an existing DataFrame?](https://stackoverflow.com/questions/12555323/how-to-add-a-new-column-to-an-existing-dataframe?rq=1)
1. [How do I get the row count of a Pandas DataFrame?](https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe?rq=1)
1. [How do I select rows from a DataFrame based on column values?](https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values?rq=1)