<h1>PROJECT 1: INVESTIGATION OF A DATASET - TMDB MOVIES DATASET</h1>

<h2>Table of Contents</h2>
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling Phase</a></li>
<li><a href="#cleaning">Data Cleaning Phase</a></li>
<li><a href="#exploration">Data Exploration Phase</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id="intro"></a>
<h2>Introduction</h2>

<p>This dataset comes from IMDB and contains information about 10,000 movies, short films and tv series collected from The Movie Database (TMDb), including user ratings, revenue, runtime and budget.</p>
<p>In this project, i'll be answering the following questions:</p>

<strong>Single variable (1d) questions:</strong>
<ol>
    <li>Which <u>year</u> has the highest release of movies? <a href="#1d-1">go-to</a></li>
    <li>Which <u>genre</u> has the highest release of movies? <a href="#1d-2">go-to</a></li>
    <li>Which 10 <u>actors</u> are casted the most? <a href="#1d-3">go-to</a></li>
</ol>

<strong>Multivariable (2d...) questions:</strong>
<ol>
    <li>Which length (<u>runtime</u>) movies are most liked according to their <u>popularity</u>? <a href="#2d-1">go-to</a></li>
    <li>What is the correlation between movies' <u>budgets</u> and their <u>revenue</u>? <a href="#2d-2">go-to</a></li>
    <li>What is the correlation between movies' <u>average ratings</u> and <u>revenue</u> generated? <a href="#2d-3">go-to</a></li>
    <li>What is the correlation between movies' <u>popularity</u> and <u>revenue</u> generated? <a href="#2d-4">go-to</a></li>
</ol>

In [None]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import operator
%matplotlib inline

: 

<a id="wrangling"></a>
<h2>DATA WRANGLING PHASE</h2>

<h3>Loading the dataset:</h3>
<p>Here, I load the dataset into memory.</p>

In [None]:
df = pd.read_csv('tmdb-movies.csv')

: 

<h3>Shape of the dataset:</h3>
<p>Here, I check for the number of columns and rows.</p>

In [None]:
df.shape

: 

<h4>Findings:</h4>
<ul>
    <li>The dataset has <b>10866 rows</b> and <b>21 columns</b>.</li>
</ul>

<h3>Columns of the dataset:</h3>
<p>Here, I check for the column values.</p>

In [None]:
list(df.columns)

: 

<h2>Data types of the columns:</h2>
<p>Here, I check for the data type of each column.</p>

In [None]:
df.dtypes

: 

<h2>Unique values per column:</h2>
<p>Here, I check for the number of unique values in each column:</p>

In [None]:
df.nunique()

: 

<h4>Findings:</h4>
<ul>
    <li>On the assumption that an imdb_id uniquely identifies a movie, there could be 11 duplicate movie entries since the number of unique values in the imdb_id column is 11 less than the total number of rows.</li>
</ul>

<h3>First three rows of the dataset:</h3>
<p>Here, I take a look at the first three rows.</p>

In [None]:
df.head(3)

: 

<h3>The last three rows of the dataset:</h3>
<p>Here, I take a look at the last three rows.</p>

In [None]:
df.tail(3)

: 

<h3>Descriptive statistics of the dataset:</h3>

In [None]:
df.describe()

: 

<h3>Informative statistics of the dataset:</h3>

In [None]:
df.info()

: 

<a id="cleaning"></a>
<h2>DATA CLEANING PHASE</h2>

<h3>Duplicated rows of the dataset</h3>
<p>Here, I check for the duplicated rows in the dataset and drop them.</p>

In [None]:
# Check for sum of duplicated rows
df.duplicated().sum()

: 

In [None]:
# Drop duplicated rows
df.drop_duplicates(inplace=True)
# confirm no duplicates exist in the dataset
print('Number of duplicates in the data set: ', df.duplicated().sum())

: 

<h3>Rows with null values in the dataset</h3>
<p>Here, I check for the number of rows with null values per column.</p>

In [None]:
df.isnull().sum()

: 

<h3>Drop unnecessary columns and those with many null values</h3>
<p>Here, I drop the columns that might effect the acuracy of my investigation.</p>

In [None]:
# Removing rows where revenue_adj and budget_adj is equal to zero
df = df[df.budget_adj != 0]
df = df[df.revenue_adj != 0]

: 

<h3>Replace rows with null values for necessary columns in the dataset:</h3>
<p>Here, I replace the rows that contain null values in necessary columns for my investigation.</p>

In [None]:
# Replace the rows with no director with 'no director'
df['director'].fillna('no director', inplace=True)
# confirm that no movie has a 'null' value for a director
print('Movies with null for director: ', df['director'].isnull().sum())

# Replace the rows with no production company with 'no production company'
df['production_companies'].fillna('no production company', inplace=True)
# confirm that no movie has a 'null' value for the production companies
print('Movies with null for production companies: ', df['production_companies'].isnull().sum())

# Replace the rows with no genre with 'no genre'
df['genres'].fillna('no production company', inplace=True)
# confirm that no movie has a 'null' value for the genres
print('Movies with null for genres: ', df['genres'].isnull().sum())

# Replace the rows with no cast with 'no cast'
df['cast'].fillna('no cast', inplace=True)
# confirm that no movie has a 'null' value for the cast
print('Movies with null for cast: ', df['cast'].isnull().sum())

: 

<h3>Convert columns to appropriate data types</h3>
<p>Here, I ensure columns have appropriate data types</p>

In [None]:
# Convert release_year to int (to support calculations on it)
df['release_year'] = df['release_year'].values.astype(int)

: 

In [None]:
# confirm change of data types
df.dtypes

: 

<a id="exploration"></a>
<h2>DATA EXPLORATION PHASE</h2>

<strong>REUSABLE FUNCTIONS AND CONSTANTS</strong>

In [None]:
# Correlation columns
corr_df = df[['revenue_adj', 'budget_adj', 'popularity', 'vote_average']]

# Reusable functions
def split_string(value, sep):
    if type(value) == str:
        return value.split(sep)
    else: 
        return value

: 

<a id="1d-1"></a><h3>1. Which year has the highest release of movies?</h3>

In [None]:
#count the number of movies in each year 
years_with_movie_count=df.groupby('release_year').count()['id']

# plot my findings
years_with_movie_count.plot(xticks = np.arange(1960,2016,5))
sns.set(rc={'figure.figsize':(25,5)})
plt.title("Number of movies per year",fontsize = 14)
plt.xlabel('Release year',fontsize = 13)
plt.ylabel('Number Of Movies',fontsize = 13)
#set the style sheet
sns.set_style("whitegrid")

: 


<a id="1d-2"></a><h3>2. Which genre has the highest release of movies?</h3>

In [None]:
#count the total number of genre appearences
total_genres_appearences = df['genres'].str.cat(sep='|')
genres_count = pd.Series(total_genres_appearences.split('|')).value_counts(ascending=False)

#plot a 'bar' plot using plot function for 'genre vs number of movies'.
genres_count.plot(kind= 'bar',figsize = (13,10),fontsize=12)
#setup the title and the labels of the plot.
plt.title("Genre with the highest release of movies",fontsize=15)
plt.xlabel('Number Of Movies',fontsize=13)
plt.ylabel("Genres",fontsize= 13)

: 

<a id="1d-3"></a><h3>3. Which 10 actors have the most appearances?</h3>

In [None]:
#count the total number of actors appearences
total_actor_appearences = df['cast'].str.cat(sep='|')
actor_count = pd.Series(total_actor_appearences.split('|')).value_counts(ascending=False)

#plot the barh plot.
actor_count.iloc[:10].plot.bar(figsize=(15,6),colormap= 'tab20c',fontsize=12)

#setup the title and the labels of the plot.
plt.title("Most Frequent Actor",fontsize=15)
plt.xticks(rotation = 70)
plt.xlabel('Actor',fontsize=13)
plt.ylabel("Number Of Movies",fontsize= 13)
sns.set_style("whitegrid")


: 

<a id="2d-1"></a><h3>4. Which length (runtime) movies are most liked according to their popularity?</h3>

In [None]:
#make the group of the data according to their runtime and find the mean popularity related to this and plot.
df.groupby('runtime')['popularity'].mean().plot(figsize = (13,5),xticks=np.arange(0,400,20))

#setup the title of the figure
plt.title("Runtime Vs Popularity",fontsize = 14)
plt.xlabel('Runtime',fontsize = 13)
plt.ylabel('Average Popularity',fontsize = 13)
#setup the figure size.
sns.set(rc={'figure.figsize':(15,8)})
sns.set_style("whitegrid")

: 

<h3>How do different attributes correlate with each other?</h3>

<a id="2d-2"></a><h3>5. budget_adj and revenue_adj</h3>

In [None]:
budget_revenue_correlation_fig = sns.jointplot(x = "budget_adj", y = "revenue_adj", kind = "scatter", data = corr_df)
budget_revenue_correlation_fig.fig.suptitle('Scatterplot and correlation for budget_adj and revenue_adj')
print('Correlation coefficient: ', corr_df['budget_adj'].corr(corr_df['revenue_adj']))

: 

<h4>Findings:</h4>
<ul>
    <li>There is a <b>moderate uphill positive correlation</b> between budget_adj and revenue_adj.</li>
</ul>

<a id="2d-3"></a><h3>6. revenue_adj and vote_average</h3>

In [None]:
revenue_vote_correlation_fig = sns.jointplot(x = "revenue_adj", y = "vote_average", kind = "scatter", data = corr_df)
revenue_vote_correlation_fig.fig.suptitle('Scatterplot and correlation for revenue_adj and vote_average')
print('Correlation coefficient: ', corr_df['revenue_adj'].corr(corr_df['vote_average']))

: 

<h4>Findings:</h4>
<ul>
    <li>There is almost <b>no linear correlation</b> between revenue_adj and vote_average. This implies that movies with hih revenues do not necessarily tend to have high popularity too.</li>
</ul>

<a id="2d-4"></a><h3>7. revenue_adj and popularity</h3>

In [None]:
revenue_popularity_correlation_fig = sns.jointplot(x = "revenue_adj", y = "popularity", kind = "scatter", data = corr_df)
revenue_popularity_correlation_fig.fig.suptitle('Scatterplot and correlation for revenue_adj and popularity')
print('Correlation coefficient: ', corr_df['revenue_adj'].corr(corr_df['popularity']))

: 

<h4>Findings:</h4>
<ul>
    <li>There is a <b>moderate uphill positive correlation</b> between revenue_adj and popularity.</li>
</ul>

<a id="conclusion"></a>
<h2>CONCLUSION</h2>
<div>
    <h3>Summary of analysis:</h3>
    <h4>Descriptive summary:</h4>
    <p>The dataset has 10866 rows and 21 columns. Some columns have null and zero values hence these needed to be handled in the cleaning process. </p>
    <h4>Exploratory summary:</h4>
    <ol>
        <h5>1d exploratory findings:</h5>
        <li>The biggest annual release of movies happened in 2011 and the smallest was in 1969.</li>
        <li>Drama genre had the biggest release of movies and TV Movie had the least.</li>
        <li>Robert Deniro is the most casted actor with over 50 castings in the dataset.</li>
    </ol>
    <ol>
        <h5>Multidimensional exploratory finding:</h5>
        <li>Movies with a runtime between 160-180 minutes of runtime tend to have the highest popularity.</li>
        <li>There is a moderate uphill positive correlation between movie budgets and their consquent revenue.</li>
        <li>There is almost no correlation between between movie revnue and vote average.</li>
        <li>There is a moderate uphill positive correlation between movie revenues and their popularity.</li>
    </ol>
</div>
<div>
    <h3>Limitations:</h3>
    <ul>
        <li>The completeneess of the dataset is questionable. This is so because the revenue and budget columns do have a currency. In case different movies have budgets and possibly revenues in different currencies, usually based on their country of production, such a disparity could render parts of this anylsis incorrect and misleding.</li>
        <li>Inspite of the fact that this is a relatively rich dataset, many rows had null and zero values. Besides making the cleaning phase tedious,Â´and time consuming, this creates false positives during the investigation (assuming a replace these with placeholders, instead of completely dropping them).</li>
    </ul>
</div>
<div>
    <h3>Resources used:</h3>
    <ul>
        <li>Kaggle: <i>https://kaggle.com</i></li>
        <li>Pandas Documentation: <i>https://pandas.pydata.org/docs/index.html</i></li>
        <li>Seaborn API Reference: <i>https://seaborn.pydata.org/api.html</i></li>
        <li>NumPy Documentation: <i>https://numpy.org/doc/stable/</i></li>
        <li>"Correlation does not imply causation": <i>https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation</i></li>
    </ul>
</div>