# Project: Investigate a Dataset - [TMDb_Movies Dataset]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

For this project, I have decided to use the TMDb Movies Dataset. The dataset can be found [here](https://docs.google.com/document/d/e/2PACX-1vTlVmknRRnfy_4eTrjw5hYGaiQim5ctr9naaRd4V9du2B5bxpd8FEH3KtDgp8qVekw7Cj1GLk1IXdZi/pub?embedded=True)

This dataset contains information about 10,000 movies collected from The Movie Database (TMDb). The information collected about these movies have been organised into 21 columns and they include: Title, Genre, imdb_id, revenue, budget e.t.c.


### Question(s) for Analysis
For my analysis, I would be asking the following questions
1. What is the average budget of the better movies? (Where better movies are those with vote_average grater than the mean vote_average)
2. Is there any correlation between genre and budget? What genres have the highest and lowest budgets?
3. What months had better performing movies (in terms of revenue)?


In [3]:
# import satements for packages i would be using
import pandas as pd
import numpy as np
import csv
from datetime import datetime
import matplotlib.pyplot as plt

# 'magic word' so that your visualizations are plotted inline with the notebook.
% matplotlib inline


In [None]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0;

Collecting pandas==0.25.0
[?25l  Downloading https://files.pythonhosted.org/packages/1d/9a/7eb9952f4b4d73fbd75ad1d5d6112f407e695957444cb695cbb3cdab918a/pandas-0.25.0-cp36-cp36m-manylinux1_x86_64.whl (10.5MB)
[K    39% |████████████▌                   | 4.1MB 28.2MB/s eta 0:00:01

<a id='wrangling'></a>
## Data Wrangling


### General Properties
First of all, I would load the csv file containing my data and store it in a variable named "tmbd_data"

In [None]:
#loading the csv file and storing it in the variable "tmbd_data"
tmdb_data = pd.read_csv('tmdb-movies.csv');

#printing first five rows with defined columns of tmdb-movies database
tmdb_data.head()



### Data Cleaning
After observing the dataset and the questions for analysis, I would drop the columns that would not be relevant for my analysis. These columns include imdb_id, popularity, cast, homepage, director, tagline, keywords, overview, runtime, production_companies. 

Secondly, I inspected the data for rows with missing values.

In [2]:
# Dropping irrelevant columns.

#Create a list of columns to drop
del_col=['imdb_id', 'popularity', 'cast', 'homepage','director', 'tagline', 'keywords', 'overview','runtime', 'production_companies']

#drop the columns
tmdb_data = tmdb_data.drop(del_col, 1)

tmdb_data.info()

NameError: name 'tmdb_data' is not defined

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.

> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 

> **Tip**: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])