### <font color=cyan>MICROSOFT MOVIE STUDIO: MOVIE ANALYSIS</font> 

![movie_analysis](img-1.png) 

## Introduction
Microsoft aims to enter the cinematic market with a strategic approach. As a data analyst for their new movie studio, my role is to identify the types of films that are performing exceptionally well, genre porpularty,audience preferences and cost, This study aims to uncover actionable insights that will guide Microsoft's new studio in making informed decisions.



## Business problem
Despite Microsoft's extensive experience in technology and software, the company lacks expertise in the film industry. To ensure a successful entry into the cinematic market, Microsoft needs to understand which types of films are currently excelling. This involves:-

- identify popular films and associated original languages.
- identify popular studios.
- Identifying high-performing genres.
- Analyzing the associated costs film that resonate with audiences and deliver a strong return on investment.

This requires a detailed analysis to minimize risks and maximize the potential for success in the highly competitive movie 

## Data understanding
For a successful analysis, I am using the following datasets that provide which provide array of information crucial for analyzing our business problems.

- <font color=cyan>tmdb.movie</font>- This dataset will provide insights into the film popularity, original language, and  dates of different film.
- <font color=cyan>bom.movie_gross</font>-This datasets helps identify top performing studios.
- <font color=cyan>im.db</font>-This datasets helps identify popular genre.
- <font color=cyan>m.movie_budget</font>-This datasetgies details on cost suc as production, domestic and foreign which helps in identifying high-performing film types and trends in audience spending.



In [4]:
#importing all necessary libraries for my analysis.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sqlite3


### <font color=gray>Loading Data for cleaning</font>

Data cleaning is an important aspect in data analysis as it plays a pivotal role in normalizing and standardzing data for analysis leading to more meaningful and valid insights. The process involves identifying and correcting errors, handling missing values, removing duplicates, and converting data into appropriate formats. By addressing these issues, I will be able to minimize inconsistencies, ultimately enhancing the quality of the data and ensuring that the results of the analysis are robust and trustworthy.

In [5]:
#loading my first dataset.

df1 = pd.read_csv(r'moviedata\zippedData\tmdb.movies.csv\tmdb.movies.csv')

### <font color=gray>Data inspection</font>
After data has been loaded successfully, inspecting the dataset is a crutial step in data cleaning.
Inspection allows us to have information on the following to make our analysis better.
- Dataset range
- Total number of columns
- Total number of rows
- Datatype for each column.


In [6]:
#running code for data inspection
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


Having gotten the structure of the dataset, I will dig deeper into the dataset to check whether the column names accurately describe the data.

In [7]:
#checking the first five rows of the dataset
df1.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


From the output produced we can determine some data cleaning to employ to standardize our dataset. This are:-
- Dropping duplicate columns
- Renaming columns.
- splitting column
- perfoming case conversion.
- checking for missing values



In [8]:
#droping duplicate column
df1.drop(columns=["Unnamed: 0","original_title"], axis=1,inplace=True)

In [9]:
df1.head()

Unnamed: 0,genre_ids,id,original_language,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,27.92,2010-07-16,Inception,8.3,22186


In [10]:
#splitting the date column to get year
df1["Year"] = df1['release_date'].str[:4]
df1.head()

Unnamed: 0,genre_ids,id,original_language,popularity,release_date,title,vote_average,vote_count,Year
0,"[12, 14, 10751]",12444,en,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,2010
1,"[14, 12, 16, 10751]",10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,2010
2,"[12, 28, 878]",10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368,2010
3,"[16, 35, 10751]",862,en,28.005,1995-11-22,Toy Story,7.9,10174,1995
4,"[28, 878, 12]",27205,en,27.92,2010-07-16,Inception,8.3,22186,2010


In [11]:
#performing case conversion
#converting all my column name from lowercase to upper case.
df1.columns = [col.capitalize() for col in df1.columns]
df1.head()

Unnamed: 0,Genre_ids,Id,Original_language,Popularity,Release_date,Title,Vote_average,Vote_count,Year
0,"[12, 14, 10751]",12444,en,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,2010
1,"[14, 12, 16, 10751]",10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,2010
2,"[12, 28, 878]",10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368,2010
3,"[16, 35, 10751]",862,en,28.005,1995-11-22,Toy Story,7.9,10174,1995
4,"[28, 878, 12]",27205,en,27.92,2010-07-16,Inception,8.3,22186,2010


In [12]:
#renaming columns
df1.rename(columns={"Title":"Movie_title"}, inplace=True)
df1.head()

Unnamed: 0,Genre_ids,Id,Original_language,Popularity,Release_date,Movie_title,Vote_average,Vote_count,Year
0,"[12, 14, 10751]",12444,en,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,2010
1,"[14, 12, 16, 10751]",10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,2010
2,"[12, 28, 878]",10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368,2010
3,"[16, 35, 10751]",862,en,28.005,1995-11-22,Toy Story,7.9,10174,1995
4,"[28, 878, 12]",27205,en,27.92,2010-07-16,Inception,8.3,22186,2010


In [13]:
#checking for missing values
df1.isna().sum()

Genre_ids            0
Id                   0
Original_language    0
Popularity           0
Release_date         0
Movie_title          0
Vote_average         0
Vote_count           0
Year                 0
dtype: int64

The above output shows that our data has no missing values.

### <font color=gray>Data Analyisis</font>
With the cleaned data, we can note determine,
- popular movie title and factor contributing to popularity(language)
- Original langage of most voted movies
- Year with the most released movies

In [14]:
df1.head()

Unnamed: 0,Genre_ids,Id,Original_language,Popularity,Release_date,Movie_title,Vote_average,Vote_count,Year
0,"[12, 14, 10751]",12444,en,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788,2010
1,"[14, 12, 16, 10751]",10191,en,28.734,2010-03-26,How to Train Your Dragon,7.7,7610,2010
2,"[12, 28, 878]",10138,en,28.515,2010-05-07,Iron Man 2,6.8,12368,2010
3,"[16, 35, 10751]",862,en,28.005,1995-11-22,Toy Story,7.9,10174,1995
4,"[28, 878, 12]",27205,en,27.92,2010-07-16,Inception,8.3,22186,2010


In [15]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Genre_ids          26517 non-null  object 
 1   Id                 26517 non-null  int64  
 2   Original_language  26517 non-null  object 
 3   Popularity         26517 non-null  float64
 4   Release_date       26517 non-null  object 
 5   Movie_title        26517 non-null  object 
 6   Vote_average       26517 non-null  float64
 7   Vote_count         26517 non-null  int64  
 8   Year               26517 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 1.8+ MB


In [23]:
conn = sqlite3.connect(r'moviedata\zippedData\im.db\im.db')
cur = conn.cursor()

In [24]:
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
table_names = cur.fetchall()
table_names

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

In [26]:
movie_basics =pd.read_sql( """SELECT*
                    FROM movie_basics;
                """,conn)
movie_basics.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"
