<a href="https://colab.research.google.com/github/shaimaalabedi/T5-EDA-ShaimaAlzahrani/blob/main/MVP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Welcome to this exploration of 10000 movies!**

**Here are the steps we'll be taking :**



**Step** **1**: Questions. What do we want to figure out with this data? We will sift through over 10,000 movie titles in order to discover valuable relationships between variables such as revenues, genres, and popularity. We are especially curious about :

•Which type of movies are more popular?

•How many votes did horror movies get in 2021?

•What are the Top 10 movies with minimum vote average? 

•What are the Top 10 movies with maximum revenue?

•Display analysis of the revenue trend over the past century. 

•Display analysis of the runtime trend over the century.

We will be directing our analysis towards finding answers to these questions




**Step 2:** Data Wrangling. Gather, load, and assess the data. Make modifications, such as adding and replacing information and removing duplicates and extraneous data, to ensure our dataset is clean for analysis.












**Step 3:** Data Exploration. Augment the data, remove outliers, create better features, and find patterns. This step might lead us back to the first two steps, questioning and wrangling.











**Step 4:** Conclusions. Lastly we will summarize the relationships we found, make predictions, and present our findings visually.












In [None]:
# import all libraries we'll use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

**Data Wrangling**










In this step, we will gather our data, then load it into a dataframe to assess its quality. We will be looking for missing and problems in quality. We will be removing extraneous data and making modifications, such as replacing information and removing duplicates, to ensure our dataset is trim and clean for analysis











In [None]:
#Gather data: load data and print out a few lines.

df= pd.read_csv('Top_10000_Movies.csv.zip',
                 lineterminator='\n')
df.head()

Unnamed: 0.1,Unnamed: 0,id,original_language,original_title,popularity,release_date,vote_average,vote_count,genre,overview,revenue,runtime,tagline
0,0,580489,en,Venom: Let There Be Carnage,5401.308,2021-09-30,6.8,1736,"['Science Fiction', 'Action', 'Adventure']",After finding a host body in investigative rep...,424000000,97.0,
1,1,524434,en,Eternals,3365.535,2021-11-03,7.1,622,"['Action', 'Adventure', 'Science Fiction', 'Fa...",The Eternals are a team of ancient aliens who ...,165000000,157.0,In the beginning...
2,2,438631,en,Dune,2911.423,2021-09-15,8.0,3632,"['Action', 'Adventure', 'Science Fiction']","Paul Atreides, a brilliant and gifted young ma...",331116356,155.0,"Beyond fear, destiny awaits."
3,3,796499,en,Army of Thieves,2552.437,2021-10-27,6.9,555,"['Action', 'Crime', 'Thriller']",A mysterious woman recruits bank teller Ludwig...,0,127.0,"Before Vegas, one locksmith became a legend."
4,4,550988,en,Free Guy,1850.47,2021-08-11,7.8,3493,"['Comedy', 'Action', 'Adventure', 'Science Fic...",A bank teller called Guy realizes he is a back...,331096766,115.0,Life's too short to be a background character.


There are lots of information here and also much that we are not concerned with.
















In [None]:
#Assess number of rows and columns of dataset
df.shape

(10000, 13)

In [None]:
#Assess summary of dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         10000 non-null  int64  
 1   id                 10000 non-null  int64  
 2   original_language  10000 non-null  object 
 3   original_title     10000 non-null  object 
 4   popularity         10000 non-null  float64
 5   release_date       9962 non-null   object 
 6   vote_average       10000 non-null  float64
 7   vote_count         10000 non-null  int64  
 8   genre              10000 non-null  object 
 9   overview           9900 non-null   object 
 10  revenue            10000 non-null  int64  
 11  runtime            9991 non-null   float64
 12  tagline            7080 non-null   object 
dtypes: float64(3), int64(4), object(6)
memory usage: 1015.8+ KB


There is missing row data for many columns, but I plan on removing these columns since they aren't directly relevant to our questions. I'll revisit missing data once the dataset is trimmed. Next we'll assess statistics for the columns.











In [None]:
# assess statistics for each column
df.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count,revenue,runtime
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,9991.0
mean,4999.5,250053.0833,34.516871,6.29875,1315.0849,57363880.0,98.773596
std,2886.89568,261734.6183,100.693958,1.43426,2501.899103,148077100.0,28.800581
min,0.0,5.0,6.269,0.0,0.0,0.0,0.0
25%,2499.75,11866.75,11.908,5.9,118.0,0.0,89.0
50%,4999.5,144476.0,17.488,6.5,425.5,591230.0,99.0
75%,7499.25,451485.0,29.62625,7.1,1297.25,47645490.0,113.0
max,9999.0,893478.0,5401.308,9.5,30184.0,2847246000.0,400.0


Now we'll make modifications to our dataset. First we'll remove extraneous data and duplicates, then add and replace information to ensure our dataset is clean for analysis.
I'll drop extraneous columns that aren't relevant to our analysis.
I am dropping release date since I'm more interested in the release year.
I'll keep the id here in case I want to merge with another dataset.










In [None]:
df.drop(['original_language', 'overview', 'tagline'], axis=1, inplace=True)
df.head()

Unnamed: 0.1,Unnamed: 0,id,original_title,popularity,release_date,vote_average,vote_count,genre,revenue,runtime
0,0,580489,Venom: Let There Be Carnage,5401.308,2021-09-30,6.8,1736,"['Science Fiction', 'Action', 'Adventure']",424000000,97.0
1,1,524434,Eternals,3365.535,2021-11-03,7.1,622,"['Action', 'Adventure', 'Science Fiction', 'Fa...",165000000,157.0
2,2,438631,Dune,2911.423,2021-09-15,8.0,3632,"['Action', 'Adventure', 'Science Fiction']",331116356,155.0
3,3,796499,Army of Thieves,2552.437,2021-10-27,6.9,555,"['Action', 'Crime', 'Thriller']",0,127.0
4,4,550988,Free Guy,1850.47,2021-08-11,7.8,3493,"['Comedy', 'Action', 'Adventure', 'Science Fic...",331096766,115.0


The correct columns were removed. Next we'll assess if there are any duplicates.











In [None]:
 sum(df.duplicated())

0

There are no longer any duplicates and the dataset now has one less row. Next, I'll assess if any rows have missing values.










In [None]:
df.isnull().sum()

Unnamed: 0         0
id                 0
original_title     0
popularity         0
release_date      38
vote_average       0
vote_count         0
genre              0
revenue            0
runtime            9
dtype: int64

Let's view the rows with missing information to assess if it's ok to drop. I'd like to order by runtime to get a sense if these are full feature length films.









In [None]:
df[df.isnull().any(axis=1)].sort_values(['runtime'], ascending=True)

Unnamed: 0.1,Unnamed: 0,id,original_title,popularity,release_date,vote_average,vote_count,genre,revenue,runtime
564,564,875828,Untitled Peaky Blinders Film,100.232,,0.0,0,[],0,0.0
9581,9581,841281,El sexo me da risa 7,13.09,,5.0,1,[],0,0.0
7727,7727,346698,Barbie,11.329,,0.0,0,['Comedy'],0,0.0
7655,7655,642885,Hocus Pocus 2,9.594,,0.0,0,"['Fantasy', 'Family', 'Comedy']",0,0.0
7497,7497,774079,Happy Death Day to Us,12.127,,0.0,0,"['Thriller', 'Comedy']",0,0.0
7087,7087,879632,Gekijô ban poketto monsutâ: Daiamondo & Pâru -...,10.23,,0.0,0,[],0,0.0
7023,7023,842246,Chucky Boy Blue,11.747,,0.0,0,[],0,0.0
6997,6997,617127,Blade,9.506,,0.0,0,"['Action', 'Fantasy']",0,0.0
6030,6030,724334,My Hero Academia,17.816,,0.0,0,[],0,0.0
5980,5980,700028,True Stories Scream,13.589,,0.0,0,['Documentary'],0,0.0


In [None]:
df.dropna(inplace=True) 
print(df.isnull().sum().any()) 
print(df.info())

False
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9953 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      9953 non-null   int64  
 1   id              9953 non-null   int64  
 2   original_title  9953 non-null   object 
 3   popularity      9953 non-null   float64
 4   release_date    9953 non-null   object 
 5   vote_average    9953 non-null   float64
 6   vote_count      9953 non-null   int64  
 7   genre           9953 non-null   object 
 8   revenue         9953 non-null   int64  
 9   runtime         9953 non-null   float64
dtypes: float64(3), int64(4), object(3)
memory usage: 855.3+ KB
None


Let's review popularity, vote count, and vote average more closely. Remove any outliers?











In [None]:
df[['original_title','popularity', 'vote_count', 'vote_average']].sort_values('popularity', ascending=False).head(25)

Unnamed: 0,original_title,popularity,vote_count,vote_average
0,Venom: Let There Be Carnage,5401.308,1736,6.8
1,Eternals,3365.535,622,7.1
2,Dune,2911.423,3632,8.0
3,Army of Thieves,2552.437,555,6.9
4,Free Guy,1850.47,3493,7.8
5,Gunpowder Milkshake,1453.423,347,6.5
12,Shang-Chi and the Legend of the Ten Rings,1327.18,1414,7.7
6,Venom,1212.352,12126,6.8
9,American Badger,1148.822,14,6.3
11,劇場版 七つの大罪 光に呪われし者たち,1108.815,210,8.4


In [None]:
df[['original_title','popularity', 'vote_count', 'vote_average']].sort_values('popularity', ascending=False).tail()

Unnamed: 0,original_title,popularity,vote_count,vote_average
9357,Auntie Mame,6.338,72,7.0
9520,巨乳ドラゴン 温泉ゾンビVSストリッパー5,6.33,27,5.6
9458,"Steamboat Bill, Jr.",6.325,222,7.7
9839,Si on chantait,6.269,12,7.1
9513,Проклятый чиновник,6.269,3,4.7


The release years range from 1960 to 2015. I'll create a column for all the decades.
 











In [None]:
bin_edges = [1990, 1909, 1919, 1929, 1939, 1949, 1959, 1979, 1989, 1999, 2009, 2019, 1929]
bin_names = ['teens','twenties','thirties','forties','fifties','sixties', 'seventies', 'eighties', 'nineties', 'two_thousands', 'two_thousand_tens','tow_thousand_twenties']
df['decades'] = pd.cut(df['release_year'], bin_edges, labels=bin_names)
df.head()

KeyError: ignored

In [None]:
df.Genre.value_counts()