## Dataset Description
This data set contains information about +9000 movies extracted from TMDB API. 

## Columns Descriptions
1. `Release_Date`: Date when the movie was released.
2. `Title`: Name of the movie.
3. `Overview`: Brief summary of the movie.
4. `Popularity`: It is a very important metric computed by TMDB developers based on the number of views per day, votes per day, number of users marked it as "favorite" and "watchlist" for the data, release date and more other metrics.
5. `Vote_Count`: Total votes received from the viewers.
6. `Vote_Average`: Average rating based on vote count and the number of viewers out of 10.
7. `Original_Language`: Original language of the movies. Dubbed version is not considered to be original language.
8. `Genre`: Categories the movie it can be classified as.
9. `Poster_Url`: Url of the movie poster.

## EDA Questions
- Q1: What is the most frequent `genre` in the dataset?
- Q2: What `genres` has highest `votes`?
- Q3: What movie got the highest `popularity`? what's its `genre`?
- Q4: Which year has the most filmmed movies?
___

## Environment Set-up

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data Wrangling

In [4]:
df = pd.read_csv('mymoviedb.csv', lineterminator='\n')
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9827 entries, 0 to 9826
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Release_Date       9827 non-null   object 
 1   Title              9827 non-null   object 
 2   Overview           9827 non-null   object 
 3   Popularity         9827 non-null   float64
 4   Vote_Count         9827 non-null   int64  
 5   Vote_Average       9827 non-null   float64
 6   Original_Language  9827 non-null   object 
 7   Genre              9827 non-null   object 
 8   Poster_Url         9827 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 691.1+ KB


In [6]:
df.duplicated().sum()

0

### Exploration Summarey
- we have a dataframe consisting of 9827 rows and 9 columns.
- our dataset looks a bit tidy with no NaNs nor duplicated values.
- `Release_Date` column needs to be casted into date time and to extract only the year value.
- `Overview`, `Original_Languege` and `Poster-Url` wouldn't be so useful during analysis, so we'll drop them.
___

## Data Cleaning

**Casting `Release_Date` column and extracing year values**

In [7]:
df['Release_Date'] = pd.to_datetime(df['Release_Date'], format='%Y-%m-%d')
df['Release_Date'] = df['Release_Date'].dt.year

df['Release_Date'].head()

0    2021
1    2022
2    2022
3    2021
4    2021
Name: Release_Date, dtype: int64

___
**Dropping `Overview`, `Original_Languege` and `Poster-Url` columns**

In [8]:
cols = ['Overview', 'Original_Language', 'Poster_Url']
df.drop(cols, axis = 1, inplace = True)
df.columns

Index(['Release_Date', 'Title', 'Popularity', 'Vote_Count', 'Vote_Average',
       'Genre'],
      dtype='object')

___