## Dataset Description
This data set contains information about +9000 movies extracted from TMDB API. 

## Columns Descriptions
1. `Release_Date`: Date when the movie was released.
2. `Title`: Name of the movie.
3. `Overview`: Brief summary of the movie.
4. `Popularity`: It is a very important metric computed by TMDB developers based on the number of views per day, votes per day, number of users marked it as "favorite" and "watchlist" for the data, release date and more other metrics.
5. `Vote_Count`: Total votes received from the viewers.
6. `Vote_Average`: Average rating based on vote count and the number of viewers out of 10.
7. `Original_Language`: Original language of the movies. Dubbed version is not considered to be original language.
8. `Genre`: Categories the movie it can be classified as.
9. `Poster_Url`: Url of the movie poster.

## EDA Questions
- Q1: What is the most frequent `genre` in the dataset?
- Q2: What `genres` has highest `votes`?
- Q3: What movie got the highest `popularity`? what's its `genre`?
- Q4: Which year has the most filmmed movies?
___

## Environment Set-up

In [1]:
# importing lib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# getting dataset file dir.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

___
### Public Functions
here, we'd put all of the public functions to be used in this notebook

**For usability and functionality sake, we would categorize columns using a function.**

In [2]:
def catigorize_col (df, col, labels):
    """
    catigorizes a certain column based on its quartiles
   
    Args:
        (df)     df   - dataframe we are proccesing
        (col)    str  - to be catigorized column's name 
        (labels) list - list of labels from min to max
    
    Returns:
        (df)     df   - dataframe with the categorized col
    """
    
    # setting the edges to cut the column accordingly
    edges = [df[col].describe()['min'],
             df[col].describe()['25%'],
             df[col].describe()['50%'],
             df[col].describe()['75%'],
             df[col].describe()['max']]
    
    df[col] = pd.cut(df[col], edges, labels = labels, duplicates='drop')
    return df

___
## Data Wrangling
here, we'd load our data from the CSV file, and dive deeper into it to check for any required cleaning steps.

In [3]:
# loading data and viewing its first 5 rows
df = pd.read_csv('mymoviedb.csv', lineterminator='\n')
df.head()

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
0,2021-12-15,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,5083.954,8940,8.3,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/1g0dhYtq4i...
1,2022-03-01,The Batman,"In his second year of fighting crime, Batman u...",3827.658,1151,8.1,en,"Crime, Mystery, Thriller",https://image.tmdb.org/t/p/original/74xTEgt7R3...
2,2022-02-25,No Exit,Stranded at a rest stop in the mountains durin...,2618.087,122,6.3,en,Thriller,https://image.tmdb.org/t/p/original/vDHsLnOWKl...
3,2021-11-24,Encanto,"The tale of an extraordinary family, the Madri...",2402.201,5076,7.7,en,"Animation, Comedy, Family, Fantasy",https://image.tmdb.org/t/p/original/4j0PNHkMr5...
4,2021-12-22,The King's Man,As a collection of history's worst tyrants and...,1895.511,1793,7.0,en,"Action, Adventure, Thriller, War",https://image.tmdb.org/t/p/original/aq4Pwv5Xeu...


In [4]:
# viewing dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9827 entries, 0 to 9826
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Release_Date       9827 non-null   object 
 1   Title              9827 non-null   object 
 2   Overview           9827 non-null   object 
 3   Popularity         9827 non-null   float64
 4   Vote_Count         9827 non-null   int64  
 5   Vote_Average       9827 non-null   float64
 6   Original_Language  9827 non-null   object 
 7   Genre              9827 non-null   object 
 8   Poster_Url         9827 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 691.1+ KB


- looks like our dataset has no NaNs!
- `Overview`, `Original_Languege` and `Poster-Url` wouldn't be so useful during analysis
- `Release_Date` column needs to be casted into date time and to extract only the year value

In [5]:
# exploring genres column
df['Genre'].head()

0    Action, Adventure, Science Fiction
1              Crime, Mystery, Thriller
2                              Thriller
3    Animation, Comedy, Family, Fantasy
4      Action, Adventure, Thriller, War
Name: Genre, dtype: object

- genres are saperated by commas followed by whitespaces.

In [6]:
# check for duplicated rows
df.duplicated().sum()

0

- our dataset has no duplicated rows either.

In [7]:
# exploring summary statistics
df.describe()

Unnamed: 0,Popularity,Vote_Count,Vote_Average
count,9827.0,9827.0,9827.0
mean,40.326088,1392.805536,6.439534
std,108.873998,2611.206907,1.129759
min,13.354,0.0,0.0
25%,16.1285,146.0,5.9
50%,21.199,444.0,6.5
75%,35.1915,1376.0,7.1
max,5083.954,31077.0,10.0


### Exploration Summarey
- we have a dataframe consisting of 9827 rows and 9 columns.
- our dataset looks a bit tidy with no NaNs nor duplicated values.
- `Release_Date` column needs to be casted into date time and to extract only the year value.
- `Overview`, `Original_Languege` and `Poster-Url` wouldn't be so useful during analysis, so we'll drop them.
- there is noticable outliers in `Popularity` column
- `Vote_Average` bettter be categorised for proper analysis.
- `Genre` column has comma saperated values and white spaces that needs to be handled and casted into category. 
___

## Data Cleaning

**Casting `Release_Date` column and extracing year values**

In [8]:
# casting column and extracting year value
df['Release_Date'] = pd.DatetimeIndex(df['Release_Date']).year

# confirming changes
print(df['Release_Date'].dtypes)
df['Release_Date'].unique()

int64


array([2021, 2022, 2020, 1986, 2018, 1989, 2012, 2009, 2005, 2019, 2014,
       2016, 2001, 2015, 2002, 2017, 2011, 2007, 2004, 2010, 1972, 2023,
       2008, 1995, 1994, 1992, 2013, 2003, 2006, 1999, 1993, 1985, 1998,
       1950, 1997, 2000, 1991, 1937, 1990, 1977, 1940, 1981, 1942, 1973,
       1941, 1974, 1951, 1996, 1959, 1961, 1967, 1953, 1988, 1956, 1983,
       1976, 1955, 1970, 1984, 1971, 1987, 1963, 1980, 1939, 1975, 1979,
       1982, 1966, 1978, 1965, 1946, 1964, 1968, 1962, 1960, 1957, 1949,
       1922, 2024, 1969, 1954, 1933, 1932, 1958, 1952, 1944, 1931, 1948,
       1938, 1945, 1927, 1925, 1935, 1936, 1902, 1920, 1947, 1943, 1929,
       1926, 1921, 1930], dtype=int64)

In [9]:
# confirming relase years
df[df['Release_Date'] > 2022]

Unnamed: 0,Release_Date,Title,Overview,Popularity,Vote_Count,Vote_Average,Original_Language,Genre,Poster_Url
210,2023,Fast & Furious 10,The tenth installment in the Fast Saga.,181.216,0,0.0,en,"Action, Crime",https://image.tmdb.org/t/p/original/2DyEk84Xnb...
414,2023,The Flash,The Flash travels back in time to prevent the ...,119.846,0,0.0,en,"Action, Adventure, Science Fiction, Fantasy, D...",https://image.tmdb.org/t/p/original/oduJooXJya...
1158,2023,Aquaman and The Lost Kingdom,"When an ancient power is unleashed, Aquaman mu...",59.539,0,0.0,en,"Adventure, Action, Fantasy",https://image.tmdb.org/t/p/original/M2diAjJukA...
1757,2023,Transformers: Rise of the Beasts,The origin story for Optimus Prime and how he ...,44.599,0,0.0,en,"Action, Science Fiction, Adventure",https://image.tmdb.org/t/p/original/bgz1IdXKb1...
2732,2023,Guardians of the Galaxy Vol. 3,The third and final film in the Guardians of t...,32.375,0,0.0,en,"Action, Adventure, Science Fiction, Comedy",https://image.tmdb.org/t/p/original/1neAtYqPFf...
3323,2024,Despicable Me 4,The fourth installment of the Despicable Me fr...,27.987,0,0.0,en,"Animation, Family, Comedy, Adventure",https://image.tmdb.org/t/p/original/a7aJ1gIrsc...
6485,2023,The Marvels,The sequel to Captain Marvel (2019).,17.592,0,0.0,en,"Action, Adventure, Science Fiction",https://image.tmdb.org/t/p/original/bnz17wOkp9...
6785,2023,Ant-Man and the Wasp: Quantumania,The third film in the Marvel Studios' Ant-Man ...,17.05,0,0.0,en,"Adventure, Science Fiction, Comedy",https://image.tmdb.org/t/p/original/mYCBuVGQGW...
9134,2023,The Little Mermaid,A young mermaid makes a deal with a sea witch:...,14.016,0,0.0,en,"Adventure, Family, Fantasy, Romance",https://image.tmdb.org/t/p/original/57bEiV6qwc...


- it looks we have some movies that are unreleased, yet they appear to have a popularity score, we'd drop them.

In [10]:
# dropping rows
df.drop(df[df['Release_Date'] > 2022].index, inplace = True)

___
**Dropping `Overview`, `Original_Languege` and `Poster-Url`**

In [11]:
# making list of column to be dropped
cols = ['Overview', 'Original_Language', 'Poster_Url']

# dropping columns and confirming changes
df.drop(cols, axis = 1, inplace = True)
df.columns

Index(['Release_Date', 'Title', 'Popularity', 'Vote_Count', 'Vote_Average',
       'Genre'],
      dtype='object')

___
**categorizing `Vote_Average` column**

We would cut the `Vote_Average` values and make 4 categories: `popular` `average` `below_avg` `not_popular` to describe it more using `catigorize_col()` function provided above.

In [12]:
# define labels for edges
labels = ['not_popular', 'below_avg', 'average', 'popular']

# categorize column based on labels and edges
catigorize_col(df, 'Vote_Average', labels)

# confirming changes
df['Vote_Average'].unique()

['popular', 'below_avg', 'average', 'not_popular', NaN]
Categories (4, object): ['not_popular' < 'below_avg' < 'average' < 'popular']

In [13]:
# exploring column
df['Vote_Average'].value_counts()

not_popular    2467
popular        2450
average        2412
below_avg      2398
Name: Vote_Average, dtype: int64

In [14]:
# dropping NaNs
df.dropna(inplace = True)

# confirming
df.isna().sum()

Release_Date    0
Title           0
Popularity      0
Vote_Count      0
Vote_Average    0
Genre           0
dtype: int64

___
**Handling `Genre` column's comma saperated values**

### TODO
for this challenging column, we choose an approach that consists of stacking genres into a dataframe, and then merging it to our original dataframe. we'd explain further in the next cells.

In [15]:
# creating a new dataframe that holds all genres for each movie
#genres_df = df['Genre'].str.split(", ", expand=True)

# viewing its head
#genres_df.head()

Now that we have our dataframe of genres done, we'd move next into making a stack out of it, so that every movie would be represented by a stack of genres.

In [16]:
# stacking genres dataframe 
#genres_df = genres_df.stack()

# configuring it as pandas dataframe
#genres_df = pd.DataFrame(genres_df)

# viewing its first 10 rows
#genres_df.head(10)

In [17]:
#Renaming the genres column and confirming value count
#genres_df.rename(columns={0:'genres_stack'}, inplace=True)
#genres_df.genres_stack.value_counts()

Now we have successfully created a new dataframe containing a stack of all movies' genres, we'd move into merging it with the original datarame
___

### we'd split genres into a list and then explode our dataframe to have only one genre per row for ezch movie

In [18]:
# split the strings into lists
df['Genre'] = df['Genre'].str.split(', ')

# explode the lists
df = df.explode('Genre').reset_index(drop=True)
df.head()

Unnamed: 0,Release_Date,Title,Popularity,Vote_Count,Vote_Average,Genre
0,2021,Spider-Man: No Way Home,5083.954,8940,popular,Action
1,2021,Spider-Man: No Way Home,5083.954,8940,popular,Adventure
2,2021,Spider-Man: No Way Home,5083.954,8940,popular,Science Fiction
3,2022,The Batman,3827.658,1151,popular,Crime
4,2022,The Batman,3827.658,1151,popular,Mystery


In [19]:
# casting column into category
df['Genre'] = df['Genre'].astype('category')

# confirming changes
df['Genre'].dtypes

CategoricalDtype(categories=['Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
                  'Documentary', 'Drama', 'Family', 'Fantasy', 'History',
                  'Horror', 'Music', 'Mystery', 'Romance', 'Science Fiction',
                  'TV Movie', 'Thriller', 'War', 'Western'],
, ordered=False)

___

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25552 entries, 0 to 25551
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   Release_Date  25552 non-null  int64   
 1   Title         25552 non-null  object  
 2   Popularity    25552 non-null  float64 
 3   Vote_Count    25552 non-null  int64   
 4   Vote_Average  25552 non-null  category
 5   Genre         25552 non-null  category
dtypes: category(2), float64(1), int64(2), object(1)
memory usage: 849.4+ KB


In [21]:
df.nunique()

Release_Date     100
Title           9415
Popularity      8088
Vote_Count      3265
Vote_Average       4
Genre             19
dtype: int64

___

## Data Visualization
here, we'd use `Matplotlib` and `seaborn` for making some informative visuals to gain insights abut our data.