## Introduction
The dataset describes various Movies and TV shows found on Netflix and Amazon Prime Videos.The dataset can be found at [Popular Movies n TV shows dataset](https://www.kaggle.com/jyotmakadiya/popular-movies-and-tv-shows-amazon-prime-netflix)    

If you find this notebook helpful, please don't forget to share your thoughts and upvote!!!         
Let's explore the content of the data and see what we can conclude

## Exploratory Data Analysis


In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing
import seaborn as sns # plotting and visualization
import pandas_profiling
# to avoid warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")


## Reading the Data

In [None]:
# Popular Movies TV shows from Prime Videos Netflix version_3.csv  let's see the data in raw format
df1 = pd.read_csv('/kaggle/input/Popular Movies TV shows from Prime Videos Netflix version_3.csv',delimiter = ',')
df1.head(10)

In [None]:
df1.describe()

**Here we will try to use profile_report inorder to describe the whole data and visualize it dynamically. And This would even be better as we have less features that can be described easily**

In [None]:
#separate features 
features = ['ID', 'Title', 'Year', 'Rating', 'IMDb',
       'Rotten Tomatoes', 'Genre', 'Netflix', 'Amazon Prime Video']

#generate data profile
profile = pandas_profiling.ProfileReport(df1[features])
profile

We can see that there are a lot of values missing in Age rating variable, a few in IMDb.    
There are some interesting conclusions we can draw from the statics of this data:
* The mean year of production is around 2002,with standard deviation 21 so most of the movies in the dataset lie around 2020 and 1980s, which resembles with our general understanding as most of the movies/TV shows found on Netflix/Prime Videos lie in these range
* There are 14 different genres, from which Drama seems to be the largest one having around 6000 entries
* Around 75% of the movies are found on Amazon Prime Videos and 25% found on Netflix (can be seen from 0,1 ratio)
* Finally We can see multiple titles on the dataset, which indicates that same movie can be found on netflix and prime videos under multile genre, which is pretty interesting
* we have also seen that the rating has most of the values missing so we should even try to drop it 

## Now let us do some data cleaning

In [None]:
#before we move to that we need to clean the data, remove missing values 
features = ['ID', 'Title', 'Year', 'Rating', 'IMDb',
       'Rotten Tomatoes', 'Genre', 'Netflix', 'Amazon Prime Video']

train_data = df1[features].copy()
train_data.isnull().sum()

It seems that around 60% of age Rating value and 420 IMdb rating are missing, We will drop IMDb NULL rows and Age rating column doesn't seem to help in further analysis. So, we will drop age rating feature altogether

In [None]:
train_data = train_data.drop('Rating',axis = 1)

***Dropping Null Values***

In [None]:
train_data = train_data.dropna()
train_data

In [None]:
#features.remove('Rating')
train_data.isnull().sum()

> Now we can clearly see that we have dropped all the null values in the data

### Converting the String values of IMDB feature to float 

In [None]:

#train_data["IMDb"] = pd.to_numeric(train_data['IMDb'])
def convert_to_float(x):
    imdb = x[0]
    try:
        return float(imdb)
    except:
        print('the unwanted value is:',imdb)
        
    return imdb
train_data["IMDb"] = train_data[['IMDb']].apply(convert_to_float,axis = 1)
train_data

We can see some string values which are causing error when converting to float, so we need to get rid of this values like d which are causing these errors

In [None]:
train_data = train_data[train_data.IMDb != 'd;}']
train_data

## Visualizations
Now our data is cleaned so we can visualize it with seaborn 

In [None]:
#movies/shows annual production data
plt.figure(figsize=(10,6))
fig = sns.distplot(train_data['Year'],color = 'coral')
fig.set_xlabel("Year",size=15)
fig.set_ylabel("Movie/Shows Count",size=15)
plt.title('Movies/TV shows in each Year',size = 20)
plt.show()

We can see most of the movies/shows are after 1950s, so we can see a left-skewed plot

In [None]:
plt.figure(figsize=(10,6))
fig = sns.distplot(train_data[train_data['Year'] > 1980]["Year"],color = 'coral')
fig.set_xlabel("Year",size=15)
fig.set_ylabel("Movie/Shows Frequency",size=15)
#fig.set_xticklabels(fig.get_xticklabels(),rotation=45)
plt.title('Movies/TV shows in each Year',size = 20)
plt.show()

### Let's visualize the IMDb rating data

In [None]:
plt.figure(figsize=(10,6))
fig = sns.distplot(train_data["IMDb"],color = 'coral')
fig.set_xlabel("IMDb Rating Rounded",size=15)
fig.set_ylabel("Movie/Shows Frequency",size=15)
plt.title('Movies/TV shows IMDb Rating',size = 20)
plt.show()

*This IMDB visualization appears great it seems to look like an Gaussian Distribution*

### Visualizing all the 14 different Genres

In [None]:
plt.figure(figsize=(10,6))
fig = sns.countplot(train_data["Genre"],color = 'coral')
fig.set_xlabel("Genre",size=15)
fig.set_ylabel("Movie/Shows Count",size=15)
fig.set_xticklabels(fig.get_xticklabels(),rotation=30)
plt.title('Movies/TV shows in each Genre',size = 20)
plt.show()

### Visualizing the distribution of rotten tomatoes

In [None]:
plt.figure(figsize=(10,6))
fig = sns.distplot(train_data["Rotten Tomatoes"],color = 'coral')
fig.set_xlabel("Rotten Tomatoes",size=15)
fig.set_ylabel("Movie/Shows frequency",size=15)
fig.set_xticklabels(fig.get_xticklabels(),rotation=30)
plt.title('Movies/TV shows distribution over Rotten tomatoes',size = 20)
plt.show()

This even looks like a bell-shaped curve but slightly right-skewed

## Conclusion
* The dataset seems pretty interesting as we can that as Rotten Tomatoes increases, the no of movies/shows decreases. 
* The IMDb rating shows large number movies/shows lie in mid range, describing average performance. 